Career Paths

The Job Ad Collector

Learn how to extract information from websites (web scraping) to automatically collect data, such as job postings.

Our Project: Job Ad Web Scraper

A script that visits a job listings website, finds all positions containing the word "Python," and saves the title, company, and link to a CSV file.

Core Technologies We'll Use:

Python
requests
BeautifulSoup4
csv
Step 1 / 5

Step 1: Library Installation and Setup

We install the necessary libraries and define the basic constants for our web scraper.

Introduction to Web Scraping

Web Scraping is the process of automatically extracting data from websites. Instead of manually copying information, we write a script that "visits" the page, downloads its content (HTML), and parses it to find the data we are interested in. It is an extremely powerful technique for data collection.

Warning: Always check a website's `robots.txt` file (e.g., `https://www.example.com/robots.txt`) and its terms of use to ensure that scraping is allowed. Always be "polite" to servers by making requests at reasonable intervals.

1. Virtual Environment and Installation

Open your terminal and create a new environment for the project:

mkdir job_scraper
cd job_scraper
python -m venv venv
# Activation...
pip install requests beautifulsoup4

2. Initial Script

Create a file `job_scraper.py`. We will start by importing the libraries and defining our constants:

  • `requests`: To make the HTTP request and download the page.
  • `BeautifulSoup` (from `bs4`): To parse the HTML.
  • `csv`: To write the data to a CSV file.
  • `URL`: The address of the website we will scrape.
  • `HEADERS`: It is good practice to send a `User-Agent` header to make our request look like that of a regular browser.

# job_scraper.py
import requests
from bs4 import BeautifulSoup
import csv
import time

# Note: Websites change frequently. The selectors may need adjustment.
URL = "https://www.kariera.gr/jobs?q=python"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
Step 2 / 5

Step 2: Downloading the HTML Content

We use the requests library to send a GET request to the website and receive its HTML as a response.

The first step is to request the content of the webpage from the server. We will use the `requests.get()` function. It is crucial to wrap this call in a `try...except` block to handle potential network problems (e.g., timeout, connection error).

The `response.raise_for_status()` method is very useful, as it automatically checks if the response was successful (status code 2xx) and raises an exception if there was an error (e.g., 404 Not Found, 500 Server Error).


# (In the same file, job_scraper.py)

def scrape_jobs():
    try:
        print(f"Making request to: {URL}")
        response = requests.get(URL, headers=HEADERS, timeout=10)
        # Raises an exception for HTTP errors (e.g., 404, 500)
        response.raise_for_status() 
        print("Page downloaded successfully!")
        return response.content # Return the content for the next step
    except requests.exceptions.RequestException as e:
        print(f"Error during HTTP request: {e}")
        return None

if __name__ == "__main__":
    html_content = scrape_jobs()
    if html_content:
        print("Received", len(html_content), "bytes of HTML.")
Step 3 / 5

Step 3: Parsing HTML with BeautifulSoup

We use BeautifulSoup to "read" the HTML and find the elements that contain the information we want.

Now that we have the HTML, we need to parse it to find the information we are interested in. This is where BeautifulSoup shines. We create a `BeautifulSoup` object, giving it the HTML content and a "parser" (usually `"html.parser"`).

How do we find the right elements?

This is the most crucial and often the most difficult part of web scraping. It requires "inspecting" the source code of the webpage:

  1. Open the URL in your browser (Chrome, Firefox).
  2. Right-click on a job posting and select "Inspect" or "Inspect Element".
  3. In the window that opens, observe the HTML structure. Look for the tags (e.g., `
    `, `

    `) and classes (`class="..."`) that surround the elements you want (title, company).

For kariera.gr, we observe that each job posting is inside a `

` with specific classes. We will use the `soup.find_all()` method to find all these divs.


# (Update scrape_jobs and the main block)

def scrape_jobs():
    try:
        response = requests.get(URL, headers=HEADERS, timeout=10)
        response.raise_for_status() 
    except requests.exceptions.RequestException as e:
        print(f"Error during HTTP request: {e}")
        return

    # Create a BeautifulSoup object to parse the HTML
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the container that holds all the job listings
    job_cards = soup.find_all("div", class_="col-sm-12 col-md-12 col-lg-6 col-xl-4 p-2")
    
    if not job_cards:
        print("No job listings found. The website structure may have changed.")
        return
    
    print(f"Found {len(job_cards)} job cards.")
    # (Data extraction will be done in the next step)

if __name__ == "__main__":
    scrape_jobs()
Step 4 / 5

Step 4: Extracting Specific Data

We iterate through the elements we found and extract the text from the title, company, and link for each job posting.

This is the final, complete code for the application. You can copy it, run it locally on your computer (after installing the necessary libraries with `pip`), and experiment by adding your own features!


# job_scraper.py
import requests
from bs4 import BeautifulSoup
import csv
import time

URL = "https://www.kariera.gr/jobs?q=python"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

def scrape_jobs():
    try:
        print(f"Πραγματοποιείται αίτημα στο: {URL}")
        response = requests.get(URL, headers=HEADERS, timeout=10)
        response.raise_for_status() 
    except requests.exceptions.RequestException as e:
        print(f"Σφάλμα κατά το αίτημα HTTP: {e}")
        return

    soup = BeautifulSoup(response.content, "html.parser")
    job_cards = soup.find_all("div", class_="col-sm-12 col-md-12 col-lg-6 col-xl-4 p-2")
    
    if not job_cards:
        print("Δεν βρέθηκαν αγγελίες. Η δομή της ιστοσελίδας μπορεί να έχει αλλάξει.")
        return

    found_jobs = []
    for card in job_cards:
        title_element = card.find("h2", class_="fs-18 mb-1")
        company_element = card.find("a", class_="d-block fw-bold text-dark fs-14 text-decoration-none")
        link_element = card.find("a", class_="text-decoration-none")
        
        if title_element and company_element and link_element and 'href' in link_element.attrs:
            title = title_element.get_text(strip=True)
            company = company_element.get_text(strip=True)
            link = "https://www.kariera.gr" + link_element['href']
            
            found_jobs.append({ "title": title, "company": company, "link": link })
            print(f"Βρέθηκε: {title} στην {company}")
        time.sleep(0.1)

    if not found_jobs:
        print("Δεν βρέθηκε καμία αγγελία με τα συγκεκριμένα κριτήρια.")
        return
        
    csv_filename = "python_jobs.csv"
    try:
        with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
            fieldnames = ["title", "company", "link"]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(found_jobs)
        print(f"\nΕπιτυχής αποθήκευση {len(found_jobs)} αγγελιών στο '{csv_filename}'")
    except IOError as e:
        print(f"Σφάλμα κατά την εγγραφή του αρχείου CSV: {e}")

if __name__ == "__main__":
    scrape_jobs()