The Job Ad Collector

Learn how to extract information from websites (web scraping) to automatically collect data, such as job postings.

Our Project: Job Ad Web Scraper

A script that visits a job listings website, finds all positions containing the word "Python," and saves the title, company, and link to a CSV file.

Core Technologies We'll Use:

Python

requests

BeautifulSoup4

csv

Step 1 / 5

Step 1: Library Installation and Setup

We install the necessary libraries and define the basic constants for our web scraper.

Introduction to Web Scraping

Web Scraping is the process of automatically extracting data from websites. Instead of manually copying information, we write a script that "visits" the page, downloads its content (HTML), and parses it to find the data we are interested in. It is an extremely powerful technique for data collection.

Warning: Always check a website's `robots.txt` file (e.g., `https://www.example.com/robots.txt`) and its terms of use to ensure that scraping is allowed. Always be "polite" to servers by making requests at reasonable intervals.

1. Virtual Environment and Installation

Open your terminal and create a new environment for the project:

mkdir job_scraper
cd job_scraper
python -m venv venv
# Activation...
pip install requests beautifulsoup4

2. Initial Script

Create a file `job_scraper.py`. We will start by importing the libraries and defining our constants:

`requests`: To make the HTTP request and download the page.
`BeautifulSoup` (from `bs4`): To parse the HTML.
`csv`: To write the data to a CSV file.
`URL`: The address of the website we will scrape.
`HEADERS`: It is good practice to send a `User-Agent` header to make our request look like that of a regular browser.


# job_scraper.py
import requests
from bs4 import BeautifulSoup
import csv
import time

# Note: Websites change frequently. The selectors may need adjustment.
URL = "https://www.kariera.gr/jobs?q=python"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

Step 2 / 5

Step 2: Downloading the HTML Content

We use the requests library to send a GET request to the website and receive its HTML as a response.

The first step is to request the content of the webpage from the server. We will use the `requests.get()` function. It is crucial to wrap this call in a `try...except` block to handle potential network problems (e.g., timeout, connection error).

The `response.raise_for_status()` method is very useful, as it automatically checks if the response was successful (status code 2xx) and raises an exception if there was an error (e.g., 404 Not Found, 500 Server Error).


# (In the same file, job_scraper.py)

def scrape_jobs():
    try:
        print(f"Making request to: {URL}")
        response = requests.get(URL, headers=HEADERS, timeout=10)
        # Raises an exception for HTTP errors (e.g., 404, 500)
        response.raise_for_status() 
        print("Page downloaded successfully!")
        return response.content # Return the content for the next step
    except requests.exceptions.RequestException as e:
        print(f"Error during HTTP request: {e}")
        return None

if __name__ == "__main__":
    html_content = scrape_jobs()
    if html_content:
        print("Received", len(html_content), "bytes of HTML.")

Step 3 / 5

Step 3: Parsing HTML with BeautifulSoup

We use BeautifulSoup to "read" the HTML and find the elements that contain the information we want.

Now that we have the HTML, we need to parse it to find the information we are interested in. This is where BeautifulSoup shines. We create a `BeautifulSoup` object, giving it the HTML content and a "parser" (usually `"html.parser"`).

How do we find the right elements?

This is the most crucial and often the most difficult part of web scraping. It requires "inspecting" the source code of the webpage:

Open the URL in your browser (Chrome, Firefox).
Right-click on a job posting and select "Inspect" or "Inspect Element".
In the window that opens, observe the HTML structure. Look for the tags (e.g., `
`, `
`) and classes (`class="..."`) that surround the elements you want (title, company).

For kariera.gr, we observe that each job posting is inside a `

` with specific classes. We will use the `soup.find_all()` method to find all these divs.


# (Update scrape_jobs and the main block)

def scrape_jobs():
    try:
        response = requests.get(URL, headers=HEADERS, timeout=10)
        response.raise_for_status() 
    except requests.exceptions.RequestException as e:
        print(f"Error during HTTP request: {e}")
        return

    # Create a BeautifulSoup object to parse the HTML
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the container that holds all the job listings
    job_cards = soup.find_all("div", class_="col-sm-12 col-md-12 col-lg-6 col-xl-4 p-2")
    
    if not job_cards:
        print("No job listings found. The website structure may have changed.")
        return
    
    print(f"Found {len(job_cards)} job cards.")
    # (Data extraction will be done in the next step)

if __name__ == "__main__":
    scrape_jobs()

Step 4 / 5

Step 4: Extracting Specific Data

We iterate through the elements we found and extract the text from the title, company, and link for each job posting.

Now that we have a list of the job "cards", we will loop through each card. Inside each card, we will use the `.find()` method to locate the specific element that contains the title (e.g., an `

`), the company, etc. Then, we will use the `.get_text(strip=True)` method to get only the text of the element, clean of HTML tags and whitespace.

For the link, we will find the `` tag and get the value of the `href` attribute.


# (Inside the scrape_jobs function, after finding the job_cards)
    found_jobs = []
    for card in job_cards:
        # We search *inside* each card for the specific elements
        title_element = card.find("h2", class_="fs-18 mb-1")
        company_element = card.find("a", class_="d-block fw-bold text-dark fs-14 text-decoration-none")
        link_element = card.find("a", class_="text-decoration-none")
        
        # Check that we found all elements before trying to use them
        if title_element and company_element and link_element and 'href' in link_element.attrs:
            title = title_element.get_text(strip=True)
            company = company_element.get_text(strip=True)
            # Create the full link, since the href is relative
            link = "https://www.kariera.gr" + link_element['href']
            
            # Store the data in a dictionary
            found_jobs.append({
                "title": title,
                "company": company,
                "link": link
            })
            print(f"Found: {title} at {company}")
        
        # A small pause to be polite to the server
        time.sleep(0.1)

    # (Saving will be done in the next step)
    return found_jobs

Step 5 / 5

Step 5: Saving Data to a CSV File

We use Python's built-in csv module to save the data we collected into a structured CSV file, ready for further analysis.

The final step is to save the data we collected. A CSV (Comma-Separated Values) file is an excellent choice, as it is simple and can be easily opened by programs like Excel or imported into Pandas for analysis.

We will use the `csv.DictWriter` class, which is ideal when our data is a list of dictionaries. We tell it what the column names are (`fieldnames`), write the header with `writer.writeheader()`, and then write all the data rows with `writer.writerows()`.


# (Inside the scrape_jobs function, after the loop)

    if not found_jobs:
        print("No job listings with the specified criteria were found.")
        return
        
    # Save the data to a CSV file
    csv_filename = "python_jobs.csv"
    try:
        with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
            # Define the column names
            fieldnames = ["title", "company", "link"]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            
            writer.writeheader() # Writes the first row with column names
            writer.writerows(found_jobs) # Writes all the job listings
            
        print(f"\nSuccessfully saved {len(found_jobs)} job listings to '{csv_filename}'")
    except IOError as e:
        print(f"Error writing the CSV file: {e}")

Tip!

Run the final script from your terminal (`python job_scraper.py`). If all goes well, you will see the job listings being printed, and at the end, a file named `python_jobs.csv` will be created in the same folder.

Project Completion & Next Steps

Congratulations! You have completed the path and now have the full code for the project.

This is the final, complete code for the application. You can copy it, run it locally on your computer (after installing the necessary libraries with `pip`), and experiment by adding your own features!


# job_scraper.py
import requests
from bs4 import BeautifulSoup
import csv
import time

URL = "https://www.kariera.gr/jobs?q=python"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

def scrape_jobs():
    try:
        print(f"Πραγματοποιείται αίτημα στο: {URL}")
        response = requests.get(URL, headers=HEADERS, timeout=10)
        response.raise_for_status() 
    except requests.exceptions.RequestException as e:
        print(f"Σφάλμα κατά το αίτημα HTTP: {e}")
        return

    soup = BeautifulSoup(response.content, "html.parser")
    job_cards = soup.find_all("div", class_="col-sm-12 col-md-12 col-lg-6 col-xl-4 p-2")
    
    if not job_cards:
        print("Δεν βρέθηκαν αγγελίες. Η δομή της ιστοσελίδας μπορεί να έχει αλλάξει.")
        return

    found_jobs = []
    for card in job_cards:
        title_element = card.find("h2", class_="fs-18 mb-1")
        company_element = card.find("a", class_="d-block fw-bold text-dark fs-14 text-decoration-none")
        link_element = card.find("a", class_="text-decoration-none")
        
        if title_element and company_element and link_element and 'href' in link_element.attrs:
            title = title_element.get_text(strip=True)
            company = company_element.get_text(strip=True)
            link = "https://www.kariera.gr" + link_element['href']
            
            found_jobs.append({ "title": title, "company": company, "link": link })
            print(f"Βρέθηκε: {title} στην {company}")
        time.sleep(0.1)

    if not found_jobs:
        print("Δεν βρέθηκε καμία αγγελία με τα συγκεκριμένα κριτήρια.")
        return
        
    csv_filename = "python_jobs.csv"
    try:
        with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
            fieldnames = ["title", "company", "link"]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(found_jobs)
        print(f"\nΕπιτυχής αποθήκευση {len(found_jobs)} αγγελιών στο '{csv_filename}'")
    except IOError as e:
        print(f"Σφάλμα κατά την εγγραφή του αρχείου CSV: {e}")

if __name__ == "__main__":
    scrape_jobs()