Career Paths
The Job Ad Collector
Learn how to extract information from websites (web scraping) to automatically collect data, such as job postings.
Our Project: Job Ad Web Scraper
A script that visits a job listings website, finds all positions containing the word "Python," and saves the title, company, and link to a CSV file.
Core Technologies We'll Use:
Step 1: Library Installation and Setup
We install the necessary libraries and define the basic constants for our web scraper.
Introduction to Web Scraping
Web Scraping is the process of automatically extracting data from websites. Instead of manually copying information, we write a script that "visits" the page, downloads its content (HTML), and parses it to find the data we are interested in. It is an extremely powerful technique for data collection.
Warning: Always check a website's `robots.txt` file (e.g., `https://www.example.com/robots.txt`) and its terms of use to ensure that scraping is allowed. Always be "polite" to servers by making requests at reasonable intervals.
1. Virtual Environment and Installation
Open your terminal and create a new environment for the project:
mkdir job_scraper
cd job_scraper
python -m venv venv
# Activation...
pip install requests beautifulsoup4
2. Initial Script
Create a file `job_scraper.py`. We will start by importing the libraries and defining our constants:
- `requests`: To make the HTTP request and download the page.
- `BeautifulSoup` (from `bs4`): To parse the HTML.
- `csv`: To write the data to a CSV file.
- `URL`: The address of the website we will scrape.
- `HEADERS`: It is good practice to send a `User-Agent` header to make our request look like that of a regular browser.
# job_scraper.py
import requests
from bs4 import BeautifulSoup
import csv
import time
# Note: Websites change frequently. The selectors may need adjustment.
URL = "https://www.kariera.gr/jobs?q=python"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
Step 2: Downloading the HTML Content
We use the requests library to send a GET request to the website and receive its HTML as a response.
The first step is to request the content of the webpage from the server. We will use the `requests.get()` function. It is crucial to wrap this call in a `try...except` block to handle potential network problems (e.g., timeout, connection error).
The `response.raise_for_status()` method is very useful, as it automatically checks if the response was successful (status code 2xx) and raises an exception if there was an error (e.g., 404 Not Found, 500 Server Error).
# (In the same file, job_scraper.py)
def scrape_jobs():
try:
print(f"Making request to: {URL}")
response = requests.get(URL, headers=HEADERS, timeout=10)
# Raises an exception for HTTP errors (e.g., 404, 500)
response.raise_for_status()
print("Page downloaded successfully!")
return response.content # Return the content for the next step
except requests.exceptions.RequestException as e:
print(f"Error during HTTP request: {e}")
return None
if __name__ == "__main__":
html_content = scrape_jobs()
if html_content:
print("Received", len(html_content), "bytes of HTML.")
Step 3: Parsing HTML with BeautifulSoup
We use BeautifulSoup to "read" the HTML and find the elements that contain the information we want.
Now that we have the HTML, we need to parse it to find the information we are interested in. This is where BeautifulSoup shines. We create a `BeautifulSoup` object, giving it the HTML content and a "parser" (usually `"html.parser"`).
How do we find the right elements?
This is the most crucial and often the most difficult part of web scraping. It requires "inspecting" the source code of the webpage:
- Open the URL in your browser (Chrome, Firefox).
- Right-click on a job posting and select "Inspect" or "Inspect Element".
- In the window that opens, observe the HTML structure. Look for the tags (e.g., ``, `
`) and classes (`class="..."`) that surround the elements you want (title, company).
For kariera.gr, we observe that each job posting is inside a `
` with specific classes. We will use the `soup.find_all()` method to find all these divs.# (Update scrape_jobs and the main block) def scrape_jobs(): try: response = requests.get(URL, headers=HEADERS, timeout=10) response.raise_for_status() except requests.exceptions.RequestException as e: print(f"Error during HTTP request: {e}") return # Create a BeautifulSoup object to parse the HTML soup = BeautifulSoup(response.content, "html.parser") # Find the container that holds all the job listings job_cards = soup.find_all("div", class_="col-sm-12 col-md-12 col-lg-6 col-xl-4 p-2") if not job_cards: print("No job listings found. The website structure may have changed.") return print(f"Found {len(job_cards)} job cards.") # (Data extraction will be done in the next step) if __name__ == "__main__": scrape_jobs()
Step 4: Extracting Specific Data
We iterate through the elements we found and extract the text from the title, company, and link for each job posting.
Now that we have a list of the job "cards", we will loop through each card. Inside each card, we will use the `.find()` method to locate the specific element that contains the title (e.g., an `
`), the company, etc. Then, we will use the `.get_text(strip=True)` method to get only the text of the element, clean of HTML tags and whitespace.
For the link, we will find the `` tag and get the value of the `href` attribute.
# (Inside the scrape_jobs function, after finding the job_cards)
found_jobs = []
for card in job_cards:
# We search *inside* each card for the specific elements
title_element = card.find("h2", class_="fs-18 mb-1")
company_element = card.find("a", class_="d-block fw-bold text-dark fs-14 text-decoration-none")
link_element = card.find("a", class_="text-decoration-none")
# Check that we found all elements before trying to use them
if title_element and company_element and link_element and 'href' in link_element.attrs:
title = title_element.get_text(strip=True)
company = company_element.get_text(strip=True)
# Create the full link, since the href is relative
link = "https://www.kariera.gr" + link_element['href']
# Store the data in a dictionary
found_jobs.append({
"title": title,
"company": company,
"link": link
})
print(f"Found: {title} at {company}")
# A small pause to be polite to the server
time.sleep(0.1)
# (Saving will be done in the next step)
return found_jobs
Step 5: Saving Data to a CSV File
We use Python's built-in csv module to save the data we collected into a structured CSV file, ready for further analysis.
The final step is to save the data we collected. A CSV (Comma-Separated Values) file is an excellent choice, as it is simple and can be easily opened by programs like Excel or imported into Pandas for analysis.
We will use the `csv.DictWriter` class, which is ideal when our data is a list of dictionaries. We tell it what the column names are (`fieldnames`), write the header with `writer.writeheader()`, and then write all the data rows with `writer.writerows()`.
# (Inside the scrape_jobs function, after the loop)
if not found_jobs:
print("No job listings with the specified criteria were found.")
return
# Save the data to a CSV file
csv_filename = "python_jobs.csv"
try:
with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
# Define the column names
fieldnames = ["title", "company", "link"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader() # Writes the first row with column names
writer.writerows(found_jobs) # Writes all the job listings
print(f"\nSuccessfully saved {len(found_jobs)} job listings to '{csv_filename}'")
except IOError as e:
print(f"Error writing the CSV file: {e}")
Tip!
Project Completion & Next Steps
Congratulations! You have completed the path and now have the full code for the project.
This is the final, complete code for the application. You can copy it, run it locally on your computer (after installing the necessary libraries with `pip`), and experiment by adding your own features!
# job_scraper.py
import requests
from bs4 import BeautifulSoup
import csv
import time
URL = "https://www.kariera.gr/jobs?q=python"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
def scrape_jobs():
try:
print(f"Πραγματοποιείται αίτημα στο: {URL}")
response = requests.get(URL, headers=HEADERS, timeout=10)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Σφάλμα κατά το αίτημα HTTP: {e}")
return
soup = BeautifulSoup(response.content, "html.parser")
job_cards = soup.find_all("div", class_="col-sm-12 col-md-12 col-lg-6 col-xl-4 p-2")
if not job_cards:
print("Δεν βρέθηκαν αγγελίες. Η δομή της ιστοσελίδας μπορεί να έχει αλλάξει.")
return
found_jobs = []
for card in job_cards:
title_element = card.find("h2", class_="fs-18 mb-1")
company_element = card.find("a", class_="d-block fw-bold text-dark fs-14 text-decoration-none")
link_element = card.find("a", class_="text-decoration-none")
if title_element and company_element and link_element and 'href' in link_element.attrs:
title = title_element.get_text(strip=True)
company = company_element.get_text(strip=True)
link = "https://www.kariera.gr" + link_element['href']
found_jobs.append({ "title": title, "company": company, "link": link })
print(f"Βρέθηκε: {title} στην {company}")
time.sleep(0.1)
if not found_jobs:
print("Δεν βρέθηκε καμία αγγελία με τα συγκεκριμένα κριτήρια.")
return
csv_filename = "python_jobs.csv"
try:
with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ["title", "company", "link"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(found_jobs)
print(f"\nΕπιτυχής αποθήκευση {len(found_jobs)} αγγελιών στο '{csv_filename}'")
except IOError as e:
print(f"Σφάλμα κατά την εγγραφή του αρχείου CSV: {e}")
if __name__ == "__main__":
scrape_jobs()