Career Paths
The Data Analyst with Pandas
Learn to manipulate and analyze data from CSV files using the Pandas library, the ultimate tool for data analysis in Python.
Our Project: e-Shop Sales Analysis
We will analyze a CSV file with e-shop sales data to find the best-selling products, total revenue by category, and sales trends over time.
Core Technologies We'll Use:
Step 1: Introduction to Pandas and the DataFrame
We understand what Pandas is and its central structure, the DataFrame, which is the cornerstone of data analysis in Python.
What is Pandas?
Pandas is an open-source library that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It is the de facto tool for cleaning, transforming, manipulating, and analyzing data in Python.
The DataFrame
The heart of Pandas is the DataFrame. You can think of it like an Excel spreadsheet or a table in a SQL database. It is a two-dimensional data structure with labeled axes (rows and columns). Each column can have a different data type (numbers, text, dates, etc.).
Environment Setup
We will need the `pandas` and `matplotlib` libraries. Open your terminal, create a virtual environment, and install them:
# Create project folder and venv
mkdir pandas_sales_analyzer
cd pandas_sales_analyzer
python -m venv venv
# Activation...
# Install libraries
pip install pandas matplotlib
Create a file `sales_analyzer.py`. Also, create a file `sales_data.csv` with the following content and save it in the same folder:
Date,Category,Product,Quantity,Price
2023-01-15,Electronics,Laptop,1,1200
2023-01-20,Books,Python Intro,2,30
2023-02-10,Electronics,Mouse,5,25
2023-02-12,Home Goods,Coffee Maker,1,80
2023-03-05,Books,Data Science Book,3,45
2023-03-22,Electronics,Keyboard,2,75
Step 2: Loading Data from CSV
We learn how to use the powerful pd.read_csv() function to load data from a CSV file directly into a DataFrame.
The first step in any analysis is to import the data. Pandas makes this process extremely simple with the `pd.read_csv()` function.
# sales_analyzer.py
import pandas as pd
def analyze_sales_data(file_path="sales_data.csv"):
try:
# We load the CSV into a DataFrame
df = pd.read_csv(file_path)
print("CSV file loaded successfully into a DataFrame.")
return df
except FileNotFoundError:
print(f"Error: The file '{file_path}' was not found.")
return None
if __name__ == '__main__':
sales_df = analyze_sales_data()
if sales_df is not None:
print("--- Object Type ---")
print(type(sales_df))
Step 3: Data Exploration
We use basic DataFrame methods (.head(), .info(), .describe()) to get a quick first look at the structure and content of our data.
Before diving into the analysis, it is crucial to understand our data. Pandas provides excellent methods for quick exploration:
- `.head(n)`: Displays the first `n` rows of the DataFrame (default is 5).
- `.info()`: Provides a concise summary of the DataFrame, including the data type of each column and non-null values.
- `.describe()`: Generates descriptive statistics for the numerical columns (count, mean, standard deviation, min, max, etc.).
# ... (inside the analyze_sales_data function, after loading) ...
print("\n--- First 5 Rows (head) ---")
print(df.head())
print("\n--- Info ---")
df.info()
print("\n--- Descriptive Statistics (describe) ---")
print(df.describe())
Step 4: Data Cleaning and Preparation
We perform necessary cleaning actions, such as converting data types and creating new columns needed for the analysis.
Data is rarely in perfect shape. Cleaning and preparation is one of the most important steps. In our example:
- The 'Date' column was read as plain text (object). We need to convert it to a real date type using `pd.to_datetime()`. This will allow us to perform time-based analyses.
- We will create a new 'Revenue' column by multiplying 'Quantity' by 'Price'. Pandas allows us to do this operation "vectorized," meaning on entire columns at once.
# ... (inside the analyze_sales_data function) ...
print("\n--- Data Preparation ---")
# Convert 'Date' column to datetime type
df['Date'] = pd.to_datetime(df['Date'])
print("'Date' column converted to datetime.")
# Create new 'Revenue' column
df['Revenue'] = df['Quantity'] * df['Price']
print("'Revenue' column created.")
print("\n--- DataFrame after Preparation ---")
print(df.head())
df.info() # Now 'Date' is datetime64[ns]
Step 5: Analysis and Aggregation
We use the powerful .groupby() method to group data and calculate important statistics, such as total revenue and sales per product.
Now the real analysis begins. The `.groupby()` method is one of the most powerful tools in Pandas. It allows us to:
- Split: Divide the data into groups based on some criteria (e.g., by 'Product' or 'Category').
- Apply: Apply a function to each group independently (e.g., `sum()`, `mean()`).
- Combine: Combine the results into a new data structure.
# ... (inside the analyze_sales_data function) ...
# 1. Total revenue
total_revenue = df['Revenue'].sum()
print(f"\nTotal Revenue: {total_revenue:.2f}€")
# 2. Top-selling products (by quantity)
top_selling_products = df.groupby('Product')['Quantity'].sum().sort_values(ascending=False)
print("\n--- Top 5 Selling Products (units) ---")
print(top_selling_products.head())
# 3. Revenue by category
revenue_by_category = df.groupby('Category')['Revenue'].sum().sort_values(ascending=False)
print("\n--- Revenue by Category ---")
print(revenue_by_category)
Step 6: Visualizing Results
We create a simple bar chart to visualize monthly revenue, making our conclusions immediately understandable.
A table of numbers is useful, but a chart is often much more revealing. Pandas integrates seamlessly with Matplotlib, allowing us to call the `.plot()` method directly on a DataFrame or Series.
To analyze revenue by month, we will use two more powerful methods: `.set_index('Date')` to make the date column the index of the DataFrame, and `.resample('M')` to resample the data to a monthly frequency, applying the `sum()` function.
# At the beginning of the file: import matplotlib.pyplot as plt
# ... (inside the analyze_sales_data function) ...
# Group revenue by month
# 'M' stands for 'Month End Frequency'
monthly_revenue = df.set_index('Date').resample('M')['Revenue'].sum()
print("\n--- Monthly Revenue ---")
print(monthly_revenue)
# Create chart
plt.figure(figsize=(10, 6)) # Define chart size
monthly_revenue.plot(kind='bar', title="Monthly e-Shop Revenue")
plt.xlabel("Month")
plt.ylabel("Revenue (€)")
plt.xticks(rotation=45) # Rotate labels for better readability
plt.tight_layout() # Adjust chart to prevent labels from being cut off
plt.show() # Display the chart (in local execution)
Tip!
Project Completion & Next Steps
Congratulations! You have completed the path and now have the full code for the project.
This is the final, complete code for the application. You can copy it, run it locally on your computer (after installing the necessary libraries with `pip`), and experiment by adding your own features!
# sales_analyzer.py
import pandas as pd
import matplotlib.pyplot as plt
def analyze_sales_data(file_path="sales_data.csv"):
try:
df = pd.read_csv(file_path)
except FileNotFoundError:
print(f"Error: The file '{file_path}' was not found.")
return
print("--- First 5 Rows of Data ---")
print(df.head())
print("\n--- Data Info ---")
df.info()
# --- Data Cleaning ---
# Convert 'Date' column to datetime objects
df['Date'] = pd.to_datetime(df['Date'])
# Calculate revenue column
df['Revenue'] = df['Quantity'] * df['Price']
# --- Data Analysis ---
print("\n--- Basic Analysis ---")
total_revenue = df['Revenue'].sum()
print(f"Total Revenue: {total_revenue:.2f}€")
top_selling_products = df.groupby('Product')['Quantity'].sum().sort_values(ascending=False)
print("\n--- Top 5 Selling Products (units) ---")
print(top_selling_products.head())
revenue_by_category = df.groupby('Category')['Revenue'].sum().sort_values(ascending=False)
print("\n--- Revenue by Category ---")
print(revenue_by_category)
# --- Visualization ---
monthly_revenue = df.set_index('Date').resample('M')['Revenue'].sum()
plt.figure(figsize=(10, 6))
monthly_revenue.plot(kind='bar')
plt.title("Monthly Revenue")
plt.xlabel("Month")
plt.ylabel("Revenue (€)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
if __name__ == '__main__':
# Create a sales_data.csv file for demonstration
data = {
'Date': ['2023-01-15', '2023-01-20', '2023-02-10', '2023-02-12', '2023-03-05'],
'Category': ['Electronics', 'Books', 'Electronics', 'Home Goods', 'Books'],
'Product': ['Laptop', 'Python Intro', 'Mouse', 'Coffee Maker', 'Data Science Book'],
'Quantity': [1, 2, 5, 1, 3],
'Price': [1200, 30, 25, 80, 45]
}
pd.DataFrame(data).to_csv('sales_data.csv', index=False)
print("Created 'sales_data.csv' for testing.")
analyze_sales_data()