Career Paths

The Data Analyst with Pandas

Learn to manipulate and analyze data from CSV files using the Pandas library, the ultimate tool for data analysis in Python.

Our Project: e-Shop Sales Analysis

We will analyze a CSV file with e-shop sales data to find the best-selling products, total revenue by category, and sales trends over time.

Core Technologies We'll Use:

Python
Pandas
Matplotlib
Step 1 / 6

Step 1: Introduction to Pandas and the DataFrame

We understand what Pandas is and its central structure, the DataFrame, which is the cornerstone of data analysis in Python.

What is Pandas?

Pandas is an open-source library that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It is the de facto tool for cleaning, transforming, manipulating, and analyzing data in Python.

The DataFrame

The heart of Pandas is the DataFrame. You can think of it like an Excel spreadsheet or a table in a SQL database. It is a two-dimensional data structure with labeled axes (rows and columns). Each column can have a different data type (numbers, text, dates, etc.).

Environment Setup

We will need the `pandas` and `matplotlib` libraries. Open your terminal, create a virtual environment, and install them:

# Create project folder and venv
mkdir pandas_sales_analyzer
cd pandas_sales_analyzer
python -m venv venv
# Activation...

# Install libraries
pip install pandas matplotlib

Create a file `sales_analyzer.py`. Also, create a file `sales_data.csv` with the following content and save it in the same folder:

Date,Category,Product,Quantity,Price
2023-01-15,Electronics,Laptop,1,1200
2023-01-20,Books,Python Intro,2,30
2023-02-10,Electronics,Mouse,5,25
2023-02-12,Home Goods,Coffee Maker,1,80
2023-03-05,Books,Data Science Book,3,45
2023-03-22,Electronics,Keyboard,2,75
Step 2 / 6

Step 2: Loading Data from CSV

We learn how to use the powerful pd.read_csv() function to load data from a CSV file directly into a DataFrame.

The first step in any analysis is to import the data. Pandas makes this process extremely simple with the `pd.read_csv()` function.


# sales_analyzer.py
import pandas as pd

def analyze_sales_data(file_path="sales_data.csv"):
    try:
        # We load the CSV into a DataFrame
        df = pd.read_csv(file_path)
        print("CSV file loaded successfully into a DataFrame.")
        return df
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
        return None

if __name__ == '__main__':
    sales_df = analyze_sales_data()
    if sales_df is not None:
        print("--- Object Type ---")
        print(type(sales_df))
Step 3 / 6

Step 3: Data Exploration

We use basic DataFrame methods (.head(), .info(), .describe()) to get a quick first look at the structure and content of our data.

Before diving into the analysis, it is crucial to understand our data. Pandas provides excellent methods for quick exploration:

  • `.head(n)`: Displays the first `n` rows of the DataFrame (default is 5).
  • `.info()`: Provides a concise summary of the DataFrame, including the data type of each column and non-null values.
  • `.describe()`: Generates descriptive statistics for the numerical columns (count, mean, standard deviation, min, max, etc.).

# ... (inside the analyze_sales_data function, after loading) ...
    print("\n--- First 5 Rows (head) ---")
    print(df.head())
    
    print("\n--- Info ---")
    df.info()
    
    print("\n--- Descriptive Statistics (describe) ---")
    print(df.describe())
Step 4 / 6

Step 4: Data Cleaning and Preparation

We perform necessary cleaning actions, such as converting data types and creating new columns needed for the analysis.

Data is rarely in perfect shape. Cleaning and preparation is one of the most important steps. In our example:

  1. The 'Date' column was read as plain text (object). We need to convert it to a real date type using `pd.to_datetime()`. This will allow us to perform time-based analyses.
  2. We will create a new 'Revenue' column by multiplying 'Quantity' by 'Price'. Pandas allows us to do this operation "vectorized," meaning on entire columns at once.

# ... (inside the analyze_sales_data function) ...
    
    print("\n--- Data Preparation ---")
    
    # Convert 'Date' column to datetime type
    df['Date'] = pd.to_datetime(df['Date'])
    print("'Date' column converted to datetime.")
    
    # Create new 'Revenue' column
    df['Revenue'] = df['Quantity'] * df['Price']
    print("'Revenue' column created.")

    print("\n--- DataFrame after Preparation ---")
    print(df.head())
    df.info() # Now 'Date' is datetime64[ns]
Step 5 / 6

Step 5: Analysis and Aggregation

We use the powerful .groupby() method to group data and calculate important statistics, such as total revenue and sales per product.

Now the real analysis begins. The `.groupby()` method is one of the most powerful tools in Pandas. It allows us to:

  1. Split: Divide the data into groups based on some criteria (e.g., by 'Product' or 'Category').
  2. Apply: Apply a function to each group independently (e.g., `sum()`, `mean()`).
  3. Combine: Combine the results into a new data structure.

# ... (inside the analyze_sales_data function) ...

    # 1. Total revenue
    total_revenue = df['Revenue'].sum()
    print(f"\nTotal Revenue: {total_revenue:.2f}€")

    # 2. Top-selling products (by quantity)
    top_selling_products = df.groupby('Product')['Quantity'].sum().sort_values(ascending=False)
    print("\n--- Top 5 Selling Products (units) ---")
    print(top_selling_products.head())

    # 3. Revenue by category
    revenue_by_category = df.groupby('Category')['Revenue'].sum().sort_values(ascending=False)
    print("\n--- Revenue by Category ---")
    print(revenue_by_category)
Step 6 / 6

Step 6: Visualizing Results

We create a simple bar chart to visualize monthly revenue, making our conclusions immediately understandable.

A table of numbers is useful, but a chart is often much more revealing. Pandas integrates seamlessly with Matplotlib, allowing us to call the `.plot()` method directly on a DataFrame or Series.

To analyze revenue by month, we will use two more powerful methods: `.set_index('Date')` to make the date column the index of the DataFrame, and `.resample('M')` to resample the data to a monthly frequency, applying the `sum()` function.


# At the beginning of the file: import matplotlib.pyplot as plt
# ... (inside the analyze_sales_data function) ...

    # Group revenue by month
    # 'M' stands for 'Month End Frequency'
    monthly_revenue = df.set_index('Date').resample('M')['Revenue'].sum()
    
    print("\n--- Monthly Revenue ---")
    print(monthly_revenue)

    # Create chart
    plt.figure(figsize=(10, 6)) # Define chart size
    monthly_revenue.plot(kind='bar', title="Monthly e-Shop Revenue")
    plt.xlabel("Month")
    plt.ylabel("Revenue (€)")
    plt.xticks(rotation=45) # Rotate labels for better readability
    plt.tight_layout() # Adjust chart to prevent labels from being cut off
    plt.show() # Display the chart (in local execution)

Project Completion & Next Steps

Congratulations! You have completed the path and now have the full code for the project.

This is the final, complete code for the application. You can copy it, run it locally on your computer (after installing the necessary libraries with `pip`), and experiment by adding your own features!


# sales_analyzer.py
import pandas as pd
import matplotlib.pyplot as plt

def analyze_sales_data(file_path="sales_data.csv"):
    try:
        df = pd.read_csv(file_path)
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
        return

    print("--- First 5 Rows of Data ---")
    print(df.head())
    
    print("\n--- Data Info ---")
    df.info()

    # --- Data Cleaning ---
    # Convert 'Date' column to datetime objects
    df['Date'] = pd.to_datetime(df['Date'])
    # Calculate revenue column
    df['Revenue'] = df['Quantity'] * df['Price']

    # --- Data Analysis ---
    print("\n--- Basic Analysis ---")
    total_revenue = df['Revenue'].sum()
    print(f"Total Revenue: {total_revenue:.2f}€")

    top_selling_products = df.groupby('Product')['Quantity'].sum().sort_values(ascending=False)
    print("\n--- Top 5 Selling Products (units) ---")
    print(top_selling_products.head())

    revenue_by_category = df.groupby('Category')['Revenue'].sum().sort_values(ascending=False)
    print("\n--- Revenue by Category ---")
    print(revenue_by_category)

    # --- Visualization ---
    monthly_revenue = df.set_index('Date').resample('M')['Revenue'].sum()
    
    plt.figure(figsize=(10, 6))
    monthly_revenue.plot(kind='bar')
    plt.title("Monthly Revenue")
    plt.xlabel("Month")
    plt.ylabel("Revenue (€)")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

if __name__ == '__main__':
    # Create a sales_data.csv file for demonstration
    data = {
        'Date': ['2023-01-15', '2023-01-20', '2023-02-10', '2023-02-12', '2023-03-05'],
        'Category': ['Electronics', 'Books', 'Electronics', 'Home Goods', 'Books'],
        'Product': ['Laptop', 'Python Intro', 'Mouse', 'Coffee Maker', 'Data Science Book'],
        'Quantity': [1, 2, 5, 1, 3],
        'Price': [1200, 30, 25, 80, 45]
    }
    pd.DataFrame(data).to_csv('sales_data.csv', index=False)
    print("Created 'sales_data.csv' for testing.")
    
    analyze_sales_data()