The pandas `groupby()` function is a powerful tool for working with tabular data in Python. It allows you to group rows based on one or more columns and then perform various operations on each group. This functionality is essential for data summarization, analysis, and manipulation.
1. Basic Principle of GroupBy.
- Grouping: You specify a column (or list of columns) in your DataFrame to create groups. Each unique value in the grouping column(s) becomes a group name.
- Operations on Groups: The `groupby()` function returns a `GroupBy` object, which provides methods to perform operations on each group.
2. Common GroupBy Operations.
- There are three main categories of operations you can perform on grouped data:
- Aggregation (agg): This involves summarizing data within each group using functions like `sum()`, `mean()`, `median()`, `count()`, etc.
- Transformation (transform): This applies a function to each group, potentially creating new columns or modifying existing ones. Common functions include `fillna()`, `apply()`, and custom functions.
- Application (apply): This allows you to apply a function to each group (like a mini-DataFrame) to perform more complex computations.
3. Example Data.
- Let’s create a sample DataFrame to illustrate these concepts:
import pandas as pd def create_df_data(): # Define example data data = {'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C'], 'product': ['Shirt', 'Jeans', 'Shirt', 'Hat', 'Shirt', 'Hat'], 'sales': [100, 200, 150, 75, 225, None]} # Include a missing value (None) for demonstration # Create DataFrame from the data df = pd.DataFrame(data) # Print the original DataFrame print('\r\nOriginal df:') print(df) # Return the created DataFrame return df # The following code will only execute if this script is run directly, # not if it is imported as a module in another script. if __name__ == "__main__": # Call the function to create and display the DataFrame df = create_df_data()
- Output.
Original df: store product sales 0 Store A Shirt 100.0 1 Store A Jeans 200.0 2 Store B Shirt 150.0 3 Store B Hat 75.0 4 Store C Shirt 225.0 5 Store C Hat NaN
4. Aggregation (agg).
- Aggregate functions operate on each group and return a single value summarizing that group.
- Here’s how to calculate total sales per store and the average sales per product category:
import pandas as pd def create_df_data(): # Define example data data = {'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C'], 'product': ['Shirt', 'Jeans', 'Shirt', 'Hat', 'Shirt', 'Hat'], 'sales': [100, 200, 150, 75, 225, None]} # Include a missing value (None) for demonstration # Create DataFrame from the data df = pd.DataFrame(data) # Print the original DataFrame print('\r\nOriginal df:') print(df) # Return the created DataFrame return df def test_agg_function(df): # Total sales per store store_totals = df.groupby('store')['sales'].sum() print('\r\nTotal sales per store:') print(store_totals) # Average sales per product product_avg_sales = df.groupby('product')['sales'].mean() print('\r\nAverage sales per product:') print(product_avg_sales) # Median sales per product product_medians = df.groupby('product')['sales'].median() print("\r\nMedian Sales per Product:") print(product_medians) # Count of sales per product product_counts = df.groupby('product')['sales'].count() print("\r\nCount of Sales per Product:") print(product_counts) # The following code will only execute if this script is run directly, # not if it is imported as a module in another script. if __name__ == "__main__": # Call the function to create and display the DataFrame df = create_df_data() test_agg_function(df)
- Output.
Original df: store product sales 0 Store A Shirt 100.0 1 Store A Jeans 200.0 2 Store B Shirt 150.0 3 Store B Hat 75.0 4 Store C Shirt 225.0 5 Store C Hat NaN Total sales per store: store Store A 300.0 Store B 225.0 Store C 225.0 Name: sales, dtype: float64 Average sales per product: product Hat 75.000000 Jeans 200.000000 Shirt 158.333333 Name: sales, dtype: float64 Median Sales per Product: product Hat 75.0 Jeans 200.0 Shirt 150.0 Name: sales, dtype: float64 Count of Sales per Product: product Hat 1 Jeans 1 Shirt 3 Name: sales, dtype: int64
5. Transformation (transform).
- Transformation functions modify or create new columns within each group.
- Here’s an example of filling missing sales values with the group mean and calculating sales percentages within each store:
import pandas as pd def create_df_data(): # Define example data data = {'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C'], 'product': ['Shirt', 'Jeans', 'Shirt', 'Hat', 'Shirt', 'Hat'], 'sales': [100, 200, 150, 75, 225, None]} # Include a missing value (None) for demonstration # Create DataFrame from the data df = pd.DataFrame(data) # Print the original DataFrame print('\r\nOriginal df:') print(df) # Return the created DataFrame return df def test_transformation_function(df): # Fill missing sales with group mean (assuming no missing values here) df['sales_filled'] = df.groupby('store')['sales'].transform('fillna', df['sales'].mean()) # Calculate sales percentages within each store (assuming no zeros in 'sales') df['sales_pct'] = df.groupby('store')['sales_filled'].transform(lambda x: x / x.sum()) print('\r\nTransformaed df:') print(df) # The following code will only execute if this script is run directly, # not if it is imported as a module in another script. if __name__ == "__main__": # Call the function to create and display the DataFrame df = create_df_data() test_transformation_function(df)
- Output.
Original df: store product sales 0 Store A Shirt 100.0 1 Store A Jeans 200.0 2 Store B Shirt 150.0 3 Store B Hat 75.0 4 Store C Shirt 225.0 5 Store C Hat NaN Transformaed df: store product sales sales_filled sales_pct 0 Store A Shirt 100.0 100.0 0.333333 1 Store A Jeans 200.0 200.0 0.666667 2 Store B Shirt 150.0 150.0 0.666667 3 Store B Hat 75.0 75.0 0.333333 4 Store C Shirt 225.0 225.0 0.600000 5 Store C Hat NaN 150.0 0.400000
6. Application (apply).
- The `apply()` method allows you to apply a custom function to each group (like a mini-DataFrame).
- Here’s an example of finding the most expensive product (by sales) within each store:
import pandas as pd def create_df_data(): # Define example data data = {'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C'], 'product': ['Shirt', 'Jeans', 'Shirt', 'Hat', 'Shirt', 'Hat'], 'sales': [100, 200, 150, 75, 225, None]} # Include a missing value (None) for demonstration # Create DataFrame from the data df = pd.DataFrame(data) # Print the original DataFrame print('\r\nOriginal df:') print(df) # Return the created DataFrame return df def get_most_expensive(group): # Define a function to retrieve the row with the maximum sales value within each group return group.loc[group['sales'].idxmax()] def test_apply_function(df): # Group the DataFrame by 'store' and apply the custom function to get the most expensive item in each store most_expensive_per_store = df.groupby('store').apply(get_most_expensive) # Print the result print(most_expensive_per_store) # The following code will only execute if this script is run directly, # not if it is imported as a module in another script. if __name__ == "__main__": # Call the function to create and display the DataFrame df = create_df_data() test_apply_function(df)
- Output.
Original df: store product sales 0 Store A Shirt 100.0 1 Store A Jeans 200.0 2 Store B Shirt 150.0 3 Store B Hat 75.0 4 Store C Shirt 225.0 5 Store C Hat NaN store product sales store Store A Store A Jeans 200.0 Store B Store B Shirt 150.0 Store C Store C Shirt 225.0