In data analysis, extracting specific rows from a multitude of CSV files is a common task. Using Python and the powerful Pandas library, you can efficiently accomplish this task. Suppose you have a directory containing 500 CSV files, and you need to extract rows based on a certain criterion from a subset of files with a particular naming convention, such as files ending with ‘_add.csv‘. Here’s a step-by-step guide on how to achieve this:
1. Setting Up the Environment.
- Before we dive into the code, make sure you have Python and Pandas installed on your system.
- You can run the command pip show pandas to confirm whether it has been installed or not.
- The below output shows that pandas have been installed on your system.
$ pip show pandas Name: pandas Version: 1.5.3 Summary: Powerful data structures for data analysis, time series, and statistics Home-page: https://pandas.pydata.org Author: The Pandas Development Team Author-email: [email protected] License: BSD-3-Clause Location: /Users/songzhao/anaconda3/lib/python3.11/site-packages Requires: numpy, numpy, python-dateutil, pytz Required-by: bokeh, datashader, holoviews, hvplot, panel, seaborn, statsmodels, xarray
- If you can not find pandas on your system, you can install Pandas using the pip command like below.
pip install pandas
- Next, import the required libraries in your Python script.
import os import pandas as pd
2. Example Dataset.
- To illustrate the process, let’s consider a set of CSV files containing information about sales transactions.
- We want to extract rows where the value in the 10th column matches a specific criterion.
- Example CSV File 1: sales_data_1_add.csv.
Date ,Product,Price,Quantity,Total ,CustomerID,Category,Region,Discount,PaymentMethod 2024-01-01,Apple ,1.99 ,5 ,9.95 ,1001 ,Fruit ,North ,0.05 ,Credit Card 2024-01-02,Banana ,0.99 ,10 ,9.90 ,1002 ,Fruit ,South ,0.00 ,PayPal 2024-01-03,Orange ,2.49 ,3 ,7.47 ,1003 ,Fruit ,West ,0.10 ,Cash 2024-01-04,Apple ,1.99 ,7 ,13.93 ,1004 ,Fruit ,East ,0.15 ,Debit Card 2024-01-05,Banana ,0.99 ,8 ,7.92 ,1005 ,Fruit ,North ,0.05 ,Cash 2024-01-06,Orange ,2.49 ,6 ,14.94 ,1006 ,Fruit ,South ,0.00 ,PayPal 2024-01-07,Apple ,1.99 ,4 ,7.96 ,1007 ,Fruit ,West ,0.10 ,Credit Card 2024-01-08,Banana ,0.99 ,9 ,8.91 ,1008 ,Fruit ,East ,0.15 ,PayPal 2024-01-09,Orange ,2.49 ,2 ,4.98 ,1009 ,Fruit ,North ,0.05 ,Debit Card 2024-01-10,Apple ,1.99 ,6 ,11.94 ,1010 ,Fruit ,South ,0.00 ,Cash 2024-01-11,Banana ,0.99 ,7 ,6.93 ,1011 ,Fruit ,West ,0.10 ,PayPal 2024-01-12,Orange ,2.49 ,5 ,12.45 ,1012 ,Fruit ,East ,0.15 ,Credit Card 2024-01-13,Apple ,1.99 ,3 ,5.97 ,1013 ,Fruit ,North ,0.05 ,PayPal 2024-01-14,Banana ,0.99 ,6 ,5.94 ,1014 ,Fruit ,South ,0.00 ,Debit Card 2024-01-15,Orange ,2.49 ,4 ,9.96 ,1015 ,Fruit ,West ,0.10 ,Cash 2024-01-16,Apple ,1.99 ,8 ,15.92 ,1016 ,Fruit ,East ,0.15 ,Credit Card
- Example CSV File 2: sales_data_2_add.csv.
Date ,Product,Price,Quantity,Total ,CustomerID,Category,Region,Discount,PaymentMethod 2024-01-17,Apple ,1.99 ,5 ,9.95 ,1017 ,Fruit ,North ,0.05 ,Cash 2024-01-18,Banana ,0.99 ,10 ,9.90 ,1018 ,Fruit ,South ,0.00 ,Debit Card 2024-01-19,Orange ,2.49 ,3 ,7.47 ,1019 ,Fruit ,West ,0.10 ,Credit Card 2024-01-20,Apple ,1.99 ,7 ,13.93 ,1020 ,Fruit ,East ,0.15 ,PayPal 2024-01-21,Banana ,0.99 ,8 ,7.92 ,1021 ,Fruit ,North ,0.05 ,Cash 2024-01-22,Orange ,2.49 ,6 ,14.94 ,1022 ,Fruit ,South ,0.00 ,Debit Card 2024-01-23,Apple ,1.99 ,4 ,7.96 ,1023 ,Fruit ,West ,0.10 ,Credit Card 2024-01-24,Banana ,0.99 ,9 ,8.91 ,1024 ,Fruit ,East ,0.15 ,PayPal 2024-01-25,Orange ,2.49 ,2 ,4.98 ,1025 ,Fruit ,North ,0.05 ,Cash 2024-01-26,Apple ,1.99 ,6 ,11.94 ,1026 ,Fruit ,South ,0.00 ,Debit Card 2024-01-27,Banana ,0.99 ,7 ,6.93 ,1027 ,Fruit ,West ,0.10 ,PayPal 2024-01-28,Orange ,2.49 ,5 ,12.45 ,1028 ,Fruit ,East ,0.15 ,Credit Card 2024-01-29,Apple ,1.99 ,3 ,5.97 ,1029 ,Fruit ,North ,0.05 ,Cash 2024-01-30,Banana ,0.99 ,6 ,5.94 ,1030 ,Fruit ,South ,0.00 ,Debit Card 2024-01-31,Orange ,2.49 ,4 ,9.96 ,1031 ,Fruit ,West ,0.10 ,Credit Card 2024-02-01,Apple ,1.99 ,8 ,15.92 ,1032 ,Fruit ,East ,0.15 ,PayPal
3. Extracting Rows Based on Criteria.
- Now, let’s proceed with the Python code to extract rows based on a specific criterion.
- In this example, we’ll search for rows where the value in the 10th column (Condition) is “PayPal” from a subset of CSV files ending with ‘_add.csv‘.
3.1 Define File Paths and Parameters.
- Set up the file paths for your source directory containing the CSV files and the destination directory where you want to save the extracted rows.
- Define any parameters such as the column number based on which you want to extract rows (in this case, column number 10).
source_directory = 'path/to/source/directory' destination_directory = 'path/to/destination/directory' target_files_pattern = '_add.csv' column_to_check = 10 desired_value = "PayPal".strip().lower()
3.2 Iterate Through Files and Extract Rows.
- Iterate through each file in the source directory. Check if the file matches the specified pattern.
- If it does, load the CSV file into a Pandas DataFrame and filter rows based on the criterion (value in column number 10). Append the filtered rows to a list.
extracted_rows = [] for file in os.listdir(source_directory): if file.endswith(target_files_pattern): file_path = os.path.join(source_directory, file) df = pd.read_csv(file_path) #print(df) rows = len(df.index) #print(rows) for i in range(rows): column_value = df.iat[i, column_to_check - 1] if column_value.strip().lower() == desired_value: extracted_rows.append(df.iloc[i])
3.3 Concatenate and Save Extracted Rows.
- Concatenate all the extracted rows into a single DataFrame and save it to a new CSV file in the destination directory.
result_df = pd.DataFrame(extracted_rows) print(result_df) result_df.to_csv(os.path.join(destination_directory, 'extracted_rows.csv'), index=False)
4. Full Example Source Code.
- First, save the example dataset in section 2 to two CSV files under the ./resource-files folder.
- Then create a Python file with the name how-to-extract-specific-rows-from-multiple-csv-files-using-pandas.py.
- Copy and paste the below Python source code to the Python file.
import os, pandas as pd def load_rows_from_multiple_csv(): # Define source and destination directories source_directory = './resource-files' destination_directory = './resource-files' # Define pattern for target CSV files and column number to check target_files_pattern = '_add.csv' column_to_check = 10 # Define desired value after stripping whitespace and converting to lowercase desired_value = "PayPal".strip().lower() # Initialize list to store extracted rows extracted_rows = [] # Iterate through files in the source directory for file in os.listdir(source_directory): # Check if the file matches the target pattern if file.endswith(target_files_pattern): file_path = os.path.join(source_directory, file) df = pd.read_csv(file_path) # Get the number of rows in the DataFrame rows = len(df.index) # Iterate through each row for i in range(rows): # Extract the value in the specified column column_value = df.iat[i, column_to_check - 1] # Check if the value matches the desired value after stripping whitespace and converting to lowercase if column_value.strip().lower() == desired_value: # If match, append the entire row to the list of extracted rows extracted_rows.append(df.iloc[i]) # Create a DataFrame from the list of extracted rows result_df = pd.DataFrame(extracted_rows) # Print the resulting DataFrame print(result_df) # Write the resulting DataFrame to a CSV file in the destination directory result_df.to_csv(os.path.join(destination_directory, 'extracted_rows.csv'), index=False) if __name__ == "__main__": # Call the function to load rows from multiple CSV files load_rows_from_multiple_csv()
- When you run the above code, it will generate the target file ./resource-files/extracted_rows.csv.
- Below is the above-generated file content.
Date ,Product,Price,Quantity,Total ,CustomerID,Category,Region,Discount,PaymentMethod 2024-01-20,Apple ,1.99,7,13.93,1020,Fruit ,East ,0.15,PayPal 2024-01-24,Banana ,0.99,9,8.91,1024,Fruit ,East ,0.15,PayPal 2024-01-27,Banana ,0.99,7,6.93,1027,Fruit ,West ,0.1,PayPal 2024-02-01,Apple ,1.99,8,15.92,1032,Fruit ,East ,0.15,PayPal 2024-01-02,Banana ,0.99,10,9.9,1002,Fruit ,South ,0.0,PayPal 2024-01-06,Orange ,2.49,6,14.94,1006,Fruit ,South ,0.0,PayPal 2024-01-08,Banana ,0.99,9,8.91,1008,Fruit ,East ,0.15,PayPal 2024-01-11,Banana ,0.99,7,6.93,1011,Fruit ,West ,0.1,PayPal 2024-01-13,Apple ,1.99,3,5.97,1013,Fruit ,North ,0.05,PayPal
5. Conclusion.
- By utilizing the power of Pandas, we can efficiently extract specific rows from a subset of CSV files based on custom criteria.
- This method proves invaluable when dealing with large datasets spread across multiple files, enabling streamlined data extraction and analysis.
- Now you can adapt this code to your specific requirements, such as modifying the criteria or handling different file structures, to effectively extract the desired data from your CSV files.