This article will show you how to use the python pandas module to read Microsoft Excel file’s one worksheet, multiple worksheets, all worksheets, and specified worksheet columns data.
1. How To Use Python Pandas Module To Read Microsoft Excel Files.
- Open a terminal and run the below command to make sure you have installed the python pandas library in your python environment. You can refer to the article How To Install Python Package Numpy, Pandas, Scipy, Matplotlib On Windows, Mac, And Linux to learn more.
$ pip show pandas Name: pandas Version: 1.2.3 Summary: Powerful data structures for data analysis, time series, and statistics Home-page: https://pandas.pydata.org Author: None Author-email: None License: BSD Location: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages Requires: numpy, python-dateutil, pytz Required-by: xarray, vaex-core, streamlit, statsmodels, plotly-express, bqplot, altair
- Import the python pandas library in your python source code.
import pandas as pd
- Invoke the python pandas module’s read_excel() function to read an excel file worksheet, the first parameter is the excel file path ( you can add r at the beginning of the file path string to avoid character escape issue, for example, the string r’C:\abc\read.xlsx’ will not treat \r as return character. ), the second parameter is the worksheet name. It returns a pandas.core.frame.DataFrame object.
df = pd.read_excel(excel_file_path, worksheet_name)
- If you want to read multiple worksheets in one excel file, you can pass a worksheet name list to the second parameter of the read_excel() function. In this case, the function returns a python dictionary object, you can iterate the python dictionary object to get each worksheet data.
df = pd.read_excel(excel_file_path, ['worksheet_1', 'worksheet_2'])
- When you pass None to the second parameter of the pandas read_excel function, it will return all the worksheets data in a python dictionary object.
df = pd.read_excel(excel_file_path, None)
- If you just want to get the specified excel columns data, you can use the returned pandas.core.frame.DataFrame object’s DataFrame() function.
# Get the pandas.core.frame.DataFrame object df = pd.read_excel(excel_file_path, worksheet_name) # Read the special columns in the worksheet. pd.DataFrame(df_one_sheet, columns= ['Name', 'Salary'])
-
Resolve column name misalignment. The default output result of pandas will cause the problems of column misalignment or incomplete display of multiple rows and columns. These two problems can be solved by using the set_option function.
pd.set_option('display.unicode.east_asian_width', True)
- Solve the problem of incomplete row and column display. By setting the display.max_rows and display.max_columns to the maximum number of rows and columns.
pd.set_option('display.max.rows', 1000) pd.set_option('display.max.columns', 1000)
2. Read Excel File With Pandas Examples.
2.1 The Example Excel WorkSheet Data Introduction.
- In this example, there are 2 worksheets ( sheet_1, sheet_2 ) in the excel file.
- Below is the data in Excel sheet_1.
Name Hire Date Salary 0 jerry 2010-01-01 16000 1 tom 2011-08-19 6000 2 kevin 2009-02-08 13000 3 richard 2012-03-19 5000 4 jackie 2015-06-08 28000 5 steven 2008-02-01 36000 6 jack 2006-09-19 8000 7 gary 2018-01-16 19000 8 john 2017-10-01 16600
- Below is the data in Excel sheet_2.
Job Title Job Description 0 Software Engineer Experience in java, python, c++ 1 QA WebDriver Selenium
2.2 Read Excel File With Pandas Examples.
- The example python file name is ReadExcelFiles.py, it contains 5 example functions, we will introduce them one by one.
2.2.1 read_one_excel_sheet().
- This function will read one worksheet data in the Excel file.
''' Created on Oct 16, 2021 @author: songzhao ''' # Import the python pandas module. import pandas as pd excel_file_path = "./employee_info.xlsx" ''' This function will read the excel file worksheet data with the specified value. ''' def read_work_sheet(excel_file_path, work_sheet_name): # Align column names by setting display.unicode.east_asian_width to True. pd.set_option('display.unicode.east_asian_width', True) # Solve the problem of incomplete row and column display pd.set_option('display.max.rows', 1000) pd.set_option('display.max.columns', 1000) # Invoke the pandas module read_excel method to load the specified excel file worksheet data. df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name) return df # This function print out the specified python object's data type. def print_object_type(prefix, object): object_type = type(object) print(prefix + str(object_type)) # This function will read one worksheet in an excel file. def read_one_excel_sheet(): print("****** read_one_excel_sheet() ******\r\n") # Define the Excel work sheet name. work_sheet_name = "sheet_1" # Get the Excel file work sheet object. df_one_sheet = read_work_sheet(excel_file_path, work_sheet_name) # Print out the work sheet data type. print_object_type("work_sheet_type: ", df_one_sheet) # Print out the loaded excel file work sheet data. print(df_one_sheet) print("=============================") # Print out the first 3 rows in the Excel sheet, if not specified then the default print out row count number is 5. print(df_one_sheet.head(3)) if __name__ == '__main__': read_one_excel_sheet()
- Below is the above example execution output.
****** read_one_excel_sheet() ****** work_sheet_type: <class 'pandas.core.frame.DataFrame'> Name Hire Date Salary 0 jerry 2010-01-01 16000 1 tom 2011-08-19 6000 2 kevin 2009-02-08 13000 3 richard 2012-03-19 5000 4 jackie 2015-06-08 28000 5 steven 2008-02-01 36000 6 jack 2006-09-19 8000 7 gary 2018-01-16 19000 8 john 2017-10-01 16600 ============================= Name Hire Date Salary 0 jerry 2010-01-01 16000 1 tom 2011-08-19 6000 2 kevin 2009-02-08 13000
2.2.2 read_multiple_excel_sheet().
- This function will read multiple Excel worksheets data.
''' Created on Oct 16, 2021 @author: songzhao ''' # Import the python pandas module. import pandas as pd excel_file_path = "./employee_info.xlsx" ''' This function will read the excel file worksheet data with the specified value. ''' def read_work_sheet(excel_file_path, work_sheet_name): # Align column names by setting display.unicode.east_asian_width to True. pd.set_option('display.unicode.east_asian_width', True) # Solve the problem of incomplete row and column display pd.set_option('display.max.rows', 1000) pd.set_option('display.max.columns', 1000) # Invoke the pandas module read_excel method to load the specified excel file worksheet data. df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name) return df # This function print out the specified python object's data type. def print_object_type(prefix, object): object_type = type(object) print(prefix + str(object_type)) # This function will read multiple worksheets in one excel file. def read_multiple_excel_sheet(): print("\r\n****** read_multiple_excel_sheet() ******\r\n") work_sheet_name_array = ["sheet_1", "sheet_2"] df_multiple_sheet = read_work_sheet(excel_file_path, work_sheet_name_array) df_sheet_1 = df_multiple_sheet['sheet_1'] # Print out the work sheet data type. print_object_type("work_sheet_1_type: ", df_sheet_1) print("\n") print(df_sheet_1) print("\n\n") df_sheet_2 = df_multiple_sheet['sheet_2'] # Print out the work sheet data type. print_object_type("work_sheet_2_type: ", df_sheet_2) print("\n") print(df_sheet_2) if __name__ == '__main__': read_multiple_excel_sheet()
- Below is the above example source code execution result.
****** read_multiple_excel_sheet() ****** work_sheet_1_type: <class 'pandas.core.frame.DataFrame'> Name Hire Date Salary 0 jerry 2010-01-01 16000 1 tom 2011-08-19 6000 2 kevin 2009-02-08 13000 3 richard 2012-03-19 5000 4 jackie 2015-06-08 28000 5 steven 2008-02-01 36000 6 jack 2006-09-19 8000 7 gary 2018-01-16 19000 8 john 2017-10-01 16600 work_sheet_2_type: <class 'pandas.core.frame.DataFrame'> Job Title Job Description 0 Software Engineer Experience in java, python, c++ 1 QA WebDriver Selenium
2.2.3 read_all_excel_sheet().
- This function will read all the Excel worksheets data.
''' Created on Oct 16, 2021 @author: songzhao ''' # Import the python pandas module. import pandas as pd excel_file_path = "./employee_info.xlsx" ''' This function will read the excel file worksheet data with the specified value. ''' def read_work_sheet(excel_file_path, work_sheet_name): # Align column names by setting display.unicode.east_asian_width to True. pd.set_option('display.unicode.east_asian_width', True) # Solve the problem of incomplete row and column display pd.set_option('display.max.rows', 1000) pd.set_option('display.max.columns', 1000) # Invoke the pandas module read_excel method to load the specified excel file worksheet data. df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name) return df # This function print out the specified python object's data type. def print_object_type(prefix, object): object_type = type(object) print(prefix + str(object_type)) # This function will read all worksheets in an excel file. def read_all_excel_sheet(): print("****** read_all_excel_sheet() ******\r\n") # return a python dictionary object. df_all_sheet = read_work_sheet(excel_file_path, None) # loop in the python dictionary object. for key in df_all_sheet: # print out the key and value. key is the worksheet name, value is the workhsheet data. print(key) print(df_all_sheet[key]) print("\n\n") if __name__ == '__main__': read_all_excel_sheet()
2.2.4 read_excel_columns_by_filtering_dataframe().
- This function will read all the data in one Excel worksheet, and then filter out 2 columns of data from the returned DataFrame object.
''' Created on Oct 16, 2021 @author: songzhao ''' # Import the python pandas module. import pandas as pd excel_file_path = "./employee_info.xlsx" ''' This function will read the excel file worksheet data with the specified value. ''' def read_work_sheet(excel_file_path, work_sheet_name): # Align column names by setting display.unicode.east_asian_width to True. pd.set_option('display.unicode.east_asian_width', True) # Solve the problem of incomplete row and column display pd.set_option('display.max.rows', 1000) pd.set_option('display.max.columns', 1000) # Invoke the pandas module read_excel method to load the specified excel file worksheet data. df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name) return df # This function print out the specified python object's data type. def print_object_type(prefix, object): object_type = type(object) print(prefix + str(object_type)) # This function read the provided columns in the worksheet. def read_excel_columns_by_filtering_dataframe(): print("****** read_excel_columns_by_filtering_dataframe() ******\r\n") # Define the Excel work sheet name. work_sheet_name = "sheet_1" # Get the whole Excel file work sheet data. df_one_sheet = read_work_sheet(excel_file_path, work_sheet_name) # Return the 2 specified excel columns value in the above Excel worksheet. df_columns = pd.DataFrame(df_one_sheet, columns= ['Name', 'Salary']) # Print the Excel columns value. print(df_columns) if __name__ == '__main__': read_excel_columns_by_filtering_dataframe()
- Below is the above example execution output.
****** read_excel_columns_by_filtering_dataframe() ****** Name Salary 0 jerry 16000 1 tom 6000 2 kevin 13000 3 richard 5000 4 jackie 28000 5 steven 36000 6 jack 8000 7 gary 19000 8 john 16600
2.2.5 read_excel_columns_by_pandas_read_excel_method().
- Besides using the python pandas module’s DataFrame class’s columns attributes to filter out the required columns from the whole worksheet data, you can also specify the columns when invoking the pandas.read_excel() method and provide the usecols parameter like below. This way is a more efficient way because it only returns the required columns from the Excel worksheet, not the whole data of the Excel file worksheet.
df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, usecols=['Name', 'Salary'])
- The usecols parameter’s value is a python list object, the list element value can be number index or the column name.
# Get the specified columns in the Excel worksheet. # usecols=[0, 1] return the first and second column. df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, usecols=[0, 1]) # usecols=['Name', 'Salary'] return the Name, Salary column. df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, usecols=['Name', 'Salary'])
- The below example source code will read the 2 columns ( ‘Name’, ‘Salary’ ) data only from the Excel worksheet and then print out each column and it’s value.
''' Created on Oct 16, 2021 @author: songzhao ''' # Import the python pandas module. import pandas as pd excel_file_path = "./employee_info.xlsx" def read_excel_columns_by_pandas_read_excel_method(): # Align column names by setting display.unicode.east_asian_width to True. pd.set_option('display.unicode.east_asian_width', True) # Solve the problem of incomplete row and column display pd.set_option('display.max.rows', 1000) pd.set_option('display.max.columns', 1000) # Define the Excel work sheet name. work_sheet_name = "sheet_1" # Invoke the pandas module read_excel method to load the specified excel file worksheet data. # Get the specified columns in the Excel worksheet. usecols=[0, 1] return the first and second column, usecols=['Name', 'Salary'] return the Name, Salary column. #df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, usecols=[0, 1]) df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, usecols=['Name', 'Salary']) print(df) print_dataframe_object(df) ''' Loop the passed in DataFrame object's columns and print out each column name and column value. ''' def print_dataframe_object(df): print('============ print_dataframe_object ================') for col in df.columns: print('column : ', col) series = df[col] print('series.index : ', series.index) for index in series.index: value = series[index] print(index, ' : ', value) print('\n\n') if __name__ == '__main__': read_excel_columns_by_pandas_read_excel_method()
- Below is the above example execution output.
Name Salary 0 jerry 16000 1 tom 6000 2 kevin 13000 3 richard 5000 4 jackie 28000 5 steven 36000 6 jack 8000 7 gary 19000 8 john 16600 ============ print_dataframe_object ================ column : Name series.index : RangeIndex(start=0, stop=9, step=1) 0 : jerry 1 : tom 2 : kevin 3 : richard 4 : jackie 5 : steven 6 : jack 7 : gary 8 : john column : Salary series.index : RangeIndex(start=0, stop=9, step=1) 0 : 16000 1 : 6000 2 : 13000 3 : 5000 4 : 28000 5 : 36000 6 : 8000 7 : 19000 8 : 16600
2.2.6 read_excel_columns_set_column_index().
- When you call the python pandas module’s read_excel() method, you can pass a header input parameter, this parameter value defines which row data is used as the column index.
- Below is an example, I pass header=1 to the read_excel() function, so it will use the first row’s data in the Excel worksheet as the column index.
''' Created on Oct 16, 2021 @author: songzhao ''' # Import the python pandas module. import pandas as pd excel_file_path = "./employee_info.xlsx" def read_excel_columns_set_column_index(): # Align column names by setting display.unicode.east_asian_width to True. pd.set_option('display.unicode.east_asian_width', True) # Solve the problem of incomplete row and column display pd.set_option('display.max.rows', 1000) pd.set_option('display.max.columns', 1000) # Define the Excel work sheet name. work_sheet_name = "sheet_1" # Invoke the pandas module read_excel method to load the specified excel file worksheet data. # Set the second column as the row index (index_col=1). #df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, index_col=0) # Set the first row as the column index (header=1).You can set header value to None to use number as the column index. df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, header=1) print(df) print_dataframe_object(df) ''' Loop the passed in DataFrame object's columns and print out each column name and column value. ''' def print_dataframe_object(df): print('============ print_dataframe_object ================') for col in df.columns: print('column : ', col) series = df[col] print('series.index : ', series.index) for index in series.index: value = series[index] print(index, ' : ', value) print('\n\n') if __name__ == '__main__': read_excel_columns_set_column_index()
- Below is the above example execution output. The Excel worksheet first row’s value jerry, 2010-01-01, 16000 will be used as the column index.
jerry 2010-01-01 16000 0 tom 2011-08-19 6000 1 kevin 2009-02-08 13000 2 richard 2012-03-19 5000 3 jackie 2015-06-08 28000 4 steven 2008-02-01 36000 5 jack 2006-09-19 8000 6 gary 2018-01-16 19000 7 john 2017-10-01 16600 ============ print_dataframe_object ================ column : jerry series.index : RangeIndex(start=0, stop=8, step=1) 0 : tom 1 : kevin 2 : richard 3 : jackie 4 : steven 5 : jack 6 : gary 7 : john column : 2010-01-01 series.index : RangeIndex(start=0, stop=8, step=1) 0 : 2011-08-19 1 : 2009-02-08 2 : 2012-03-19 3 : 2015-06-08 4 : 2008-02-01 5 : 2006-09-19 6 : 2018-01-16 7 : 2017-10-01 column : 16000 series.index : RangeIndex(start=0, stop=8, step=1) 0 : 6000 1 : 13000 2 : 5000 3 : 28000 4 : 36000 5 : 8000 6 : 19000 7 : 16600
2.2.7 read_excel_columns_set_row_index().
- This function will pass the input parameter index_col=0 to the pandas.read_excel() method.
- It will use the first Excel worksheet column ( ‘Name’ ) as each row’s index. Below is the source code.
''' Created on Oct 16, 2021 @author: songzhao ''' # Import the python pandas module. import pandas as pd excel_file_path = "./employee_info.xlsx" def read_excel_columns_set_row_index(): # Align column names by setting display.unicode.east_asian_width to True. pd.set_option('display.unicode.east_asian_width', True) # Solve the problem of incomplete row and column display pd.set_option('display.max.rows', 1000) pd.set_option('display.max.columns', 1000) # Define the Excel work sheet name. work_sheet_name = "sheet_1" # Invoke the pandas module read_excel method to load the specified excel file worksheet data. # Set the second column as the row index (index_col=1). df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, index_col=0) print(df) print_dataframe_object(df) ''' Loop the passed in DataFrame object's columns and print out each column name and column value. ''' def print_dataframe_object(df): print('============ print_dataframe_object ================') for col in df.columns: print('column : ', col) series = df[col] print('series.index : ', series.index) for index in series.index: value = series[index] print(index, ' : ', value) print('\n\n') if __name__ == '__main__': read_excel_columns_set_row_index()
- When you run the above example, it will generate the below output. You can see for each column data, the row data index is ‘Name’.
Hire Date Salary Name jerry 2010-01-01 16000 tom 2011-08-19 6000 kevin 2009-02-08 13000 richard 2012-03-19 5000 jackie 2015-06-08 28000 steven 2008-02-01 36000 jack 2006-09-19 8000 gary 2018-01-16 19000 john 2017-10-01 16600 ============ print_dataframe_object ================ column : Hire Date series.index : Index(['jerry', 'tom', 'kevin', 'richard', 'jackie', 'steven', 'jack', 'gary', 'john'], dtype='object', name='Name') jerry : 2010-01-01 tom : 2011-08-19 kevin : 2009-02-08 richard : 2012-03-19 jackie : 2015-06-08 steven : 2008-02-01 jack : 2006-09-19 gary : 2018-01-16 john : 2017-10-01 column : Salary series.index : Index(['jerry', 'tom', 'kevin', 'richard', 'jackie', 'steven', 'jack', 'gary', 'john'], dtype='object', name='Name') jerry : 16000 tom : 6000 kevin : 13000 richard : 5000 jackie : 28000 steven : 36000 jack : 8000 gary : 19000 john : 16600
References