12/1/2021

Pandas Excel

Python Pandas is a Python data analysis library. It can read, filter and re-arrange small and large data sets and output them in a range of formats including Excel. Pandas writes Excel files using the Xlwt module for xls files and the Openpyxl or XlsxWriter modules for xlsx files. Using XlsxWriter with Pandas. We import the pandas module, including ExcelFile. The method readexcel reads the data into a Pandas Data Frame, where the first parameter is the filename and the second parameter is the sheet. The list of columns will be called df.columns. Import pandas as pd.

Introduction

The purpose of this article is to show some common Excel tasks and howyou would execute similar tasks in pandas. Some of the examples are somewhat trivialbut I think it is important to show the simple as well as the morecomplex functions you can find elsewhere. As an added bonus, I’m going to do some fuzzy stringmatching to show a little twist to the process and show how pandas canutilize the full python system of modules to do something simply inpython that would be complex in Excel.

Make sense? Let’s get started.

Pandas

Adding a Sum to a Row

The first task I’ll cover is summing some columns to add a total column.

We will start by importing our excel data into a pandas dataframe.

accountnamestreetcitystatepostal-codeJanFebMar
0 211829 Kerluke, Koepp and Hilpert 34456 Sean Highway New Jaycob Texas 28752 10000 62000 35000
1 320563 Walter-Trantow 1311 Alvis Tunnel Port Khadijah NorthCarolina 38365 95000 45000 35000
2 648336 Bashirian, Kunde and Price 62184 Schamberger Underpass Apt. 231 New Lilianland Iowa 76517 91000 120000 35000
3 109996 D’Amore, Gleichner and Bode 155 Fadel Crescent Apt. 144 Hyattburgh Maine 46021 45000 120000 10000
4 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681 162000 120000 35000

We want to add a total column to show total sales for Jan, Feb and Mar.

This is straightforward in Excel and in pandas. For Excel, I have added the formula sum(G2:I2) in column J.Here is what it looks like in Excel:

Next, here is how we do it in pandas:

accountnamestreetcitystatepostal-codeJanFebMartotal
0 211829 Kerluke, Koepp and Hilpert 34456 Sean Highway New Jaycob Texas 28752 10000 62000 35000 107000
1 320563 Walter-Trantow 1311 Alvis Tunnel Port Khadijah NorthCarolina 38365 95000 45000 35000 175000
2 648336 Bashirian, Kunde and Price 62184 Schamberger Underpass Apt. 231 New Lilianland Iowa 76517 91000 120000 35000 246000
3 109996 D’Amore, Gleichner and Bode 155 Fadel Crescent Apt. 144 Hyattburgh Maine 46021 45000 120000 10000 175000
4 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681 162000 120000 35000 317000

Next, let’s get some totals and other values for each month. Here is whatwe are trying to do as shown in Excel:

As you can see, we added a SUM(G2:G16) in row 17 in each of the columns to gettotals by month.

Performing column level analysis is easy in pandas. Here are a couple of examples.

Now, we want to add a total by month and grand total. This is wherepandas and Excel diverge a little. It is very simple to add totals incells in Excel for each month. Because pandas need to maintain theintegrity of the entire DataFrame, there are a couple more steps.

First, create a sum for the month and total columns.

This is fairly intuitive however, if you want to add totals as a row,you need to do some minor manipulations.

We need to transpose the data and convert the Series to a DataFrame sothat it is easier to concat onto our existing data. The T functionallows us to switch the data from being row-based to column-based.

The final thing we need to do before adding the totals back is to addthe missing columns. We use reindex to do this for us. The trick is toadd all of our columns and then allow pandas to fill in the values thatare missing.

accountnamestreetcitystatepostal-codeJanFebMartotal
0NaNNaNNaNNaNNaNNaN 1462000 1507000 717000 3686000

Now that we have a nicely formatted DataFrame, we can add it to ourexisting one using append.

accountnamestreetcitystatepostal-codeJanFebMartotal
11 231907 Hahn-Moore 18115 Olivine Throughway Norbertomouth NorthDakota 31415 150000 10000 162000 322000
12 242368 Frami, Anderson and Donnelly 182 Bertie Road East Davian Iowa 72686 162000 120000 35000 317000
13 268755 Walsh-Haley 2624 Beatty Parkways Goodwinmouth RhodeIsland 31919 55000 120000 35000 210000
14 273274 McDermott PLC 8917 Bergstrom Meadow Kathryneborough Delaware 27933 150000 120000 70000 340000
15 NaN NaN NaN NaN NaN NaN 1462000 1507000 717000 3686000

Additional Data Transforms

Pandas Excel

For another example, let’s try to add a state abbreviation to the data set.

From an Excel perspective the easiest way is probably to add a newcolumn, do a vlookup on the state name and fill in the abbreviation.

I did this and here is a snapshot of what the results looks like:

You’ll notice that after performing the vlookup, there are some values that are notcoming through correctly. That’s because we misspelled some of thestates. Handling this in Excel would be really challenging (on big data sets).

Fortunately with pandas we have the full power of the python ecosystemat our disposal. In thinking about how to solve this type of messy data problem,I thought about trying to do some fuzzy text matching to determine the correct value.

Fortunately someone else has done a lot of work in this are. The fuzzy wuzzy library has some pretty usefulfunctions for this type of situation. Make sure to get it and install it first.

The other piece of code we need is a state name to abbreviation mapping. Instead of trying totype it myself, a little googling found this code.

Get started by importing the appropriate fuzzywuzzy functions and define our state map dictionary.

Here are some example of how the fuzzy text matching function works.

Now that we know how this works, we create our function to take the state column and convert it to a validabbreviation. We use the 80 score_cutoff for this data. You can playwith it to see what number works for your data. You’ll notice that we either return a validabbreviation or an np.nan so that we have some valid values in the field.

Add the column in the location we want and fill it with NaN values

accountnamestreetcitystatepostal-codeabbrevJanFebMartotal
0 211829 Kerluke, Koepp and Hilpert 34456 Sean Highway New Jaycob Texas 28752NaN 10000 62000 35000 107000
1 320563 Walter-Trantow 1311 Alvis Tunnel Port Khadijah NorthCarolina 38365NaN 95000 45000 35000 175000
2 648336 Bashirian, Kunde and Price 62184 Schamberger Underpass Apt. 231 New Lilianland Iowa 76517NaN 91000 120000 35000 246000
3 109996 D’Amore, Gleichner and Bode 155 Fadel Crescent Apt. 144 Hyattburgh Maine 46021NaN 45000 120000 10000 175000
4 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681NaN 162000 120000 35000 317000

We use apply to add the abbreviations into the approriate column.

accountnamestreetcitystatepostal-codeabbrevJanFebMartotal
11 231907 Hahn-Moore 18115 Olivine Throughway Norbertomouth NorthDakota 31415ND 150000 10000 162000 322000
12 242368 Frami, Anderson and Donnelly 182 Bertie Road East Davian Iowa 72686IA 162000 120000 35000 317000
13 268755 Walsh-Haley 2624 Beatty Parkways Goodwinmouth RhodeIsland 31919RI 55000 120000 35000 210000
14 273274 McDermott PLC 8917 Bergstrom Meadow Kathryneborough Delaware 27933DE 150000 120000 70000 340000
15 NaN NaN NaN NaN NaN NaN NaN 1462000 1507000 717000 3686000

I think this is pretty cool. We have developed a very simple process to intelligently clean upthis data. Obviously when you only have 15 or so rows, this is not a big deal. However, what if you had15,000? You would have to do something manual in Excel to clean this up.

Subtotals

For the final section of this article, let’s get some subtotals by state.

In Excel, we would use the subtotal tool to do this for us.

The output would look like this:

Creating a subtotal in pandas, is accomplished using groupby

JanFebMartotal
abbrev
AR 150000 120000 35000 305000
CA 162000 120000 35000 317000
DE 150000 120000 70000 340000
IA 253000 240000 70000 563000
ID 70000 120000 35000 225000
ME 45000 120000 10000 175000
MS 62000 120000 70000 252000
NC 95000 45000 35000 175000
ND 150000 10000 162000 322000
PA 70000 95000 35000 200000
RI 200000 215000 70000 485000
TN 45000 120000 55000 220000
TX 10000 62000 35000 107000

Next, we want to format the data as currency by using applymap to all the values in thedata frame.

JanFebMartotal
abbrev
AR $150,000 $120,000 $35,000 $305,000
CA $162,000 $120,000 $35,000 $317,000
DE $150,000 $120,000 $70,000 $340,000
IA $253,000 $240,000 $70,000 $563,000
ID $70,000 $120,000 $35,000 $225,000
ME $45,000 $120,000 $10,000 $175,000
MS $62,000 $120,000 $70,000 $252,000
NC $95,000 $45,000 $35,000 $175,000
ND $150,000 $10,000 $162,000 $322,000
PA $70,000 $95,000 $35,000 $200,000
RI $200,000 $215,000 $70,000 $485,000
TN $45,000 $120,000 $55,000 $220,000
TX $10,000 $62,000 $35,000 $107,000

The formatting looks good, now we can get the totals like we did earlier.

Convert the values to columns and format it.

JanFebMartotal
0 $1,462,000 $1,507,000 $717,000 $3,686,000

Finally, add the total value to the DataFrame.

Pandas Excel Sheet Name

JanFebMartotal
AR $150,000 $120,000 $35,000 $305,000
CA $162,000 $120,000 $35,000 $317,000
DE $150,000 $120,000 $70,000 $340,000
IA $253,000 $240,000 $70,000 $563,000
ID $70,000 $120,000 $35,000 $225,000
ME $45,000 $120,000 $10,000 $175,000
MS $62,000 $120,000 $70,000 $252,000
NC $95,000 $45,000 $35,000 $175,000
ND $150,000 $10,000 $162,000 $322,000
PA $70,000 $95,000 $35,000 $200,000
RI $200,000 $215,000 $70,000 $485,000
TN $45,000 $120,000 $55,000 $220,000
TX $10,000 $62,000 $35,000 $107,000
0 $1,462,000 $1,507,000 $717,000 $3,686,000

You’ll notice that the index is ‘0’ for the total line. We want to changethat using rename.

JanFebMartotal
AR $150,000 $120,000 $35,000 $305,000
CA $162,000 $120,000 $35,000 $317,000
DE $150,000 $120,000 $70,000 $340,000
IA $253,000 $240,000 $70,000 $563,000
ID $70,000 $120,000 $35,000 $225,000
ME $45,000 $120,000 $10,000 $175,000
MS $62,000 $120,000 $70,000 $252,000
NC $95,000 $45,000 $35,000 $175,000
ND $150,000 $10,000 $162,000 $322,000
PA $70,000 $95,000 $35,000 $200,000
RI $200,000 $215,000 $70,000 $485,000
TN $45,000 $120,000 $55,000 $220,000
TX $10,000 $62,000 $35,000 $107,000
Total $1,462,000 $1,507,000 $717,000 $3,686,000
This process is somewhat manual and does not work for doing multiple subtotals.The sidetable package makes this process much easier and is more robust.

Conclusion

Pandas Excel To Dataframe

By now, most people know that pandas can do a lot of complexmanipulations on data - similar to Excel. As I have been learning aboutpandas, I still find myself trying to remember how to do things that Iknow how to do in Excel but not in pandas. I realize that thiscomparison may not be exactly fair - they are different tools. However,I hope to reach people that know Excel and want to learn whatalternatives are out there for their data processing needs. I hopethese examples will help others feel confident that they can replace a lotof their crufty Excel data manipulations with pandas.

I found this exercise helpful to cement these ideas in my mind. I hope it worksfor you as well. If you have other Excel tasks that you would like to learn howto do in pandas, let me know via the comments below and I will try to help.

Updates

  • 25-June-2020 Add a link to sidetable for creating subtotals.

Comments

Pandas DataFrames and Series can be used as function arguments and return types for Excel worksheet functions using the decorator xl_func.

When used as an argument, the range specified in Excel will be converted into a Pandas DataFrame or Series as specified by the function signature.

When returning a DataFrame or Series, a range of data will be returned to Excel. PyXLL can automatically resize the range of the array formula to match the returned data by setting auto_resize=True in xl_func.

The following code shows a function that returns a random dataframe, including the index:

A function can also take a DataFrame or Series as one its arguments. When passing a DataFrame or Series to a function the whole data area must be selected in Excel and used as the argument to the function.

The following function takes a DataFrame including the column headers row, but not including the index column and returns the sum of a single column.:

See also Pandas DataFrame Formatting.

Pandas Excel Nan

Pandas Types Options ¶

The following options are available for the dataframe and series argument and return types:

  • dataframe, when used as an argument type

    dataframe<index=0,columns=1,dtype=None,dtypes=None,index_dtype=None>

    index

    Number of columns to use as the DataFrame’s index. Specifying more than one will result in a DataFrame where the index is a MultiIndex.

    columns

    Number of rows to use as the DataFrame’s columns. Specifying more than one will result in a DataFrame where the columns is a MultiIndex. If used in conjunction with index then any column headers on the index columns will be used to name the index.

    dtype

    Datatype for the values in the dataframe. May not be set with dtypes.

    dtypes

    Dictionary of column name -> datatype for the values in the dataframe. May not be set with dtype.

    index_dtype

    Datatype for the values in the dataframe’s index.

  • dataframe, when used as a return type

    dataframe<index=None,columns=True>

    index

    If True include the index when returning to Excel, if False don’t. If None, only include if the index is named.

    columns

    If True include the column headers, if False don’t.

  • series, when used as an argument type

    series<index=1,transpose=None,dtype=None,index_dtype=None>

    index

    Number of columns (or rows, depending on the orientation of the Series) to use as the Series index.

    transpose

    Set to True if the Series is arranged horizontally or False if vertically. By default the orientation will be guessed from the structure of the data.

    dtype

    Datatype for the values in the Series.

    index_dtype

    Datatype for the values in the Series’ index.

  • series, when used as a return type

    series<index=True,transpose=False>

    index

    If True include the index when returning to Excel, if False don’t.

    transpose

    Set to True if the Series should be arranged horizontally, or False if vertically.

Passing as Python objects instead of Excel arrays ¶

When passing large DataFrames between Python functions, it is not always necessary to return the full DataFrame to Excel and it can be expensive reconstructing the DataFrame from the Excel range each time. In those cases you can use the object return type to return a handle to the Python object. Functions taking the dataframe and series types can accept object handles.

The following returns a random DataFrame as a Python object, so will appear in Excel as a single cell with a handle to that object:

The result of a function like this can be passed to another function that expects a DataFrame:

This allows for large datasets to be used in Excel efficiently, especially where the data set would be cumbersome to deal with in Excel when unpacked.

Using the Pandas type converters outside of a UDF ¶

Sometimes it’s useful to be able to convert a range of data into a DataFrame, or a DataFrame into a range of data for Excel, in a context other than function decorated with xl_func. Or, you might have a function that takes the var type, which could be a DataFrame depending on other arguments.

Pandas Excel Multiple Sheets

Pandas excel append

In these cases the function get_type_converter can be used. For example:

Or the other way:

Pandas Excel

All the parameters for the dataframe and series types can be used to control how the conversion is performed.