# Pandas Excel

Python Pandas is a Python data analysis library. It can read, filter and re-arrange small and large data sets and output them in a range of formats including Excel. Pandas writes Excel files using the Xlwt module for xls files and the Openpyxl or XlsxWriter modules for xlsx files. Using XlsxWriter with Pandas. We import the pandas module, including ExcelFile. The method readexcel reads the data into a Pandas Data Frame, where the first parameter is the filename and the second parameter is the sheet. The list of columns will be called df.columns. Import pandas as pd.

- Pandas Excel Sheet Name
- Pandas Excel To Dataframe
- Pandas Excel Nan
- Pandas Excel Multiple Sheets
- Pandas Excel

## Introduction

The purpose of this article is to show some common Excel tasks and howyou would execute similar tasks in pandas. Some of the examples are somewhat trivialbut I think it is important to show the simple as well as the morecomplex functions you can find elsewhere. As an added bonus, I’m going to do some fuzzy stringmatching to show a little twist to the process and show how pandas canutilize the full python system of modules to do something simply inpython that would be complex in Excel.

Make sense? Let’s get started.

## Adding a Sum to a Row

The first task I’ll cover is summing some columns to add a total column.

We will start by importing our excel data into a pandas dataframe.

account | name | street | city | state | postal-code | Jan | Feb | Mar | |
---|---|---|---|---|---|---|---|---|---|

0 | 211829 | Kerluke, Koepp and Hilpert | 34456 Sean Highway | New Jaycob | Texas | 28752 | 10000 | 62000 | 35000 |

1 | 320563 | Walter-Trantow | 1311 Alvis Tunnel | Port Khadijah | NorthCarolina | 38365 | 95000 | 45000 | 35000 |

2 | 648336 | Bashirian, Kunde and Price | 62184 Schamberger Underpass Apt. 231 | New Lilianland | Iowa | 76517 | 91000 | 120000 | 35000 |

3 | 109996 | D’Amore, Gleichner and Bode | 155 Fadel Crescent Apt. 144 | Hyattburgh | Maine | 46021 | 45000 | 120000 | 10000 |

4 | 121213 | Bauch-Goldner | 7274 Marissa Common | Shanahanchester | California | 49681 | 162000 | 120000 | 35000 |

We want to add a total column to show total sales for Jan, Feb and Mar.

This is straightforward in Excel and in pandas. For Excel, I have added the formula `sum(G2:I2)`

in column J.Here is what it looks like in Excel:

Next, here is how we do it in pandas:

account | name | street | city | state | postal-code | Jan | Feb | Mar | total | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 211829 | Kerluke, Koepp and Hilpert | 34456 Sean Highway | New Jaycob | Texas | 28752 | 10000 | 62000 | 35000 | 107000 |

1 | 320563 | Walter-Trantow | 1311 Alvis Tunnel | Port Khadijah | NorthCarolina | 38365 | 95000 | 45000 | 35000 | 175000 |

2 | 648336 | Bashirian, Kunde and Price | 62184 Schamberger Underpass Apt. 231 | New Lilianland | Iowa | 76517 | 91000 | 120000 | 35000 | 246000 |

3 | 109996 | D’Amore, Gleichner and Bode | 155 Fadel Crescent Apt. 144 | Hyattburgh | Maine | 46021 | 45000 | 120000 | 10000 | 175000 |

4 | 121213 | Bauch-Goldner | 7274 Marissa Common | Shanahanchester | California | 49681 | 162000 | 120000 | 35000 | 317000 |

Next, let’s get some totals and other values for each month. Here is whatwe are trying to do as shown in Excel:

As you can see, we added a `SUM(G2:G16)`

in row 17 in each of the columns to gettotals by month.

Performing column level analysis is easy in pandas. Here are a couple of examples.

Now, we want to add a total by month and grand total. This is wherepandas and Excel diverge a little. It is very simple to add totals incells in Excel for each month. Because pandas need to maintain theintegrity of the entire DataFrame, there are a couple more steps.

First, create a sum for the month and total columns.

This is fairly intuitive however, if you want to add totals as a row,you need to do some minor manipulations.

We need to transpose the data and convert the Series to a DataFrame sothat it is easier to concat onto our existing data. The `T`

functionallows us to switch the data from being row-based to column-based.

The final thing we need to do before adding the totals back is to addthe missing columns. We use `reindex`

to do this for us. The trick is toadd all of our columns and then allow pandas to fill in the values thatare missing.

account | name | street | city | state | postal-code | Jan | Feb | Mar | total | |
---|---|---|---|---|---|---|---|---|---|---|

0 | NaN | NaN | NaN | NaN | NaN | NaN | 1462000 | 1507000 | 717000 | 3686000 |

Now that we have a nicely formatted DataFrame, we can add it to ourexisting one using `append`

.

account | name | street | city | state | postal-code | Jan | Feb | Mar | total | |
---|---|---|---|---|---|---|---|---|---|---|

11 | 231907 | Hahn-Moore | 18115 Olivine Throughway | Norbertomouth | NorthDakota | 31415 | 150000 | 10000 | 162000 | 322000 |

12 | 242368 | Frami, Anderson and Donnelly | 182 Bertie Road | East Davian | Iowa | 72686 | 162000 | 120000 | 35000 | 317000 |

13 | 268755 | Walsh-Haley | 2624 Beatty Parkways | Goodwinmouth | RhodeIsland | 31919 | 55000 | 120000 | 35000 | 210000 |

14 | 273274 | McDermott PLC | 8917 Bergstrom Meadow | Kathryneborough | Delaware | 27933 | 150000 | 120000 | 70000 | 340000 |

15 | NaN | NaN | NaN | NaN | NaN | NaN | 1462000 | 1507000 | 717000 | 3686000 |

## Additional Data Transforms

For another example, let’s try to add a state abbreviation to the data set.

From an Excel perspective the easiest way is probably to add a newcolumn, do a vlookup on the state name and fill in the abbreviation.

I did this and here is a snapshot of what the results looks like:

You’ll notice that after performing the vlookup, there are some values that are notcoming through correctly. That’s because we misspelled some of thestates. Handling this in Excel would be really challenging (on big data sets).

Fortunately with pandas we have the full power of the python ecosystemat our disposal. In thinking about how to solve this type of messy data problem,I thought about trying to do some fuzzy text matching to determine the correct value.

Fortunately someone else has done a lot of work in this are. The fuzzy wuzzy library has some pretty usefulfunctions for this type of situation. Make sure to get it and install it first.

The other piece of code we need is a state name to abbreviation mapping. Instead of trying totype it myself, a little googling found this code.

Get started by importing the appropriate fuzzywuzzy functions and define our state map dictionary.

Here are some example of how the fuzzy text matching function works.

Now that we know how this works, we create our function to take the state column and convert it to a validabbreviation. We use the 80 score_cutoff for this data. You can playwith it to see what number works for your data. You’ll notice that we either return a validabbreviation or an `np.nan`

so that we have some valid values in the field.

Add the column in the location we want and fill it with NaN values

account | name | street | city | state | postal-code | abbrev | Jan | Feb | Mar | total | |
---|---|---|---|---|---|---|---|---|---|---|---|

0 | 211829 | Kerluke, Koepp and Hilpert | 34456 Sean Highway | New Jaycob | Texas | 28752 | NaN | 10000 | 62000 | 35000 | 107000 |

1 | 320563 | Walter-Trantow | 1311 Alvis Tunnel | Port Khadijah | NorthCarolina | 38365 | NaN | 95000 | 45000 | 35000 | 175000 |

2 | 648336 | Bashirian, Kunde and Price | 62184 Schamberger Underpass Apt. 231 | New Lilianland | Iowa | 76517 | NaN | 91000 | 120000 | 35000 | 246000 |

3 | 109996 | D’Amore, Gleichner and Bode | 155 Fadel Crescent Apt. 144 | Hyattburgh | Maine | 46021 | NaN | 45000 | 120000 | 10000 | 175000 |

4 | 121213 | Bauch-Goldner | 7274 Marissa Common | Shanahanchester | California | 49681 | NaN | 162000 | 120000 | 35000 | 317000 |

We use `apply`

to add the abbreviations into the approriate column.

account | name | street | city | state | postal-code | abbrev | Jan | Feb | Mar | total | |
---|---|---|---|---|---|---|---|---|---|---|---|

11 | 231907 | Hahn-Moore | 18115 Olivine Throughway | Norbertomouth | NorthDakota | 31415 | ND | 150000 | 10000 | 162000 | 322000 |

12 | 242368 | Frami, Anderson and Donnelly | 182 Bertie Road | East Davian | Iowa | 72686 | IA | 162000 | 120000 | 35000 | 317000 |

13 | 268755 | Walsh-Haley | 2624 Beatty Parkways | Goodwinmouth | RhodeIsland | 31919 | RI | 55000 | 120000 | 35000 | 210000 |

14 | 273274 | McDermott PLC | 8917 Bergstrom Meadow | Kathryneborough | Delaware | 27933 | DE | 150000 | 120000 | 70000 | 340000 |

15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1462000 | 1507000 | 717000 | 3686000 |

I think this is pretty cool. We have developed a very simple process to intelligently clean upthis data. Obviously when you only have 15 or so rows, this is not a big deal. However, what if you had15,000? You would have to do something manual in Excel to clean this up.

## Subtotals

For the final section of this article, let’s get some subtotals by state.

In Excel, we would use the `subtotal`

tool to do this for us.

The output would look like this:

Creating a subtotal in pandas, is accomplished using `groupby`

Jan | Feb | Mar | total | |
---|---|---|---|---|

abbrev | ||||

AR | 150000 | 120000 | 35000 | 305000 |

CA | 162000 | 120000 | 35000 | 317000 |

DE | 150000 | 120000 | 70000 | 340000 |

IA | 253000 | 240000 | 70000 | 563000 |

ID | 70000 | 120000 | 35000 | 225000 |

ME | 45000 | 120000 | 10000 | 175000 |

MS | 62000 | 120000 | 70000 | 252000 |

NC | 95000 | 45000 | 35000 | 175000 |

ND | 150000 | 10000 | 162000 | 322000 |

PA | 70000 | 95000 | 35000 | 200000 |

RI | 200000 | 215000 | 70000 | 485000 |

TN | 45000 | 120000 | 55000 | 220000 |

TX | 10000 | 62000 | 35000 | 107000 |

Next, we want to format the data as currency by using `applymap`

to all the values in thedata frame.

Jan | Feb | Mar | total | |
---|---|---|---|---|

abbrev | ||||

AR | $150,000 | $120,000 | $35,000 | $305,000 |

CA | $162,000 | $120,000 | $35,000 | $317,000 |

DE | $150,000 | $120,000 | $70,000 | $340,000 |

IA | $253,000 | $240,000 | $70,000 | $563,000 |

ID | $70,000 | $120,000 | $35,000 | $225,000 |

ME | $45,000 | $120,000 | $10,000 | $175,000 |

MS | $62,000 | $120,000 | $70,000 | $252,000 |

NC | $95,000 | $45,000 | $35,000 | $175,000 |

ND | $150,000 | $10,000 | $162,000 | $322,000 |

PA | $70,000 | $95,000 | $35,000 | $200,000 |

RI | $200,000 | $215,000 | $70,000 | $485,000 |

TN | $45,000 | $120,000 | $55,000 | $220,000 |

TX | $10,000 | $62,000 | $35,000 | $107,000 |

The formatting looks good, now we can get the totals like we did earlier.

Convert the values to columns and format it.

Jan | Feb | Mar | total | |
---|---|---|---|---|

0 | $1,462,000 | $1,507,000 | $717,000 | $3,686,000 |

Finally, add the total value to the DataFrame.

### Pandas Excel Sheet Name

Jan | Feb | Mar | total | |
---|---|---|---|---|

AR | $150,000 | $120,000 | $35,000 | $305,000 |

CA | $162,000 | $120,000 | $35,000 | $317,000 |

DE | $150,000 | $120,000 | $70,000 | $340,000 |

IA | $253,000 | $240,000 | $70,000 | $563,000 |

ID | $70,000 | $120,000 | $35,000 | $225,000 |

ME | $45,000 | $120,000 | $10,000 | $175,000 |

MS | $62,000 | $120,000 | $70,000 | $252,000 |

NC | $95,000 | $45,000 | $35,000 | $175,000 |

ND | $150,000 | $10,000 | $162,000 | $322,000 |

PA | $70,000 | $95,000 | $35,000 | $200,000 |

RI | $200,000 | $215,000 | $70,000 | $485,000 |

TN | $45,000 | $120,000 | $55,000 | $220,000 |

TX | $10,000 | $62,000 | $35,000 | $107,000 |

0 | $1,462,000 | $1,507,000 | $717,000 | $3,686,000 |

You’ll notice that the index is ‘0’ for the total line. We want to changethat using `rename`

.

Jan | Feb | Mar | total | |
---|---|---|---|---|

AR | $150,000 | $120,000 | $35,000 | $305,000 |

CA | $162,000 | $120,000 | $35,000 | $317,000 |

DE | $150,000 | $120,000 | $70,000 | $340,000 |

IA | $253,000 | $240,000 | $70,000 | $563,000 |

ID | $70,000 | $120,000 | $35,000 | $225,000 |

ME | $45,000 | $120,000 | $10,000 | $175,000 |

MS | $62,000 | $120,000 | $70,000 | $252,000 |

NC | $95,000 | $45,000 | $35,000 | $175,000 |

ND | $150,000 | $10,000 | $162,000 | $322,000 |

PA | $70,000 | $95,000 | $35,000 | $200,000 |

RI | $200,000 | $215,000 | $70,000 | $485,000 |

TN | $45,000 | $120,000 | $55,000 | $220,000 |

TX | $10,000 | $62,000 | $35,000 | $107,000 |

Total | $1,462,000 | $1,507,000 | $717,000 | $3,686,000 |

## Conclusion

### Pandas Excel To Dataframe

By now, most people know that pandas can do a lot of complexmanipulations on data - similar to Excel. As I have been learning aboutpandas, I still find myself trying to remember how to do things that Iknow how to do in Excel but not in pandas. I realize that thiscomparison may not be exactly fair - they are different tools. However,I hope to reach people that know Excel and want to learn whatalternatives are out there for their data processing needs. I hopethese examples will help others feel confident that they can replace a lotof their crufty Excel data manipulations with pandas.

I found this exercise helpful to cement these ideas in my mind. I hope it worksfor you as well. If you have other Excel tasks that you would like to learn howto do in pandas, let me know via the comments below and I will try to help.

## Updates

**25-June-2020**Add a link to sidetable for creating subtotals.

## Comments

Pandas DataFrames and Series can be used as function arguments and return types for Excel worksheet functions using the decorator `xl_func`

.

When used as an argument, the range specified in Excel will be converted into a Pandas DataFrame or Series as specified by the function signature.

When returning a DataFrame or Series, a range of data will be returned to Excel. PyXLL can automatically resize the range of the array formula to match the returned data by setting `auto_resize=True`

in `xl_func`

.

The following code shows a function that returns a random dataframe, including the index:

A function can also take a DataFrame or Series as one its arguments. When passing a DataFrame or Series to a function the whole data area must be selected in Excel and used as the argument to the function.

The following function takes a DataFrame including the column headers row, but not including the index column and returns the sum of a single column.:

See also Pandas DataFrame Formatting.

### Pandas Excel Nan

## Pandas Types Options ¶

The following options are available for the `dataframe`

and `series`

argument and return types:

**dataframe**, when used as an argument type`dataframe<index=0,columns=1,dtype=None,dtypes=None,index_dtype=None>`

`index`

Number of columns to use as the DataFrame’s index. Specifying more than one will result in a DataFrame where the index is a MultiIndex.

`columns`

Number of rows to use as the DataFrame’s columns. Specifying more than one will result in a DataFrame where the columns is a MultiIndex. If used in conjunction with index then any column headers on the index columns will be used to name the index.

`dtype`

Datatype for the values in the dataframe. May not be set with dtypes.

`dtypes`

Dictionary of column name -> datatype for the values in the dataframe. May not be set with dtype.

`index_dtype`

Datatype for the values in the dataframe’s index.

**dataframe**, when used as a return type`dataframe<index=None,columns=True>`

`index`

If True include the index when returning to Excel, if False don’t. If None, only include if the index is named.

`columns`

If True include the column headers, if False don’t.

**series**, when used as an argument type`series<index=1,transpose=None,dtype=None,index_dtype=None>`

`index`

Number of columns (or rows, depending on the orientation of the Series) to use as the Series index.

`transpose`

Set to True if the Series is arranged horizontally or False if vertically. By default the orientation will be guessed from the structure of the data.

`dtype`

Datatype for the values in the Series.

`index_dtype`

Datatype for the values in the Series’ index.

**series**, when used as a return type`series<index=True,transpose=False>`

`index`

If True include the index when returning to Excel, if False don’t.

`transpose`

Set to True if the Series should be arranged horizontally, or False if vertically.

## Passing as Python objects instead of Excel arrays ¶

When passing large DataFrames between Python functions, it is not always necessary to return the full DataFrame to Excel and it can be expensive reconstructing the DataFrame from the Excel range each time. In those cases you can use the `object`

return type to return a handle to the Python object. Functions taking the `dataframe`

and `series`

types can accept object handles.

The following returns a random DataFrame as a Python object, so will appear in Excel as a single cell with a handle to that object:

The result of a function like this can be passed to another function that expects a DataFrame:

This allows for large datasets to be used in Excel efficiently, especially where the data set would be cumbersome to deal with in Excel when unpacked.

## Using the Pandas type converters outside of a UDF ¶

Sometimes it’s useful to be able to convert a range of data into a DataFrame, or a DataFrame into a range of data for Excel, in a context other than function decorated with `xl_func`

. Or, you might have a function that takes the `var`

type, which could be a DataFrame depending on other arguments.

### Pandas Excel Multiple Sheets

In these cases the function `get_type_converter`

can be used. For example:

Or the other way:

### Pandas Excel

All the parameters for the `dataframe`

and `series`

types can be used to control how the conversion is performed.