Taming Pandas

Andrei Zhozhin

| 12 minutes

Why learn new tools

In the modern world there is high demand for strong analytical skills, amount of data is growing dramatically, number of different databases and data stores is also increasing. Previously all data was stored in relational databases and SQL language was lingua franca to query it. But now not all databases support SQL and we technically could not put all our data into single storage, that means we need to find new tools capable to work with different data sources and provide at least the same level of flexibility like SQL.

If you already proficient in SQL this article would be useful for you as it provides you with python/pandas equivalents for known SQL constructs. If you are not familiar with SQL I can recommend to start with SQLZoo to learn SQL through practice.

Meet Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. It provides very flexible interfaces for software developers and analysts. I think python was heavily popularized beyond developers community because of pandas.

In pandas it is not necessary to create single statement that would produce final result, we can easily break calculation into stages and produce result gradually - this is one of the major benefits of using pandas. SQL does not support references to common blocks of code thus you have to repeat yourself. Here is an excellent article Against SQL covering most of issues of SQL. Even though SQL has a lot of issues it is still a standard for databases and even new products (that could be not even relational) in some cases implement subset of SQL dialect.

The data set

We will look at a reduced version of the AdventureWorksLT database. Actually we would use only couple of tables to show how SQL queries could be translated to python/pandas expressions producing the same result.

Schema of the database

To read data from SQLite db into pandas data frames we need to execute the following lines

import pandas as pd
import sqlite3

con = sqlite3.connect("AdventureWorksLT.db")
products = pd.read_sql_query("select * from Product", con)
product_categories = pd.read_sql_query("select * from ProductCategory", con)

Preparing data set

Pandas can use some heuristics to detect types of the columns, but sometimes due to type incompatibility between SQL database and Python it is not possible to map types, so Pandas can use the object type to support NULL values for typed columns (like integer)

We have such a case in the Products table

CREATE TABLE [Product](
...
    [Weight] INTEGER NULL,
...
)

We will convert NULL values of Weight to 0

products['Weight'= pd.to_numeric(products['Weight']).fillna(0)

Data Extraction

As we are working with a big data set we need to limit the number of rows returned as a result. For that, there will be LIMIT used for SQL and .head() for python to reduce output of our example queries.

Select

All queries start with select *:

SELECT *
FROM   Product

is equal to data frame reference itself

products

… Limit X

Starting from this point all queries would be using explicit column names. It is considered bad practice to use select * in code due to multiple reasons.

SELECT ProductID,
       Name,
       ProductNumber
FROM   Product
LIMIT  5
ProductID Name ProductNumber
1 680 HL Road Frame - Black, 58 FR-R92B-58
2 706 HL Road Frame - Red, 58 FR-R92R-58
3 707 Sport-100 Helmet, Red HL-U509-R
4 708 Sport-100 Helmet, Black HL-U509
5 709 Mountain Bike Socks, M SO-B909-M
products[['ProductID', 'Name', 'ProductNumber']] \
    .head(5)
ProductID Name ProductNumber
0 680 HL Road Frame - Black, 58 FR-R92B-58
1 706 HL Road Frame - Red, 58 FR-R92R-58
2 707 Sport-100 Helmet, Red HL-U509-R
3 708 Sport-100 Helmet, Black HL-U509
4 709 Mountain Bike Socks, M SO-B909-M

Please note: index of returned rows in SQL starts with 1(one) and it is row_number, but indexes of rows in python starts with 0(zero) and this is an index in data set.

… Where X>Y

SELECT Name,
       ProductNumber,
       StandardCost
FROM   Product
WHERE  StandardCost > 2000
LIMIT  5  
Name ProductNumber StandardCost
1 Road-150 Red, 62 BK-R93R-62 2171.2942
2 Road-150 Red, 44 BK-R93R-44 2171.2942
3 Road-150 Red, 48 BK-R93R-48 2171.2942
4 Road-150 Red, 52 BK-R93R-52 2171.2942
5 Road-150 Red, 56 BK-R93R-56 2171.2942
products[['Name', 'ProductNumber', 'StandardCost']] \
   [products.StandardCost > 2000] \
   .head(5)
Name ProductNumber StandardCost
44 Road-150 Red, 62 BK-R93R-62 2171.2942
45 Road-150 Red, 44 BK-R93R-44 2171.2942
46 Road-150 Red, 48 BK-R93R-48 2171.2942
47 Road-150 Red, 52 BK-R93R-52 2171.2942
48 Road-150 Red, 56 BK-R93R-56 2171.2942

Please note: indexes of python rows are different

SELECT NAME,
       ProductNumber,
       StandardCost
FROM   Product
WHERE  StandardCost > 10
       AND StandardCost < 12
Name ProductNumber StandardCost
1 Cable Lock LO-C100 10.3125
2 Mountain Pump PU-M044 10.3084
3 ML Mountain Tire TI-M602 11.2163
4 Touring Tire TI-T723 10.8423
products[['Name', 'ProductNumber', 'StandardCost']] \
 [(products.StandardCost > 10& (products.StandardCost < 12)]

Alternatively we can utilize .query() API

products[['Name', 'ProductNumber', 'StandardCost']] \
 .query('StandardCost > 10 & StandardCost < 12')
Name ProductNumber StandardCost
138 Cable Lock LO-C100 10.3125
140 Mountain Pump PU-M044 10.3084
224 ML Mountain Tire TI-M602 11.2163
229 Touring Tire TI-T723 10.8423

If you want to use SQL OR to need to use | in python equivalent.

… Where X in (…)

SELECT *
FROM   Product
WHERE  Color IN ( 'Red', 'Black', 'Silver' )
       AND StandardCost > 8
       AND StandardCost < 14
LIMIT  5  
Name ProductNumber Color StandardCost
1 Sport-100 Helmet, Red HL-U509-R Red 13.0863
2 Sport-100 Helmet, Black HL-U509 Black 13.0863
3 Half-Finger Gloves, S GL-H102-S Black 9.1593
4 Half-Finger Gloves, M GL-H102-M Black 9.1593
5 Half-Finger Gloves, L GL-H102-L Black 9.1593
6 Chain CH-0234 Silver 8.9866
colors = ['Red', 'Black', 'Silver']
products[['Name', 'ProductNumber', 'Color', 'StandardCost']] \
    [(products.Color.isin(colors)) \
     & (products.StandardCost > 8) \
     & (products.StandardCost < 14)] 

Or with .query() API

colors = ['Red', 'Black', 'Silver']
products[['Name', 'ProductNumber', 'Color', 'StandardCost']] \
    .query('Color in @colors and StandardCost > 8 and StandardCost < 14')
Name ProductNumber Color StandardCost
2 Sport-100 Helmet, Red HL-U509-R Red 13.0863
3 Sport-100 Helmet, Black HL-U509 Black 13.0863
153 Half-Finger Gloves, S GL-H102-S Black 9.1593
154 Half-Finger Gloves, M GL-H102-M Black 9.1593
155 Half-Finger Gloves, L GL-H102-L Black 9.1593
247 Chain CH-0234 Silver 8.9866
SELECT Name,
       ProductNumber,
       Color,
       StandardCost
FROM   Product
WHERE  Color NOT IN ( 'Red', 'Black', 'Blue', 'White',
                      'Silver', 'Yellow' )
LIMIT  5  
Name ProductNumber Color StandardCost
1 AWC Logo Cap CA-1098 Multi 6.9223
2 Long-Sleeve Logo Jersey, S LJ-0192-S Multi 38.4923
3 Long-Sleeve Logo Jersey, M LJ-0192-M Multi 38.4923
4 Long-Sleeve Logo Jersey, L LJ-0192-L Multi 38.4923
5 Long-Sleeve Logo Jersey, XL LJ-0192-X Multi 38.4923
colors=['Red', 'Black', 'Blue', 'White', 'Silver', 'Yellow']
products[['Name', 'ProductNumber', 'Color', 'StandardCost']] \
    [(~products.Color.isin(colors))] \
    .head(5)

Or with .query() API

colors=['Red', 'Black', 'Blue', 'White', 'Silver', 'Yellow']
products[['Name', 'ProductNumber', 'Color', 'StandardCost']] \
    .query('Color not in @colors') \
    .head(5)
Name ProductNumber Color StandardCost
7 AWC Logo Cap CA-1098 Multi 6.9223
8 Long-Sleeve Logo Jersey, S LJ-0192-S Multi 38.4923
9 Long-Sleeve Logo Jersey, M LJ-0192-M Multi 38.4923
10 Long-Sleeve Logo Jersey, L LJ-0192-L Multi 38.4923
11 Long-Sleeve Logo Jersey, XL LJ-0192-X Multi 38.4923

Select distinct

SELECT DISTINCT Color
FROM   Product
LIMIT  5  
pd.DataFrame(products['Color'].unique(), columns=['Color']) \
    .head(5)

Please note products.Color.unique() is not a data frame but an array, so we need to convert it into a data frame explicitly.

Order by

SELECT Name,
       ProductNumber,
       Color,
       Weight
FROM   Product
WHERE  Color = 'Red'
ORDER  BY Weight
LIMIT  5  
Name ProductNumber Color Weight
1 Touring-1000 Yellow, 60 BK-T79Y-60 Yellow 11747.98
2 Touring-1000 Yellow, 54 BK-T79Y-54 Yellow 11648.19
3 Touring-1000 Yellow, 50 BK-T79Y-50 Yellow 11530.26
4 Touring-1000 Yellow, 46 BK-T79Y-46 Yellow 11398.72
5 Road-350-W Yellow, 48 BK-R79Y-48 Yellow 7447.95
products[['Name', 'ProductNumber', 'Color', 'Weight']] \
    [(products.Color == 'Yellow'& (products.StandardCost > 1000)] \
    .sort_values('Weight', ascending = False) \
    .head(5)
Name ProductNumber Color Weight
252 Touring-1000 Yellow, 60 BK-T79Y-60 Yellow 11747.98
251 Touring-1000 Yellow, 54 BK-T79Y-54 Yellow 11648.19
250 Touring-1000 Yellow, 50 BK-T79Y-50 Yellow 11530.26
249 Touring-1000 Yellow, 46 BK-T79Y-46 Yellow 11398.72
271 Road-350-W Yellow, 48 BK-R79Y-48 Yellow 7447.95

Aggregates

Aggregate functions in pandas return results in a row rather than in column (like in SQL), so to get the same data shape in the result we need to transpose the resulting data set with .T function

Count

SELECT Count(NAME)
FROM   Product  
count(Name)
1 295
products[['Name']] \
    .aggregate(['count']) \
    .T
count
Name 295

Max, Min, Avg, Sum

SELECT Min(StandardCost),
       Max(StandardCost),
       Avg(StandardCost)
FROM   Product  
min(StandardCost) max(StandardCost) avg(StandardCost)
1 0.8565 2171.2942 438.220347457627
products[['StandardCost']] \
    .aggregate(['min', 'max', 'average']) \
    .T
min max average
StandardCost 0.8565 2171.2942 438.220347
SELECT Count(NAME),
       Sum(StandardCost),
       Max(ListPrice)
FROM   Product  
count(Name) sum(StandardCost) max(ListPrice)
1 295 129275.0025 3578.27

TODO: this return wrong shape of data

products \
    .agg({'Name':'count', 'StandardCost':'sum', 'ListPrice':'max'})
Name 295.0000
StandardCost 129275.0025
ListPrice 3578.2700

Group by

SELECT Color,
       Count(Color)
FROM   Product
WHERE  Color <> ''
GROUP  BY Color
ORDER  BY Count(*) DESC
LIMIT  5  
Color count(Color)
1 Black 89
2 Red 38
3 Yellow 36
4 Silver 36
5 Blue 26

Note: count() in pandas return number of non-null values, thus size() should be used instead.

products \
    [products.Color != ''] \
    .groupby(['Color']) \
    .size().reset_index(name='Count') \
    .sort_values('Count', ascending=False) \
    .head(5)
Color Count
0 Black 89
4 Red 38
5 Silver 36
8 Yellow 36
1 Blue 26

Having

SELECT color,
       Count(color)
FROM   Product
WHERE  color <> ''
GROUP  BY color
HAVING Count(color) > 30  
Color count(Color)
1 Black 89
2 Red 38
3 Silver 36
4 Yellow 36
products \
    [products.Color!=''] \
    .groupby(['Color']) \
    .filter(lambda g: len(g) > 30) \
    .groupby(['Color']) \
    .size().reset_index(name='Count')
Color Count
0 Black 89
1 Red 38
2 Silver 36
3 Yellow 36

Top N records

SELECT Name,
       ProductNumber,
       ListPrice
FROM   Product
ORDER  BY ListPrice DESC
LIMIT  3  
Name ProductNumber ListPrice
1 Road-150 Red, 62 BK-R93R-62 3578.27
2 Road-150 Red, 44 BK-R93R-44 3578.27
3 Road-150 Red, 48 BK-R93R-48 3578.27
products[['Name', 'ProductNumber', 'ListPrice']] \
    .nlargest(3, 'ListPrice')
Name ProductNumber ListPrice
44 Road-150 Red, 62 BK-R93R-62 3578.27
45 Road-150 Red, 44 BK-R93R-44 3578.27
46 Road-150 Red, 48 BK-R93R-48 3578.27

Please note: This method is equivalent to df.sort_values(columns, ascending=False).head(n), but more performant.

Pagination

SELECT Name,
       ProductNumber,
       ListPrice
FROM   Product
ORDER  BY ListPrice DESC
LIMIT  3 offset 3  
products[['Name', 'ProductNumber', 'ListPrice']] \
    .nlargest(3+3, 'ListPrice').tail(3)

Join

SELECT p.Name,
       p.Color,
       pc.Name
FROM   Product p
       JOIN ProductCategory pc
         ON p.ProductCategoryID = pc.ProductCategoryID
WHERE  p.Color <> ''
LIMIT  5  
products \
    .merge(product_categories, on='ProductCategoryID', suffixes=['', '_Category']) \
    [products.Color!=''] \
    [['Name', 'Color', 'Name_Category']] \
 .head(5)

We have to use suffixes here as products and product_categories have columns with identical names, and we don’t want to use defaults ['_x', '_y']

Union

-- bike categories
SELECT ProductCategoryID,
       NAME
FROM   ProductCategory
WHERE  NAME LIKE '% Bikes'
UNION ALL
-- bike parts categories
SELECT ProductCategoryID,
       NAME
FROM   ProductCategory
WHERE  NAME LIKE 'Bike %'  
categories = product_categories
bike_categories = categories[categories.Name.str.endswith(' Bikes')]
bike_parts_categories = categories[categories.Name.str.startswith('Bike ')]

pd.concat([bike_categories, bike_parts_categories])[['ProductCategoryID', 'Name']]

Rank

If we need to find the most expensive products in every category we might need to use rank(), dense_rank(), or row_number() SQL functions. Because we have several products with same top price, we would use row_number() over (partition by) to avoid duplicates (if price is the same, select first using alphabetical order).

SELECT CategoryName,
       ProductName,
       ListPrice
FROM   (
       SELECT pc.NAME AS CategoryName ,
              p.NAME  AS ProductName ,
              p.ListPrice ,
              Row_number() OVER 
                  (PARTITION BY pc.Name ORDER BY p.ListPrice DESC) AS rn
       FROM   Product p
       JOIN   ProductPategory pc
       ON     p.ProductCategoryID = pc.ProductCategoryID )
WHERE  rn=1
ORDER BY CategoryName 
LIMIT  5 
CategoryName ProductName ListPrice
1 Bib-Shorts Men’s Bib-Shorts, S 89.99
2 Bike Racks Hitch Rack - 4-Bike 120
3 Bike Stands All-Purpose Bike Stand 159
4 Bottles and Cages Mountain Bottle Cage 9.99
5 Bottom Brackets HL Bottom Bracket 121.49
products_with_categories = products \
    .merge(product_categories, on='ProductCategoryID', suffixes=['', '_Category'])

products_with_categories['rn'= products_with_categories \
    .sort_values(['Name_Category', 'ListPrice'], ascending=[True, False]) \ 
    .groupby(['Name_Category']) \
    .cumcount() + 1  # count starts from 0 in python, to simulate SQL we add 1

products_with_categories[['Name_Category','Name', 'ListPrice']] \
    [products_with_categories.rn == 1] \
    .sort_values('Name_Category') \
    .head(5)
Name_Category Name ListPrice
197 Bib-Shorts Men’s Bib-Shorts, S 89.99
223 Bike Racks Hitch Rack - 4-Bike 120.00
226 Bike Stands All-Purpose Bike Stand 159.00
210 Bottles and Cages Mountain Bottle Cage 9.99
294 Bottom Brackets HL Bottom Bracket 121.49

Iterate

Iterate over DataFrame rows as (index, Series) pairs. Please note: because iterrows returns a Series for each row, it does not preserve dtypes across the rows.

for i, v in df.iterrows():
    print(row)

Iterate over DataFrame rows as namedtuples.

for row in df.itertuples(index=False):
    print(row)

Please note: You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

Data manipulation

Insert

In SQL we have tables that might have autoincrement and default columns, thus we can omit these during insert

INSERT INTO ProductCategory
            (NAME,
             rowguid)
VALUES     ('Category X',
            '12d8351e-9d34-4797-957a-99a3acdf4d31')  

But in python we don’t have them, so we need to emulate same behavior if required. We also need to specify all column values during insert.

max_pid = product_categories.agg({'ProductCategoryID':'max'})
product_categories.loc[product_categories.index.size] = \ 
    [max_pid, None, 'Category X', '12d8351e-9d34-4797-957a-99a3acdf4d31', None]

In SQL there is a possibility to insert data into a table based on select from the other table

INSERT INTO Products
SELECT *
FROM   tmp_Products  

In python, we can just append one data frame to another to get a similar effect

products = products.append(tmp_products)

Update

UPDATE ProductCategory
SET    NAME = 'New Bikes'
WHERE  NAME = 'Bikes'  
product_categories\
    .loc[product_categories.Name=='Bikes', 'Name'= 'New Bikes'
UPDATE Product
SET    StandardCost = StandardCost * 1.10  
products['StandardCost'= products['StandardCost']*1.10

Delete

DELETE FROM Product
WHERE  Color = 'Grey'  

The easiest way to delete is just to leave the required subset, or to use drop() which works for columns and rows

products = products[products.Color!='Gray']

Or with .query() API

products.drop(products.Color=='Gray', inplace=True)

Alter

In SQL to create new columns we need to add it first to the table

ALTER TABLE Product
  ADD COLUMN [Profit] INTEGER;

UPDATE Product
SET    Profit = ListPrice - StandardCost;  

In python/pandas we just assign value to new column and it adds column with values in one go

products['Profit'= products['ListPrice'- products['StandardCost']

To drop column we have to again alter table

ALTER TABLE Product
  DROP COLUMN [Profit]  

In python/pandas we can drop column

products.drop('Profit', axis='columns')

Summary

We have reviewed multiple cases where SQL could be directly mapped to python/pandas. It might take some time to get yourself familiar with new library and it’s API but it definitely worth it. I hope this article would help you to progress faster.

Related content

Elliptic curve cryptography
Riding blockchain