Taming Pandas

Andrei Zhozhin

September 12, 2021 | 12 minutes

Technology

Why learn new tools

In the modern world there is high demand for strong analytical skills, amount of data is growing dramatically, number of different databases and data stores is also increasing. Previously all data was stored in relational databases and SQL language was lingua franca to query it. But now not all databases support SQL and we technically could not put all our data into single storage, that means we need to find new tools capable to work with different data sources and provide at least the same level of flexibility like SQL.

If you already proficient in SQL this article would be useful for you as it provides you with python/pandas equivalents for known SQL constructs. If you are not familiar with SQL I can recommend to start with SQLZoo to learn SQL through practice.

Meet Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. It provides very flexible interfaces for software developers and analysts. I think python was heavily popularized beyond developers community because of pandas.

In pandas it is not necessary to create single statement that would produce final result, we can easily break calculation into stages and produce result gradually - this is one of the major benefits of using pandas. SQL does not support references to common blocks of code thus you have to repeat yourself. Here is an excellent article Against SQL covering most of issues of SQL. Even though SQL has a lot of issues it is still a standard for databases and even new products (that could be not even relational) in some cases implement subset of SQL dialect.

The data set

We will look at a reduced version of the AdventureWorksLT database. Actually we would use only couple of tables to show how SQL queries could be translated to python/pandas expressions producing the same result.

To read data from SQLite db into pandas data frames we need to execute the following lines

import pandas as pd
import sqlite3

con = sqlite3.connect("AdventureWorksLT.db")
products = pd.read_sql_query("select * from Product", con)
product_categories = pd.read_sql_query("select * from ProductCategory", con)

Preparing data set

Pandas can use some heuristics to detect types of the columns, but sometimes due to type incompatibility between SQL database and Python it is not possible to map types, so Pandas can use the object type to support NULL values for typed columns (like integer)

We have such a case in the Products table

CREATE TABLE [Product](
...
    [Weight] INTEGER NULL,
...
)

We will convert NULL values of Weight to 0

products['Weight'] = pd.to_numeric(products['Weight']).fillna(0)

Data Extraction

As we are working with a big data set we need to limit the number of rows returned as a result. For that, there will be LIMIT used for SQL and .head() for python to reduce output of our example queries.

Select

All queries start with select *:

SELECT *
FROM   Product

is equal to data frame reference itself

products

… Limit X

Starting from this point all queries would be using explicit column names. It is considered bad practice to use select * in code due to multiple reasons.

SELECT ProductID,
       Name,
       ProductNumber
FROM   Product
LIMIT  5

	ProductID	Name	ProductNumber
1	680	HL Road Frame - Black, 58	FR-R92B-58
2	706	HL Road Frame - Red, 58	FR-R92R-58
3	707	Sport-100 Helmet, Red	HL-U509-R
4	708	Sport-100 Helmet, Black	HL-U509
5	709	Mountain Bike Socks, M	SO-B909-M

products[['ProductID', 'Name', 'ProductNumber']] \
    .head(5)

	ProductID	Name	ProductNumber
0	680	HL Road Frame - Black, 58	FR-R92B-58
1	706	HL Road Frame - Red, 58	FR-R92R-58
2	707	Sport-100 Helmet, Red	HL-U509-R
3	708	Sport-100 Helmet, Black	HL-U509
4	709	Mountain Bike Socks, M	SO-B909-M

Please note: index of returned rows in SQL starts with 1(one) and it is row_number, but indexes of rows in python starts with 0(zero) and this is an index in data set.

… Where X>Y

SELECT Name,
       ProductNumber,
       StandardCost
FROM   Product
WHERE  StandardCost > 2000
LIMIT  5

	Name	ProductNumber	StandardCost
1	Road-150 Red, 62	BK-R93R-62	2171.2942
2	Road-150 Red, 44	BK-R93R-44	2171.2942
3	Road-150 Red, 48	BK-R93R-48	2171.2942
4	Road-150 Red, 52	BK-R93R-52	2171.2942
5	Road-150 Red, 56	BK-R93R-56	2171.2942

products[['Name', 'ProductNumber', 'StandardCost']] \
   [products.StandardCost > 2000] \
   .head(5)

	Name	ProductNumber	StandardCost
44	Road-150 Red, 62	BK-R93R-62	2171.2942
45	Road-150 Red, 44	BK-R93R-44	2171.2942
46	Road-150 Red, 48	BK-R93R-48	2171.2942
47	Road-150 Red, 52	BK-R93R-52	2171.2942
48	Road-150 Red, 56	BK-R93R-56	2171.2942

Please note: indexes of python rows are different

SELECT NAME,
       ProductNumber,
       StandardCost
FROM   Product
WHERE  StandardCost > 10
       AND StandardCost < 12

	Name	ProductNumber	StandardCost
1	Cable Lock	LO-C100	10.3125
2	Mountain Pump	PU-M044	10.3084
3	ML Mountain Tire	TI-M602	11.2163
4	Touring Tire	TI-T723	10.8423

products[['Name', 'ProductNumber', 'StandardCost']] \
 [(products.StandardCost > 10) & (products.StandardCost < 12)]

Alternatively we can utilize .query() API

products[['Name', 'ProductNumber', 'StandardCost']] \
 .query('StandardCost > 10 & StandardCost < 12')

	Name	ProductNumber	StandardCost
138	Cable Lock	LO-C100	10.3125
140	Mountain Pump	PU-M044	10.3084
224	ML Mountain Tire	TI-M602	11.2163
229	Touring Tire	TI-T723	10.8423

If you want to use SQL OR to need to use | in python equivalent.

… Where X in (…)

SELECT *
FROM   Product
WHERE  Color IN ( 'Red', 'Black', 'Silver' )
       AND StandardCost > 8
       AND StandardCost < 14
LIMIT  5

	Name	ProductNumber	Color	StandardCost
1	Sport-100 Helmet, Red	HL-U509-R	Red	13.0863
2	Sport-100 Helmet, Black	HL-U509	Black	13.0863
3	Half-Finger Gloves, S	GL-H102-S	Black	9.1593
4	Half-Finger Gloves, M	GL-H102-M	Black	9.1593
5	Half-Finger Gloves, L	GL-H102-L	Black	9.1593
6	Chain	CH-0234	Silver	8.9866

colors = ['Red', 'Black', 'Silver']
products[['Name', 'ProductNumber', 'Color', 'StandardCost']] \
    [(products.Color.isin(colors)) \
     & (products.StandardCost > 8) \
     & (products.StandardCost < 14)]

Or with .query() API

colors = ['Red', 'Black', 'Silver']
products[['Name', 'ProductNumber', 'Color', 'StandardCost']] \
    .query('Color in @colors and StandardCost > 8 and StandardCost < 14')

	Name	ProductNumber	Color	StandardCost
2	Sport-100 Helmet, Red	HL-U509-R	Red	13.0863
3	Sport-100 Helmet, Black	HL-U509	Black	13.0863
153	Half-Finger Gloves, S	GL-H102-S	Black	9.1593
154	Half-Finger Gloves, M	GL-H102-M	Black	9.1593
155	Half-Finger Gloves, L	GL-H102-L	Black	9.1593
247	Chain	CH-0234	Silver	8.9866

SELECT Name,
       ProductNumber,
       Color,
       StandardCost
FROM   Product
WHERE  Color NOT IN ( 'Red', 'Black', 'Blue', 'White',
                      'Silver', 'Yellow' )
LIMIT  5

	Name	ProductNumber	Color	StandardCost
1	AWC Logo Cap	CA-1098	Multi	6.9223
2	Long-Sleeve Logo Jersey, S	LJ-0192-S	Multi	38.4923
3	Long-Sleeve Logo Jersey, M	LJ-0192-M	Multi	38.4923
4	Long-Sleeve Logo Jersey, L	LJ-0192-L	Multi	38.4923
5	Long-Sleeve Logo Jersey, XL	LJ-0192-X	Multi	38.4923

colors=['Red', 'Black', 'Blue', 'White', 'Silver', 'Yellow']
products[['Name', 'ProductNumber', 'Color', 'StandardCost']] \
    [(~products.Color.isin(colors))] \
    .head(5)

Or with .query() API

colors=['Red', 'Black', 'Blue', 'White', 'Silver', 'Yellow']
products[['Name', 'ProductNumber', 'Color', 'StandardCost']] \
    .query('Color not in @colors') \
    .head(5)

	Name	ProductNumber	Color	StandardCost
7	AWC Logo Cap	CA-1098	Multi	6.9223
8	Long-Sleeve Logo Jersey, S	LJ-0192-S	Multi	38.4923
9	Long-Sleeve Logo Jersey, M	LJ-0192-M	Multi	38.4923
10	Long-Sleeve Logo Jersey, L	LJ-0192-L	Multi	38.4923
11	Long-Sleeve Logo Jersey, XL	LJ-0192-X	Multi	38.4923

Select distinct

SELECT DISTINCT Color
FROM   Product
LIMIT  5

pd.DataFrame(products['Color'].unique(), columns=['Color']) \
    .head(5)

Please note products.Color.unique() is not a data frame but an array, so we need to convert it into a data frame explicitly.

Order by

SELECT Name,
       ProductNumber,
       Color,
       Weight
FROM   Product
WHERE  Color = 'Red'
ORDER  BY Weight
LIMIT  5

	Name	ProductNumber	Color	Weight
1	Touring-1000 Yellow, 60	BK-T79Y-60	Yellow	11747.98
2	Touring-1000 Yellow, 54	BK-T79Y-54	Yellow	11648.19
3	Touring-1000 Yellow, 50	BK-T79Y-50	Yellow	11530.26
4	Touring-1000 Yellow, 46	BK-T79Y-46	Yellow	11398.72
5	Road-350-W Yellow, 48	BK-R79Y-48	Yellow	7447.95

products[['Name', 'ProductNumber', 'Color', 'Weight']] \
    [(products.Color == 'Yellow') & (products.StandardCost > 1000)] \
    .sort_values('Weight', ascending = False) \
    .head(5)

	Name	ProductNumber	Color	Weight
252	Touring-1000 Yellow, 60	BK-T79Y-60	Yellow	11747.98
251	Touring-1000 Yellow, 54	BK-T79Y-54	Yellow	11648.19
250	Touring-1000 Yellow, 50	BK-T79Y-50	Yellow	11530.26
249	Touring-1000 Yellow, 46	BK-T79Y-46	Yellow	11398.72
271	Road-350-W Yellow, 48	BK-R79Y-48	Yellow	7447.95

Aggregates

Aggregate functions in pandas return results in a row rather than in column (like in SQL), so to get the same data shape in the result we need to transpose the resulting data set with .T function

Count

SELECT Count(NAME)
FROM   Product

	count(Name)
1	295

products[['Name']] \
    .aggregate(['count']) \
    .T

	count
Name	295

Max, Min, Avg, Sum

SELECT Min(StandardCost),
       Max(StandardCost),
       Avg(StandardCost)
FROM   Product

	min(StandardCost)	max(StandardCost)	avg(StandardCost)
1	0.8565	2171.2942	438.220347457627

products[['StandardCost']] \
    .aggregate(['min', 'max', 'average']) \
    .T

	min	max	average
StandardCost	0.8565	2171.2942	438.220347

SELECT Count(NAME),
       Sum(StandardCost),
       Max(ListPrice)
FROM   Product

	count(Name)	sum(StandardCost)	max(ListPrice)
1	295	129275.0025	3578.27

TODO: this return wrong shape of data

products \
    .agg({'Name':'count', 'StandardCost':'sum', 'ListPrice':'max'})


Name	295.0000
StandardCost	129275.0025
ListPrice	3578.2700

Group by

SELECT Color,
       Count(Color)
FROM   Product
WHERE  Color <> ''
GROUP  BY Color
ORDER  BY Count(*) DESC
LIMIT  5

	Color	count(Color)
1	Black	89
2	Red	38
3	Yellow	36
4	Silver	36
5	Blue	26

Note: count() in pandas return number of non-null values, thus size() should be used instead.

products \
    [products.Color != ''] \
    .groupby(['Color']) \
    .size().reset_index(name='Count') \
    .sort_values('Count', ascending=False) \
    .head(5)

	Color	Count
0	Black	89
4	Red	38
5	Silver	36
8	Yellow	36
1	Blue	26

Having

SELECT color,
       Count(color)
FROM   Product
WHERE  color <> ''
GROUP  BY color
HAVING Count(color) > 30

	Color	count(Color)
1	Black	89
2	Red	38
3	Silver	36
4	Yellow	36

products \
    [products.Color!=''] \
    .groupby(['Color']) \
    .filter(lambda g: len(g) > 30) \
    .groupby(['Color']) \
    .size().reset_index(name='Count')

	Color	Count
0	Black	89
1	Red	38
2	Silver	36
3	Yellow	36

Top N records

SELECT Name,
       ProductNumber,
       ListPrice
FROM   Product
ORDER  BY ListPrice DESC
LIMIT  3

	Name	ProductNumber	ListPrice
1	Road-150 Red, 62	BK-R93R-62	3578.27
2	Road-150 Red, 44	BK-R93R-44	3578.27
3	Road-150 Red, 48	BK-R93R-48	3578.27

products[['Name', 'ProductNumber', 'ListPrice']] \
    .nlargest(3, 'ListPrice')

	Name	ProductNumber	ListPrice
44	Road-150 Red, 62	BK-R93R-62	3578.27
45	Road-150 Red, 44	BK-R93R-44	3578.27
46	Road-150 Red, 48	BK-R93R-48	3578.27

Please note: This method is equivalent to df.sort_values(columns, ascending=False).head(n), but more performant.

Pagination

SELECT Name,
       ProductNumber,
       ListPrice
FROM   Product
ORDER  BY ListPrice DESC
LIMIT  3 offset 3

products[['Name', 'ProductNumber', 'ListPrice']] \
    .nlargest(3+3, 'ListPrice').tail(3)

Join

SELECT p.Name,
       p.Color,
       pc.Name
FROM   Product p
       JOIN ProductCategory pc
         ON p.ProductCategoryID = pc.ProductCategoryID
WHERE  p.Color <> ''
LIMIT  5

products \
    .merge(product_categories, on='ProductCategoryID', suffixes=['', '_Category']) \
    [products.Color!=''] \
    [['Name', 'Color', 'Name_Category']] \
 .head(5)

We have to use suffixes here as products and product_categories have columns with identical names, and we don’t want to use defaults ['_x', '_y']

Union

-- bike categories
SELECT ProductCategoryID,
       NAME
FROM   ProductCategory
WHERE  NAME LIKE '% Bikes'
UNION ALL
-- bike parts categories
SELECT ProductCategoryID,
       NAME
FROM   ProductCategory
WHERE  NAME LIKE 'Bike %'

categories = product_categories
bike_categories = categories[categories.Name.str.endswith(' Bikes')]
bike_parts_categories = categories[categories.Name.str.startswith('Bike ')]

pd.concat([bike_categories, bike_parts_categories])[['ProductCategoryID', 'Name']]

Rank

If we need to find the most expensive products in every category we might need to use rank(), dense_rank(), or row_number() SQL functions. Because we have several products with same top price, we would use row_number() over (partition by) to avoid duplicates (if price is the same, select first using alphabetical order).

SELECT CategoryName,
       ProductName,
       ListPrice
FROM   (
       SELECT pc.NAME AS CategoryName ,
              p.NAME  AS ProductName ,
              p.ListPrice ,
              Row_number() OVER 
                  (PARTITION BY pc.Name ORDER BY p.ListPrice DESC) AS rn
       FROM   Product p
       JOIN   ProductPategory pc
       ON     p.ProductCategoryID = pc.ProductCategoryID )
WHERE  rn=1
ORDER BY CategoryName 
LIMIT  5

	CategoryName	ProductName	ListPrice
1	Bib-Shorts	Men’s Bib-Shorts, S	89.99
2	Bike Racks	Hitch Rack - 4-Bike	120
3	Bike Stands	All-Purpose Bike Stand	159
4	Bottles and Cages	Mountain Bottle Cage	9.99
5	Bottom Brackets	HL Bottom Bracket	121.49

products_with_categories = products \
    .merge(product_categories, on='ProductCategoryID', suffixes=['', '_Category'])

products_with_categories['rn'] = products_with_categories \
    .sort_values(['Name_Category', 'ListPrice'], ascending=[True, False]) \ 
    .groupby(['Name_Category']) \
    .cumcount() + 1  # count starts from 0 in python, to simulate SQL we add 1

products_with_categories[['Name_Category','Name', 'ListPrice']] \
    [products_with_categories.rn == 1] \
    .sort_values('Name_Category') \
    .head(5)

	Name_Category	Name	ListPrice
197	Bib-Shorts	Men’s Bib-Shorts, S	89.99
223	Bike Racks	Hitch Rack - 4-Bike	120.00
226	Bike Stands	All-Purpose Bike Stand	159.00
210	Bottles and Cages	Mountain Bottle Cage	9.99
294	Bottom Brackets	HL Bottom Bracket	121.49

Iterate

Iterate over DataFrame rows as (index, Series) pairs. Please note: because iterrows returns a Series for each row, it does not preserve dtypes across the rows.

for i, v in df.iterrows():
    print(row)

Iterate over DataFrame rows as namedtuples.

for row in df.itertuples(index=False):
    print(row)

Please note: You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

Data manipulation

Insert

In SQL we have tables that might have autoincrement and default columns, thus we can omit these during insert

INSERT INTO ProductCategory
            (NAME,
             rowguid)
VALUES     ('Category X',
            '12d8351e-9d34-4797-957a-99a3acdf4d31')

But in python we don’t have them, so we need to emulate same behavior if required. We also need to specify all column values during insert.

max_pid = product_categories.agg({'ProductCategoryID':'max'})
product_categories.loc[product_categories.index.size] = \ 
    [max_pid, None, 'Category X', '12d8351e-9d34-4797-957a-99a3acdf4d31', None]

In SQL there is a possibility to insert data into a table based on select from the other table

INSERT INTO Products
SELECT *
FROM   tmp_Products

In python, we can just append one data frame to another to get a similar effect

products = products.append(tmp_products)

Update

UPDATE ProductCategory
SET    NAME = 'New Bikes'
WHERE  NAME = 'Bikes'

product_categories\
    .loc[product_categories.Name=='Bikes', 'Name'] = 'New Bikes'

UPDATE Product
SET    StandardCost = StandardCost * 1.10

products['StandardCost'] = products['StandardCost']*1.10

Delete

DELETE FROM Product
WHERE  Color = 'Grey'

The easiest way to delete is just to leave the required subset, or to use drop() which works for columns and rows

products = products[products.Color!='Gray']

Or with .query() API

products.drop(products.Color=='Gray', inplace=True)

Alter

In SQL to create new columns we need to add it first to the table

ALTER TABLE Product
  ADD COLUMN [Profit] INTEGER;

UPDATE Product
SET    Profit = ListPrice - StandardCost;

In python/pandas we just assign value to new column and it adds column with values in one go

products['Profit'] = products['ListPrice'] - products['StandardCost']

To drop column we have to again alter table

ALTER TABLE Product
  DROP COLUMN [Profit]

In python/pandas we can drop column

products.drop('Profit', axis='columns')

Summary

We have reviewed multiple cases where SQL could be directly mapped to python/pandas. It might take some time to get yourself familiar with new library and it’s API but it definitely worth it. I hope this article would help you to progress faster.

Taming Pandas

Why learn new tools

Meet Pandas

The data set

Schema of the database

Preparing data set

Data Extraction

Select

… Limit X

… Where X>Y

… Where X in (…)

Select distinct

Order by

Aggregates

Count

Max, Min, Avg, Sum

Group by

Having

Top N records

Join

Union

Rank

Iterate

Data manipulation

Insert

Update

Delete

Alter

Summary

Links

Related content