Taming Pandas
Andrei Zhozhin
| 12 minutes
Why learn new tools
In the modern world there is high demand for strong analytical skills, amount of data is growing dramatically, number of different databases and data stores is also increasing. Previously all data was stored in relational databases and SQL language was lingua franca to query it. But now not all databases support SQL and we technically could not put all our data into single storage, that means we need to find new tools capable to work with different data sources and provide at least the same level of flexibility like SQL.
If you already proficient in SQL this article would be useful for you as it provides you with python/pandas equivalents for known SQL constructs. If you are not familiar with SQL I can recommend to start with SQLZoo to learn SQL through practice.
Meet Pandas
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. It provides very flexible interfaces for software developers and analysts. I think python was heavily popularized beyond developers community because of pandas.
In pandas it is not necessary to create single statement that would produce final result, we can easily break calculation into stages and produce result gradually - this is one of the major benefits of using pandas. SQL does not support references to common blocks of code thus you have to repeat yourself. Here is an excellent article Against SQL covering most of issues of SQL. Even though SQL has a lot of issues it is still a standard for databases and even new products (that could be not even relational) in some cases implement subset of SQL dialect.
The data set
We will look at a reduced version of the AdventureWorksLT database. Actually we would use only couple of tables to show how SQL queries could be translated to python/pandas expressions producing the same result.
To read data from SQLite db into pandas data frames we need to execute the following lines
import pandas as pd
import sqlite3
con = sqlite3.connect("AdventureWorksLT.db")
products = pd.read_sql_query("select * from Product", con)
product_categories = pd.read_sql_query("select * from ProductCategory", con)
Preparing data set
Pandas can use some heuristics to detect types of the columns, but sometimes due to type incompatibility between SQL database and Python it is not possible to map types, so Pandas can use the object
type to support NULL
values for typed columns (like integer
)
We have such a case in the Products
table
CREATE TABLE [Product](
...
[Weight] INTEGER NULL,
...
)
We will convert NULL
values of Weight
to 0
products['Weight'] = pd.to_numeric(products['Weight']).fillna(0)
Data Extraction
As we are working with a big data set we need to limit the number of rows returned as a result. For that, there will be LIMIT
used for SQL and .head()
for python to reduce output of our example queries.
Select
All queries start with select *
:
SELECT *
FROM Product
is equal to data frame reference itself
products
… Limit X
Starting from this point all queries would be using explicit column names. It is considered bad practice to use select *
in code due to multiple reasons.
SELECT ProductID,
Name,
ProductNumber
FROM Product
LIMIT 5
ProductID | Name | ProductNumber | |
---|---|---|---|
1 | 680 | HL Road Frame - Black, 58 | FR-R92B-58 |
2 | 706 | HL Road Frame - Red, 58 | FR-R92R-58 |
3 | 707 | Sport-100 Helmet, Red | HL-U509-R |
4 | 708 | Sport-100 Helmet, Black | HL-U509 |
5 | 709 | Mountain Bike Socks, M | SO-B909-M |
products[['ProductID', 'Name', 'ProductNumber']] \
.head(5)
ProductID | Name | ProductNumber | |
---|---|---|---|
0 | 680 | HL Road Frame - Black, 58 | FR-R92B-58 |
1 | 706 | HL Road Frame - Red, 58 | FR-R92R-58 |
2 | 707 | Sport-100 Helmet, Red | HL-U509-R |
3 | 708 | Sport-100 Helmet, Black | HL-U509 |
4 | 709 | Mountain Bike Socks, M | SO-B909-M |
Please note: index of returned rows in SQL starts with 1
(one) and it is row_number, but indexes of rows in python starts with 0
(zero) and this is an index in data set.
… Where X>Y
SELECT Name,
ProductNumber,
StandardCost
FROM Product
WHERE StandardCost > 2000
LIMIT 5
Name | ProductNumber | StandardCost | |
---|---|---|---|
1 | Road-150 Red, 62 | BK-R93R-62 | 2171.2942 |
2 | Road-150 Red, 44 | BK-R93R-44 | 2171.2942 |
3 | Road-150 Red, 48 | BK-R93R-48 | 2171.2942 |
4 | Road-150 Red, 52 | BK-R93R-52 | 2171.2942 |
5 | Road-150 Red, 56 | BK-R93R-56 | 2171.2942 |
products[['Name', 'ProductNumber', 'StandardCost']] \
[products.StandardCost > 2000] \
.head(5)
Name | ProductNumber | StandardCost | |
---|---|---|---|
44 | Road-150 Red, 62 | BK-R93R-62 | 2171.2942 |
45 | Road-150 Red, 44 | BK-R93R-44 | 2171.2942 |
46 | Road-150 Red, 48 | BK-R93R-48 | 2171.2942 |
47 | Road-150 Red, 52 | BK-R93R-52 | 2171.2942 |
48 | Road-150 Red, 56 | BK-R93R-56 | 2171.2942 |
Please note: indexes of python rows are different
SELECT NAME,
ProductNumber,
StandardCost
FROM Product
WHERE StandardCost > 10
AND StandardCost < 12
Name | ProductNumber | StandardCost | |
---|---|---|---|
1 | Cable Lock | LO-C100 | 10.3125 |
2 | Mountain Pump | PU-M044 | 10.3084 |
3 | ML Mountain Tire | TI-M602 | 11.2163 |
4 | Touring Tire | TI-T723 | 10.8423 |
products[['Name', 'ProductNumber', 'StandardCost']] \
[(products.StandardCost > 10) & (products.StandardCost < 12)]
Alternatively we can utilize .query()
API
products[['Name', 'ProductNumber', 'StandardCost']] \
.query('StandardCost > 10 & StandardCost < 12')
Name | ProductNumber | StandardCost | |
---|---|---|---|
138 | Cable Lock | LO-C100 | 10.3125 |
140 | Mountain Pump | PU-M044 | 10.3084 |
224 | ML Mountain Tire | TI-M602 | 11.2163 |
229 | Touring Tire | TI-T723 | 10.8423 |
If you want to use SQL OR
to need to use |
in python equivalent.
… Where X in (…)
SELECT *
FROM Product
WHERE Color IN ( 'Red', 'Black', 'Silver' )
AND StandardCost > 8
AND StandardCost < 14
LIMIT 5
Name | ProductNumber | Color | StandardCost | |
---|---|---|---|---|
1 | Sport-100 Helmet, Red | HL-U509-R | Red | 13.0863 |
2 | Sport-100 Helmet, Black | HL-U509 | Black | 13.0863 |
3 | Half-Finger Gloves, S | GL-H102-S | Black | 9.1593 |
4 | Half-Finger Gloves, M | GL-H102-M | Black | 9.1593 |
5 | Half-Finger Gloves, L | GL-H102-L | Black | 9.1593 |
6 | Chain | CH-0234 | Silver | 8.9866 |
colors = ['Red', 'Black', 'Silver']
products[['Name', 'ProductNumber', 'Color', 'StandardCost']] \
[(products.Color.isin(colors)) \
& (products.StandardCost > 8) \
& (products.StandardCost < 14)]
Or with .query()
API
colors = ['Red', 'Black', 'Silver']
products[['Name', 'ProductNumber', 'Color', 'StandardCost']] \
.query('Color in @colors and StandardCost > 8 and StandardCost < 14')
Name | ProductNumber | Color | StandardCost | |
---|---|---|---|---|
2 | Sport-100 Helmet, Red | HL-U509-R | Red | 13.0863 |
3 | Sport-100 Helmet, Black | HL-U509 | Black | 13.0863 |
153 | Half-Finger Gloves, S | GL-H102-S | Black | 9.1593 |
154 | Half-Finger Gloves, M | GL-H102-M | Black | 9.1593 |
155 | Half-Finger Gloves, L | GL-H102-L | Black | 9.1593 |
247 | Chain | CH-0234 | Silver | 8.9866 |
SELECT Name,
ProductNumber,
Color,
StandardCost
FROM Product
WHERE Color NOT IN ( 'Red', 'Black', 'Blue', 'White',
'Silver', 'Yellow' )
LIMIT 5
Name | ProductNumber | Color | StandardCost | |
---|---|---|---|---|
1 | AWC Logo Cap | CA-1098 | Multi | 6.9223 |
2 | Long-Sleeve Logo Jersey, S | LJ-0192-S | Multi | 38.4923 |
3 | Long-Sleeve Logo Jersey, M | LJ-0192-M | Multi | 38.4923 |
4 | Long-Sleeve Logo Jersey, L | LJ-0192-L | Multi | 38.4923 |
5 | Long-Sleeve Logo Jersey, XL | LJ-0192-X | Multi | 38.4923 |
colors=['Red', 'Black', 'Blue', 'White', 'Silver', 'Yellow']
products[['Name', 'ProductNumber', 'Color', 'StandardCost']] \
[(~products.Color.isin(colors))] \
.head(5)
Or with .query()
API
colors=['Red', 'Black', 'Blue', 'White', 'Silver', 'Yellow']
products[['Name', 'ProductNumber', 'Color', 'StandardCost']] \
.query('Color not in @colors') \
.head(5)
Name | ProductNumber | Color | StandardCost | |
---|---|---|---|---|
7 | AWC Logo Cap | CA-1098 | Multi | 6.9223 |
8 | Long-Sleeve Logo Jersey, S | LJ-0192-S | Multi | 38.4923 |
9 | Long-Sleeve Logo Jersey, M | LJ-0192-M | Multi | 38.4923 |
10 | Long-Sleeve Logo Jersey, L | LJ-0192-L | Multi | 38.4923 |
11 | Long-Sleeve Logo Jersey, XL | LJ-0192-X | Multi | 38.4923 |
Select distinct
SELECT DISTINCT Color
FROM Product
LIMIT 5
pd.DataFrame(products['Color'].unique(), columns=['Color']) \
.head(5)
Please note products.Color.unique()
is not a data frame but an array, so we need to convert it into a data frame explicitly.
Order by
SELECT Name,
ProductNumber,
Color,
Weight
FROM Product
WHERE Color = 'Red'
ORDER BY Weight
LIMIT 5
Name | ProductNumber | Color | Weight | |
---|---|---|---|---|
1 | Touring-1000 Yellow, 60 | BK-T79Y-60 | Yellow | 11747.98 |
2 | Touring-1000 Yellow, 54 | BK-T79Y-54 | Yellow | 11648.19 |
3 | Touring-1000 Yellow, 50 | BK-T79Y-50 | Yellow | 11530.26 |
4 | Touring-1000 Yellow, 46 | BK-T79Y-46 | Yellow | 11398.72 |
5 | Road-350-W Yellow, 48 | BK-R79Y-48 | Yellow | 7447.95 |
products[['Name', 'ProductNumber', 'Color', 'Weight']] \
[(products.Color == 'Yellow') & (products.StandardCost > 1000)] \
.sort_values('Weight', ascending = False) \
.head(5)
Name | ProductNumber | Color | Weight | |
---|---|---|---|---|
252 | Touring-1000 Yellow, 60 | BK-T79Y-60 | Yellow | 11747.98 |
251 | Touring-1000 Yellow, 54 | BK-T79Y-54 | Yellow | 11648.19 |
250 | Touring-1000 Yellow, 50 | BK-T79Y-50 | Yellow | 11530.26 |
249 | Touring-1000 Yellow, 46 | BK-T79Y-46 | Yellow | 11398.72 |
271 | Road-350-W Yellow, 48 | BK-R79Y-48 | Yellow | 7447.95 |
Aggregates
Aggregate functions in pandas return results in a row rather than in column (like in SQL), so to get the same data shape in the result we need to transpose the resulting data set with .T
function
Count
SELECT Count(NAME)
FROM Product
count(Name) | |
---|---|
1 | 295 |
products[['Name']] \
.aggregate(['count']) \
.T
count | |
---|---|
Name | 295 |
Max, Min, Avg, Sum
SELECT Min(StandardCost),
Max(StandardCost),
Avg(StandardCost)
FROM Product
min(StandardCost) | max(StandardCost) | avg(StandardCost) | |
---|---|---|---|
1 | 0.8565 | 2171.2942 | 438.220347457627 |
products[['StandardCost']] \
.aggregate(['min', 'max', 'average']) \
.T
min | max | average | |
---|---|---|---|
StandardCost | 0.8565 | 2171.2942 | 438.220347 |
SELECT Count(NAME),
Sum(StandardCost),
Max(ListPrice)
FROM Product
count(Name) | sum(StandardCost) | max(ListPrice) | |
---|---|---|---|
1 | 295 | 129275.0025 | 3578.27 |
TODO: this return wrong shape of data
products \
.agg({'Name':'count', 'StandardCost':'sum', 'ListPrice':'max'})
Name | 295.0000 |
StandardCost | 129275.0025 |
ListPrice | 3578.2700 |
Group by
SELECT Color,
Count(Color)
FROM Product
WHERE Color <> ''
GROUP BY Color
ORDER BY Count(*) DESC
LIMIT 5
Color | count(Color) | |
---|---|---|
1 | Black | 89 |
2 | Red | 38 |
3 | Yellow | 36 |
4 | Silver | 36 |
5 | Blue | 26 |
Note: count()
in pandas return number of non-null values, thus size()
should be used instead.
products \
[products.Color != ''] \
.groupby(['Color']) \
.size().reset_index(name='Count') \
.sort_values('Count', ascending=False) \
.head(5)
Color | Count | |
---|---|---|
0 | Black | 89 |
4 | Red | 38 |
5 | Silver | 36 |
8 | Yellow | 36 |
1 | Blue | 26 |
Having
SELECT color,
Count(color)
FROM Product
WHERE color <> ''
GROUP BY color
HAVING Count(color) > 30
Color | count(Color) | |
---|---|---|
1 | Black | 89 |
2 | Red | 38 |
3 | Silver | 36 |
4 | Yellow | 36 |
products \
[products.Color!=''] \
.groupby(['Color']) \
.filter(lambda g: len(g) > 30) \
.groupby(['Color']) \
.size().reset_index(name='Count')
Color | Count | |
---|---|---|
0 | Black | 89 |
1 | Red | 38 |
2 | Silver | 36 |
3 | Yellow | 36 |
Top N records
SELECT Name,
ProductNumber,
ListPrice
FROM Product
ORDER BY ListPrice DESC
LIMIT 3
Name | ProductNumber | ListPrice | |
---|---|---|---|
1 | Road-150 Red, 62 | BK-R93R-62 | 3578.27 |
2 | Road-150 Red, 44 | BK-R93R-44 | 3578.27 |
3 | Road-150 Red, 48 | BK-R93R-48 | 3578.27 |
products[['Name', 'ProductNumber', 'ListPrice']] \
.nlargest(3, 'ListPrice')
Name | ProductNumber | ListPrice | |
---|---|---|---|
44 | Road-150 Red, 62 | BK-R93R-62 | 3578.27 |
45 | Road-150 Red, 44 | BK-R93R-44 | 3578.27 |
46 | Road-150 Red, 48 | BK-R93R-48 | 3578.27 |
Please note: This method is equivalent to df.sort_values(columns, ascending=False).head(n)
, but more performant.
Pagination
SELECT Name,
ProductNumber,
ListPrice
FROM Product
ORDER BY ListPrice DESC
LIMIT 3 offset 3
products[['Name', 'ProductNumber', 'ListPrice']] \
.nlargest(3+3, 'ListPrice').tail(3)
Join
SELECT p.Name,
p.Color,
pc.Name
FROM Product p
JOIN ProductCategory pc
ON p.ProductCategoryID = pc.ProductCategoryID
WHERE p.Color <> ''
LIMIT 5
products \
.merge(product_categories, on='ProductCategoryID', suffixes=['', '_Category']) \
[products.Color!=''] \
[['Name', 'Color', 'Name_Category']] \
.head(5)
We have to use suffixes
here as products
and product_categories
have columns with identical names, and we don’t want to use defaults ['_x', '_y']
Union
-- bike categories
SELECT ProductCategoryID,
NAME
FROM ProductCategory
WHERE NAME LIKE '% Bikes'
UNION ALL
-- bike parts categories
SELECT ProductCategoryID,
NAME
FROM ProductCategory
WHERE NAME LIKE 'Bike %'
categories = product_categories
bike_categories = categories[categories.Name.str.endswith(' Bikes')]
bike_parts_categories = categories[categories.Name.str.startswith('Bike ')]
pd.concat([bike_categories, bike_parts_categories])[['ProductCategoryID', 'Name']]
Rank
If we need to find the most expensive products in every category we might need to use rank()
, dense_rank()
, or row_number()
SQL functions. Because we have several products with same top price, we would use row_number() over (partition by)
to avoid duplicates (if price is the same, select first using alphabetical order).
SELECT CategoryName,
ProductName,
ListPrice
FROM (
SELECT pc.NAME AS CategoryName ,
p.NAME AS ProductName ,
p.ListPrice ,
Row_number() OVER
(PARTITION BY pc.Name ORDER BY p.ListPrice DESC) AS rn
FROM Product p
JOIN ProductPategory pc
ON p.ProductCategoryID = pc.ProductCategoryID )
WHERE rn=1
ORDER BY CategoryName
LIMIT 5
CategoryName | ProductName | ListPrice | |
---|---|---|---|
1 | Bib-Shorts | Men’s Bib-Shorts, S | 89.99 |
2 | Bike Racks | Hitch Rack - 4-Bike | 120 |
3 | Bike Stands | All-Purpose Bike Stand | 159 |
4 | Bottles and Cages | Mountain Bottle Cage | 9.99 |
5 | Bottom Brackets | HL Bottom Bracket | 121.49 |
products_with_categories = products \
.merge(product_categories, on='ProductCategoryID', suffixes=['', '_Category'])
products_with_categories['rn'] = products_with_categories \
.sort_values(['Name_Category', 'ListPrice'], ascending=[True, False]) \
.groupby(['Name_Category']) \
.cumcount() + 1 # count starts from 0 in python, to simulate SQL we add 1
products_with_categories[['Name_Category','Name', 'ListPrice']] \
[products_with_categories.rn == 1] \
.sort_values('Name_Category') \
.head(5)
Name_Category | Name | ListPrice | |
---|---|---|---|
197 | Bib-Shorts | Men’s Bib-Shorts, S | 89.99 |
223 | Bike Racks | Hitch Rack - 4-Bike | 120.00 |
226 | Bike Stands | All-Purpose Bike Stand | 159.00 |
210 | Bottles and Cages | Mountain Bottle Cage | 9.99 |
294 | Bottom Brackets | HL Bottom Bracket | 121.49 |
Iterate
Iterate over DataFrame rows as (index, Series) pairs. Please note: because iterrows
returns a Series for each row, it does not preserve dtypes across the rows.
for i, v in df.iterrows():
print(row)
Iterate over DataFrame rows as namedtuples
.
for row in df.itertuples(index=False):
print(row)
Please note: You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Data manipulation
Insert
In SQL we have tables that might have autoincrement
and default
columns, thus we can omit these during insert
INSERT INTO ProductCategory
(NAME,
rowguid)
VALUES ('Category X',
'12d8351e-9d34-4797-957a-99a3acdf4d31')
But in python we don’t have them, so we need to emulate same behavior if required. We also need to specify all column values during insert.
max_pid = product_categories.agg({'ProductCategoryID':'max'})
product_categories.loc[product_categories.index.size] = \
[max_pid, None, 'Category X', '12d8351e-9d34-4797-957a-99a3acdf4d31', None]
In SQL there is a possibility to insert data into a table based on select from the other table
INSERT INTO Products
SELECT *
FROM tmp_Products
In python, we can just append one data frame to another to get a similar effect
products = products.append(tmp_products)
Update
UPDATE ProductCategory
SET NAME = 'New Bikes'
WHERE NAME = 'Bikes'
product_categories\
.loc[product_categories.Name=='Bikes', 'Name'] = 'New Bikes'
UPDATE Product
SET StandardCost = StandardCost * 1.10
products['StandardCost'] = products['StandardCost']*1.10
Delete
DELETE FROM Product
WHERE Color = 'Grey'
The easiest way to delete is just to leave the required subset, or to use drop()
which works for columns and rows
products = products[products.Color!='Gray']
Or with .query()
API
products.drop(products.Color=='Gray', inplace=True)
Alter
In SQL to create new columns we need to add it first to the table
ALTER TABLE Product
ADD COLUMN [Profit] INTEGER;
UPDATE Product
SET Profit = ListPrice - StandardCost;
In python/pandas we just assign value to new column and it adds column with values in one go
products['Profit'] = products['ListPrice'] - products['StandardCost']
To drop column we have to again alter table
ALTER TABLE Product
DROP COLUMN [Profit]
In python/pandas we can drop column
products.drop('Profit', axis='columns')
Summary
We have reviewed multiple cases where SQL could be directly mapped to python/pandas. It might take some time to get yourself familiar with new library and it’s API but it definitely worth it. I hope this article would help you to progress faster.
Links
- https://sqlzoo.net/wiki/SQL_Tutorial
- https://sqlzoo.net/wiki/AdventureWorks
- https://github.com/nuitsjp/AdventureWorks-for-SQLite
- https://pandas.pydata.org/docs/reference/index.html
- https://devarea.com/pandas-for-sql-users/
- https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e
- https://scattered-thoughts.net/writing/against-sql/