Pandas Cheat Sheet
https://pypi.org/project/pandas/
Import
import pandas as pd
Series
Example inputs:
# list:
= [1, 7, 2]
a
# dictionary:
= {"day1": 420, "day2": 380, "day3": 390} kv
Simple series, no labels:
= pd.Series(a)
myseries1
print(myseries1)
0 1
1 7
2 2 dtype: int64
Series with labels:
= pd.Series(a, index=["x","y","z"])
myseries2
print(myseries2)
x 1
y 7
z 2 dtype: int64
Key-value as series:
= pd.Series(kv)
mykvseries
print(mykvseries)
day1 420
day2 380
day3 390 dtype: int64
Subset of key-value input:
= pd.Series(kv, index = ["day1","day2"])
mykvseries_filtered
print(mykvseries_filtered)
day1 420
day2 380 dtype: int64
Dataframes
Input:
= {
mydataset 'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
Load into a dataframe:
= pd.DataFrame(mydataset)
mydataframe
print(mydataframe)
cars passings
0 BMW 3
1 Volvo 7 2 Ford 2
Load from a File
CSV:
= pd.read_csv('data.csv') df
JSON:
= pd.read_json('data.json') df
Simple Analysis
First 10 rows:
print(df.head(10))
Last 5 rows (default):
print(df.tail())
Dataset info:
print(df.info())
The result tells us the following:
- Row count and column count
- The name of each column, with the data type
- How many non-null values there are present in each column
Clean Empty Cells
Drop empty cells, placing the results in a new dataframe:
= df.dropna() new_df
Drop empty cells, modifying the original dataframe:
= True) df.dropna(inplace
Replace empty cells with a default value (130 in this example):
# WARNING: This affects all columns!
130, inplace = True) df.fillna(
Replace with a default value in a specific column:
"Calories": 130}, inplace=True) df.fillna({
Replace using the mean:
# Mean is the average value (the sum of all values divided by number of values).
= df["Calories"].mean()
x
"Calories": x}, inplace=True) df.fillna({
Replace using the median:
# Median is the value in the middle, after you have sorted all values ascending.
= df["Calories"].median()
x
"Calories": x}, inplace=True) df.fillna({
Replace using the mode:
# Mode is the value that appears most frequently.
= df["Calories"].mode()[0]
x
"Calories": x}, inplace=True) df.fillna({
Clean Wrong Format
This example assumes that we have values that are not in a consistent format, but that can still be converted to a date:
'Date'] = pd.to_datetime(df['Date'], format='mixed') df[
But, there may be some that can’t be converted at all. They will end up with NaT (not a time) values. We can remove them with this:
=['Date'], inplace = True) df.dropna(subset
Clean Wrong Data
Sometimes, data is just wrong, e.g., typos.
For simple fixes, we can update the row directly:
# Assign a value of 45 to the Duration column in row 7:
7, 'Duration'] = 45 df.loc[
For large data sets, use rules-based updating:
# For each row with a Duration value larger than 120, assign a new value of 120:
for x in df.index:
if df.loc[x, "Duration"] > 120:
"Duration"] = 120 df.loc[x,
Remove bad rows altogether:
# For each row with a Duration value larger than 120, drop the row:
for x in df.index:
if df.loc[x, "Duration"] > 120:
= True) df.drop(x, inplace
Remove Duplicates
Find duplicates:
print(df.duplicated())
Remove them:
= True) df.drop_duplicates(inplace
Correlation
The corr()
method calculates the relationship between each column in a data set. The closer to 1 a correlation value is, the more closely related the columns are.
A positive correlation means values are likely to move together, e.g., if one goes up, the other probably will too. A negative correlation shows the opposite, e.g., if one goes up, the other is likely to go down.
df.corr()
Example output:
Duration | Pulse | Maxpulse | Calories | |
---|---|---|---|---|
Duration | 1.000000 | -0.155408 | 0.009403 | 0.922717 |
Pulse | -0.155408 | 1.000000 | 0.786535 | 0.025121 |
Maxpulse | 0.009403 | 0.786535 | 1.000000 | 0.203813 |
Calories | 0.922717 | 0.025121 | 0.203813 | 1.000000 |
Plotting
Import matplotlib
:
import matplotlib.pyplot as plt
Line plot (default):
df.plot()
plt.show()
Scatter plot:
# You can use .corr() to check for strong correlation and determine good
# argument candidates for a scatter plot.
df.corr()
= 'scatter', x = 'Duration', y = 'Calories')
df.plot(kind
plt.show()
Histogram:
"Duration"].plot(kind = 'hist') df[