My Data Science Cheatsheet
As a budding data scientist intern, there are always the same few questions that I keep googling for. So I thought why not compile all the answers here so I have a one-stop resource for my FAQ?
Will keep adding to this as I go on, please feel free to use this too!
Configuring Jupyter Notebooks
Changing how many rows and columns you can see in a dataframe
Used when your df has a lot of rows or columns that you want to see in the notebook
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 25)
Interactive Plots
Too much data in a matplotlib plot, you want to zoom in but you are using a notebook
- Restart your kernel
- run this cell first
import matplotlib matplotlib.use('TKAgg')
- then run this cell
import matplotlib.pyplot as plt print(matplotlib.get_backend()) ### test if it works plt.plot(range(10)) plt.show()
Getting and Saving Data
- Saving all your dfs into a
data
folder makes things neater! - use
index = False
: don’t save the unnamed index column into the excel file
CSV vs Excel
- csv can be loaded alot faster
- excel is better if you think you need to edit specific portions that might be too tedious to do on Python.
import pandas as pd
# saving dataframes
your_df.to_excel('data/your_df.xlsx', index = False)
your_df.to_csv('data/your_df.csv', index = False)
# loading dfs
your_df = pd.read_excel('data/your_df.xlsx')
your_df = pd.read_csv('data/your_df.csv')
Accessing Data
Handling Null values
1. You can visualize the severity of missing values quickly with missingno.matrix(df)
.
import missingno
missingno.matrix(train)
2. Getting the count of how many nan rows there are
train.isnull().sum() # get null counts of each column
3. Getting all rows with NaN for entire dataframe
- reccomended to use
df.isna()
insteaddf.isnull()
as isnull is going to be deprecated. It’s also consistent with other functions likefillna()
df[df.isnull().any(axis = 1)]
4. Getting all rows with NaN for specific column
- reccomended to use
df.isna()
insteaddf.isnull()
as isnull is going to be deprecated. It’s also consistent with other functions likefillna()
df[df['column name'].isna().any(axis = 1)]