My Data Science Cheatsheet

As a budding data scientist intern, there are always the same few questions that I keep googling for. So I thought why not compile all the answers here so I have a one-stop resource for my FAQ?

Will keep adding to this as I go on, please feel free to use this too!

Configuring Jupyter Notebooks

Changing how many rows and columns you can see in a dataframe

Used when your df has a lot of rows or columns that you want to see in the notebook

import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 25)

Interactive Plots

Too much data in a matplotlib plot, you want to zoom in but you are using a notebook

  1. Restart your kernel
  2. run this cell first
    import matplotlib
    matplotlib.use('TKAgg')
    
  3. then run this cell
    import matplotlib.pyplot as plt
    print(matplotlib.get_backend())
    ### test if it works
    plt.plot(range(10))
    plt.show()
    

Getting and Saving Data

  • Saving all your dfs into a data folder makes things neater!
  • use index = False: don’t save the unnamed index column into the excel file

CSV vs Excel

  • csv can be loaded alot faster
  • excel is better if you think you need to edit specific portions that might be too tedious to do on Python.
import pandas as pd

# saving dataframes
your_df.to_excel('data/your_df.xlsx', index = False)
your_df.to_csv('data/your_df.csv', index = False)

# loading dfs
your_df = pd.read_excel('data/your_df.xlsx')
your_df = pd.read_csv('data/your_df.csv')

Accessing Data

Handling Null values

1. You can visualize the severity of missing values quickly with missingno.matrix(df).

import missingno

missingno.matrix(train)

2. Getting the count of how many nan rows there are

train.isnull().sum() # get null counts of each column

3. Getting all rows with NaN for entire dataframe

  • reccomended to use df.isna() instead df.isnull() as isnull is going to be deprecated. It’s also consistent with other functions like fillna()
df[df.isnull().any(axis = 1)]

4. Getting all rows with NaN for specific column

  • reccomended to use df.isna() instead df.isnull() as isnull is going to be deprecated. It’s also consistent with other functions like fillna()
df[df['column name'].isna().any(axis = 1)]