Skip to content

Python Pandas 2

_#python

Thanks to info from : - talkpython.fm - Excel to Python course - Udemy Courses

General Code Snippets

JUPYTER

  • df. and press TAB will display options/methods
  • ?df will open the help
  • df.head and SHIFT + TAB will show the signature
  • SHIFT double TAB will show additional info_
  • SHIFT triple TAB shows the full signature
  • %%timeit this command runs the cell a few times to give an estimated timelapse
  • %ls runs the ls command on the current folder right from jupyter

CONDA

DETAIL COMMAND
LIST ALL ENVIRONMENTS conda info --envs
CREATE A NEW ENV conda create -n myenv python=3
INSTALL A PACKAGE conda install pandas
INSTALL WITH PIP python -m pip install <package>
ACTIVATE AN ENV conda activate myenv
INSTALL GIT IN ANACONDA conda install -c anaconda git

Pandas Notes

General Pandas Notes

  • df.sample(3) ==> get a sample of 3 random rows from your dataset
  • df.info()
  • df.shape
  • df.describe()
  • df.corr() ==>> display the correlation between columns
  • NOTE you can apply changes to the dataframe by passing 'inplace=True' to most functions
  • AXIS By default axis = 0, which refers to rows. To, e.g. drop a column, pass axis=1 as a parameter

Columns

DETAIL COMMAND
df['column name'].count()
df['column name'].nunique()
df['column name'].sum()
df['column name'].mean()
FREQUENCY TABLE df['column name'].value_counts()
df['new col'] = df['existing col'] # 0.15
df['new col'] = 'XX'
view info and data types df.info()
get info on the data frame df.describe()
get info on the columns df.describe(include='object')
INDEX - SET A COLUMN AS THE INDEX df.set_index('col_name')
INDEX - ADD A NEW INDEX df.reset_index()
sort the index (useful for performance) df.sort_index(ascending=False)
possible to sort by another column df.sort_values('col_name')
sort by 2 or more cols (LIST!) df.sort_values(['col_1', 'col_2'])

Sum up multiple columns (e.g. to filter)

(StackOverflow link at the bottom)

cols_to_sum = df.columns[ : df.shape[1]]
# NOTE. df.shape[1] GIVES THE NUMBER OF COLUMNS
df['Fruit Total'] = df[cols_to_sum].sum(axis=1)

Another option:

df['Fruit Total']= df.iloc[:, -4:-1].sum(axis=1)
print (df)
  Apples  Bananas  Grapes  Kiwis  Fruit Total
0     2.0      3.0     NaN    1.0          5.0
1     1.0      3.0     7.0    NaN         11.0
2     NaN      NaN     2.0    3.0          2.0

Data

DETAIL COMMAND
GET DATA BY ROW/COLUMN INDEX df.loc[rows,columns]
GET DATA FROM ROW 0 AND SHOW ALL COLUMNS df.loc[0,:]
GET DATA ROWS 1, 2, 3 df[[[1,2,3]]] :
CHOOSE COLUMNS BY INDEX NUMBER df.iloc[:, 1:5]
see unique values in a col df.col_name.unique()

Data Types

Dtype When to use Conversion
object text .astype('str')
datetime64 date & time pd.to_datetime()
Int64 / float64 number pd.to_numeric()
  • Get a field's data type by df['col_name'].dtype
  • others : Category, Boolean, String
  • Cateogrical Data type
  • convert to cateogrical type in one of two ways:
    • df['col_name'].astype('category')
    • pd.Categorical(df['col_name'])
  • Numerical String Conversion : to convert a float to an int (e.g. 31.0 to 31), first need to convert to str: df['col_name'] = df.col_name.astype('str').astype('float').astype('int')

Filtering data : use Boolean indexing

  • create filter(s) :
  • filter_purchase_date = df['purchase_date'].dt.month == 11
  • filter_product_type = df['product'] == 'book'
  • apply filters to dataframe :
  • df[filter_purchase_date & filter_product_type]
  • display only a column or columns :
  • df["column_name"]
  • df.column_name
  • df[["col_1", "col_2"]]

Fltering on Rows (masks)

  • TWO ways to filter df.loc and df.query :
  • df.loc[ df['quantity'] > 12 ]
  • df.query['quantity > 12]
  • filter on rows in a col (similar to AutoFilter in Excel) :
  • df[col_name == "criterion"
  • NOTE thaa col_name = "criterion" is a boolean filter returning a list of True/False
  • (df.col_name == "criterion").sum() to get the number of records
  • filter on multiple conditions:
  • AND :
    filter_1 = (df.price < 100) & (df.minimum_nights < 3)
    filter_1.sum() # RETURNS THE NUMBER OF RECORDS THAT MATCH OUR FILTER
    df[filter_1 # APPLY THE FILTER ON THE DATAFRAME
  • OR :
    filter_2 = (df.reviews_per_month > 3) | (df.number_of_reviews > 50)
  • Numpy OR :
    filter_3 = np.logical_or((df.reviews_per_month > 3), (df.number_of_reviews > 50))
  • INVERT a filter : ~filter_1 The tilde in front inverts the filter (not)

PIVOT TABLES

 pd.pivot_table(
      df,
      index=['col_name_rows'],
      columns=['col_name_cols'],
      values=['col_name_measure'],
      aggfunc='sum',
      margins=True, #TOTALS COLUMN
      fill_value=0 #FILLER FOR ERRORS/EMPTIES
 )

ADD / REMOVE DATA

  • remove a row or column :
  • df.drop('col_name', axis=1)
  • df.drop(columns="col_name")
  • df.drop(index_num) # THIS DELETES A ROW
  • add rows:
  • df.append({'col_name': 'value', 'col_name': 'value'}, ignore_index=True)
  • append two dataframes:
  • df.append(df_other, , ignore_index=True)
  • NOTE on , ignore_index=True : this ignores the index on the second dataframe and keeps incrementing the index on the existing dataframe

TIME SERIES

CODE NOTE
pd.to_datetime(df['col_name]) convert to date type
, format="%m/%d/%Y to specify in the above that the date is 12/31/1900 format

General Notes

 .
 ├── 1-Data_Prep.ipynb  # Data prep notebook
 ├── 2-EDA.ipynb        # Final analysis notebook
 ├── data               # Categorized data files
 │   ├── external       # External data files  
 │   ├── interim        # Working folder
 │   ├── processed      # Cleaned and ready to use
 │   └── raw            # Unmodified originals
 └── reports            # Final reports

Udemy Data Manipulation in Python : A Pandas Crash Course