Python Pandas 2¶

_#python

Thanks to info from : - talkpython.fm - Excel to Python course - Udemy Courses

General Code Snippets¶

JUPYTER¶

df. and press TAB will display options/methods
?df will open the help
df.head and SHIFT + TAB will show the signature
SHIFT double TAB will show additional info_
SHIFT triple TAB shows the full signature
%%timeit this command runs the cell a few times to give an estimated timelapse
%ls runs the ls command on the current folder right from jupyter

CONDA¶

DETAIL	COMMAND
LIST ALL ENVIRONMENTS	`conda info --envs`
CREATE A NEW ENV	`conda create -n myenv python=3`
INSTALL A PACKAGE	`conda install pandas`
INSTALL WITH PIP	`python -m pip install <package>`
ACTIVATE AN ENV	`conda activate myenv`
INSTALL GIT IN ANACONDA	`conda install -c anaconda git`

Pandas Notes¶

General Pandas Notes¶

df.sample(3) ==> get a sample of 3 random rows from your dataset
df.info()
df.shape
df.describe()
df.corr() ==>> display the correlation between columns
NOTE you can apply changes to the dataframe by passing 'inplace=True' to most functions
AXIS By default axis = 0, which refers to rows. To, e.g. drop a column, pass axis=1 as a parameter

Columns¶

DETAIL	COMMAND
	`df['column name'].count()`
	`df['column name'].nunique()`
	`df['column name'].sum()`
	`df['column name'].mean()`
FREQUENCY TABLE	`df['column name'].value_counts()`
	`df['new col'] = df['existing col'] # 0.15`
	`df['new col'] = 'XX'`
view info and data types	`df.info()`
get info on the data frame	`df.describe()`
get info on the columns	`df.describe(include='object')`
INDEX - SET A COLUMN AS THE INDEX	`df.set_index('col_name')`
INDEX - ADD A NEW INDEX	`df.reset_index()`
sort the index (useful for performance)	`df.sort_index(ascending=False)`
possible to sort by another column	`df.sort_values('col_name')`
sort by 2 or more cols (LIST!)	`df.sort_values(['col_1', 'col_2'])`

Sum up multiple columns (e.g. to filter)¶

(StackOverflow link at the bottom)

cols_to_sum = df.columns[ : df.shape[1]]
# NOTE. df.shape[1] GIVES THE NUMBER OF COLUMNS
df['Fruit Total'] = df[cols_to_sum].sum(axis=1)

Another option:

df['Fruit Total']= df.iloc[:, -4:-1].sum(axis=1)
print (df)
  Apples  Bananas  Grapes  Kiwis  Fruit Total
0     2.0      3.0     NaN    1.0          5.0
1     1.0      3.0     7.0    NaN         11.0
2     NaN      NaN     2.0    3.0          2.0

Data¶

DETAIL	COMMAND
GET DATA BY ROW/COLUMN INDEX	`df.loc[rows,columns]`
GET DATA FROM ROW 0 AND SHOW ALL COLUMNS	`df.loc[0,:]`
GET DATA ROWS 1, 2, 3	`df[[[1,2,3]]] :`
CHOOSE COLUMNS BY INDEX NUMBER	`df.iloc[:, 1:5]`
see unique values in a col	`df.col_name.unique()`

Data Types¶

Dtype	When to use	Conversion
object	text	`.astype('str')`
datetime64	date & time	`pd.to_datetime()`
Int64 / float64	number	`pd.to_numeric()`

Get a field's data type by df['col_name'].dtype
others : Category, Boolean, String
Cateogrical Data type
convert to cateogrical type in one of two ways:
- df['col_name'].astype('category')
- pd.Categorical(df['col_name'])
Numerical String Conversion : to convert a float to an int (e.g. 31.0 to 31), first need to convert to str: df['col_name'] = df.col_name.astype('str').astype('float').astype('int')

Filtering data : use Boolean indexing¶

create filter(s) :
filter_purchase_date = df['purchase_date'].dt.month == 11
filter_product_type = df['product'] == 'book'
apply filters to dataframe :
df[filter_purchase_date & filter_product_type]
display only a column or columns :
df["column_name"]
df.column_name
df[["col_1", "col_2"]]

Fltering on Rows (masks)¶

TWO ways to filter df.loc and df.query :
df.loc[ df['quantity'] > 12 ]
df.query['quantity > 12]
filter on rows in a col (similar to AutoFilter in Excel) :
df[col_name == "criterion"
NOTE thaa col_name = "criterion" is a boolean filter returning a list of True/False
(df.col_name == "criterion").sum() to get the number of records
filter on multiple conditions:
AND :
filter_1 = (df.price < 100) & (df.minimum_nights < 3)
filter_1.sum() # RETURNS THE NUMBER OF RECORDS THAT MATCH OUR FILTER
df[filter_1 # APPLY THE FILTER ON THE DATAFRAME
OR :
filter_2 = (df.reviews_per_month > 3) | (df.number_of_reviews > 50)
Numpy OR :
filter_3 = np.logical_or((df.reviews_per_month > 3), (df.number_of_reviews > 50))
INVERT a filter : ~filter_1 The tilde in front inverts the filter (not)

PIVOT TABLES¶

 pd.pivot_table(
      df,
      index=['col_name_rows'],
      columns=['col_name_cols'],
      values=['col_name_measure'],
      aggfunc='sum',
      margins=True, #TOTALS COLUMN
      fill_value=0 #FILLER FOR ERRORS/EMPTIES
 )

ADD / REMOVE DATA¶

remove a row or column :
df.drop('col_name', axis=1)
df.drop(columns="col_name")
df.drop(index_num) # THIS DELETES A ROW
add rows:
df.append({'col_name': 'value', 'col_name': 'value'}, ignore_index=True)
append two dataframes:
df.append(df_other, , ignore_index=True)
NOTE on , ignore_index=True : this ignores the index on the second dataframe and keeps incrementing the index on the existing dataframe

TIME SERIES¶

CODE	NOTE
`pd.to_datetime(df['col_name])`	convert to date type
`, format="%m/%d/%Y`	to specify in the above that the date is 12/31/1900 format

General Notes¶

Recommended folder structure¶

 .
 ├── 1-Data_Prep.ipynb  # Data prep notebook
 ├── 2-EDA.ipynb        # Final analysis notebook
 ├── data               # Categorized data files
 │   ├── external       # External data files  
 │   ├── interim        # Working folder
 │   ├── processed      # Cleaned and ready to use
 │   └── raw            # Unmodified originals
 └── reports            # Final reports

Links¶

Udemy Data Manipulation in Python : A Pandas Crash Course¶

URL : https://www.superdatascience.com/pages/data-manipulation-in-python