Python Pandas 2
_#python
Thanks to info from :
- talkpython.fm - Excel to Python course
- Udemy Courses
General Code Snippets
JUPYTER
df.
and press TAB will display options/methods
?df
will open the help
df.head
and SHIFT + TAB will show the signature
SHIFT
double TAB will show additional info_
SHIFT
triple TAB shows the full signature
%%timeit
this command runs the cell a few times to give an estimated timelapse
%ls
runs the ls
command on the current folder right from jupyter
CONDA
DETAIL |
COMMAND |
LIST ALL ENVIRONMENTS |
conda info --envs |
CREATE A NEW ENV |
conda create -n myenv python=3 |
INSTALL A PACKAGE |
conda install pandas |
INSTALL WITH PIP |
python -m pip install <package> |
ACTIVATE AN ENV |
conda activate myenv |
INSTALL GIT IN ANACONDA |
conda install -c anaconda git |
Pandas Notes
General Pandas Notes
df.sample(3)
==> get a sample of 3 random rows from your dataset
df.info()
df.shape
df.describe()
df.corr()
==>> display the correlation between columns
- NOTE you can apply changes to the dataframe by passing 'inplace=True' to most functions
- AXIS By default axis = 0, which refers to rows. To, e.g. drop a column, pass
axis=1
as a parameter
Columns
DETAIL |
COMMAND |
|
df['column name'].count() |
|
df['column name'].nunique() |
|
df['column name'].sum() |
|
df['column name'].mean() |
FREQUENCY TABLE |
df['column name'].value_counts() |
|
df['new col'] = df['existing col'] # 0.15 |
|
df['new col'] = 'XX' |
view info and data types |
df.info() |
get info on the data frame |
df.describe() |
get info on the columns |
df.describe(include='object') |
INDEX - SET A COLUMN AS THE INDEX |
df.set_index('col_name') |
INDEX - ADD A NEW INDEX |
df.reset_index() |
sort the index (useful for performance) |
df.sort_index(ascending=False) |
possible to sort by another column |
df.sort_values('col_name') |
sort by 2 or more cols (LIST!) |
df.sort_values(['col_1', 'col_2']) |
Sum up multiple columns (e.g. to filter)
(StackOverflow link at the bottom)
cols_to_sum = df.columns[ : df.shape[1]]
# NOTE. df.shape[1] GIVES THE NUMBER OF COLUMNS
df['Fruit Total'] = df[cols_to_sum].sum(axis=1)
Another option:
df['Fruit Total']= df.iloc[:, -4:-1].sum(axis=1)
print (df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 5.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 2.0
Data
DETAIL |
COMMAND |
GET DATA BY ROW/COLUMN INDEX |
df.loc[rows,columns] |
GET DATA FROM ROW 0 AND SHOW ALL COLUMNS |
df.loc[0,:] |
GET DATA ROWS 1, 2, 3 |
df[[[1,2,3]]] : |
CHOOSE COLUMNS BY INDEX NUMBER |
df.iloc[:, 1:5] |
see unique values in a col |
df.col_name.unique() |
Data Types
Dtype |
When to use |
Conversion |
object |
text |
.astype('str') |
datetime64 |
date & time |
pd.to_datetime() |
Int64 / float64 |
number |
pd.to_numeric() |
- Get a field's data type by
df['col_name'].dtype
- others : Category, Boolean, String
- Cateogrical Data type
- convert to cateogrical type in one of two ways:
df['col_name'].astype('category')
pd.Categorical(df['col_name'])
- Numerical String Conversion : to convert a float to an int (e.g. 31.0 to 31), first need to convert to str:
df['col_name'] = df.col_name.astype('str').astype('float').astype('int')
Filtering data : use Boolean indexing
- create filter(s) :
filter_purchase_date = df['purchase_date'].dt.month == 11
filter_product_type = df['product'] == 'book'
- apply filters to dataframe :
df[filter_purchase_date & filter_product_type]
- display only a column or columns :
df["column_name"]
df.column_name
df[["col_1", "col_2"]]
Fltering on Rows (masks)
- TWO ways to filter
df.loc
and df.query
:
df.loc[ df['quantity'] > 12 ]
df.query['quantity > 12]
- filter on rows in a col (similar to AutoFilter in Excel) :
df[col_name == "criterion"
- NOTE thaa col_name = "criterion" is a boolean filter returning a list of True/False
(df.col_name == "criterion").sum()
to get the number of records
- filter on multiple conditions:
- AND :
filter_1 = (df.price < 100) & (df.minimum_nights < 3)
filter_1.sum() # RETURNS THE NUMBER OF RECORDS THAT MATCH OUR FILTER
df[filter_1 # APPLY THE FILTER ON THE DATAFRAME
- OR :
filter_2 = (df.reviews_per_month > 3) | (df.number_of_reviews > 50)
- Numpy OR :
filter_3 = np.logical_or((df.reviews_per_month > 3), (df.number_of_reviews > 50))
- INVERT a filter :
~filter_1
The tilde in front inverts the filter (not)
PIVOT TABLES
pd.pivot_table(
df,
index=['col_name_rows'],
columns=['col_name_cols'],
values=['col_name_measure'],
aggfunc='sum',
margins=True, #TOTALS COLUMN
fill_value=0 #FILLER FOR ERRORS/EMPTIES
)
ADD / REMOVE DATA
- remove a row or column :
df.drop('col_name', axis=1)
df.drop(columns="col_name")
df.drop(index_num)
# THIS DELETES A ROW
- add rows:
df.append({'col_name': 'value', 'col_name': 'value'}, ignore_index=True)
- append two dataframes:
df.append(df_other, , ignore_index=True)
- NOTE on
, ignore_index=True
: this ignores the index on the second dataframe and keeps incrementing the index on the existing dataframe
TIME SERIES
CODE |
NOTE |
pd.to_datetime(df['col_name]) |
convert to date type |
, format="%m/%d/%Y |
to specify in the above that the date is 12/31/1900 format |
General Notes
Recommended folder structure
.
├── 1-Data_Prep.ipynb # Data prep notebook
├── 2-EDA.ipynb # Final analysis notebook
├── data # Categorized data files
│ ├── external # External data files
│ ├── interim # Working folder
│ ├── processed # Cleaned and ready to use
│ └── raw # Unmodified originals
└── reports # Final reports
Links
Udemy Data Manipulation in Python : A Pandas Crash Course