Open In Colab

Introduction to Pandas

Learning Objectives:

  • Gain an introduction to the DataFrame and Series data structures of the pandas library
  • Access and manipulate data within a DataFrame and Series

pandas is a column-oriented data analysis API. It’s a great tool for handling and analyzing input data, and many ML frameworks support pandas data structures as inputs.

Although a comprehensive introduction to the API would span many pages, the core concepts are fairly straightforward, and we’ll present them below. For a more complete reference, the pandas docs site contains extensive documentation and many tutorials. (Note that Colab may use a slightly older version number, but the parts of pandas covered here are unlikely to differ from version to version.)

Let’s import pandas library

import pandas as pd
pd.__version__
'1.5.3'

The primary data structures in pandas are implemented as two classes:

  • DataFrame, which you can imagine as a relational data table, with rows and named columns.
  • Series, which is a single column. Each row can be labeled via an index. A DataFrame contains one or more Series and a name for each Series.

The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in Spark and R.

from sklearn.datasets import load_diabetes

diabetes = load_diabetes(as_frame=True)
print(type(diabetes['data']))
<class 'pandas.core.frame.DataFrame'>
?load_diabetes
print(diabetes.DESCR)
.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
df = diabetes['data']

Creating pandas dataframe from series

cities = pd.Series(['Mumbai', 'Bangalore', 'Chennai', 'Delhi'])
population = pd.Series([17000000, 13000000, 6000000])
city_info_df = pd.DataFrame({'City': cities, 'Population': population})
type(city_info_df)
pandas.core.frame.DataFrame
city_info_df
City Population
0 Mumbai 17000000.0
1 Bangalore 13000000.0
2 Chennai 6000000.0
3 Delhi NaN

Exploring data in dataframe

Find out the number of rows and columns in the dataframe.

df.shape
(442, 10)

In this case, there are 442 rows and 10 columns.

Let’s see the names of the columns

df.columns
Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')
list(df.columns)
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
dtypes: float64(10)
memory usage: 34.7 KB

Quickly examine a few entries in the dataframe. Say first 5 and last 5.

? df.head
df.head()
age sex bmi bp s1 s2 s3 s4 s5 s6
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641
df.tail()
age sex bmi bp s1 s2 s3 s4 s5 s6
437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 -0.002592 0.031193 0.007207
438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674 0.034309 -0.018114 0.044485
439 0.041708 0.050680 -0.015906 0.017293 -0.037344 -0.013840 -0.024993 -0.011080 -0.046883 0.015491
440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674 0.026560 0.044529 -0.025930
441 -0.045472 -0.044642 -0.073030 -0.081413 0.083740 0.027809 0.173816 -0.039493 -0.004222 0.003064

Let’s get summary statistics on the dataframe

df.describe()
age sex bmi bp s1 s2 s3 s4 s5 s6
count 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02
mean -2.511817e-19 1.230790e-17 -2.245564e-16 -4.797570e-17 -1.381499e-17 3.918434e-17 -5.777179e-18 -9.042540e-18 9.293722e-17 1.130318e-17
std 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02
min -1.072256e-01 -4.464164e-02 -9.027530e-02 -1.123988e-01 -1.267807e-01 -1.156131e-01 -1.023071e-01 -7.639450e-02 -1.260971e-01 -1.377672e-01
25% -3.729927e-02 -4.464164e-02 -3.422907e-02 -3.665608e-02 -3.424784e-02 -3.035840e-02 -3.511716e-02 -3.949338e-02 -3.324559e-02 -3.317903e-02
50% 5.383060e-03 -4.464164e-02 -7.283766e-03 -5.670422e-03 -4.320866e-03 -3.819065e-03 -6.584468e-03 -2.592262e-03 -1.947171e-03 -1.077698e-03
75% 3.807591e-02 5.068012e-02 3.124802e-02 3.564379e-02 2.835801e-02 2.984439e-02 2.931150e-02 3.430886e-02 3.243232e-02 2.791705e-02
max 1.107267e-01 5.068012e-02 1.705552e-01 1.320436e-01 1.539137e-01 1.987880e-01 1.811791e-01 1.852344e-01 1.335973e-01 1.356118e-01
df.describe(percentiles=[0.2, 0.6, 0.8])
age sex bmi bp s1 s2 s3 s4 s5 s6
count 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02
mean -2.511817e-19 1.230790e-17 -2.245564e-16 -4.797570e-17 -1.381499e-17 3.918434e-17 -5.777179e-18 -9.042540e-18 9.293722e-17 1.130318e-17
std 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02
min -1.072256e-01 -4.464164e-02 -9.027530e-02 -1.123988e-01 -1.267807e-01 -1.156131e-01 -1.023071e-01 -7.639450e-02 -1.260971e-01 -1.377672e-01
20% -4.547248e-02 -4.464164e-02 -4.048038e-02 -4.009893e-02 -3.871969e-02 -3.695017e-02 -3.971921e-02 -3.949338e-02 -4.117617e-02 -3.835666e-02
50% 5.383060e-03 -4.464164e-02 -7.283766e-03 -5.670422e-03 -4.320866e-03 -3.819065e-03 -6.584468e-03 -2.592262e-03 -1.947171e-03 -1.077698e-03
60% 1.628068e-02 5.068012e-02 5.218854e-03 8.100982e-03 8.062710e-03 8.706873e-03 8.142084e-03 -2.592262e-03 1.255119e-02 7.206516e-03
80% 4.170844e-02 5.068012e-02 4.229559e-02 4.941519e-02 3.943444e-02 3.952068e-02 3.759519e-02 3.430886e-02 3.885335e-02 3.620126e-02
max 1.107267e-01 5.068012e-02 1.705552e-01 1.320436e-01 1.539137e-01 1.987880e-01 1.811791e-01 1.852344e-01 1.335973e-01 1.356118e-01
df.describe().T
count mean std min 25% 50% 75% max
age 442.0 -2.511817e-19 0.047619 -0.107226 -0.037299 0.005383 0.038076 0.110727
sex 442.0 1.230790e-17 0.047619 -0.044642 -0.044642 -0.044642 0.050680 0.050680
bmi 442.0 -2.245564e-16 0.047619 -0.090275 -0.034229 -0.007284 0.031248 0.170555
bp 442.0 -4.797570e-17 0.047619 -0.112399 -0.036656 -0.005670 0.035644 0.132044
s1 442.0 -1.381499e-17 0.047619 -0.126781 -0.034248 -0.004321 0.028358 0.153914
s2 442.0 3.918434e-17 0.047619 -0.115613 -0.030358 -0.003819 0.029844 0.198788
s3 442.0 -5.777179e-18 0.047619 -0.102307 -0.035117 -0.006584 0.029312 0.181179
s4 442.0 -9.042540e-18 0.047619 -0.076395 -0.039493 -0.002592 0.034309 0.185234
s5 442.0 9.293722e-17 0.047619 -0.126097 -0.033246 -0.001947 0.032432 0.133597
s6 442.0 1.130318e-17 0.047619 -0.137767 -0.033179 -0.001078 0.027917 0.135612

Selection

df.columns
Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')
df['age']
0      0.038076
1     -0.001882
2      0.085299
3     -0.089063
4      0.005383
         ...   
437    0.041708
438   -0.005515
439    0.041708
440   -0.045472
441   -0.045472
Name: age, Length: 442, dtype: float64
type(df['age'])
pandas.core.series.Series
df['age'][0]
0.038075906433423026
df['age'][:5]
0    0.038076
1   -0.001882
2    0.085299
3   -0.089063
4    0.005383
Name: age, dtype: float64
df['age'][-5:]
437    0.041708
438   -0.005515
439    0.041708
440   -0.045472
441   -0.045472
Name: age, dtype: float64
df['age'][100:200]
100    0.016281
101    0.016281
102   -0.092695
103    0.059871
104   -0.027310
         ...   
195    0.027178
196   -0.023677
197    0.048974
198   -0.052738
199    0.041708
Name: age, Length: 100, dtype: float64
df[['age', 'sex']]
age sex
0 0.038076 0.050680
1 -0.001882 -0.044642
2 0.085299 0.050680
3 -0.089063 -0.044642
4 0.005383 -0.044642
... ... ...
437 0.041708 0.050680
438 -0.005515 0.050680
439 0.041708 0.050680
440 -0.045472 -0.044642
441 -0.045472 -0.044642

442 rows × 2 columns

df[['age', 'sex']][:5]
age sex
0 0.038076 0.050680
1 -0.001882 -0.044642
2 0.085299 0.050680
3 -0.089063 -0.044642
4 0.005383 -0.044642
df[['age', 'sex']][-5:]
age sex
437 0.041708 0.050680
438 -0.005515 0.050680
439 0.041708 0.050680
440 -0.045472 -0.044642
441 -0.045472 -0.044642
df[['age', 'sex']]
age sex
0 0.038076 0.050680
1 -0.001882 -0.044642
2 0.085299 0.050680
3 -0.089063 -0.044642
4 0.005383 -0.044642
... ... ...
437 0.041708 0.050680
438 -0.005515 0.050680
439 0.041708 0.050680
440 -0.045472 -0.044642
441 -0.045472 -0.044642

442 rows × 2 columns

  • .loc
  • .iloc
df.iloc[441]
age   -0.045472
sex   -0.044642
bmi   -0.073030
bp    -0.081413
s1     0.083740
s2     0.027809
s3     0.173816
s4    -0.039493
s5    -0.004222
s6     0.003064
Name: 441, dtype: float64
df.iloc[0]
age    0.038076
sex    0.050680
bmi    0.061696
bp     0.021872
s1    -0.044223
s2    -0.034821
s3    -0.043401
s4    -0.002592
s5     0.019907
s6    -0.017646
Name: 0, dtype: float64
df.loc[0]
age    0.038076
sex    0.050680
bmi    0.061696
bp     0.021872
s1    -0.044223
s2    -0.034821
s3    -0.043401
s4    -0.002592
s5     0.019907
s6    -0.017646
Name: 0, dtype: float64
df.head(n=5)
age sex bmi bp s1 s2 s3 s4 s5 s6
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641
df.loc[4, 'age']
0.005383060374248237
df.loc[4, ['age', 'sex']]
age    0.005383
sex   -0.044642
Name: 4, dtype: float64
df.iloc[4, [1, 5, 7, 9]]
sex   -0.044642
s2     0.015596
s4    -0.002592
s6    -0.046641
Name: 4, dtype: float64
rows_condition_met = df.age > 5.383060e-03
df.loc[rows_condition_met]
age sex bmi bp s1 s2 s3 s4 s5 s6
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641
7 0.063504 0.050680 -0.001895 0.066629 0.090620 0.108914 0.022869 0.017703 -0.035816 0.003064
8 0.041708 0.050680 0.061696 -0.040099 -0.013953 0.006202 -0.028674 -0.002592 -0.014960 0.011349
... ... ... ... ... ... ... ... ... ... ...
431 0.070769 0.050680 -0.030996 0.021872 -0.037344 -0.047034 0.033914 -0.039493 -0.014960 -0.001078
432 0.009016 -0.044642 0.055229 -0.005670 0.057597 0.044719 -0.002903 0.023239 0.055686 0.106617
434 0.016281 -0.044642 0.001339 0.008101 0.005311 0.010899 0.030232 -0.039493 -0.045424 0.032059
437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 -0.002592 0.031193 0.007207
439 0.041708 0.050680 -0.015906 0.017293 -0.037344 -0.013840 -0.024993 -0.011080 -0.046883 0.015491

228 rows × 10 columns

df.loc[df.age > 5.383060e-03]
age sex bmi bp s1 s2 s3 s4 s5 s6
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641
7 0.063504 0.050680 -0.001895 0.066629 0.090620 0.108914 0.022869 0.017703 -0.035816 0.003064
8 0.041708 0.050680 0.061696 -0.040099 -0.013953 0.006202 -0.028674 -0.002592 -0.014960 0.011349
... ... ... ... ... ... ... ... ... ... ...
431 0.070769 0.050680 -0.030996 0.021872 -0.037344 -0.047034 0.033914 -0.039493 -0.014960 -0.001078
432 0.009016 -0.044642 0.055229 -0.005670 0.057597 0.044719 -0.002903 0.023239 0.055686 0.106617
434 0.016281 -0.044642 0.001339 0.008101 0.005311 0.010899 0.030232 -0.039493 -0.045424 0.032059
437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 -0.002592 0.031193 0.007207
439 0.041708 0.050680 -0.015906 0.017293 -0.037344 -0.013840 -0.024993 -0.011080 -0.046883 0.015491

228 rows × 10 columns

age_df_temp = df.loc[df.age < 5.383060e-03]
age_df_temp = df[df.age < 5.383060e-03]
age_df_temp.head()
age sex bmi bp s1 s2 s3 s4 s5 s6
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362
5 -0.092695 -0.044642 -0.040696 -0.019442 -0.068991 -0.079288 0.041277 -0.076395 -0.041176 -0.096346
6 -0.045472 0.050680 -0.047163 -0.015999 -0.040096 -0.024800 0.000779 -0.039493 -0.062917 -0.038357
9 -0.070900 -0.044642 0.039062 -0.033213 -0.012577 -0.034508 -0.024993 -0.002592 0.067737 -0.013504
age_df_temp.iloc[2]
age   -0.092695
sex   -0.044642
bmi   -0.040696
bp    -0.019442
s1    -0.068991
s2    -0.079288
s3     0.041277
s4    -0.076395
s5    -0.041176
s6    -0.096346
Name: 5, dtype: float64
age_df_temp.loc[1]
age   -0.001882
sex   -0.044642
bmi   -0.051474
bp    -0.026328
s1    -0.008449
s2    -0.019163
s3     0.074412
s4    -0.039493
s5    -0.068332
s6    -0.092204
Name: 1, dtype: float64
age_df_temp = df.loc[(df.age < 5.383060e-03) & (df.sex > -4.464164e-02 )]
age_df_temp.shape
(214, 10)
age_df_temp
age sex bmi bp s1 s2 s3 s4 s5 s6
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362
5 -0.092695 -0.044642 -0.040696 -0.019442 -0.068991 -0.079288 0.041277 -0.076395 -0.041176 -0.096346
6 -0.045472 0.050680 -0.047163 -0.015999 -0.040096 -0.024800 0.000779 -0.039493 -0.062917 -0.038357
9 -0.070900 -0.044642 0.039062 -0.033213 -0.012577 -0.034508 -0.024993 -0.002592 0.067737 -0.013504
... ... ... ... ... ... ... ... ... ... ...
435 -0.012780 -0.044642 -0.023451 -0.040099 -0.016704 0.004636 -0.017629 -0.002592 -0.038460 -0.038357
436 -0.056370 -0.044642 -0.074108 -0.050427 -0.024960 -0.047034 0.092820 -0.076395 -0.061176 -0.046641
438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674 0.034309 -0.018114 0.044485
440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674 0.026560 0.044529 -0.025930
441 -0.045472 -0.044642 -0.073030 -0.081413 0.083740 0.027809 0.173816 -0.039493 -0.004222 0.003064

214 rows × 10 columns

list("ABCD")
['A', 'B', 'C', 'D']
list("abcdefghi")
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
range(100)
range(0, 100)
import numpy as np
another_df = pd.DataFrame(
    np.random.rand(100, 4),
    index=range(10, 110),
    columns=list("ABCD"))
another_df.head()
A B C D
10 0.128088 0.963425 0.450839 0.887946
11 0.441873 0.635796 0.576356 0.206467
12 0.467190 0.512398 0.255862 0.829151
13 0.350556 0.748146 0.649348 0.158734
14 0.861936 0.355931 0.235678 0.166667
another_df.tail()
A B C D
105 0.531472 0.838437 0.540441 0.691833
106 0.253700 0.651600 0.810953 0.275363
107 0.932707 0.076388 0.415450 0.918451
108 0.176879 0.149722 0.242837 0.214613
109 0.908052 0.453071 0.427724 0.433188
df = pd.DataFrame(np.random.rand(9, 4), index=list("abcdefghi"), columns=list("ABCD"))
df.shape
(9, 4)
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, a to i
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       9 non-null      float64
 1   B       9 non-null      float64
 2   C       9 non-null      float64
 3   D       9 non-null      float64
dtypes: float64(4)
memory usage: 660.0+ bytes
df.head()
A B C D
a 0.511163 0.077225 0.712719 0.703220
b 0.144652 0.319507 0.059459 0.076181
c 0.668516 0.924316 0.190403 0.144573
d 0.470595 0.358167 0.526109 0.548248
e 0.840948 0.981394 0.959002 0.870479
df.index
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], dtype='object')
df.loc['a']
A    0.511163
B    0.077225
C    0.712719
D    0.703220
Name: a, dtype: float64
df.loc['a', 'A':'D']
A    0.511163
B    0.077225
C    0.712719
D    0.703220
Name: a, dtype: float64
df.loc['a':'d', :]
A B C D
a 0.511163 0.077225 0.712719 0.703220
b 0.144652 0.319507 0.059459 0.076181
c 0.668516 0.924316 0.190403 0.144573
d 0.470595 0.358167 0.526109 0.548248
df.iloc[:4, 0:3]
A B C
a 0.511163 0.077225 0.712719
b 0.144652 0.319507 0.059459
c 0.668516 0.924316 0.190403
d 0.470595 0.358167 0.526109
df.iloc[:4, 1:3]
B C
a 0.077225 0.712719
b 0.319507 0.059459
c 0.924316 0.190403
d 0.358167 0.526109
df.iloc[0:4, :]
A B C D
a 0.511163 0.077225 0.712719 0.703220
b 0.144652 0.319507 0.059459 0.076181
c 0.668516 0.924316 0.190403 0.144573
d 0.470595 0.358167 0.526109 0.548248
df.iloc[0:4]
A B C D
a 0.511163 0.077225 0.712719 0.703220
b 0.144652 0.319507 0.059459 0.076181
c 0.668516 0.924316 0.190403 0.144573
d 0.470595 0.358167 0.526109 0.548248
df.iloc[:, 1]
a    0.077225
b    0.319507
c    0.924316
d    0.358167
e    0.981394
f    0.259232
g    0.969513
h    0.788978
i    0.121502
Name: B, dtype: float64
selector = lambda df: df['A'] > 0
selector
<function __main__.<lambda>(df)>
df.loc[selector]
A B C D
a 0.511163 0.077225 0.712719 0.703220
b 0.144652 0.319507 0.059459 0.076181
c 0.668516 0.924316 0.190403 0.144573
d 0.470595 0.358167 0.526109 0.548248
e 0.840948 0.981394 0.959002 0.870479
f 0.227027 0.259232 0.627810 0.249644
g 0.802435 0.969513 0.604462 0.871050
h 0.660405 0.788978 0.023313 0.524337
i 0.551329 0.121502 0.321013 0.481932
selector = lambda df: df['A'] > 0.5
df.loc[selector]
A B C D
a 0.511163 0.077225 0.712719 0.703220
c 0.668516 0.924316 0.190403 0.144573
e 0.840948 0.981394 0.959002 0.870479
g 0.802435 0.969513 0.604462 0.871050
h 0.660405 0.788978 0.023313 0.524337
i 0.551329 0.121502 0.321013 0.481932
selector = lambda df: (df['A'] > 0.5)&(df['B'] < 0.2)
df.loc[selector]
A B C D
a 0.511163 0.077225 0.712719 0.703220
i 0.551329 0.121502 0.321013 0.481932
condition_for_selection = (df['A'] > 0.5)&(df['B'] < 0.2)
condition_for_selection
a     True
b    False
c    False
d    False
e    False
f    False
g    False
h    False
i     True
dtype: bool
df[condition_for_selection]
A B C D
a 0.511163 0.077225 0.712719 0.703220
i 0.551329 0.121502 0.321013 0.481932
condition_for_selection = (df['A'] > 0.5) | ~(df['B'] < 0.2)
df[condition_for_selection]
A B C D
a 0.511163 0.077225 0.712719 0.703220
b 0.144652 0.319507 0.059459 0.076181
c 0.668516 0.924316 0.190403 0.144573
d 0.470595 0.358167 0.526109 0.548248
e 0.840948 0.981394 0.959002 0.870479
f 0.227027 0.259232 0.627810 0.249644
g 0.802435 0.969513 0.604462 0.871050
h 0.660405 0.788978 0.023313 0.524337
i 0.551329 0.121502 0.321013 0.481932

OR operator: | NOT operator: ~

Adding a column in the dataframe

df['E'] = df['A']*100
df
A B C D E
a 0.511163 0.077225 0.712719 0.703220 51.116259
b 0.144652 0.319507 0.059459 0.076181 14.465201
c 0.668516 0.924316 0.190403 0.144573 66.851581
d 0.470595 0.358167 0.526109 0.548248 47.059507
e 0.840948 0.981394 0.959002 0.870479 84.094789
f 0.227027 0.259232 0.627810 0.249644 22.702716
g 0.802435 0.969513 0.604462 0.871050 80.243516
h 0.660405 0.788978 0.023313 0.524337 66.040476
i 0.551329 0.121502 0.321013 0.481932 55.132923
df["F"] = df["A"] + df["C"]
df
A B C D E F
a 0.511163 0.077225 0.712719 0.703220 51.116259 1.223882
b 0.144652 0.319507 0.059459 0.076181 14.465201 0.204111
c 0.668516 0.924316 0.190403 0.144573 66.851581 0.858919
d 0.470595 0.358167 0.526109 0.548248 47.059507 0.996704
e 0.840948 0.981394 0.959002 0.870479 84.094789 1.799950
f 0.227027 0.259232 0.627810 0.249644 22.702716 0.854837
g 0.802435 0.969513 0.604462 0.871050 80.243516 1.406897
h 0.660405 0.788978 0.023313 0.524337 66.040476 0.683717
i 0.551329 0.121502 0.321013 0.481932 55.132923 0.872342
criteria = df['A'] < 0.2
criteria
a    False
b     True
c    False
d    False
e    False
f    False
g    False
h    False
i    False
Name: A, dtype: bool
df.loc[criteria, 'A'] = 0
df
A B C D E F
a 0.511163 0.077225 0.712719 0.703220 51.116259 1.223882
b 0.000000 0.319507 0.059459 0.076181 14.465201 0.204111
c 0.668516 0.924316 0.190403 0.144573 66.851581 0.858919
d 0.470595 0.358167 0.526109 0.548248 47.059507 0.996704
e 0.840948 0.981394 0.959002 0.870479 84.094789 1.799950
f 0.227027 0.259232 0.627810 0.249644 22.702716 0.854837
g 0.802435 0.969513 0.604462 0.871050 80.243516 1.406897
h 0.660405 0.788978 0.023313 0.524337 66.040476 0.683717
i 0.551329 0.121502 0.321013 0.481932 55.132923 0.872342
cities = ['Mumbai', 'Delhi', 'Chennai', 'Kolkata', 'Bengalure', 'Hyderabad', 'Pune', 'Ahmedabad', 'Indore']
df['city'] = cities
df_copy = df.copy()
df
A B C D E F city
a 0.511163 0.077225 0.712719 0.703220 51.116259 1.223882 Mumbai
b 0.000000 0.319507 0.059459 0.076181 14.465201 0.204111 Delhi
c 0.668516 0.924316 0.190403 0.144573 66.851581 0.858919 Chennai
d 0.470595 0.358167 0.526109 0.548248 47.059507 0.996704 Kolkata
e 0.840948 0.981394 0.959002 0.870479 84.094789 1.799950 Bengalure
f 0.227027 0.259232 0.627810 0.249644 22.702716 0.854837 Hyderabad
g 0.802435 0.969513 0.604462 0.871050 80.243516 1.406897 Pune
h 0.660405 0.788978 0.023313 0.524337 66.040476 0.683717 Ahmedabad
i 0.551329 0.121502 0.321013 0.481932 55.132923 0.872342 Indore
cities_new = ['Mumbai', 'Delhi', 'Chennai',
              'Kolkata', 'Bengalure', 'Hyderabad',
              'Pune', 'Ahmedabad', 'Guwahati']
df_copy['new_city'] = cities_new
df_copy
A B C D E F city new_city
a 0.511163 0.077225 0.712719 0.703220 51.116259 1.223882 Mumbai Mumbai
b 0.000000 0.319507 0.059459 0.076181 14.465201 0.204111 Delhi Delhi
c 0.668516 0.924316 0.190403 0.144573 66.851581 0.858919 Chennai Chennai
d 0.470595 0.358167 0.526109 0.548248 47.059507 0.996704 Kolkata Kolkata
e 0.840948 0.981394 0.959002 0.870479 84.094789 1.799950 Bengalure Bengalure
f 0.227027 0.259232 0.627810 0.249644 22.702716 0.854837 Hyderabad Hyderabad
g 0.802435 0.969513 0.604462 0.871050 80.243516 1.406897 Pune Pune
h 0.660405 0.788978 0.023313 0.524337 66.040476 0.683717 Ahmedabad Ahmedabad
i 0.551329 0.121502 0.321013 0.481932 55.132923 0.872342 Indore Guwahati
criteria = df_copy['city'].isin(['Pune', 'Bengaluru', 'Hyderabad'])
df_copy.loc[df.city == 'Bengalure', ['city', 'new_city']] = 'Bengaluru'
df_copy
A B C D E F city new_city new_cities
a 0.511163 0.077225 0.712719 0.703220 51.116259 1.223882 Mumbai Mumbai NaN
b 0.000000 0.319507 0.059459 0.076181 14.465201 0.204111 Delhi Delhi NaN
c 0.668516 0.924316 0.190403 0.144573 66.851581 0.858919 Chennai Chennai NaN
d 0.470595 0.358167 0.526109 0.548248 47.059507 0.996704 Kolkata Kolkata NaN
e 0.840948 0.981394 0.959002 0.870479 84.094789 1.799950 Bengaluru Bengaluru Bengaluru
f 0.227027 0.259232 0.627810 0.249644 22.702716 0.854837 Hyderabad Hyderabad NaN
g 0.802435 0.969513 0.604462 0.871050 80.243516 1.406897 Pune Pune NaN
h 0.660405 0.788978 0.023313 0.524337 66.040476 0.683717 Ahmedabad Ahmedabad NaN
i 0.551329 0.121502 0.321013 0.481932 55.132923 0.872342 Indore Guwahati NaN
?df_copy.drop
df_copy.drop(['new_cities'], axis=1)
A B C D E F city new_city
a 0.511163 0.077225 0.712719 0.703220 51.116259 1.223882 Mumbai Mumbai
b 0.000000 0.319507 0.059459 0.076181 14.465201 0.204111 Delhi Delhi
c 0.668516 0.924316 0.190403 0.144573 66.851581 0.858919 Chennai Chennai
d 0.470595 0.358167 0.526109 0.548248 47.059507 0.996704 Kolkata Kolkata
e 0.840948 0.981394 0.959002 0.870479 84.094789 1.799950 Bengaluru Bengaluru
f 0.227027 0.259232 0.627810 0.249644 22.702716 0.854837 Hyderabad Hyderabad
g 0.802435 0.969513 0.604462 0.871050 80.243516 1.406897 Pune Pune
h 0.660405 0.788978 0.023313 0.524337 66.040476 0.683717 Ahmedabad Ahmedabad
i 0.551329 0.121502 0.321013 0.481932 55.132923 0.872342 Indore Guwahati
criteria
a    False
b    False
c    False
d    False
e     True
f     True
g     True
h    False
i    False
Name: city, dtype: bool
df_copy.loc[:, 'city'] = 'Chennai'
df_copy = df_copy.drop(['new_cities'], axis=1)
df_copy
A B C D E F city new_city
a 0.511163 0.077225 0.712719 0.70322 51.116259 1.223882 Chennai Mumbai
b 0.0 0.319507 0.059459 0.076181 14.465201 0.204111 Chennai Delhi
c 0.668516 0.924316 0.190403 0.144573 66.851581 0.858919 Chennai Chennai
d 0.470595 0.358167 0.526109 0.548248 47.059507 0.996704 Chennai Kolkata
e 0.840948 0.981394 0.959002 0.870479 84.094789 1.79995 Chennai Bengaluru
f 0.227027 0.259232 0.62781 0.249644 22.702716 0.854837 Chennai Hyderabad
g 0.802435 0.969513 0.604462 0.87105 80.243516 1.406897 Chennai Pune
h 0.660405 0.788978 0.023313 0.524337 66.040476 0.683717 Chennai Ahmedabad
i 0.551329 0.121502 0.321013 0.481932 55.132923 0.872342 Chennai Guwahati
city Chennai Chennai Chennai Chennai Chennai Chennai Chennai Chennai
df_copy = df_copy.drop(['city'])
df_copy
A B C D E F city new_city
a 0.511163 0.077225 0.712719 0.70322 51.116259 1.223882 Chennai Mumbai
b 0.0 0.319507 0.059459 0.076181 14.465201 0.204111 Chennai Delhi
c 0.668516 0.924316 0.190403 0.144573 66.851581 0.858919 Chennai Chennai
d 0.470595 0.358167 0.526109 0.548248 47.059507 0.996704 Chennai Kolkata
e 0.840948 0.981394 0.959002 0.870479 84.094789 1.79995 Chennai Bengaluru
f 0.227027 0.259232 0.62781 0.249644 22.702716 0.854837 Chennai Hyderabad
g 0.802435 0.969513 0.604462 0.87105 80.243516 1.406897 Chennai Pune
h 0.660405 0.788978 0.023313 0.524337 66.040476 0.683717 Chennai Ahmedabad
i 0.551329 0.121502 0.321013 0.481932 55.132923 0.872342 Chennai Guwahati
?df_copy.sample
df_copy.sample(3)
A B C D E F city new_city
a 0.511163 0.077225 0.712719 0.70322 51.116259 1.223882 Chennai Mumbai
g 0.802435 0.969513 0.604462 0.87105 80.243516 1.406897 Chennai Pune
h 0.660405 0.788978 0.023313 0.524337 66.040476 0.683717 Chennai Ahmedabad
df_copy.sample(3)
A B C D E F city new_city
g 0.802435 0.969513 0.604462 0.87105 80.243516 1.406897 Chennai Pune
a 0.511163 0.077225 0.712719 0.70322 51.116259 1.223882 Chennai Mumbai
f 0.227027 0.259232 0.62781 0.249644 22.702716 0.854837 Chennai Hyderabad
df_copy.sample(3, random_state=42)
A B C D E F city new_city
h 0.660405 0.788978 0.023313 0.524337 66.040476 0.683717 Chennai Ahmedabad
b 0.0 0.319507 0.059459 0.076181 14.465201 0.204111 Chennai Delhi
f 0.227027 0.259232 0.62781 0.249644 22.702716 0.854837 Chennai Hyderabad
df_copy.sample(3, random_state=42)
A B C D E F city new_city
h 0.660405 0.788978 0.023313 0.524337 66.040476 0.683717 Chennai Ahmedabad
b 0.0 0.319507 0.059459 0.076181 14.465201 0.204111 Chennai Delhi
f 0.227027 0.259232 0.62781 0.249644 22.702716 0.854837 Chennai Hyderabad
df_copy.sample(3, random_state=42, replace=True)
A B C D E F city new_city
g 0.802435 0.969513 0.604462 0.87105 80.243516 1.406897 Chennai Pune
d 0.470595 0.358167 0.526109 0.548248 47.059507 0.996704 Chennai Kolkata
h 0.660405 0.788978 0.023313 0.524337 66.040476 0.683717 Chennai Ahmedabad
import pandas as pd

cities = ['Mumbai', 'Chennai', 'Pune', 'Ahmedabad', 'Kolkata', 'Kanpur', 'Delhi']
city_df = pd.DataFrame(cities)
city_df.columns = ['City_Name']
city_df
City_Name
0 Mumbai
1 Chennai
2 Pune
3 Ahmedabad
4 Kolkata
5 Kanpur
6 Delhi
condition_met = city_df.City_Name == 'Mumbai'
type(condition_met)
pandas.core.series.Series
city_df[condition_met]
City_Name
0 Mumbai
condition_met
0     True
1    False
2    False
3    False
4    False
5    False
6    False
Name: City_Name, dtype: bool
city_df[city_df.City_Name == 'Pune']
City_Name
2 Pune

Pandas cheatsheet

Agggregation and grouping

import numpy as np
random_state = np.random.RandomState(100)
random_series = pd.Series(random_state.rand(10))
random_series
0    0.543405
1    0.278369
2    0.424518
3    0.844776
4    0.004719
5    0.121569
6    0.670749
7    0.825853
8    0.136707
9    0.575093
dtype: float64

Now that the series is created, we can try aggregation functions like mean, std etc.

random_series.mean()
0.4425757785871915
random_series.std()
0.2988992029497061
random_series.sum()
4.425757785871915

Let’s create a dataframe with 5 rows and 2 columns containing random numbers.

df = pd.DataFrame({'A': random_state.rand(5),
                   'B': random_state.rand(5)})
df
A B
0 0.891322 0.978624
1 0.209202 0.811683
2 0.185328 0.171941
3 0.108377 0.816225
4 0.219697 0.274074
df.sum()
A    1.613927
B    3.052546
dtype: float64
df.mean()
A    0.322785
B    0.610509
dtype: float64

Try these operations row-wise

df.mean(axis=1)
0    0.934973
1    0.510443
2    0.178635
3    0.462301
4    0.246886
dtype: float64
df.sum(axis=1)
0    1.869946
1    1.020885
2    0.357269
3    0.924602
4    0.493771
dtype: float64

Groupby

Three stages * Split - we split dataframe into multiple smaller dataframe based on the values of keys * Apply - we apply desired aggregation/transformation on each dataframe. * Combine - we combine results from apply state into a dataframe

split-apply-combine.png
list("ABCABC")
['A', 'B', 'C', 'A', 'B', 'C']
df = pd.DataFrame({'key': ['A','B','C']*2, #list("ABCABC"), ['A','B','C','A','B','C']
                   'data': range(6)})
df
key data
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
df.groupby("key")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x78e73c1c8040>
df.groupby("key").sum()
data
key
A 3
B 5
C 7