Introduction to Pandas

Learning Objectives:

Gain an introduction to the DataFrame and Series data structures of the pandas library
Access and manipulate data within a DataFrame and Series

pandas is a column-oriented data analysis API. It’s a great tool for handling and analyzing input data, and many ML frameworks support pandas data structures as inputs.

Although a comprehensive introduction to the API would span many pages, the core concepts are fairly straightforward, and we’ll present them below. For a more complete reference, the pandas docs site contains extensive documentation and many tutorials. (Note that Colab may use a slightly older version number, but the parts of pandas covered here are unlikely to differ from version to version.)

Let’s import pandas library

import pandas as pd
pd.__version__

'1.5.3'

The primary data structures in pandas are implemented as two classes:

DataFrame, which you can imagine as a relational data table, with rows and named columns.
Series, which is a single column. Each row can be labeled via an index. A DataFrame contains one or more Series and a name for each Series.

The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in Spark and R.

from sklearn.datasets import load_diabetes

diabetes = load_diabetes(as_frame=True)
print(type(diabetes['data']))

<class 'pandas.core.frame.DataFrame'>

?load_diabetes

print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

df = diabetes['data']

Creating pandas dataframe from series

cities = pd.Series(['Mumbai', 'Bangalore', 'Chennai', 'Delhi'])
population = pd.Series([17000000, 13000000, 6000000])

city_info_df = pd.DataFrame({'City': cities, 'Population': population})

type(city_info_df)

pandas.core.frame.DataFrame

city_info_df

	City	Population
0	Mumbai	17000000.0
1	Bangalore	13000000.0
2	Chennai	6000000.0
3	Delhi	NaN

Exploring data in dataframe

Find out the number of rows and columns in the dataframe.

df.shape

(442, 10)

In this case, there are 442 rows and 10 columns.

Let’s see the names of the columns

df.columns

Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')

list(df.columns)

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
dtypes: float64(10)
memory usage: 34.7 KB

Quickly examine a few entries in the dataframe. Say first 5 and last 5.

? df.head

df.head()

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019907	-0.017646
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068332	-0.092204
2	0.085299	0.050680	0.044451	-0.005670	-0.045599	-0.034194	-0.032356	-0.002592	0.002861	-0.025930
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022688	-0.009362
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031988	-0.046641

df.tail()

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
437	0.041708	0.050680	0.019662	0.059744	-0.005697	-0.002566	-0.028674	-0.002592	0.031193	0.007207
438	-0.005515	0.050680	-0.015906	-0.067642	0.049341	0.079165	-0.028674	0.034309	-0.018114	0.044485
439	0.041708	0.050680	-0.015906	0.017293	-0.037344	-0.013840	-0.024993	-0.011080	-0.046883	0.015491
440	-0.045472	-0.044642	0.039062	0.001215	0.016318	0.015283	-0.028674	0.026560	0.044529	-0.025930
441	-0.045472	-0.044642	-0.073030	-0.081413	0.083740	0.027809	0.173816	-0.039493	-0.004222	0.003064

Let’s get summary statistics on the dataframe

df.describe()

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
count	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02
mean	-2.511817e-19	1.230790e-17	-2.245564e-16	-4.797570e-17	-1.381499e-17	3.918434e-17	-5.777179e-18	-9.042540e-18	9.293722e-17	1.130318e-17
std	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02
min	-1.072256e-01	-4.464164e-02	-9.027530e-02	-1.123988e-01	-1.267807e-01	-1.156131e-01	-1.023071e-01	-7.639450e-02	-1.260971e-01	-1.377672e-01
25%	-3.729927e-02	-4.464164e-02	-3.422907e-02	-3.665608e-02	-3.424784e-02	-3.035840e-02	-3.511716e-02	-3.949338e-02	-3.324559e-02	-3.317903e-02
50%	5.383060e-03	-4.464164e-02	-7.283766e-03	-5.670422e-03	-4.320866e-03	-3.819065e-03	-6.584468e-03	-2.592262e-03	-1.947171e-03	-1.077698e-03
75%	3.807591e-02	5.068012e-02	3.124802e-02	3.564379e-02	2.835801e-02	2.984439e-02	2.931150e-02	3.430886e-02	3.243232e-02	2.791705e-02
max	1.107267e-01	5.068012e-02	1.705552e-01	1.320436e-01	1.539137e-01	1.987880e-01	1.811791e-01	1.852344e-01	1.335973e-01	1.356118e-01

df.describe(percentiles=[0.2, 0.6, 0.8])

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
count	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02	4.420000e+02
mean	-2.511817e-19	1.230790e-17	-2.245564e-16	-4.797570e-17	-1.381499e-17	3.918434e-17	-5.777179e-18	-9.042540e-18	9.293722e-17	1.130318e-17
std	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02	4.761905e-02
min	-1.072256e-01	-4.464164e-02	-9.027530e-02	-1.123988e-01	-1.267807e-01	-1.156131e-01	-1.023071e-01	-7.639450e-02	-1.260971e-01	-1.377672e-01
20%	-4.547248e-02	-4.464164e-02	-4.048038e-02	-4.009893e-02	-3.871969e-02	-3.695017e-02	-3.971921e-02	-3.949338e-02	-4.117617e-02	-3.835666e-02
50%	5.383060e-03	-4.464164e-02	-7.283766e-03	-5.670422e-03	-4.320866e-03	-3.819065e-03	-6.584468e-03	-2.592262e-03	-1.947171e-03	-1.077698e-03
60%	1.628068e-02	5.068012e-02	5.218854e-03	8.100982e-03	8.062710e-03	8.706873e-03	8.142084e-03	-2.592262e-03	1.255119e-02	7.206516e-03
80%	4.170844e-02	5.068012e-02	4.229559e-02	4.941519e-02	3.943444e-02	3.952068e-02	3.759519e-02	3.430886e-02	3.885335e-02	3.620126e-02
max	1.107267e-01	5.068012e-02	1.705552e-01	1.320436e-01	1.539137e-01	1.987880e-01	1.811791e-01	1.852344e-01	1.335973e-01	1.356118e-01

df.describe().T

	count	mean	std	min	25%	50%	75%	max
age	442.0	-2.511817e-19	0.047619	-0.107226	-0.037299	0.005383	0.038076	0.110727
sex	442.0	1.230790e-17	0.047619	-0.044642	-0.044642	-0.044642	0.050680	0.050680
bmi	442.0	-2.245564e-16	0.047619	-0.090275	-0.034229	-0.007284	0.031248	0.170555
bp	442.0	-4.797570e-17	0.047619	-0.112399	-0.036656	-0.005670	0.035644	0.132044
s1	442.0	-1.381499e-17	0.047619	-0.126781	-0.034248	-0.004321	0.028358	0.153914
s2	442.0	3.918434e-17	0.047619	-0.115613	-0.030358	-0.003819	0.029844	0.198788
s3	442.0	-5.777179e-18	0.047619	-0.102307	-0.035117	-0.006584	0.029312	0.181179
s4	442.0	-9.042540e-18	0.047619	-0.076395	-0.039493	-0.002592	0.034309	0.185234
s5	442.0	9.293722e-17	0.047619	-0.126097	-0.033246	-0.001947	0.032432	0.133597
s6	442.0	1.130318e-17	0.047619	-0.137767	-0.033179	-0.001078	0.027917	0.135612

Selection

df.columns

Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')

df['age']

0      0.038076
1     -0.001882
2      0.085299
3     -0.089063
4      0.005383
         ...   
437    0.041708
438   -0.005515
439    0.041708
440   -0.045472
441   -0.045472
Name: age, Length: 442, dtype: float64

type(df['age'])

pandas.core.series.Series

df['age'][0]

0.038075906433423026

df['age'][:5]

0    0.038076
1   -0.001882
2    0.085299
3   -0.089063
4    0.005383
Name: age, dtype: float64

df['age'][-5:]

437    0.041708
438   -0.005515
439    0.041708
440   -0.045472
441   -0.045472
Name: age, dtype: float64

df['age'][100:200]

100    0.016281
101    0.016281
102   -0.092695
103    0.059871
104   -0.027310
         ...   
195    0.027178
196   -0.023677
197    0.048974
198   -0.052738
199    0.041708
Name: age, Length: 100, dtype: float64

df[['age', 'sex']]

	age	sex
0	0.038076	0.050680
1	-0.001882	-0.044642
2	0.085299	0.050680
3	-0.089063	-0.044642
4	0.005383	-0.044642
...	...	...
437	0.041708	0.050680
438	-0.005515	0.050680
439	0.041708	0.050680
440	-0.045472	-0.044642
441	-0.045472	-0.044642

442 rows × 2 columns

df[['age', 'sex']][:5]

	age	sex
0	0.038076	0.050680
1	-0.001882	-0.044642
2	0.085299	0.050680
3	-0.089063	-0.044642
4	0.005383	-0.044642

df[['age', 'sex']][-5:]

	age	sex
437	0.041708	0.050680
438	-0.005515	0.050680
439	0.041708	0.050680
440	-0.045472	-0.044642
441	-0.045472	-0.044642

df[['age', 'sex']]

	age	sex
0	0.038076	0.050680
1	-0.001882	-0.044642
2	0.085299	0.050680
3	-0.089063	-0.044642
4	0.005383	-0.044642
...	...	...
437	0.041708	0.050680
438	-0.005515	0.050680
439	0.041708	0.050680
440	-0.045472	-0.044642
441	-0.045472	-0.044642

442 rows × 2 columns

.loc
.iloc

df.iloc[441]

age   -0.045472
sex   -0.044642
bmi   -0.073030
bp    -0.081413
s1     0.083740
s2     0.027809
s3     0.173816
s4    -0.039493
s5    -0.004222
s6     0.003064
Name: 441, dtype: float64

df.iloc[0]

age    0.038076
sex    0.050680
bmi    0.061696
bp     0.021872
s1    -0.044223
s2    -0.034821
s3    -0.043401
s4    -0.002592
s5     0.019907
s6    -0.017646
Name: 0, dtype: float64

df.loc[0]

age    0.038076
sex    0.050680
bmi    0.061696
bp     0.021872
s1    -0.044223
s2    -0.034821
s3    -0.043401
s4    -0.002592
s5     0.019907
s6    -0.017646
Name: 0, dtype: float64

df.head(n=5)

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019907	-0.017646
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068332	-0.092204
2	0.085299	0.050680	0.044451	-0.005670	-0.045599	-0.034194	-0.032356	-0.002592	0.002861	-0.025930
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022688	-0.009362
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031988	-0.046641

df.loc[4, 'age']

0.005383060374248237

df.loc[4, ['age', 'sex']]

age    0.005383
sex   -0.044642
Name: 4, dtype: float64

df.iloc[4, [1, 5, 7, 9]]

sex   -0.044642
s2     0.015596
s4    -0.002592
s6    -0.046641
Name: 4, dtype: float64

rows_condition_met = df.age > 5.383060e-03

df.loc[rows_condition_met]

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019907	-0.017646
2	0.085299	0.050680	0.044451	-0.005670	-0.045599	-0.034194	-0.032356	-0.002592	0.002861	-0.025930
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031988	-0.046641
7	0.063504	0.050680	-0.001895	0.066629	0.090620	0.108914	0.022869	0.017703	-0.035816	0.003064
8	0.041708	0.050680	0.061696	-0.040099	-0.013953	0.006202	-0.028674	-0.002592	-0.014960	0.011349
...	...	...	...	...	...	...	...	...	...	...
431	0.070769	0.050680	-0.030996	0.021872	-0.037344	-0.047034	0.033914	-0.039493	-0.014960	-0.001078
432	0.009016	-0.044642	0.055229	-0.005670	0.057597	0.044719	-0.002903	0.023239	0.055686	0.106617
434	0.016281	-0.044642	0.001339	0.008101	0.005311	0.010899	0.030232	-0.039493	-0.045424	0.032059
437	0.041708	0.050680	0.019662	0.059744	-0.005697	-0.002566	-0.028674	-0.002592	0.031193	0.007207
439	0.041708	0.050680	-0.015906	0.017293	-0.037344	-0.013840	-0.024993	-0.011080	-0.046883	0.015491

228 rows × 10 columns

df.loc[df.age > 5.383060e-03]

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019907	-0.017646
2	0.085299	0.050680	0.044451	-0.005670	-0.045599	-0.034194	-0.032356	-0.002592	0.002861	-0.025930
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031988	-0.046641
7	0.063504	0.050680	-0.001895	0.066629	0.090620	0.108914	0.022869	0.017703	-0.035816	0.003064
8	0.041708	0.050680	0.061696	-0.040099	-0.013953	0.006202	-0.028674	-0.002592	-0.014960	0.011349
...	...	...	...	...	...	...	...	...	...	...
431	0.070769	0.050680	-0.030996	0.021872	-0.037344	-0.047034	0.033914	-0.039493	-0.014960	-0.001078
432	0.009016	-0.044642	0.055229	-0.005670	0.057597	0.044719	-0.002903	0.023239	0.055686	0.106617
434	0.016281	-0.044642	0.001339	0.008101	0.005311	0.010899	0.030232	-0.039493	-0.045424	0.032059
437	0.041708	0.050680	0.019662	0.059744	-0.005697	-0.002566	-0.028674	-0.002592	0.031193	0.007207
439	0.041708	0.050680	-0.015906	0.017293	-0.037344	-0.013840	-0.024993	-0.011080	-0.046883	0.015491

228 rows × 10 columns

age_df_temp = df.loc[df.age < 5.383060e-03]

age_df_temp = df[df.age < 5.383060e-03]

age_df_temp.head()

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068332	-0.092204
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022688	-0.009362
5	-0.092695	-0.044642	-0.040696	-0.019442	-0.068991	-0.079288	0.041277	-0.076395	-0.041176	-0.096346
6	-0.045472	0.050680	-0.047163	-0.015999	-0.040096	-0.024800	0.000779	-0.039493	-0.062917	-0.038357
9	-0.070900	-0.044642	0.039062	-0.033213	-0.012577	-0.034508	-0.024993	-0.002592	0.067737	-0.013504

age_df_temp.iloc[2]

age   -0.092695
sex   -0.044642
bmi   -0.040696
bp    -0.019442
s1    -0.068991
s2    -0.079288
s3     0.041277
s4    -0.076395
s5    -0.041176
s6    -0.096346
Name: 5, dtype: float64

age_df_temp.loc[1]

age   -0.001882
sex   -0.044642
bmi   -0.051474
bp    -0.026328
s1    -0.008449
s2    -0.019163
s3     0.074412
s4    -0.039493
s5    -0.068332
s6    -0.092204
Name: 1, dtype: float64

age_df_temp = df.loc[(df.age < 5.383060e-03) & (df.sex > -4.464164e-02 )]

age_df_temp.shape

(214, 10)

age_df_temp

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068332	-0.092204
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022688	-0.009362
5	-0.092695	-0.044642	-0.040696	-0.019442	-0.068991	-0.079288	0.041277	-0.076395	-0.041176	-0.096346
6	-0.045472	0.050680	-0.047163	-0.015999	-0.040096	-0.024800	0.000779	-0.039493	-0.062917	-0.038357
9	-0.070900	-0.044642	0.039062	-0.033213	-0.012577	-0.034508	-0.024993	-0.002592	0.067737	-0.013504
...	...	...	...	...	...	...	...	...	...	...
435	-0.012780	-0.044642	-0.023451	-0.040099	-0.016704	0.004636	-0.017629	-0.002592	-0.038460	-0.038357
436	-0.056370	-0.044642	-0.074108	-0.050427	-0.024960	-0.047034	0.092820	-0.076395	-0.061176	-0.046641
438	-0.005515	0.050680	-0.015906	-0.067642	0.049341	0.079165	-0.028674	0.034309	-0.018114	0.044485
440	-0.045472	-0.044642	0.039062	0.001215	0.016318	0.015283	-0.028674	0.026560	0.044529	-0.025930
441	-0.045472	-0.044642	-0.073030	-0.081413	0.083740	0.027809	0.173816	-0.039493	-0.004222	0.003064

214 rows × 10 columns

list("ABCD")

['A', 'B', 'C', 'D']

list("abcdefghi")

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']

range(100)

range(0, 100)

import numpy as np

another_df = pd.DataFrame(
    np.random.rand(100, 4),
    index=range(10, 110),
    columns=list("ABCD"))

another_df.head()

	A	B	C	D
10	0.128088	0.963425	0.450839	0.887946
11	0.441873	0.635796	0.576356	0.206467
12	0.467190	0.512398	0.255862	0.829151
13	0.350556	0.748146	0.649348	0.158734
14	0.861936	0.355931	0.235678	0.166667

another_df.tail()

	A	B	C	D
105	0.531472	0.838437	0.540441	0.691833
106	0.253700	0.651600	0.810953	0.275363
107	0.932707	0.076388	0.415450	0.918451
108	0.176879	0.149722	0.242837	0.214613
109	0.908052	0.453071	0.427724	0.433188

df = pd.DataFrame(np.random.rand(9, 4), index=list("abcdefghi"), columns=list("ABCD"))

df.shape

(9, 4)

df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, a to i
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       9 non-null      float64
 1   B       9 non-null      float64
 2   C       9 non-null      float64
 3   D       9 non-null      float64
dtypes: float64(4)
memory usage: 660.0+ bytes

df.head()

	A	B	C	D
a	0.511163	0.077225	0.712719	0.703220
b	0.144652	0.319507	0.059459	0.076181
c	0.668516	0.924316	0.190403	0.144573
d	0.470595	0.358167	0.526109	0.548248
e	0.840948	0.981394	0.959002	0.870479

df.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], dtype='object')

df.loc['a']

A    0.511163
B    0.077225
C    0.712719
D    0.703220
Name: a, dtype: float64

df.loc['a', 'A':'D']

A    0.511163
B    0.077225
C    0.712719
D    0.703220
Name: a, dtype: float64

df.loc['a':'d', :]

	A	B	C	D
a	0.511163	0.077225	0.712719	0.703220
b	0.144652	0.319507	0.059459	0.076181
c	0.668516	0.924316	0.190403	0.144573
d	0.470595	0.358167	0.526109	0.548248

df.iloc[:4, 0:3]

	A	B	C
a	0.511163	0.077225	0.712719
b	0.144652	0.319507	0.059459
c	0.668516	0.924316	0.190403
d	0.470595	0.358167	0.526109

df.iloc[:4, 1:3]

	B	C
a	0.077225	0.712719
b	0.319507	0.059459
c	0.924316	0.190403
d	0.358167	0.526109

df.iloc[0:4, :]

	A	B	C	D
a	0.511163	0.077225	0.712719	0.703220
b	0.144652	0.319507	0.059459	0.076181
c	0.668516	0.924316	0.190403	0.144573
d	0.470595	0.358167	0.526109	0.548248

df.iloc[0:4]

	A	B	C	D
a	0.511163	0.077225	0.712719	0.703220
b	0.144652	0.319507	0.059459	0.076181
c	0.668516	0.924316	0.190403	0.144573
d	0.470595	0.358167	0.526109	0.548248

df.iloc[:, 1]

a    0.077225
b    0.319507
c    0.924316
d    0.358167
e    0.981394
f    0.259232
g    0.969513
h    0.788978
i    0.121502
Name: B, dtype: float64

selector = lambda df: df['A'] > 0

selector

<function __main__.<lambda>(df)>

df.loc[selector]

	A	B	C	D
a	0.511163	0.077225	0.712719	0.703220
b	0.144652	0.319507	0.059459	0.076181
c	0.668516	0.924316	0.190403	0.144573
d	0.470595	0.358167	0.526109	0.548248
e	0.840948	0.981394	0.959002	0.870479
f	0.227027	0.259232	0.627810	0.249644
g	0.802435	0.969513	0.604462	0.871050
h	0.660405	0.788978	0.023313	0.524337
i	0.551329	0.121502	0.321013	0.481932

selector = lambda df: df['A'] > 0.5

df.loc[selector]

	A	B	C	D
a	0.511163	0.077225	0.712719	0.703220
c	0.668516	0.924316	0.190403	0.144573
e	0.840948	0.981394	0.959002	0.870479
g	0.802435	0.969513	0.604462	0.871050
h	0.660405	0.788978	0.023313	0.524337
i	0.551329	0.121502	0.321013	0.481932

selector = lambda df: (df['A'] > 0.5)&(df['B'] < 0.2)

df.loc[selector]

	A	B	C	D
a	0.511163	0.077225	0.712719	0.703220
i	0.551329	0.121502	0.321013	0.481932

condition_for_selection = (df['A'] > 0.5)&(df['B'] < 0.2)

condition_for_selection

a     True
b    False
c    False
d    False
e    False
f    False
g    False
h    False
i     True
dtype: bool

df[condition_for_selection]

	A	B	C	D
a	0.511163	0.077225	0.712719	0.703220
i	0.551329	0.121502	0.321013	0.481932

condition_for_selection = (df['A'] > 0.5) | ~(df['B'] < 0.2)

df[condition_for_selection]

	A	B	C	D
a	0.511163	0.077225	0.712719	0.703220
b	0.144652	0.319507	0.059459	0.076181
c	0.668516	0.924316	0.190403	0.144573
d	0.470595	0.358167	0.526109	0.548248
e	0.840948	0.981394	0.959002	0.870479
f	0.227027	0.259232	0.627810	0.249644
g	0.802435	0.969513	0.604462	0.871050
h	0.660405	0.788978	0.023313	0.524337
i	0.551329	0.121502	0.321013	0.481932

OR operator: | NOT operator: ~

Adding a column in the dataframe

df['E'] = df['A']*100

df

	A	B	C	D	E
a	0.511163	0.077225	0.712719	0.703220	51.116259
b	0.144652	0.319507	0.059459	0.076181	14.465201
c	0.668516	0.924316	0.190403	0.144573	66.851581
d	0.470595	0.358167	0.526109	0.548248	47.059507
e	0.840948	0.981394	0.959002	0.870479	84.094789
f	0.227027	0.259232	0.627810	0.249644	22.702716
g	0.802435	0.969513	0.604462	0.871050	80.243516
h	0.660405	0.788978	0.023313	0.524337	66.040476
i	0.551329	0.121502	0.321013	0.481932	55.132923

df["F"] = df["A"] + df["C"]

df

	A	B	C	D	E	F
a	0.511163	0.077225	0.712719	0.703220	51.116259	1.223882
b	0.144652	0.319507	0.059459	0.076181	14.465201	0.204111
c	0.668516	0.924316	0.190403	0.144573	66.851581	0.858919
d	0.470595	0.358167	0.526109	0.548248	47.059507	0.996704
e	0.840948	0.981394	0.959002	0.870479	84.094789	1.799950
f	0.227027	0.259232	0.627810	0.249644	22.702716	0.854837
g	0.802435	0.969513	0.604462	0.871050	80.243516	1.406897
h	0.660405	0.788978	0.023313	0.524337	66.040476	0.683717
i	0.551329	0.121502	0.321013	0.481932	55.132923	0.872342

criteria = df['A'] < 0.2

criteria

a    False
b     True
c    False
d    False
e    False
f    False
g    False
h    False
i    False
Name: A, dtype: bool

df.loc[criteria, 'A'] = 0

df

	A	B	C	D	E	F
a	0.511163	0.077225	0.712719	0.703220	51.116259	1.223882
b	0.000000	0.319507	0.059459	0.076181	14.465201	0.204111
c	0.668516	0.924316	0.190403	0.144573	66.851581	0.858919
d	0.470595	0.358167	0.526109	0.548248	47.059507	0.996704
e	0.840948	0.981394	0.959002	0.870479	84.094789	1.799950
f	0.227027	0.259232	0.627810	0.249644	22.702716	0.854837
g	0.802435	0.969513	0.604462	0.871050	80.243516	1.406897
h	0.660405	0.788978	0.023313	0.524337	66.040476	0.683717
i	0.551329	0.121502	0.321013	0.481932	55.132923	0.872342

cities = ['Mumbai', 'Delhi', 'Chennai', 'Kolkata', 'Bengalure', 'Hyderabad', 'Pune', 'Ahmedabad', 'Indore']

df['city'] = cities

df_copy = df.copy()

df

	A	B	C	D	E	F	city
a	0.511163	0.077225	0.712719	0.703220	51.116259	1.223882	Mumbai
b	0.000000	0.319507	0.059459	0.076181	14.465201	0.204111	Delhi
c	0.668516	0.924316	0.190403	0.144573	66.851581	0.858919	Chennai
d	0.470595	0.358167	0.526109	0.548248	47.059507	0.996704	Kolkata
e	0.840948	0.981394	0.959002	0.870479	84.094789	1.799950	Bengalure
f	0.227027	0.259232	0.627810	0.249644	22.702716	0.854837	Hyderabad
g	0.802435	0.969513	0.604462	0.871050	80.243516	1.406897	Pune
h	0.660405	0.788978	0.023313	0.524337	66.040476	0.683717	Ahmedabad
i	0.551329	0.121502	0.321013	0.481932	55.132923	0.872342	Indore

cities_new = ['Mumbai', 'Delhi', 'Chennai',
              'Kolkata', 'Bengalure', 'Hyderabad',
              'Pune', 'Ahmedabad', 'Guwahati']

df_copy['new_city'] = cities_new

df_copy

	A	B	C	D	E	F	city	new_city
a	0.511163	0.077225	0.712719	0.703220	51.116259	1.223882	Mumbai	Mumbai
b	0.000000	0.319507	0.059459	0.076181	14.465201	0.204111	Delhi	Delhi
c	0.668516	0.924316	0.190403	0.144573	66.851581	0.858919	Chennai	Chennai
d	0.470595	0.358167	0.526109	0.548248	47.059507	0.996704	Kolkata	Kolkata
e	0.840948	0.981394	0.959002	0.870479	84.094789	1.799950	Bengalure	Bengalure
f	0.227027	0.259232	0.627810	0.249644	22.702716	0.854837	Hyderabad	Hyderabad
g	0.802435	0.969513	0.604462	0.871050	80.243516	1.406897	Pune	Pune
h	0.660405	0.788978	0.023313	0.524337	66.040476	0.683717	Ahmedabad	Ahmedabad
i	0.551329	0.121502	0.321013	0.481932	55.132923	0.872342	Indore	Guwahati

criteria = df_copy['city'].isin(['Pune', 'Bengaluru', 'Hyderabad'])

df_copy.loc[df.city == 'Bengalure', ['city', 'new_city']] = 'Bengaluru'

df_copy

	A	B	C	D	E	F	city	new_city	new_cities
a	0.511163	0.077225	0.712719	0.703220	51.116259	1.223882	Mumbai	Mumbai	NaN
b	0.000000	0.319507	0.059459	0.076181	14.465201	0.204111	Delhi	Delhi	NaN
c	0.668516	0.924316	0.190403	0.144573	66.851581	0.858919	Chennai	Chennai	NaN
d	0.470595	0.358167	0.526109	0.548248	47.059507	0.996704	Kolkata	Kolkata	NaN
e	0.840948	0.981394	0.959002	0.870479	84.094789	1.799950	Bengaluru	Bengaluru	Bengaluru
f	0.227027	0.259232	0.627810	0.249644	22.702716	0.854837	Hyderabad	Hyderabad	NaN
g	0.802435	0.969513	0.604462	0.871050	80.243516	1.406897	Pune	Pune	NaN
h	0.660405	0.788978	0.023313	0.524337	66.040476	0.683717	Ahmedabad	Ahmedabad	NaN
i	0.551329	0.121502	0.321013	0.481932	55.132923	0.872342	Indore	Guwahati	NaN

?df_copy.drop

df_copy.drop(['new_cities'], axis=1)

	A	B	C	D	E	F	city	new_city
a	0.511163	0.077225	0.712719	0.703220	51.116259	1.223882	Mumbai	Mumbai
b	0.000000	0.319507	0.059459	0.076181	14.465201	0.204111	Delhi	Delhi
c	0.668516	0.924316	0.190403	0.144573	66.851581	0.858919	Chennai	Chennai
d	0.470595	0.358167	0.526109	0.548248	47.059507	0.996704	Kolkata	Kolkata
e	0.840948	0.981394	0.959002	0.870479	84.094789	1.799950	Bengaluru	Bengaluru
f	0.227027	0.259232	0.627810	0.249644	22.702716	0.854837	Hyderabad	Hyderabad
g	0.802435	0.969513	0.604462	0.871050	80.243516	1.406897	Pune	Pune
h	0.660405	0.788978	0.023313	0.524337	66.040476	0.683717	Ahmedabad	Ahmedabad
i	0.551329	0.121502	0.321013	0.481932	55.132923	0.872342	Indore	Guwahati

criteria

a    False
b    False
c    False
d    False
e     True
f     True
g     True
h    False
i    False
Name: city, dtype: bool

df_copy.loc[:, 'city'] = 'Chennai'

df_copy = df_copy.drop(['new_cities'], axis=1)

df_copy

	A	B	C	D	E	F	city	new_city
a	0.511163	0.077225	0.712719	0.70322	51.116259	1.223882	Chennai	Mumbai
b	0.0	0.319507	0.059459	0.076181	14.465201	0.204111	Chennai	Delhi
c	0.668516	0.924316	0.190403	0.144573	66.851581	0.858919	Chennai	Chennai
d	0.470595	0.358167	0.526109	0.548248	47.059507	0.996704	Chennai	Kolkata
e	0.840948	0.981394	0.959002	0.870479	84.094789	1.79995	Chennai	Bengaluru
f	0.227027	0.259232	0.62781	0.249644	22.702716	0.854837	Chennai	Hyderabad
g	0.802435	0.969513	0.604462	0.87105	80.243516	1.406897	Chennai	Pune
h	0.660405	0.788978	0.023313	0.524337	66.040476	0.683717	Chennai	Ahmedabad
i	0.551329	0.121502	0.321013	0.481932	55.132923	0.872342	Chennai	Guwahati
city	Chennai	Chennai	Chennai	Chennai	Chennai	Chennai	Chennai	Chennai

df_copy = df_copy.drop(['city'])

df_copy

	A	B	C	D	E	F	city	new_city
a	0.511163	0.077225	0.712719	0.70322	51.116259	1.223882	Chennai	Mumbai
b	0.0	0.319507	0.059459	0.076181	14.465201	0.204111	Chennai	Delhi
c	0.668516	0.924316	0.190403	0.144573	66.851581	0.858919	Chennai	Chennai
d	0.470595	0.358167	0.526109	0.548248	47.059507	0.996704	Chennai	Kolkata
e	0.840948	0.981394	0.959002	0.870479	84.094789	1.79995	Chennai	Bengaluru
f	0.227027	0.259232	0.62781	0.249644	22.702716	0.854837	Chennai	Hyderabad
g	0.802435	0.969513	0.604462	0.87105	80.243516	1.406897	Chennai	Pune
h	0.660405	0.788978	0.023313	0.524337	66.040476	0.683717	Chennai	Ahmedabad
i	0.551329	0.121502	0.321013	0.481932	55.132923	0.872342	Chennai	Guwahati

?df_copy.sample

df_copy.sample(3)

	A	B	C	D	E	F	city	new_city
a	0.511163	0.077225	0.712719	0.70322	51.116259	1.223882	Chennai	Mumbai
g	0.802435	0.969513	0.604462	0.87105	80.243516	1.406897	Chennai	Pune
h	0.660405	0.788978	0.023313	0.524337	66.040476	0.683717	Chennai	Ahmedabad

df_copy.sample(3)

	A	B	C	D	E	F	city	new_city
g	0.802435	0.969513	0.604462	0.87105	80.243516	1.406897	Chennai	Pune
a	0.511163	0.077225	0.712719	0.70322	51.116259	1.223882	Chennai	Mumbai
f	0.227027	0.259232	0.62781	0.249644	22.702716	0.854837	Chennai	Hyderabad

df_copy.sample(3, random_state=42)

	A	B	C	D	E	F	city	new_city
h	0.660405	0.788978	0.023313	0.524337	66.040476	0.683717	Chennai	Ahmedabad
b	0.0	0.319507	0.059459	0.076181	14.465201	0.204111	Chennai	Delhi
f	0.227027	0.259232	0.62781	0.249644	22.702716	0.854837	Chennai	Hyderabad

df_copy.sample(3, random_state=42)

	A	B	C	D	E	F	city	new_city
h	0.660405	0.788978	0.023313	0.524337	66.040476	0.683717	Chennai	Ahmedabad
b	0.0	0.319507	0.059459	0.076181	14.465201	0.204111	Chennai	Delhi
f	0.227027	0.259232	0.62781	0.249644	22.702716	0.854837	Chennai	Hyderabad

df_copy.sample(3, random_state=42, replace=True)

	A	B	C	D	E	F	city	new_city
g	0.802435	0.969513	0.604462	0.87105	80.243516	1.406897	Chennai	Pune
d	0.470595	0.358167	0.526109	0.548248	47.059507	0.996704	Chennai	Kolkata
h	0.660405	0.788978	0.023313	0.524337	66.040476	0.683717	Chennai	Ahmedabad

import pandas as pd

cities = ['Mumbai', 'Chennai', 'Pune', 'Ahmedabad', 'Kolkata', 'Kanpur', 'Delhi']
city_df = pd.DataFrame(cities)

city_df.columns = ['City_Name']

city_df

	City_Name
0	Mumbai
1	Chennai
2	Pune
3	Ahmedabad
4	Kolkata
5	Kanpur
6	Delhi

condition_met = city_df.City_Name == 'Mumbai'

type(condition_met)

pandas.core.series.Series

city_df[condition_met]

	City_Name
0	Mumbai

condition_met

0     True
1    False
2    False
3    False
4    False
5    False
6    False
Name: City_Name, dtype: bool

city_df[city_df.City_Name == 'Pune']

	City_Name
2	Pune

Pandas cheatsheet

Agggregation and grouping

import numpy as np

random_state = np.random.RandomState(100)
random_series = pd.Series(random_state.rand(10))

random_series

0    0.543405
1    0.278369
2    0.424518
3    0.844776
4    0.004719
5    0.121569
6    0.670749
7    0.825853
8    0.136707
9    0.575093
dtype: float64

Now that the series is created, we can try aggregation functions like mean, std etc.

random_series.mean()

0.4425757785871915

random_series.std()

0.2988992029497061

random_series.sum()

4.425757785871915

Let’s create a dataframe with 5 rows and 2 columns containing random numbers.

df = pd.DataFrame({'A': random_state.rand(5),
                   'B': random_state.rand(5)})
df

	A	B
0	0.891322	0.978624
1	0.209202	0.811683
2	0.185328	0.171941
3	0.108377	0.816225
4	0.219697	0.274074

df.sum()

A    1.613927
B    3.052546
dtype: float64

df.mean()

A    0.322785
B    0.610509
dtype: float64

Try these operations row-wise

df.mean(axis=1)

0    0.934973
1    0.510443
2    0.178635
3    0.462301
4    0.246886
dtype: float64

df.sum(axis=1)

0    1.869946
1    1.020885
2    0.357269
3    0.924602
4    0.493771
dtype: float64

Groupby

Three stages * Split - we split dataframe into multiple smaller dataframe based on the values of keys * Apply - we apply desired aggregation/transformation on each dataframe. * Combine - we combine results from apply state into a dataframe

list("ABCABC")

['A', 'B', 'C', 'A', 'B', 'C']

df = pd.DataFrame({'key': ['A','B','C']*2, #list("ABCABC"), ['A','B','C','A','B','C']
                   'data': range(6)})

df

	key	data
0	A	0
1	B	1
2	C	2
3	A	3
4	B	4
5	C	5

df.groupby("key")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x78e73c1c8040>

df.groupby("key").sum()

	data
key
A	3
B	5
C	7