Introduction to Pandas
Learning Objectives:
Gain an introduction to the DataFrame and Series data structures of the pandas library
Access and manipulate data within a DataFrame and Series
pandas is a column-oriented data analysis API. It’s a great tool for handling and analyzing input data, and many ML frameworks support pandas data structures as inputs.
Although a comprehensive introduction to the API would span many pages, the core concepts are fairly straightforward, and we’ll present them below. For a more complete reference, the pandas docs site contains extensive documentation and many tutorials. (Note that Colab may use a slightly older version number, but the parts of pandas covered here are unlikely to differ from version to version.)
Let’s import pandas library
import pandas as pd
pd.__version__
The primary data structures in pandas are implemented as two classes:
DataFrame , which you can imagine as a relational data table, with rows and named columns.
Series , which is a single column. Each row can be labeled via an index. A DataFrame contains one or more Series and a name for each Series.
The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in Spark and R.
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame= True )
print (type (diabetes['data' ]))
<class 'pandas.core.frame.DataFrame'>
.. _diabetes_dataset:
Diabetes dataset
----------------
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
**Data Set Characteristics:**
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
Creating pandas dataframe from series
cities = pd.Series(['Mumbai' , 'Bangalore' , 'Chennai' , 'Delhi' ])
population = pd.Series([17000000 , 13000000 , 6000000 ])
city_info_df = pd.DataFrame({'City' : cities, 'Population' : population})
pandas.core.frame.DataFrame
0
Mumbai
17000000.0
1
Bangalore
13000000.0
2
Chennai
6000000.0
3
Delhi
NaN
Exploring data in dataframe
Find out the number of rows and columns in the dataframe.
In this case, there are 442 rows and 10 columns.
Let’s see the names of the columns
Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 442 non-null float64
1 sex 442 non-null float64
2 bmi 442 non-null float64
3 bp 442 non-null float64
4 s1 442 non-null float64
5 s2 442 non-null float64
6 s3 442 non-null float64
7 s4 442 non-null float64
8 s5 442 non-null float64
9 s6 442 non-null float64
dtypes: float64(10)
memory usage: 34.7 KB
Quickly examine a few entries in the dataframe. Say first 5 and last 5.
0
0.038076
0.050680
0.061696
0.021872
-0.044223
-0.034821
-0.043401
-0.002592
0.019907
-0.017646
1
-0.001882
-0.044642
-0.051474
-0.026328
-0.008449
-0.019163
0.074412
-0.039493
-0.068332
-0.092204
2
0.085299
0.050680
0.044451
-0.005670
-0.045599
-0.034194
-0.032356
-0.002592
0.002861
-0.025930
3
-0.089063
-0.044642
-0.011595
-0.036656
0.012191
0.024991
-0.036038
0.034309
0.022688
-0.009362
4
0.005383
-0.044642
-0.036385
0.021872
0.003935
0.015596
0.008142
-0.002592
-0.031988
-0.046641
437
0.041708
0.050680
0.019662
0.059744
-0.005697
-0.002566
-0.028674
-0.002592
0.031193
0.007207
438
-0.005515
0.050680
-0.015906
-0.067642
0.049341
0.079165
-0.028674
0.034309
-0.018114
0.044485
439
0.041708
0.050680
-0.015906
0.017293
-0.037344
-0.013840
-0.024993
-0.011080
-0.046883
0.015491
440
-0.045472
-0.044642
0.039062
0.001215
0.016318
0.015283
-0.028674
0.026560
0.044529
-0.025930
441
-0.045472
-0.044642
-0.073030
-0.081413
0.083740
0.027809
0.173816
-0.039493
-0.004222
0.003064
Let’s get summary statistics on the dataframe
count
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
mean
-2.511817e-19
1.230790e-17
-2.245564e-16
-4.797570e-17
-1.381499e-17
3.918434e-17
-5.777179e-18
-9.042540e-18
9.293722e-17
1.130318e-17
std
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
min
-1.072256e-01
-4.464164e-02
-9.027530e-02
-1.123988e-01
-1.267807e-01
-1.156131e-01
-1.023071e-01
-7.639450e-02
-1.260971e-01
-1.377672e-01
25%
-3.729927e-02
-4.464164e-02
-3.422907e-02
-3.665608e-02
-3.424784e-02
-3.035840e-02
-3.511716e-02
-3.949338e-02
-3.324559e-02
-3.317903e-02
50%
5.383060e-03
-4.464164e-02
-7.283766e-03
-5.670422e-03
-4.320866e-03
-3.819065e-03
-6.584468e-03
-2.592262e-03
-1.947171e-03
-1.077698e-03
75%
3.807591e-02
5.068012e-02
3.124802e-02
3.564379e-02
2.835801e-02
2.984439e-02
2.931150e-02
3.430886e-02
3.243232e-02
2.791705e-02
max
1.107267e-01
5.068012e-02
1.705552e-01
1.320436e-01
1.539137e-01
1.987880e-01
1.811791e-01
1.852344e-01
1.335973e-01
1.356118e-01
df.describe(percentiles= [0.2 , 0.6 , 0.8 ])
count
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
4.420000e+02
mean
-2.511817e-19
1.230790e-17
-2.245564e-16
-4.797570e-17
-1.381499e-17
3.918434e-17
-5.777179e-18
-9.042540e-18
9.293722e-17
1.130318e-17
std
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
4.761905e-02
min
-1.072256e-01
-4.464164e-02
-9.027530e-02
-1.123988e-01
-1.267807e-01
-1.156131e-01
-1.023071e-01
-7.639450e-02
-1.260971e-01
-1.377672e-01
20%
-4.547248e-02
-4.464164e-02
-4.048038e-02
-4.009893e-02
-3.871969e-02
-3.695017e-02
-3.971921e-02
-3.949338e-02
-4.117617e-02
-3.835666e-02
50%
5.383060e-03
-4.464164e-02
-7.283766e-03
-5.670422e-03
-4.320866e-03
-3.819065e-03
-6.584468e-03
-2.592262e-03
-1.947171e-03
-1.077698e-03
60%
1.628068e-02
5.068012e-02
5.218854e-03
8.100982e-03
8.062710e-03
8.706873e-03
8.142084e-03
-2.592262e-03
1.255119e-02
7.206516e-03
80%
4.170844e-02
5.068012e-02
4.229559e-02
4.941519e-02
3.943444e-02
3.952068e-02
3.759519e-02
3.430886e-02
3.885335e-02
3.620126e-02
max
1.107267e-01
5.068012e-02
1.705552e-01
1.320436e-01
1.539137e-01
1.987880e-01
1.811791e-01
1.852344e-01
1.335973e-01
1.356118e-01
age
442.0
-2.511817e-19
0.047619
-0.107226
-0.037299
0.005383
0.038076
0.110727
sex
442.0
1.230790e-17
0.047619
-0.044642
-0.044642
-0.044642
0.050680
0.050680
bmi
442.0
-2.245564e-16
0.047619
-0.090275
-0.034229
-0.007284
0.031248
0.170555
bp
442.0
-4.797570e-17
0.047619
-0.112399
-0.036656
-0.005670
0.035644
0.132044
s1
442.0
-1.381499e-17
0.047619
-0.126781
-0.034248
-0.004321
0.028358
0.153914
s2
442.0
3.918434e-17
0.047619
-0.115613
-0.030358
-0.003819
0.029844
0.198788
s3
442.0
-5.777179e-18
0.047619
-0.102307
-0.035117
-0.006584
0.029312
0.181179
s4
442.0
-9.042540e-18
0.047619
-0.076395
-0.039493
-0.002592
0.034309
0.185234
s5
442.0
9.293722e-17
0.047619
-0.126097
-0.033246
-0.001947
0.032432
0.133597
s6
442.0
1.130318e-17
0.047619
-0.137767
-0.033179
-0.001078
0.027917
0.135612
Selection
Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')
0 0.038076
1 -0.001882
2 0.085299
3 -0.089063
4 0.005383
...
437 0.041708
438 -0.005515
439 0.041708
440 -0.045472
441 -0.045472
Name: age, Length: 442, dtype: float64
pandas.core.series.Series
0 0.038076
1 -0.001882
2 0.085299
3 -0.089063
4 0.005383
Name: age, dtype: float64
437 0.041708
438 -0.005515
439 0.041708
440 -0.045472
441 -0.045472
Name: age, dtype: float64
100 0.016281
101 0.016281
102 -0.092695
103 0.059871
104 -0.027310
...
195 0.027178
196 -0.023677
197 0.048974
198 -0.052738
199 0.041708
Name: age, Length: 100, dtype: float64
0
0.038076
0.050680
1
-0.001882
-0.044642
2
0.085299
0.050680
3
-0.089063
-0.044642
4
0.005383
-0.044642
...
...
...
437
0.041708
0.050680
438
-0.005515
0.050680
439
0.041708
0.050680
440
-0.045472
-0.044642
441
-0.045472
-0.044642
442 rows × 2 columns
0
0.038076
0.050680
1
-0.001882
-0.044642
2
0.085299
0.050680
3
-0.089063
-0.044642
4
0.005383
-0.044642
437
0.041708
0.050680
438
-0.005515
0.050680
439
0.041708
0.050680
440
-0.045472
-0.044642
441
-0.045472
-0.044642
0
0.038076
0.050680
1
-0.001882
-0.044642
2
0.085299
0.050680
3
-0.089063
-0.044642
4
0.005383
-0.044642
...
...
...
437
0.041708
0.050680
438
-0.005515
0.050680
439
0.041708
0.050680
440
-0.045472
-0.044642
441
-0.045472
-0.044642
442 rows × 2 columns
age -0.045472
sex -0.044642
bmi -0.073030
bp -0.081413
s1 0.083740
s2 0.027809
s3 0.173816
s4 -0.039493
s5 -0.004222
s6 0.003064
Name: 441, dtype: float64
age 0.038076
sex 0.050680
bmi 0.061696
bp 0.021872
s1 -0.044223
s2 -0.034821
s3 -0.043401
s4 -0.002592
s5 0.019907
s6 -0.017646
Name: 0, dtype: float64
age 0.038076
sex 0.050680
bmi 0.061696
bp 0.021872
s1 -0.044223
s2 -0.034821
s3 -0.043401
s4 -0.002592
s5 0.019907
s6 -0.017646
Name: 0, dtype: float64
0
0.038076
0.050680
0.061696
0.021872
-0.044223
-0.034821
-0.043401
-0.002592
0.019907
-0.017646
1
-0.001882
-0.044642
-0.051474
-0.026328
-0.008449
-0.019163
0.074412
-0.039493
-0.068332
-0.092204
2
0.085299
0.050680
0.044451
-0.005670
-0.045599
-0.034194
-0.032356
-0.002592
0.002861
-0.025930
3
-0.089063
-0.044642
-0.011595
-0.036656
0.012191
0.024991
-0.036038
0.034309
0.022688
-0.009362
4
0.005383
-0.044642
-0.036385
0.021872
0.003935
0.015596
0.008142
-0.002592
-0.031988
-0.046641
df.loc[4 , ['age' , 'sex' ]]
age 0.005383
sex -0.044642
Name: 4, dtype: float64
sex -0.044642
s2 0.015596
s4 -0.002592
s6 -0.046641
Name: 4, dtype: float64
rows_condition_met = df.age > 5.383060e-03
df.loc[rows_condition_met]
0
0.038076
0.050680
0.061696
0.021872
-0.044223
-0.034821
-0.043401
-0.002592
0.019907
-0.017646
2
0.085299
0.050680
0.044451
-0.005670
-0.045599
-0.034194
-0.032356
-0.002592
0.002861
-0.025930
4
0.005383
-0.044642
-0.036385
0.021872
0.003935
0.015596
0.008142
-0.002592
-0.031988
-0.046641
7
0.063504
0.050680
-0.001895
0.066629
0.090620
0.108914
0.022869
0.017703
-0.035816
0.003064
8
0.041708
0.050680
0.061696
-0.040099
-0.013953
0.006202
-0.028674
-0.002592
-0.014960
0.011349
...
...
...
...
...
...
...
...
...
...
...
431
0.070769
0.050680
-0.030996
0.021872
-0.037344
-0.047034
0.033914
-0.039493
-0.014960
-0.001078
432
0.009016
-0.044642
0.055229
-0.005670
0.057597
0.044719
-0.002903
0.023239
0.055686
0.106617
434
0.016281
-0.044642
0.001339
0.008101
0.005311
0.010899
0.030232
-0.039493
-0.045424
0.032059
437
0.041708
0.050680
0.019662
0.059744
-0.005697
-0.002566
-0.028674
-0.002592
0.031193
0.007207
439
0.041708
0.050680
-0.015906
0.017293
-0.037344
-0.013840
-0.024993
-0.011080
-0.046883
0.015491
228 rows × 10 columns
df.loc[df.age > 5.383060e-03 ]
0
0.038076
0.050680
0.061696
0.021872
-0.044223
-0.034821
-0.043401
-0.002592
0.019907
-0.017646
2
0.085299
0.050680
0.044451
-0.005670
-0.045599
-0.034194
-0.032356
-0.002592
0.002861
-0.025930
4
0.005383
-0.044642
-0.036385
0.021872
0.003935
0.015596
0.008142
-0.002592
-0.031988
-0.046641
7
0.063504
0.050680
-0.001895
0.066629
0.090620
0.108914
0.022869
0.017703
-0.035816
0.003064
8
0.041708
0.050680
0.061696
-0.040099
-0.013953
0.006202
-0.028674
-0.002592
-0.014960
0.011349
...
...
...
...
...
...
...
...
...
...
...
431
0.070769
0.050680
-0.030996
0.021872
-0.037344
-0.047034
0.033914
-0.039493
-0.014960
-0.001078
432
0.009016
-0.044642
0.055229
-0.005670
0.057597
0.044719
-0.002903
0.023239
0.055686
0.106617
434
0.016281
-0.044642
0.001339
0.008101
0.005311
0.010899
0.030232
-0.039493
-0.045424
0.032059
437
0.041708
0.050680
0.019662
0.059744
-0.005697
-0.002566
-0.028674
-0.002592
0.031193
0.007207
439
0.041708
0.050680
-0.015906
0.017293
-0.037344
-0.013840
-0.024993
-0.011080
-0.046883
0.015491
228 rows × 10 columns
age_df_temp = df.loc[df.age < 5.383060e-03 ]
age_df_temp = df[df.age < 5.383060e-03 ]
1
-0.001882
-0.044642
-0.051474
-0.026328
-0.008449
-0.019163
0.074412
-0.039493
-0.068332
-0.092204
3
-0.089063
-0.044642
-0.011595
-0.036656
0.012191
0.024991
-0.036038
0.034309
0.022688
-0.009362
5
-0.092695
-0.044642
-0.040696
-0.019442
-0.068991
-0.079288
0.041277
-0.076395
-0.041176
-0.096346
6
-0.045472
0.050680
-0.047163
-0.015999
-0.040096
-0.024800
0.000779
-0.039493
-0.062917
-0.038357
9
-0.070900
-0.044642
0.039062
-0.033213
-0.012577
-0.034508
-0.024993
-0.002592
0.067737
-0.013504
age -0.092695
sex -0.044642
bmi -0.040696
bp -0.019442
s1 -0.068991
s2 -0.079288
s3 0.041277
s4 -0.076395
s5 -0.041176
s6 -0.096346
Name: 5, dtype: float64
age -0.001882
sex -0.044642
bmi -0.051474
bp -0.026328
s1 -0.008449
s2 -0.019163
s3 0.074412
s4 -0.039493
s5 -0.068332
s6 -0.092204
Name: 1, dtype: float64
age_df_temp = df.loc[(df.age < 5.383060e-03 ) & (df.sex > - 4.464164e-02 )]
1
-0.001882
-0.044642
-0.051474
-0.026328
-0.008449
-0.019163
0.074412
-0.039493
-0.068332
-0.092204
3
-0.089063
-0.044642
-0.011595
-0.036656
0.012191
0.024991
-0.036038
0.034309
0.022688
-0.009362
5
-0.092695
-0.044642
-0.040696
-0.019442
-0.068991
-0.079288
0.041277
-0.076395
-0.041176
-0.096346
6
-0.045472
0.050680
-0.047163
-0.015999
-0.040096
-0.024800
0.000779
-0.039493
-0.062917
-0.038357
9
-0.070900
-0.044642
0.039062
-0.033213
-0.012577
-0.034508
-0.024993
-0.002592
0.067737
-0.013504
...
...
...
...
...
...
...
...
...
...
...
435
-0.012780
-0.044642
-0.023451
-0.040099
-0.016704
0.004636
-0.017629
-0.002592
-0.038460
-0.038357
436
-0.056370
-0.044642
-0.074108
-0.050427
-0.024960
-0.047034
0.092820
-0.076395
-0.061176
-0.046641
438
-0.005515
0.050680
-0.015906
-0.067642
0.049341
0.079165
-0.028674
0.034309
-0.018114
0.044485
440
-0.045472
-0.044642
0.039062
0.001215
0.016318
0.015283
-0.028674
0.026560
0.044529
-0.025930
441
-0.045472
-0.044642
-0.073030
-0.081413
0.083740
0.027809
0.173816
-0.039493
-0.004222
0.003064
214 rows × 10 columns
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
another_df = pd.DataFrame(
np.random.rand(100 , 4 ),
index= range (10 , 110 ),
columns= list ("ABCD" ))
10
0.128088
0.963425
0.450839
0.887946
11
0.441873
0.635796
0.576356
0.206467
12
0.467190
0.512398
0.255862
0.829151
13
0.350556
0.748146
0.649348
0.158734
14
0.861936
0.355931
0.235678
0.166667
105
0.531472
0.838437
0.540441
0.691833
106
0.253700
0.651600
0.810953
0.275363
107
0.932707
0.076388
0.415450
0.918451
108
0.176879
0.149722
0.242837
0.214613
109
0.908052
0.453071
0.427724
0.433188
df = pd.DataFrame(np.random.rand(9 , 4 ), index= list ("abcdefghi" ), columns= list ("ABCD" ))
Index(['A', 'B', 'C', 'D'], dtype='object')
<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, a to i
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 9 non-null float64
1 B 9 non-null float64
2 C 9 non-null float64
3 D 9 non-null float64
dtypes: float64(4)
memory usage: 660.0+ bytes
a
0.511163
0.077225
0.712719
0.703220
b
0.144652
0.319507
0.059459
0.076181
c
0.668516
0.924316
0.190403
0.144573
d
0.470595
0.358167
0.526109
0.548248
e
0.840948
0.981394
0.959002
0.870479
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], dtype='object')
A 0.511163
B 0.077225
C 0.712719
D 0.703220
Name: a, dtype: float64
A 0.511163
B 0.077225
C 0.712719
D 0.703220
Name: a, dtype: float64
a
0.511163
0.077225
0.712719
0.703220
b
0.144652
0.319507
0.059459
0.076181
c
0.668516
0.924316
0.190403
0.144573
d
0.470595
0.358167
0.526109
0.548248
a
0.511163
0.077225
0.712719
b
0.144652
0.319507
0.059459
c
0.668516
0.924316
0.190403
d
0.470595
0.358167
0.526109
a
0.077225
0.712719
b
0.319507
0.059459
c
0.924316
0.190403
d
0.358167
0.526109
a
0.511163
0.077225
0.712719
0.703220
b
0.144652
0.319507
0.059459
0.076181
c
0.668516
0.924316
0.190403
0.144573
d
0.470595
0.358167
0.526109
0.548248
a
0.511163
0.077225
0.712719
0.703220
b
0.144652
0.319507
0.059459
0.076181
c
0.668516
0.924316
0.190403
0.144573
d
0.470595
0.358167
0.526109
0.548248
a 0.077225
b 0.319507
c 0.924316
d 0.358167
e 0.981394
f 0.259232
g 0.969513
h 0.788978
i 0.121502
Name: B, dtype: float64
selector = lambda df: df['A' ] > 0
<function __main__.<lambda>(df)>
a
0.511163
0.077225
0.712719
0.703220
b
0.144652
0.319507
0.059459
0.076181
c
0.668516
0.924316
0.190403
0.144573
d
0.470595
0.358167
0.526109
0.548248
e
0.840948
0.981394
0.959002
0.870479
f
0.227027
0.259232
0.627810
0.249644
g
0.802435
0.969513
0.604462
0.871050
h
0.660405
0.788978
0.023313
0.524337
i
0.551329
0.121502
0.321013
0.481932
selector = lambda df: df['A' ] > 0.5
a
0.511163
0.077225
0.712719
0.703220
c
0.668516
0.924316
0.190403
0.144573
e
0.840948
0.981394
0.959002
0.870479
g
0.802435
0.969513
0.604462
0.871050
h
0.660405
0.788978
0.023313
0.524337
i
0.551329
0.121502
0.321013
0.481932
selector = lambda df: (df['A' ] > 0.5 )& (df['B' ] < 0.2 )
a
0.511163
0.077225
0.712719
0.703220
i
0.551329
0.121502
0.321013
0.481932
condition_for_selection = (df['A' ] > 0.5 )& (df['B' ] < 0.2 )
a True
b False
c False
d False
e False
f False
g False
h False
i True
dtype: bool
df[condition_for_selection]
a
0.511163
0.077225
0.712719
0.703220
i
0.551329
0.121502
0.321013
0.481932
condition_for_selection = (df['A' ] > 0.5 ) | ~ (df['B' ] < 0.2 )
df[condition_for_selection]
a
0.511163
0.077225
0.712719
0.703220
b
0.144652
0.319507
0.059459
0.076181
c
0.668516
0.924316
0.190403
0.144573
d
0.470595
0.358167
0.526109
0.548248
e
0.840948
0.981394
0.959002
0.870479
f
0.227027
0.259232
0.627810
0.249644
g
0.802435
0.969513
0.604462
0.871050
h
0.660405
0.788978
0.023313
0.524337
i
0.551329
0.121502
0.321013
0.481932
OR operator: | NOT operator: ~
Adding a column in the dataframe
a
0.511163
0.077225
0.712719
0.703220
51.116259
b
0.144652
0.319507
0.059459
0.076181
14.465201
c
0.668516
0.924316
0.190403
0.144573
66.851581
d
0.470595
0.358167
0.526109
0.548248
47.059507
e
0.840948
0.981394
0.959002
0.870479
84.094789
f
0.227027
0.259232
0.627810
0.249644
22.702716
g
0.802435
0.969513
0.604462
0.871050
80.243516
h
0.660405
0.788978
0.023313
0.524337
66.040476
i
0.551329
0.121502
0.321013
0.481932
55.132923
df["F" ] = df["A" ] + df["C" ]
a
0.511163
0.077225
0.712719
0.703220
51.116259
1.223882
b
0.144652
0.319507
0.059459
0.076181
14.465201
0.204111
c
0.668516
0.924316
0.190403
0.144573
66.851581
0.858919
d
0.470595
0.358167
0.526109
0.548248
47.059507
0.996704
e
0.840948
0.981394
0.959002
0.870479
84.094789
1.799950
f
0.227027
0.259232
0.627810
0.249644
22.702716
0.854837
g
0.802435
0.969513
0.604462
0.871050
80.243516
1.406897
h
0.660405
0.788978
0.023313
0.524337
66.040476
0.683717
i
0.551329
0.121502
0.321013
0.481932
55.132923
0.872342
a False
b True
c False
d False
e False
f False
g False
h False
i False
Name: A, dtype: bool
df.loc[criteria, 'A' ] = 0
a
0.511163
0.077225
0.712719
0.703220
51.116259
1.223882
b
0.000000
0.319507
0.059459
0.076181
14.465201
0.204111
c
0.668516
0.924316
0.190403
0.144573
66.851581
0.858919
d
0.470595
0.358167
0.526109
0.548248
47.059507
0.996704
e
0.840948
0.981394
0.959002
0.870479
84.094789
1.799950
f
0.227027
0.259232
0.627810
0.249644
22.702716
0.854837
g
0.802435
0.969513
0.604462
0.871050
80.243516
1.406897
h
0.660405
0.788978
0.023313
0.524337
66.040476
0.683717
i
0.551329
0.121502
0.321013
0.481932
55.132923
0.872342
cities = ['Mumbai' , 'Delhi' , 'Chennai' , 'Kolkata' , 'Bengalure' , 'Hyderabad' , 'Pune' , 'Ahmedabad' , 'Indore' ]
a
0.511163
0.077225
0.712719
0.703220
51.116259
1.223882
Mumbai
b
0.000000
0.319507
0.059459
0.076181
14.465201
0.204111
Delhi
c
0.668516
0.924316
0.190403
0.144573
66.851581
0.858919
Chennai
d
0.470595
0.358167
0.526109
0.548248
47.059507
0.996704
Kolkata
e
0.840948
0.981394
0.959002
0.870479
84.094789
1.799950
Bengalure
f
0.227027
0.259232
0.627810
0.249644
22.702716
0.854837
Hyderabad
g
0.802435
0.969513
0.604462
0.871050
80.243516
1.406897
Pune
h
0.660405
0.788978
0.023313
0.524337
66.040476
0.683717
Ahmedabad
i
0.551329
0.121502
0.321013
0.481932
55.132923
0.872342
Indore
cities_new = ['Mumbai' , 'Delhi' , 'Chennai' ,
'Kolkata' , 'Bengalure' , 'Hyderabad' ,
'Pune' , 'Ahmedabad' , 'Guwahati' ]
df_copy['new_city' ] = cities_new
a
0.511163
0.077225
0.712719
0.703220
51.116259
1.223882
Mumbai
Mumbai
b
0.000000
0.319507
0.059459
0.076181
14.465201
0.204111
Delhi
Delhi
c
0.668516
0.924316
0.190403
0.144573
66.851581
0.858919
Chennai
Chennai
d
0.470595
0.358167
0.526109
0.548248
47.059507
0.996704
Kolkata
Kolkata
e
0.840948
0.981394
0.959002
0.870479
84.094789
1.799950
Bengalure
Bengalure
f
0.227027
0.259232
0.627810
0.249644
22.702716
0.854837
Hyderabad
Hyderabad
g
0.802435
0.969513
0.604462
0.871050
80.243516
1.406897
Pune
Pune
h
0.660405
0.788978
0.023313
0.524337
66.040476
0.683717
Ahmedabad
Ahmedabad
i
0.551329
0.121502
0.321013
0.481932
55.132923
0.872342
Indore
Guwahati
criteria = df_copy['city' ].isin(['Pune' , 'Bengaluru' , 'Hyderabad' ])
df_copy.loc[df.city == 'Bengalure' , ['city' , 'new_city' ]] = 'Bengaluru'
a
0.511163
0.077225
0.712719
0.703220
51.116259
1.223882
Mumbai
Mumbai
NaN
b
0.000000
0.319507
0.059459
0.076181
14.465201
0.204111
Delhi
Delhi
NaN
c
0.668516
0.924316
0.190403
0.144573
66.851581
0.858919
Chennai
Chennai
NaN
d
0.470595
0.358167
0.526109
0.548248
47.059507
0.996704
Kolkata
Kolkata
NaN
e
0.840948
0.981394
0.959002
0.870479
84.094789
1.799950
Bengaluru
Bengaluru
Bengaluru
f
0.227027
0.259232
0.627810
0.249644
22.702716
0.854837
Hyderabad
Hyderabad
NaN
g
0.802435
0.969513
0.604462
0.871050
80.243516
1.406897
Pune
Pune
NaN
h
0.660405
0.788978
0.023313
0.524337
66.040476
0.683717
Ahmedabad
Ahmedabad
NaN
i
0.551329
0.121502
0.321013
0.481932
55.132923
0.872342
Indore
Guwahati
NaN
df_copy.drop(['new_cities' ], axis= 1 )
a
0.511163
0.077225
0.712719
0.703220
51.116259
1.223882
Mumbai
Mumbai
b
0.000000
0.319507
0.059459
0.076181
14.465201
0.204111
Delhi
Delhi
c
0.668516
0.924316
0.190403
0.144573
66.851581
0.858919
Chennai
Chennai
d
0.470595
0.358167
0.526109
0.548248
47.059507
0.996704
Kolkata
Kolkata
e
0.840948
0.981394
0.959002
0.870479
84.094789
1.799950
Bengaluru
Bengaluru
f
0.227027
0.259232
0.627810
0.249644
22.702716
0.854837
Hyderabad
Hyderabad
g
0.802435
0.969513
0.604462
0.871050
80.243516
1.406897
Pune
Pune
h
0.660405
0.788978
0.023313
0.524337
66.040476
0.683717
Ahmedabad
Ahmedabad
i
0.551329
0.121502
0.321013
0.481932
55.132923
0.872342
Indore
Guwahati
a False
b False
c False
d False
e True
f True
g True
h False
i False
Name: city, dtype: bool
df_copy.loc[:, 'city' ] = 'Chennai'
df_copy = df_copy.drop(['new_cities' ], axis= 1 )
a
0.511163
0.077225
0.712719
0.70322
51.116259
1.223882
Chennai
Mumbai
b
0.0
0.319507
0.059459
0.076181
14.465201
0.204111
Chennai
Delhi
c
0.668516
0.924316
0.190403
0.144573
66.851581
0.858919
Chennai
Chennai
d
0.470595
0.358167
0.526109
0.548248
47.059507
0.996704
Chennai
Kolkata
e
0.840948
0.981394
0.959002
0.870479
84.094789
1.79995
Chennai
Bengaluru
f
0.227027
0.259232
0.62781
0.249644
22.702716
0.854837
Chennai
Hyderabad
g
0.802435
0.969513
0.604462
0.87105
80.243516
1.406897
Chennai
Pune
h
0.660405
0.788978
0.023313
0.524337
66.040476
0.683717
Chennai
Ahmedabad
i
0.551329
0.121502
0.321013
0.481932
55.132923
0.872342
Chennai
Guwahati
city
Chennai
Chennai
Chennai
Chennai
Chennai
Chennai
Chennai
Chennai
df_copy = df_copy.drop(['city' ])
a
0.511163
0.077225
0.712719
0.70322
51.116259
1.223882
Chennai
Mumbai
b
0.0
0.319507
0.059459
0.076181
14.465201
0.204111
Chennai
Delhi
c
0.668516
0.924316
0.190403
0.144573
66.851581
0.858919
Chennai
Chennai
d
0.470595
0.358167
0.526109
0.548248
47.059507
0.996704
Chennai
Kolkata
e
0.840948
0.981394
0.959002
0.870479
84.094789
1.79995
Chennai
Bengaluru
f
0.227027
0.259232
0.62781
0.249644
22.702716
0.854837
Chennai
Hyderabad
g
0.802435
0.969513
0.604462
0.87105
80.243516
1.406897
Chennai
Pune
h
0.660405
0.788978
0.023313
0.524337
66.040476
0.683717
Chennai
Ahmedabad
i
0.551329
0.121502
0.321013
0.481932
55.132923
0.872342
Chennai
Guwahati
a
0.511163
0.077225
0.712719
0.70322
51.116259
1.223882
Chennai
Mumbai
g
0.802435
0.969513
0.604462
0.87105
80.243516
1.406897
Chennai
Pune
h
0.660405
0.788978
0.023313
0.524337
66.040476
0.683717
Chennai
Ahmedabad
g
0.802435
0.969513
0.604462
0.87105
80.243516
1.406897
Chennai
Pune
a
0.511163
0.077225
0.712719
0.70322
51.116259
1.223882
Chennai
Mumbai
f
0.227027
0.259232
0.62781
0.249644
22.702716
0.854837
Chennai
Hyderabad
df_copy.sample(3 , random_state= 42 )
h
0.660405
0.788978
0.023313
0.524337
66.040476
0.683717
Chennai
Ahmedabad
b
0.0
0.319507
0.059459
0.076181
14.465201
0.204111
Chennai
Delhi
f
0.227027
0.259232
0.62781
0.249644
22.702716
0.854837
Chennai
Hyderabad
df_copy.sample(3 , random_state= 42 )
h
0.660405
0.788978
0.023313
0.524337
66.040476
0.683717
Chennai
Ahmedabad
b
0.0
0.319507
0.059459
0.076181
14.465201
0.204111
Chennai
Delhi
f
0.227027
0.259232
0.62781
0.249644
22.702716
0.854837
Chennai
Hyderabad
df_copy.sample(3 , random_state= 42 , replace= True )
g
0.802435
0.969513
0.604462
0.87105
80.243516
1.406897
Chennai
Pune
d
0.470595
0.358167
0.526109
0.548248
47.059507
0.996704
Chennai
Kolkata
h
0.660405
0.788978
0.023313
0.524337
66.040476
0.683717
Chennai
Ahmedabad
import pandas as pd
cities = ['Mumbai' , 'Chennai' , 'Pune' , 'Ahmedabad' , 'Kolkata' , 'Kanpur' , 'Delhi' ]
city_df = pd.DataFrame(cities)
city_df.columns = ['City_Name' ]
0
Mumbai
1
Chennai
2
Pune
3
Ahmedabad
4
Kolkata
5
Kanpur
6
Delhi
condition_met = city_df.City_Name == 'Mumbai'
pandas.core.series.Series
0 True
1 False
2 False
3 False
4 False
5 False
6 False
Name: City_Name, dtype: bool
city_df[city_df.City_Name == 'Pune' ]
Pandas cheatsheet
Agggregation and grouping
random_state = np.random.RandomState(100 )
random_series = pd.Series(random_state.rand(10 ))
0 0.543405
1 0.278369
2 0.424518
3 0.844776
4 0.004719
5 0.121569
6 0.670749
7 0.825853
8 0.136707
9 0.575093
dtype: float64
Now that the series is created, we can try aggregation functions like mean, std etc.
Let’s create a dataframe with 5 rows and 2 columns containing random numbers.
df = pd.DataFrame({'A' : random_state.rand(5 ),
'B' : random_state.rand(5 )})
df
0
0.891322
0.978624
1
0.209202
0.811683
2
0.185328
0.171941
3
0.108377
0.816225
4
0.219697
0.274074
A 1.613927
B 3.052546
dtype: float64
A 0.322785
B 0.610509
dtype: float64
Try these operations row-wise
0 0.934973
1 0.510443
2 0.178635
3 0.462301
4 0.246886
dtype: float64
0 1.869946
1 1.020885
2 0.357269
3 0.924602
4 0.493771
dtype: float64
Groupby
Three stages * Split - we split dataframe into multiple smaller dataframe based on the values of keys * Apply - we apply desired aggregation/transformation on each dataframe. * Combine - we combine results from apply state into a dataframe
['A', 'B', 'C', 'A', 'B', 'C']
df = pd.DataFrame({'key' : ['A' ,'B' ,'C' ]* 2 , #list("ABCABC"), ['A','B','C','A','B','C']
'data' : range (6 )})
0
A
0
1
B
1
2
C
2
3
A
3
4
B
4
5
C
5
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x78e73c1c8040>