Open In Colab

Pandas Introduction

This is Introduction to pandas. Pandas library helps in Data loading , analysis and transformation

Anatomy of a dataframe

Columns, index and data

import pandas as pd

# Create a sample DataFrame
data = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])
df
col1 col2 col3
row1 1 4 7
row2 2 5 8
row3 3 6 9
# Create a sample DataFrame
data = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}
df = pd.DataFrame(data)#, index=['row1', 'row2', 'row3'])
df
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
# Display columns
df.columns
Index(['col1', 'col2', 'col3'], dtype='object')
df.index
Index(['row1', 'row2', 'row3'], dtype='object')

# Display data (values)
df.values
array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

Creating Dataframe

  • From lists of lists
  • From series objects
  • From dictionaries
# From lists of lists
list_of_lists_data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
df_from_list = pd.DataFrame(list_of_lists_data, columns=['Age', 'B', 'C'])
df_from_list
Age B C
0 1 2 3
1 4 5 6
2 7 8 9
pd.Series([1,2,3])
0
0 1
1 2
2 3

# From series objects
series_data = {'col1': pd.Series([1, 2, 3]),
               'col2': pd.Series([4, 5, 6])}
df_from_series = pd.DataFrame(series_data)
print("\nDataFrame from series objects:")
df_from_series

DataFrame from series objects:
col1 col2
0 1 4
1 2 5
2 3 6
# From dictionaries
dictionary_data = {
    'col1': [1, 2, 3],
    'col2': [4, 5, 6]
    }

df_from_dict = pd.DataFrame(dictionary_data)
print("\nDataFrame from dictionaries:")
df_from_dict

DataFrame from dictionaries:
col1 col2
0 1 4
1 2 5
2 3 6

Loading Datasets

  • load data from csv file
  • load data from json file
  • load data from html file
df_csv = pd.read_csv("/content/sample_data/california_housing_test.csv")
df_csv
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0
1 -118.30 34.26 43.0 1510.0 310.0 809.0 277.0 3.5990 176500.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0
3 -118.36 33.82 28.0 67.0 15.0 49.0 11.0 6.1359 330000.0
4 -119.67 36.33 19.0 1241.0 244.0 850.0 237.0 2.9375 81700.0
... ... ... ... ... ... ... ... ... ...
2995 -119.86 34.42 23.0 1450.0 642.0 1258.0 607.0 1.1790 225000.0
2996 -118.14 34.06 27.0 5257.0 1082.0 3496.0 1036.0 3.3906 237200.0
2997 -119.70 36.30 10.0 956.0 201.0 693.0 220.0 2.2895 62000.0
2998 -117.12 34.10 40.0 96.0 14.0 46.0 14.0 3.2708 162500.0
2999 -119.63 34.42 42.0 1765.0 263.0 753.0 260.0 8.5608 500001.0

3000 rows × 9 columns

df_json = pd.read_json("/content/sample_data/anscombe.json")
df_json
Series X Y
0 I 10 8.04
1 I 8 6.95
2 I 13 7.58
3 I 9 8.81
4 I 11 8.33
5 I 14 9.96
6 I 6 7.24
7 I 4 4.26
8 I 12 10.84
9 I 7 4.81
10 I 5 5.68
11 II 10 9.14
12 II 8 8.14
13 II 13 8.74
14 II 9 8.77
15 II 11 9.26
16 II 14 8.10
17 II 6 6.13
18 II 4 3.10
19 II 12 9.13
20 II 7 7.26
21 II 5 4.74
22 III 10 7.46
23 III 8 6.77
24 III 13 12.74
25 III 9 7.11
26 III 11 7.81
27 III 14 8.84
28 III 6 6.08
29 III 4 5.39
30 III 12 8.15
31 III 7 6.42
32 III 5 5.73
33 IV 8 6.58
34 IV 8 5.76
35 IV 8 7.71
36 IV 8 8.84
37 IV 8 8.47
38 IV 8 7.04
39 IV 8 5.25
40 IV 19 12.50
41 IV 8 5.56
42 IV 8 7.91
43 IV 8 6.89
import pandas as pd

# URL of the Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'

# Read the tables from the URL
tables = pd.read_html(url)

# Assuming the first table is the one we want

tables[0]

print("DataFrame loaded from Wikipedia table:")
tables[0]
DataFrame loaded from Wikipedia table:
Country or territory Population (1 July 2022) Population (1 July 2023) Change (%) UN continental region[1] UN statistical subregion[1]
0 World 8021407192 8091734930 +0.88%
1 India 1425423212 1438069596 +0.89% Asia Southern Asia
2 China[a] 1425179569 1422584933 −0.18% Asia Eastern Asia
3 United States 341534046 343477335 +0.57% Americas Northern America
4 Indonesia 278830529 281190067 +0.85% Asia South-eastern Asia
... ... ... ... ... ... ...
233 Montserrat (United Kingdom) 4453 4420 −0.74% Americas Caribbean
234 Falkland Islands (United Kingdom) 3490 3477 −0.37% Americas South America
235 Tokelau (New Zealand) 2290 2397 +4.67% Oceania Polynesia
236 Niue (New Zealand) 1821 1817 −0.22% Oceania Polynesia
237 Vatican City[w] 505 496 −1.78% Europe Southern Europe

238 rows × 6 columns

tables[1]
vteLists of countries by population statistics vteLists of countries by population statistics.1
0 Global Current population United Nations Demographics...
1 Continents/subregions Africa Antarctica Asia Europe North America Ca...
2 Intercontinental Americas Arab world Commonwealth of Nations Eu...
3 Cities/urban areas World cities National capitals Megacities Mega...
4 Past and future Past and future population Estimates of histor...
5 Population density Current density Past and future population den...
6 Growth indicators Population growth rate Natural increase Net re...
7 Life expectancy World Africa Asia Europe North America Oceania...
8 Other demographics Age at childbearing Age at first marriage Age ...
9 Health Antidepressant consumption Antiviral medicatio...
10 Education and innovation Bloomberg Innovation Index Education Index Glo...
11 Economic Access to financial services Development aid d...
12 List of international rankings Lists by country List of international rankings Lists by country

Datatypes of columns

  • int
  • float
  • category
  • object
df_csv.dtypes
0
longitude float64
latitude float64
housing_median_age float64
total_rooms float64
total_bedrooms float64
population float64
households float64
median_income float64
median_house_value float64

tables[0].dtypes
0
Country or territory object
Population (1 July 2022) int64
Population (1 July 2023) int64
Change (%) object
UN continental region[1] object
UN statistical subregion[1] object

Summarizing dataframes

  • describe
  • missing values
  • value_counts() for categorical columns
df = df_csv
df.describe()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 3000.000000 3000.00000 3000.000000 3000.000000 3000.000000 3000.000000 3000.00000 3000.000000 3000.00000
mean -119.589200 35.63539 28.845333 2599.578667 529.950667 1402.798667 489.91200 3.807272 205846.27500
std 1.994936 2.12967 12.555396 2155.593332 415.654368 1030.543012 365.42271 1.854512 113119.68747
min -124.180000 32.56000 1.000000 6.000000 2.000000 5.000000 2.00000 0.499900 22500.00000
25% -121.810000 33.93000 18.000000 1401.000000 291.000000 780.000000 273.00000 2.544000 121200.00000
50% -118.485000 34.27000 29.000000 2106.000000 437.000000 1155.000000 409.50000 3.487150 177650.00000
75% -118.020000 37.69000 37.000000 3129.000000 636.000000 1742.750000 597.25000 4.656475 263975.00000
max -114.490000 41.92000 52.000000 30450.000000 5419.000000 11935.000000 4930.00000 15.000100 500001.00000
df.isnull().sum()
0
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 0
population 0
households 0
median_income 0
median_house_value 0

tables[0]["UN continental region[1]"].value_counts()
count
UN continental region[1]
Africa 58
Americas 55
Asia 51
Europe 50
Oceania 23
1

Selecting data

  • Select Column(s)
  • Select row(s)
  • Select columns and rows
  • Conditional selection
  • Special selection methods
    • query
    • select_dtypes
  • Order columns
#Select Column(s)
df[['longitude', 'latitude']]
longitude latitude
0 -122.05 37.37
1 -118.30 34.26
2 -117.81 33.78
3 -118.36 33.82
4 -119.67 36.33
... ... ...
2995 -119.86 34.42
2996 -118.14 34.06
2997 -119.70 36.30
2998 -117.12 34.10
2999 -119.63 34.42

3000 rows × 2 columns

# Select row(s)
df.iloc[0:3]
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0
1 -118.30 34.26 43.0 1510.0 310.0 809.0 277.0 3.5990 176500.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0
# Select columns and rows
df.iloc[0:3, 0:2] # rows , columns
longitude latitude
0 -122.05 37.37
1 -118.30 34.26
2 -117.81 33.78
# Select columns and rows
df.loc[0:3, ['longitude', 'latitude']]
longitude latitude
0 -122.05 37.37
1 -118.30 34.26
2 -117.81 33.78
3 -118.36 33.82
# boolean indexing
df['population'] > 1000
population
0 True
1 False
2 True
3 False
4 False
... ...
2995 True
2996 True
2997 False
2998 False
2999 False

3000 rows × 1 columns


# Conditional selection
df[df['population'] > 1000]
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0
7 -120.65 35.48 19.0 2310.0 471.0 1341.0 441.0 3.2250 166900.0
8 -122.84 38.40 15.0 3080.0 617.0 1446.0 599.0 3.6696 194400.0
9 -118.02 34.08 31.0 2402.0 632.0 2830.0 603.0 2.3333 164200.0
... ... ... ... ... ... ... ... ... ...
2988 -122.01 36.97 43.0 2162.0 509.0 1208.0 464.0 2.5417 260900.0
2989 -122.02 37.60 32.0 1295.0 295.0 1097.0 328.0 3.2386 149600.0
2990 -118.23 34.09 49.0 1638.0 456.0 1500.0 430.0 2.6923 150000.0
2995 -119.86 34.42 23.0 1450.0 642.0 1258.0 607.0 1.1790 225000.0
2996 -118.14 34.06 27.0 5257.0 1082.0 3496.0 1036.0 3.3906 237200.0

1798 rows × 9 columns

# Special selection methods
# query
df.query('population > 1000')
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0
7 -120.65 35.48 19.0 2310.0 471.0 1341.0 441.0 3.2250 166900.0
8 -122.84 38.40 15.0 3080.0 617.0 1446.0 599.0 3.6696 194400.0
9 -118.02 34.08 31.0 2402.0 632.0 2830.0 603.0 2.3333 164200.0
... ... ... ... ... ... ... ... ... ...
2988 -122.01 36.97 43.0 2162.0 509.0 1208.0 464.0 2.5417 260900.0
2989 -122.02 37.60 32.0 1295.0 295.0 1097.0 328.0 3.2386 149600.0
2990 -118.23 34.09 49.0 1638.0 456.0 1500.0 430.0 2.6923 150000.0
2995 -119.86 34.42 23.0 1450.0 642.0 1258.0 607.0 1.1790 225000.0
2996 -118.14 34.06 27.0 5257.0 1082.0 3496.0 1036.0 3.3906 237200.0

1798 rows × 9 columns

df.query('population > 1000 and housing_median_age < 10')
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
33 -118.08 34.55 5.0 16181.0 2971.0 8152.0 2651.0 4.5237 141800.0
45 -117.24 33.17 4.0 9998.0 1874.0 3925.0 1672.0 4.2826 237500.0
93 -117.50 33.87 4.0 6755.0 1017.0 2866.0 850.0 5.0493 239800.0
153 -118.38 34.27 8.0 3248.0 847.0 2608.0 731.0 2.8214 158300.0
182 -122.24 37.55 3.0 6164.0 1175.0 2198.0 975.0 6.7413 435900.0
... ... ... ... ... ... ... ... ... ...
2899 -121.92 38.02 8.0 2750.0 479.0 1526.0 484.0 5.1020 156500.0
2913 -122.39 37.78 3.0 3464.0 1179.0 1441.0 919.0 4.7105 275000.0
2930 -121.84 37.29 4.0 2937.0 648.0 1780.0 665.0 4.3851 160400.0
2936 -119.75 36.87 3.0 13802.0 2244.0 5226.0 1972.0 5.0941 143700.0
2969 -118.11 34.68 6.0 7430.0 1184.0 3489.0 1115.0 5.3267 140100.0

139 rows × 9 columns

df.query('population > 1000 or housing_median_age < 10')
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0
7 -120.65 35.48 19.0 2310.0 471.0 1341.0 441.0 3.2250 166900.0
8 -122.84 38.40 15.0 3080.0 617.0 1446.0 599.0 3.6696 194400.0
9 -118.02 34.08 31.0 2402.0 632.0 2830.0 603.0 2.3333 164200.0
... ... ... ... ... ... ... ... ... ...
2988 -122.01 36.97 43.0 2162.0 509.0 1208.0 464.0 2.5417 260900.0
2989 -122.02 37.60 32.0 1295.0 295.0 1097.0 328.0 3.2386 149600.0
2990 -118.23 34.09 49.0 1638.0 456.0 1500.0 430.0 2.6923 150000.0
2995 -119.86 34.42 23.0 1450.0 642.0 1258.0 607.0 1.1790 225000.0
2996 -118.14 34.06 27.0 5257.0 1082.0 3496.0 1036.0 3.3906 237200.0

1843 rows × 9 columns

df.select_dtypes(include=['number']) # Selects all numeric columns
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0
1 -118.30 34.26 43.0 1510.0 310.0 809.0 277.0 3.5990 176500.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0
3 -118.36 33.82 28.0 67.0 15.0 49.0 11.0 6.1359 330000.0
4 -119.67 36.33 19.0 1241.0 244.0 850.0 237.0 2.9375 81700.0
... ... ... ... ... ... ... ... ... ...
2995 -119.86 34.42 23.0 1450.0 642.0 1258.0 607.0 1.1790 225000.0
2996 -118.14 34.06 27.0 5257.0 1082.0 3496.0 1036.0 3.3906 237200.0
2997 -119.70 36.30 10.0 956.0 201.0 693.0 220.0 2.2895 62000.0
2998 -117.12 34.10 40.0 96.0 14.0 46.0 14.0 3.2708 162500.0
2999 -119.63 34.42 42.0 1765.0 263.0 753.0 260.0 8.5608 500001.0

3000 rows × 9 columns

df.select_dtypes(include=['int64', 'float64']) # Selects specific numeric types
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0
1 -118.30 34.26 43.0 1510.0 310.0 809.0 277.0 3.5990 176500.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0
3 -118.36 33.82 28.0 67.0 15.0 49.0 11.0 6.1359 330000.0
4 -119.67 36.33 19.0 1241.0 244.0 850.0 237.0 2.9375 81700.0
... ... ... ... ... ... ... ... ... ...
2995 -119.86 34.42 23.0 1450.0 642.0 1258.0 607.0 1.1790 225000.0
2996 -118.14 34.06 27.0 5257.0 1082.0 3496.0 1036.0 3.3906 237200.0
2997 -119.70 36.30 10.0 956.0 201.0 693.0 220.0 2.2895 62000.0
2998 -117.12 34.10 40.0 96.0 14.0 46.0 14.0 3.2708 162500.0
2999 -119.63 34.42 42.0 1765.0 263.0 753.0 260.0 8.5608 500001.0

3000 rows × 9 columns

tables[0].select_dtypes(exclude= ["number"]).dtypes
0
Country or territory object
Change (%) object
UN continental region[1] object
UN statistical subregion[1] object

df.sample(5)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
334 -118.13 34.01 45.0 1179.0 268.0 736.0 252.0 2.7083 161800.0
649 -122.27 37.80 39.0 1715.0 623.0 1327.0 467.0 1.8477 179200.0
604 -121.93 38.01 9.0 2294.0 389.0 1142.0 365.0 5.3363 160800.0
716 -121.17 37.97 28.0 1374.0 248.0 769.0 229.0 3.6389 130400.0
2981 -120.66 35.49 17.0 4422.0 945.0 2307.0 885.0 2.8285 171300.0

Mathematical operations on series

  • addition, subtraction, multiplication and division
    • by constant
    • by another series
df['housing_median_age'].head()
housing_median_age
0 27.0
1 43.0
2 27.0
3 28.0
4 19.0

df['housing_median_age'] * 2
housing_median_age
0 54.0
1 86.0
2 54.0
3 56.0
4 38.0
... ...
2995 46.0
2996 54.0
2997 20.0
2998 80.0
2999 84.0

3000 rows × 1 columns


df["total_rooms"] +     df["total_bedrooms"]
0
0 4546.0
1 1820.0
2 4096.0
3 82.0
4 1485.0
... ...
2995 2092.0
2996 6339.0
2997 1157.0
2998 110.0
2999 2028.0

3000 rows × 1 columns


Creating, renaming

  • create rows and columns
  • rename columns
df["extra column"] = df["total_rooms"] +    df["total_bedrooms"]
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value extra column
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0 4546.0
1 -118.30 34.26 43.0 1510.0 310.0 809.0 277.0 3.5990 176500.0 1820.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0 4096.0
3 -118.36 33.82 28.0 67.0 15.0 49.0 11.0 6.1359 330000.0 82.0
4 -119.67 36.33 19.0 1241.0 244.0 850.0 237.0 2.9375 81700.0 1485.0
# Create a new row using loc

df_new = pd.DataFrame()

for i in range(10):
  df_new.loc[i,"new_col"] = (i+2)/2
df_new
new_col
0 1.0
1 1.5
2 2.0
3 2.5
4 3.0
5 3.5
6 4.0
7 4.5
8 5.0
9 5.5
df
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value extra column
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0 4546.0
1 -118.30 34.26 43.0 1510.0 310.0 809.0 277.0 3.5990 176500.0 1820.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0 4096.0
3 -118.36 33.82 28.0 67.0 15.0 49.0 11.0 6.1359 330000.0 82.0
4 -119.67 36.33 19.0 1241.0 244.0 850.0 237.0 2.9375 81700.0 1485.0
... ... ... ... ... ... ... ... ... ... ...
2995 -119.86 34.42 23.0 1450.0 642.0 1258.0 607.0 1.1790 225000.0 2092.0
2996 -118.14 34.06 27.0 5257.0 1082.0 3496.0 1036.0 3.3906 237200.0 6339.0
2997 -119.70 36.30 10.0 956.0 201.0 693.0 220.0 2.2895 62000.0 1157.0
2998 -117.12 34.10 40.0 96.0 14.0 46.0 14.0 3.2708 162500.0 110.0
2999 -119.63 34.42 42.0 1765.0 263.0 753.0 260.0 8.5608 500001.0 2028.0

3000 rows × 10 columns

# Create a new row using loc
df.loc[3000, "longitude"] = 30
df
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value extra column
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0 4546.0
1 -118.30 34.26 43.0 1510.0 310.0 809.0 277.0 3.5990 176500.0 1820.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0 4096.0
3 -118.36 33.82 28.0 67.0 15.0 49.0 11.0 6.1359 330000.0 82.0
4 -119.67 36.33 19.0 1241.0 244.0 850.0 237.0 2.9375 81700.0 1485.0
... ... ... ... ... ... ... ... ... ... ...
2996 -118.14 34.06 27.0 5257.0 1082.0 3496.0 1036.0 3.3906 237200.0 6339.0
2997 -119.70 36.30 10.0 956.0 201.0 693.0 220.0 2.2895 62000.0 1157.0
2998 -117.12 34.10 40.0 96.0 14.0 46.0 14.0 3.2708 162500.0 110.0
2999 -119.63 34.42 42.0 1765.0 263.0 753.0 260.0 8.5608 500001.0 2028.0
3000 30.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN

3001 rows × 10 columns

df.isnull().sum()
0
longitude 0
latitude 1
housing_median_age 1
total_rooms 1
total_bedrooms 1
population 1
households 1
median_income 1
median_house_value 1
extra column 1

a_row = df.iloc[0]
a_row
0
longitude -122.0500
latitude 37.3700
housing_median_age 27.0000
total_rooms 3885.0000
total_bedrooms 661.0000
population 1537.0000
households 606.0000
median_income 6.6085
median_house_value 344700.0000
extra column 4546.0000

# Create a new row using loc
df.loc[3001] = a_row
df
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value extra column
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0 4546.0
1 -118.30 34.26 43.0 1510.0 310.0 809.0 277.0 3.5990 176500.0 1820.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0 4096.0
3 -118.36 33.82 28.0 67.0 15.0 49.0 11.0 6.1359 330000.0 82.0
4 -119.67 36.33 19.0 1241.0 244.0 850.0 237.0 2.9375 81700.0 1485.0
... ... ... ... ... ... ... ... ... ... ...
2997 -119.70 36.30 10.0 956.0 201.0 693.0 220.0 2.2895 62000.0 1157.0
2998 -117.12 34.10 40.0 96.0 14.0 46.0 14.0 3.2708 162500.0 110.0
2999 -119.63 34.42 42.0 1765.0 263.0 753.0 260.0 8.5608 500001.0 2028.0
3000 30.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3001 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0 4546.0

3002 rows × 10 columns

# Rename columns
df.rename(columns={'extra column': 'sum_column'})
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value sum_column
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0 4546.0
1 -118.30 34.26 43.0 1510.0 310.0 809.0 277.0 3.5990 176500.0 1820.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0 4096.0
3 -118.36 33.82 28.0 67.0 15.0 49.0 11.0 6.1359 330000.0 82.0
4 -119.67 36.33 19.0 1241.0 244.0 850.0 237.0 2.9375 81700.0 1485.0
... ... ... ... ... ... ... ... ... ... ...
2997 -119.70 36.30 10.0 956.0 201.0 693.0 220.0 2.2895 62000.0 1157.0
2998 -117.12 34.10 40.0 96.0 14.0 46.0 14.0 3.2708 162500.0 110.0
2999 -119.63 34.42 42.0 1765.0 263.0 753.0 260.0 8.5608 500001.0 2028.0
3000 30.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3001 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0 4546.0

3002 rows × 10 columns

# Rename columns
df.rename(columns={'longitude': 'long', 'latitude': 'lat'})
long lat housing_median_age total_rooms total_bedrooms population households median_income median_house_value extra column
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0 4546.0
1 -118.30 34.26 43.0 1510.0 310.0 809.0 277.0 3.5990 176500.0 1820.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0 4096.0
3 -118.36 33.82 28.0 67.0 15.0 49.0 11.0 6.1359 330000.0 82.0
4 -119.67 36.33 19.0 1241.0 244.0 850.0 237.0 2.9375 81700.0 1485.0
... ... ... ... ... ... ... ... ... ... ...
2997 -119.70 36.30 10.0 956.0 201.0 693.0 220.0 2.2895 62000.0 1157.0
2998 -117.12 34.10 40.0 96.0 14.0 46.0 14.0 3.2708 162500.0 110.0
2999 -119.63 34.42 42.0 1765.0 263.0 753.0 260.0 8.5608 500001.0 2028.0
3000 30.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3001 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0 4546.0

3002 rows × 10 columns

df.columns
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'extra column'],
      dtype='object')
# groupby
# pivot
# melting