Pandas Introduction

https://pandas.pydata.org/docs/user_guide/10min.html#min
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
Pandas Tutorial (Data Analysis In Python)

This is Introduction to pandas. Pandas library helps in Data loading , analysis and transformation

Anatomy of a dataframe

Columns, index and data

import pandas as pd

# Create a sample DataFrame
data = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])
df

	col1	col2	col3
row1	1	4	7
row2	2	5	8
row3	3	6	9

# Create a sample DataFrame
data = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}
df = pd.DataFrame(data)#, index=['row1', 'row2', 'row3'])
df

	col1	col2	col3
0	1	4	7
1	2	5	8
2	3	6	9

# Display columns
df.columns

Index(['col1', 'col2', 'col3'], dtype='object')

df.index

Index(['row1', 'row2', 'row3'], dtype='object')


# Display data (values)
df.values

array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

Creating Dataframe

From lists of lists
From series objects
From dictionaries

# From lists of lists
list_of_lists_data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
df_from_list = pd.DataFrame(list_of_lists_data, columns=['Age', 'B', 'C'])
df_from_list

	Age	B	C
0	1	2	3
1	4	5	6
2	7	8	9

pd.Series([1,2,3])

	0
0	1
1	2
2	3

dtype: int64

# From series objects
series_data = {'col1': pd.Series([1, 2, 3]),
               'col2': pd.Series([4, 5, 6])}
df_from_series = pd.DataFrame(series_data)
print("\nDataFrame from series objects:")
df_from_series


DataFrame from series objects:

	col1	col2
0	1	4
1	2	5
2	3	6

# From dictionaries
dictionary_data = {
    'col1': [1, 2, 3],
    'col2': [4, 5, 6]
    }

df_from_dict = pd.DataFrame(dictionary_data)
print("\nDataFrame from dictionaries:")
df_from_dict


DataFrame from dictionaries:

	col1	col2
0	1	4
1	2	5
2	3	6

Loading Datasets

load data from csv file
load data from json file
load data from html file

df_csv = pd.read_csv("/content/sample_data/california_housing_test.csv")
df_csv

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0
1	-118.30	34.26	43.0	1510.0	310.0	809.0	277.0	3.5990	176500.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0
3	-118.36	33.82	28.0	67.0	15.0	49.0	11.0	6.1359	330000.0
4	-119.67	36.33	19.0	1241.0	244.0	850.0	237.0	2.9375	81700.0
...	...	...	...	...	...	...	...	...	...
2995	-119.86	34.42	23.0	1450.0	642.0	1258.0	607.0	1.1790	225000.0
2996	-118.14	34.06	27.0	5257.0	1082.0	3496.0	1036.0	3.3906	237200.0
2997	-119.70	36.30	10.0	956.0	201.0	693.0	220.0	2.2895	62000.0
2998	-117.12	34.10	40.0	96.0	14.0	46.0	14.0	3.2708	162500.0
2999	-119.63	34.42	42.0	1765.0	263.0	753.0	260.0	8.5608	500001.0

3000 rows × 9 columns

df_json = pd.read_json("/content/sample_data/anscombe.json")
df_json

	Series	X	Y
0	I	10	8.04
1	I	8	6.95
2	I	13	7.58
3	I	9	8.81
4	I	11	8.33
5	I	14	9.96
6	I	6	7.24
7	I	4	4.26
8	I	12	10.84
9	I	7	4.81
10	I	5	5.68
11	II	10	9.14
12	II	8	8.14
13	II	13	8.74
14	II	9	8.77
15	II	11	9.26
16	II	14	8.10
17	II	6	6.13
18	II	4	3.10
19	II	12	9.13
20	II	7	7.26
21	II	5	4.74
22	III	10	7.46
23	III	8	6.77
24	III	13	12.74
25	III	9	7.11
26	III	11	7.81
27	III	14	8.84
28	III	6	6.08
29	III	4	5.39
30	III	12	8.15
31	III	7	6.42
32	III	5	5.73
33	IV	8	6.58
34	IV	8	5.76
35	IV	8	7.71
36	IV	8	8.84
37	IV	8	8.47
38	IV	8	7.04
39	IV	8	5.25
40	IV	19	12.50
41	IV	8	5.56
42	IV	8	7.91
43	IV	8	6.89

import pandas as pd

# URL of the Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'

# Read the tables from the URL
tables = pd.read_html(url)

# Assuming the first table is the one we want

tables[0]

print("DataFrame loaded from Wikipedia table:")
tables[0]

DataFrame loaded from Wikipedia table:

	Country or territory	Population (1 July 2022)	Population (1 July 2023)	Change (%)	UN continental region[1]	UN statistical subregion[1]
0	World	8021407192	8091734930	+0.88%	–	–
1	India	1425423212	1438069596	+0.89%	Asia	Southern Asia
2	China[a]	1425179569	1422584933	−0.18%	Asia	Eastern Asia
3	United States	341534046	343477335	+0.57%	Americas	Northern America
4	Indonesia	278830529	281190067	+0.85%	Asia	South-eastern Asia
...	...	...	...	...	...	...
233	Montserrat (United Kingdom)	4453	4420	−0.74%	Americas	Caribbean
234	Falkland Islands (United Kingdom)	3490	3477	−0.37%	Americas	South America
235	Tokelau (New Zealand)	2290	2397	+4.67%	Oceania	Polynesia
236	Niue (New Zealand)	1821	1817	−0.22%	Oceania	Polynesia
237	Vatican City[w]	505	496	−1.78%	Europe	Southern Europe

238 rows × 6 columns

tables[1]

	vteLists of countries by population statistics	vteLists of countries by population statistics.1
0	Global	Current population United Nations Demographics...
1	Continents/subregions	Africa Antarctica Asia Europe North America Ca...
2	Intercontinental	Americas Arab world Commonwealth of Nations Eu...
3	Cities/urban areas	World cities National capitals Megacities Mega...
4	Past and future	Past and future population Estimates of histor...
5	Population density	Current density Past and future population den...
6	Growth indicators	Population growth rate Natural increase Net re...
7	Life expectancy	World Africa Asia Europe North America Oceania...
8	Other demographics	Age at childbearing Age at first marriage Age ...
9	Health	Antidepressant consumption Antiviral medicatio...
10	Education and innovation	Bloomberg Innovation Index Education Index Glo...
11	Economic	Access to financial services Development aid d...
12	List of international rankings Lists by country	List of international rankings Lists by country

Datatypes of columns

int
float
category
object

df_csv.dtypes

	0
longitude	float64
latitude	float64
housing_median_age	float64
total_rooms	float64
total_bedrooms	float64
population	float64
households	float64
median_income	float64
median_house_value	float64

dtype: object

tables[0].dtypes

	0
Country or territory	object
Population (1 July 2022)	int64
Population (1 July 2023)	int64
Change (%)	object
UN continental region[1]	object
UN statistical subregion[1]	object

dtype: object

Summarizing dataframes

describe
missing values
value_counts() for categorical columns

df = df_csv

df.describe()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	3000.000000	3000.00000	3000.000000	3000.000000	3000.000000	3000.000000	3000.00000	3000.000000	3000.00000
mean	-119.589200	35.63539	28.845333	2599.578667	529.950667	1402.798667	489.91200	3.807272	205846.27500
std	1.994936	2.12967	12.555396	2155.593332	415.654368	1030.543012	365.42271	1.854512	113119.68747
min	-124.180000	32.56000	1.000000	6.000000	2.000000	5.000000	2.00000	0.499900	22500.00000
25%	-121.810000	33.93000	18.000000	1401.000000	291.000000	780.000000	273.00000	2.544000	121200.00000
50%	-118.485000	34.27000	29.000000	2106.000000	437.000000	1155.000000	409.50000	3.487150	177650.00000
75%	-118.020000	37.69000	37.000000	3129.000000	636.000000	1742.750000	597.25000	4.656475	263975.00000
max	-114.490000	41.92000	52.000000	30450.000000	5419.000000	11935.000000	4930.00000	15.000100	500001.00000

df.isnull().sum()

	0
longitude	0
latitude	0
housing_median_age	0
total_rooms	0
total_bedrooms	0
population	0
households	0
median_income	0
median_house_value	0

dtype: int64

tables[0]["UN continental region[1]"].value_counts()

	count
UN continental region[1]
Africa	58
Americas	55
Asia	51
Europe	50
Oceania	23
–	1

dtype: int64

Selecting data

Select Column(s)
Select row(s)
Select columns and rows
Conditional selection
Special selection methods
- query
- select_dtypes
Order columns

#Select Column(s)
df[['longitude', 'latitude']]

	longitude	latitude
0	-122.05	37.37
1	-118.30	34.26
2	-117.81	33.78
3	-118.36	33.82
4	-119.67	36.33
...	...	...
2995	-119.86	34.42
2996	-118.14	34.06
2997	-119.70	36.30
2998	-117.12	34.10
2999	-119.63	34.42

3000 rows × 2 columns

# Select row(s)
df.iloc[0:3]

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0
1	-118.30	34.26	43.0	1510.0	310.0	809.0	277.0	3.5990	176500.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0

# Select columns and rows
df.iloc[0:3, 0:2] # rows , columns

	longitude	latitude
0	-122.05	37.37
1	-118.30	34.26
2	-117.81	33.78

# Select columns and rows
df.loc[0:3, ['longitude', 'latitude']]

	longitude	latitude
0	-122.05	37.37
1	-118.30	34.26
2	-117.81	33.78
3	-118.36	33.82

# boolean indexing
df['population'] > 1000

	population
0	True
1	False
2	True
3	False
4	False
...	...
2995	True
2996	True
2997	False
2998	False
2999	False

3000 rows × 1 columns

dtype: bool

# Conditional selection
df[df['population'] > 1000]

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0
7	-120.65	35.48	19.0	2310.0	471.0	1341.0	441.0	3.2250	166900.0
8	-122.84	38.40	15.0	3080.0	617.0	1446.0	599.0	3.6696	194400.0
9	-118.02	34.08	31.0	2402.0	632.0	2830.0	603.0	2.3333	164200.0
...	...	...	...	...	...	...	...	...	...
2988	-122.01	36.97	43.0	2162.0	509.0	1208.0	464.0	2.5417	260900.0
2989	-122.02	37.60	32.0	1295.0	295.0	1097.0	328.0	3.2386	149600.0
2990	-118.23	34.09	49.0	1638.0	456.0	1500.0	430.0	2.6923	150000.0
2995	-119.86	34.42	23.0	1450.0	642.0	1258.0	607.0	1.1790	225000.0
2996	-118.14	34.06	27.0	5257.0	1082.0	3496.0	1036.0	3.3906	237200.0

1798 rows × 9 columns

# Special selection methods
# query
df.query('population > 1000')

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0
7	-120.65	35.48	19.0	2310.0	471.0	1341.0	441.0	3.2250	166900.0
8	-122.84	38.40	15.0	3080.0	617.0	1446.0	599.0	3.6696	194400.0
9	-118.02	34.08	31.0	2402.0	632.0	2830.0	603.0	2.3333	164200.0
...	...	...	...	...	...	...	...	...	...
2988	-122.01	36.97	43.0	2162.0	509.0	1208.0	464.0	2.5417	260900.0
2989	-122.02	37.60	32.0	1295.0	295.0	1097.0	328.0	3.2386	149600.0
2990	-118.23	34.09	49.0	1638.0	456.0	1500.0	430.0	2.6923	150000.0
2995	-119.86	34.42	23.0	1450.0	642.0	1258.0	607.0	1.1790	225000.0
2996	-118.14	34.06	27.0	5257.0	1082.0	3496.0	1036.0	3.3906	237200.0

1798 rows × 9 columns

df.query('population > 1000 and housing_median_age < 10')

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
33	-118.08	34.55	5.0	16181.0	2971.0	8152.0	2651.0	4.5237	141800.0
45	-117.24	33.17	4.0	9998.0	1874.0	3925.0	1672.0	4.2826	237500.0
93	-117.50	33.87	4.0	6755.0	1017.0	2866.0	850.0	5.0493	239800.0
153	-118.38	34.27	8.0	3248.0	847.0	2608.0	731.0	2.8214	158300.0
182	-122.24	37.55	3.0	6164.0	1175.0	2198.0	975.0	6.7413	435900.0
...	...	...	...	...	...	...	...	...	...
2899	-121.92	38.02	8.0	2750.0	479.0	1526.0	484.0	5.1020	156500.0
2913	-122.39	37.78	3.0	3464.0	1179.0	1441.0	919.0	4.7105	275000.0
2930	-121.84	37.29	4.0	2937.0	648.0	1780.0	665.0	4.3851	160400.0
2936	-119.75	36.87	3.0	13802.0	2244.0	5226.0	1972.0	5.0941	143700.0
2969	-118.11	34.68	6.0	7430.0	1184.0	3489.0	1115.0	5.3267	140100.0

139 rows × 9 columns

df.query('population > 1000 or housing_median_age < 10')

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0
7	-120.65	35.48	19.0	2310.0	471.0	1341.0	441.0	3.2250	166900.0
8	-122.84	38.40	15.0	3080.0	617.0	1446.0	599.0	3.6696	194400.0
9	-118.02	34.08	31.0	2402.0	632.0	2830.0	603.0	2.3333	164200.0
...	...	...	...	...	...	...	...	...	...
2988	-122.01	36.97	43.0	2162.0	509.0	1208.0	464.0	2.5417	260900.0
2989	-122.02	37.60	32.0	1295.0	295.0	1097.0	328.0	3.2386	149600.0
2990	-118.23	34.09	49.0	1638.0	456.0	1500.0	430.0	2.6923	150000.0
2995	-119.86	34.42	23.0	1450.0	642.0	1258.0	607.0	1.1790	225000.0
2996	-118.14	34.06	27.0	5257.0	1082.0	3496.0	1036.0	3.3906	237200.0

1843 rows × 9 columns

df.select_dtypes(include=['number']) # Selects all numeric columns

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0
1	-118.30	34.26	43.0	1510.0	310.0	809.0	277.0	3.5990	176500.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0
3	-118.36	33.82	28.0	67.0	15.0	49.0	11.0	6.1359	330000.0
4	-119.67	36.33	19.0	1241.0	244.0	850.0	237.0	2.9375	81700.0
...	...	...	...	...	...	...	...	...	...
2995	-119.86	34.42	23.0	1450.0	642.0	1258.0	607.0	1.1790	225000.0
2996	-118.14	34.06	27.0	5257.0	1082.0	3496.0	1036.0	3.3906	237200.0
2997	-119.70	36.30	10.0	956.0	201.0	693.0	220.0	2.2895	62000.0
2998	-117.12	34.10	40.0	96.0	14.0	46.0	14.0	3.2708	162500.0
2999	-119.63	34.42	42.0	1765.0	263.0	753.0	260.0	8.5608	500001.0

3000 rows × 9 columns

df.select_dtypes(include=['int64', 'float64']) # Selects specific numeric types

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0
1	-118.30	34.26	43.0	1510.0	310.0	809.0	277.0	3.5990	176500.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0
3	-118.36	33.82	28.0	67.0	15.0	49.0	11.0	6.1359	330000.0
4	-119.67	36.33	19.0	1241.0	244.0	850.0	237.0	2.9375	81700.0
...	...	...	...	...	...	...	...	...	...
2995	-119.86	34.42	23.0	1450.0	642.0	1258.0	607.0	1.1790	225000.0
2996	-118.14	34.06	27.0	5257.0	1082.0	3496.0	1036.0	3.3906	237200.0
2997	-119.70	36.30	10.0	956.0	201.0	693.0	220.0	2.2895	62000.0
2998	-117.12	34.10	40.0	96.0	14.0	46.0	14.0	3.2708	162500.0
2999	-119.63	34.42	42.0	1765.0	263.0	753.0	260.0	8.5608	500001.0

3000 rows × 9 columns

tables[0].select_dtypes(exclude= ["number"]).dtypes

	0
Country or territory	object
Change (%)	object
UN continental region[1]	object
UN statistical subregion[1]	object

dtype: object

df.sample(5)

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
334	-118.13	34.01	45.0	1179.0	268.0	736.0	252.0	2.7083	161800.0
649	-122.27	37.80	39.0	1715.0	623.0	1327.0	467.0	1.8477	179200.0
604	-121.93	38.01	9.0	2294.0	389.0	1142.0	365.0	5.3363	160800.0
716	-121.17	37.97	28.0	1374.0	248.0	769.0	229.0	3.6389	130400.0
2981	-120.66	35.49	17.0	4422.0	945.0	2307.0	885.0	2.8285	171300.0

Mathematical operations on series

addition, subtraction, multiplication and division
- by constant
- by another series

df['housing_median_age'].head()

	housing_median_age
0	27.0
1	43.0
2	27.0
3	28.0
4	19.0

dtype: float64

df['housing_median_age'] * 2

	housing_median_age
0	54.0
1	86.0
2	54.0
3	56.0
4	38.0
...	...
2995	46.0
2996	54.0
2997	20.0
2998	80.0
2999	84.0

3000 rows × 1 columns

dtype: float64

df["total_rooms"] +     df["total_bedrooms"]

	0
0	4546.0
1	1820.0
2	4096.0
3	82.0
4	1485.0
...	...
2995	2092.0
2996	6339.0
2997	1157.0
2998	110.0
2999	2028.0

3000 rows × 1 columns

dtype: float64

Creating, renaming

create rows and columns
rename columns

df["extra column"] = df["total_rooms"] +    df["total_bedrooms"]
df.head()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	extra column
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0	4546.0
1	-118.30	34.26	43.0	1510.0	310.0	809.0	277.0	3.5990	176500.0	1820.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0	4096.0
3	-118.36	33.82	28.0	67.0	15.0	49.0	11.0	6.1359	330000.0	82.0
4	-119.67	36.33	19.0	1241.0	244.0	850.0	237.0	2.9375	81700.0	1485.0

# Create a new row using loc

df_new = pd.DataFrame()

for i in range(10):
  df_new.loc[i,"new_col"] = (i+2)/2
df_new

	new_col
0	1.0
1	1.5
2	2.0
3	2.5
4	3.0
5	3.5
6	4.0
7	4.5
8	5.0
9	5.5

df

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	extra column
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0	4546.0
1	-118.30	34.26	43.0	1510.0	310.0	809.0	277.0	3.5990	176500.0	1820.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0	4096.0
3	-118.36	33.82	28.0	67.0	15.0	49.0	11.0	6.1359	330000.0	82.0
4	-119.67	36.33	19.0	1241.0	244.0	850.0	237.0	2.9375	81700.0	1485.0
...	...	...	...	...	...	...	...	...	...	...
2995	-119.86	34.42	23.0	1450.0	642.0	1258.0	607.0	1.1790	225000.0	2092.0
2996	-118.14	34.06	27.0	5257.0	1082.0	3496.0	1036.0	3.3906	237200.0	6339.0
2997	-119.70	36.30	10.0	956.0	201.0	693.0	220.0	2.2895	62000.0	1157.0
2998	-117.12	34.10	40.0	96.0	14.0	46.0	14.0	3.2708	162500.0	110.0
2999	-119.63	34.42	42.0	1765.0	263.0	753.0	260.0	8.5608	500001.0	2028.0

3000 rows × 10 columns

# Create a new row using loc
df.loc[3000, "longitude"] = 30
df

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	extra column
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0	4546.0
1	-118.30	34.26	43.0	1510.0	310.0	809.0	277.0	3.5990	176500.0	1820.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0	4096.0
3	-118.36	33.82	28.0	67.0	15.0	49.0	11.0	6.1359	330000.0	82.0
4	-119.67	36.33	19.0	1241.0	244.0	850.0	237.0	2.9375	81700.0	1485.0
...	...	...	...	...	...	...	...	...	...	...
2996	-118.14	34.06	27.0	5257.0	1082.0	3496.0	1036.0	3.3906	237200.0	6339.0
2997	-119.70	36.30	10.0	956.0	201.0	693.0	220.0	2.2895	62000.0	1157.0
2998	-117.12	34.10	40.0	96.0	14.0	46.0	14.0	3.2708	162500.0	110.0
2999	-119.63	34.42	42.0	1765.0	263.0	753.0	260.0	8.5608	500001.0	2028.0
3000	30.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

3001 rows × 10 columns

df.isnull().sum()

	0
longitude	0
latitude	1
housing_median_age	1
total_rooms	1
total_bedrooms	1
population	1
households	1
median_income	1
median_house_value	1
extra column	1

dtype: int64

a_row = df.iloc[0]
a_row

	0
longitude	-122.0500
latitude	37.3700
housing_median_age	27.0000
total_rooms	3885.0000
total_bedrooms	661.0000
population	1537.0000
households	606.0000
median_income	6.6085
median_house_value	344700.0000
extra column	4546.0000

dtype: float64

# Create a new row using loc
df.loc[3001] = a_row
df

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	extra column
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0	4546.0
1	-118.30	34.26	43.0	1510.0	310.0	809.0	277.0	3.5990	176500.0	1820.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0	4096.0
3	-118.36	33.82	28.0	67.0	15.0	49.0	11.0	6.1359	330000.0	82.0
4	-119.67	36.33	19.0	1241.0	244.0	850.0	237.0	2.9375	81700.0	1485.0
...	...	...	...	...	...	...	...	...	...	...
2997	-119.70	36.30	10.0	956.0	201.0	693.0	220.0	2.2895	62000.0	1157.0
2998	-117.12	34.10	40.0	96.0	14.0	46.0	14.0	3.2708	162500.0	110.0
2999	-119.63	34.42	42.0	1765.0	263.0	753.0	260.0	8.5608	500001.0	2028.0
3000	30.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3001	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0	4546.0

3002 rows × 10 columns

# Rename columns
df.rename(columns={'extra column': 'sum_column'})

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	sum_column
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0	4546.0
1	-118.30	34.26	43.0	1510.0	310.0	809.0	277.0	3.5990	176500.0	1820.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0	4096.0
3	-118.36	33.82	28.0	67.0	15.0	49.0	11.0	6.1359	330000.0	82.0
4	-119.67	36.33	19.0	1241.0	244.0	850.0	237.0	2.9375	81700.0	1485.0
...	...	...	...	...	...	...	...	...	...	...
2997	-119.70	36.30	10.0	956.0	201.0	693.0	220.0	2.2895	62000.0	1157.0
2998	-117.12	34.10	40.0	96.0	14.0	46.0	14.0	3.2708	162500.0	110.0
2999	-119.63	34.42	42.0	1765.0	263.0	753.0	260.0	8.5608	500001.0	2028.0
3000	30.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3001	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0	4546.0

3002 rows × 10 columns

# Rename columns
df.rename(columns={'longitude': 'long', 'latitude': 'lat'})

	long	lat	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	extra column
0	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0	4546.0
1	-118.30	34.26	43.0	1510.0	310.0	809.0	277.0	3.5990	176500.0	1820.0
2	-117.81	33.78	27.0	3589.0	507.0	1484.0	495.0	5.7934	270500.0	4096.0
3	-118.36	33.82	28.0	67.0	15.0	49.0	11.0	6.1359	330000.0	82.0
4	-119.67	36.33	19.0	1241.0	244.0	850.0	237.0	2.9375	81700.0	1485.0
...	...	...	...	...	...	...	...	...	...	...
2997	-119.70	36.30	10.0	956.0	201.0	693.0	220.0	2.2895	62000.0	1157.0
2998	-117.12	34.10	40.0	96.0	14.0	46.0	14.0	3.2708	162500.0	110.0
2999	-119.63	34.42	42.0	1765.0	263.0	753.0	260.0	8.5608	500001.0	2028.0
3000	30.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3001	-122.05	37.37	27.0	3885.0	661.0	1537.0	606.0	6.6085	344700.0	4546.0

3002 rows × 10 columns

df.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'extra column'],
      dtype='object')

# groupby
# pivot
# melting