apply_and_map – Data Science Lab

#Applying a Function on a Pandas Series

The apply() method is one of the most common methods of data preprocessing. It simplifies applying a function on each element in a pandas Series and each row or column in a pandas DataFrame. In this tutorial, we’ll learn how to use the apply() method in pandas — you’ll need to know the fundamentals of Python and lambda functions. If you aren’t familiar with these or need to brush up your Python skills, you might like to try our free Python Fundamentals course.

series form the basis of pandas. They are just one-dimensional arrays with axis labels called indices.

There are different ways of creating a Series object (e.g., we can initialize a Series with lists or dictionaries). Let’s define a Series object with two lists containing student names as indices and their heights in centimeters as data:

import pandas as pd
import numpy as np
from IPython.display import display

students = pd.Series(data=[180, 175, 168, 190],
                     index=['A', 'B', 'C', 'D'])
display(students)
print(type(students))

A    180
B    175
C    168
D    190
dtype: int64

<class 'pandas.core.series.Series'>

The code above returns the content of the students object and its data type.

The data type of the students object is Series, so we can apply any functions on its data using the apply() method. Let’s see how we can convert the heights of the students from centimeters to feet:

def cm_to_feet(h):
    return np.round(h/30.48, 2)

print(students.apply(cm_to_feet))

Vik       5.91
Mehdi     5.74
Bella     5.51
Chriss    6.23
dtype: float64

The students’ heights are converted to feet with two decimal places. To do so, we first defined a function that does the conversion, then pass the function name without parentheses to the apply() method. The apply() method takes each element in the Series and applies the cm_to_feet() function on it.

data1 = pd.DataFrame({'EmployeeName': ['Callen Dunkley', 'Sarah Rayner', 'Jeanette Sloan', 'Kaycee Acosta', 'Henri Conroy', 'Emma Peralta', 'Martin Butt', 'Alex Jensen', 'Kim Howarth', 'Jane Burnett'],
                    'Department': ['Accounting', 'Engineering', 'Engineering', 'HR', 'HR', 'HR', 'Data Science', 'Data Science', 'Accounting', 'Data Science'],
                    'HireDate': [2010, 2018, 2012, 2014, 2014, 2018, 2020, 2018, 2020, 2012],
                    'Sex': ['M', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'F'],
                    'Birthdate': ['04/09/1982', '14/04/1981', '06/05/1997', '08/01/1986', '10/10/1988', '12/11/1992', '10/04/1991', '16/07/1995', '08/10/1992', '11/10/1979'],
                    'Weight': [78, 80, 66, 67, 90, 57, 115, 87, 95, 57],
                    'Height': [176, 160, 169, 157, 185, 164, 195, 180, 174, 165],
                    'Kids': [2, 1, 0, 1, 1, 0, 2, 0, 3, 1]
                    })
display(data1)

	EmployeeName	Department	HireDate	Sex	Birthdate	Weight	Height	Kids
0	Callen Dunkley	Accounting	2010	M	04/09/1982	78	176	2
1	Sarah Rayner	Engineering	2018	F	14/04/1981	80	160	1
2	Jeanette Sloan	Engineering	2012	F	06/05/1997	66	169	0
3	Kaycee Acosta	HR	2014	F	08/01/1986	67	157	1
4	Henri Conroy	HR	2014	M	10/10/1988	90	185	1
5	Emma Peralta	HR	2018	F	12/11/1992	57	164	0
6	Martin Butt	Data Science	2020	M	10/04/1991	115	195	2
7	Alex Jensen	Data Science	2018	M	16/07/1995	87	180	0
8	Kim Howarth	Accounting	2020	M	08/10/1992	95	174	3
9	Jane Burnett	Data Science	2012	F	11/10/1979	57	165	1

###In this section, we’ll work on dummy requests initiated by the company’s HR team. We’ll learn how to use the apply() method by going through different scenarios. We’ll explore a new use case in each scenario and solve it using the apply() method.

Scenario 1 Let’s assume that the HR team wants to send an invitation email that starts with a friendly greeting to all the employees (e.g., Hey, Sarah!). They asked you to create two columns for storing the employees’ first and last names separately, making referring to the employees’ first names easy. To do so, we can use a lambda function that splits a string into a list after breaking it by the specified separator; the default separator character of the split() method is any white space. Let’s look at the code:

data1['FirstName'] = data1['EmployeeName'].apply(lambda x : x.split()[0])
data1['LastName'] = data1['EmployeeName'].apply(lambda x : x.split()[1])
display(data1)

	EmployeeName	Department	HireDate	Sex	Birthdate	Weight	Height	Kids	FirstName	LastName
0	Callen Dunkley	Accounting	2010	M	04/09/1982	78	176	2	Callen	Dunkley
1	Sarah Rayner	Engineering	2018	F	14/04/1981	80	160	1	Sarah	Rayner
2	Jeanette Sloan	Engineering	2012	F	06/05/1997	66	169	0	Jeanette	Sloan
3	Kaycee Acosta	HR	2014	F	08/01/1986	67	157	1	Kaycee	Acosta
4	Henri Conroy	HR	2014	M	10/10/1988	90	185	1	Henri	Conroy
5	Emma Peralta	HR	2018	F	12/11/1992	57	164	0	Emma	Peralta
6	Martin Butt	Data Science	2020	M	10/04/1991	115	195	2	Martin	Butt
7	Alex Jensen	Data Science	2018	M	16/07/1995	87	180	0	Alex	Jensen
8	Kim Howarth	Accounting	2020	M	08/10/1992	95	174	3	Kim	Howarth
9	Jane Burnett	Data Science	2012	F	11/10/1979	57	165	1	Jane	Burnett

In the code above, we applied the lambda function on the EmployeeName column, which is technically a Series object. The lambda function splits the employees’ full names into first and last names. Thus, the code creates two more columns that contain the first and last names of employees.

import pandas as pd

# Sample data

# Define data as Series
airline_series = pd.Series(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet', 'Vistara'])
duration_series = pd.Series(['2h 5m', '2h', '50m', '1h 30m', '3h 20m'])
arrival_time_series = pd.Series(['21:45', '10:00', '00:50', '15:30', '19:15'])
departure_time_series = pd.Series(['19:40', '07:55', '00:44', '14:00', '15:55'])
distance_series = pd.Series([2050, 2000, 220, 800, 3000])

# Create DataFrame
df = pd.DataFrame({
    'Airline': airline_series,
    'Duration': duration_series,
    'Arrival Time': arrival_time_series,
    'Departure Time': departure_time_series,
    'Distance (km)': distance_series
})
# Create DataFrame
data = pd.DataFrame(df)

# Display DataFrame
data

	Airline	Duration	Arrival Time	Departure Time	Distance (km)
0	IndiGo	2h 5m	21:45	19:40	2050
1	Air India	2h	10:00	07:55	2000
2	Jet Airways	50m	00:50	00:44	220
3	SpiceJet	1h 30m	15:30	14:00	800
4	Vistara	3h 20m	19:15	15:55	3000

def min_sec(s):
  sec = 0
  a = s.split() # ["2h", "30m"]
  for i in range(len(a)):
    if "h" in a[i]:
     sec += int(a[i][:-1])*60*60
    if "m" in a[i]:
      sec += int(a[i][:-1])*60
  return sec

data["new_duration"] = data['Duration'].apply(min_sec)

data

	Airline	Duration	Arrival Time	Departure Time	Distance (km)	new_duration
0	IndiGo	2h 5m	21:45	19:40	2050	7500
1	Air India	2h	10:00	07:55	2000	7200
2	Jet Airways	50m	00:50	00:44	220	3000
3	SpiceJet	1h 30m	15:30	14:00	800	5400
4	Vistara	3h 20m	19:15	15:55	3000	12000

##Transforming the values in the ‘Departure Time’ and ‘Arrival Time’ columns to represent the hour component. For instance, if an entry is 10:05, the corresponding value should be 10. ##Then converting the time into four categories as follows: ##5 <= hour < 12 = Morning ##12 <= hour < 17 = Afternoon ##17 <= hour < 20 = Evening ##20 <= hour < 5 = Night

def hr(time):
  time = time.split(':')
  hour = int(time[0])
  if 5 <= hour and hour < 12:
    return 'Morning'
  if 12 <= hour and hour < 17:
    return 'Afternoon'
  if 17 <= hour and hour < 20:
    return 'Evening'
  if 20 <= hour or hour < 5:
    return 'Night'

data['Arrival Time'] = data['Arrival Time'].apply(hr)

data

	Airline	Duration	Arrival Time	Departure Time	Distance (km)	new_duration
0	IndiGo	2h 5m	Night	19:40	2050	7500
1	Air India	2h	Morning	07:55	2000	7200
2	Jet Airways	50m	Night	00:44	220	3000
3	SpiceJet	1h 30m	Afternoon	14:00	800	5400
4	Vistara	3h 20m	Evening	15:55	3000	12000

# Mapping Values using a Dictionary: In this example, we’ll use a dictionary to map existing values in a Series to new values.

# Sample data
data_map = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40]}

# Create DataFrame
df_map = pd.DataFrame(data_map)

# Define a mapping dictionary
mapping = {'Alice': 'A', 'Bob': 'B', 'Charlie': 'C', 'David': 'D'}

# Map values in 'Name' column using the dictionary
df_map['Mapped Name'] = df_map['Name'].map(mapping)

# Display DataFrame
df_map

	Name	Age	Mapped Name
0	Alice	25	A
1	Bob	30	B
2	Charlie	35	C
3	David	40	D

#Mapping Values using a Function: In this example, we’ll use a function to map existing values in a Series to new values.

# Define a function to map ages to age groups
def map_age(age):
    if age < 30:
        return 'Young'
    elif age >= 30 and age < 40:
        return 'Middle-aged'
    else:
        return 'Senior'

# Map values in 'Age' column using the function
df_map['Age Group'] = df_map['Age'].map(map_age)

# Display DataFrame
df_map

	Name	Age	Mapped Name	Age Group
0	Alice	25	A	Young
1	Bob	30	B	Middle-aged
2	Charlie	35	C	Middle-aged
3	David	40	D	Senior

Other Links