compare – Data Science Lab

###Pandas, the compare() function provides a way to compare two DataFrame objects and generate a DataFrame highlighting the differences between them. This can be particularly useful when you have two datasets and want to identify discrepancies or changes between them

##DataFrame.compare(other, align_axis=1, keep_shape=False, keep_equal=False)

So, let’s understand each of its parameters –

other : This is the first parameter which actually takes the DataFrame object to be compared with the present DataFrame.
align_axis : It deals with the axis(vertical / horizontal) where the comparison is to be made(by default False).0 or index : Here the output of the differences are presented vertically, 1 or columns : The output of the differences are displayed horizontally.
keep_shape : It means that whether we want all the data values to be displayed in the output or only the ones with distinct value. It is of bool type and the default value for it is “false”, i.e. it displays all the values in the table by default.
keep_equal : This is mainly for displaying same or equal values in the output when set to True. If it is made false then it will display the equal values as NANs.

Returns another DataFrame with the differences between the two dataFrames.

Let’s create dataframe and see how compare works.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "col1": ["a", "a", "b", "b", "a"],
        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
    },
    columns=["col1", "col2", "col3"],
)

print(df)

  col1  col2  col3
0    a   1.0   1.0
1    a   2.0   2.0
2    b   3.0   3.0
3    b   NaN   4.0
4    a   5.0   5.0

Let’s create a copy of df dataframe and do some changes in new dataframe df2. Then we will try to use compare() to see the result.

# Using Compare without doing any changes in df dataframe and df2 dataframe
df2 = df.copy()

df.compare(df2)

#Making some changes in df2

df2.loc[0, "col1"] = "c"
df2.loc[2, "col3"] = 4.0

print(df2)

  col1  col2  col3
0    c   1.0   1.0
1    a   2.0   2.0
2    b   3.0   4.0
3    b   NaN   4.0
4    a   5.0   5.0

#Let's see how compare works.

df.compare(df2)

	col1		col3
	self	other	self	other
0	a	c	NaN	NaN
2	NaN	NaN	3.0	4.0

df.compare(df2, align_axis = 0)

		col1	col3
0	self	a	NaN
0	other	c	NaN
2	self	NaN	3.0
2	other	NaN	4.0

# Let's make changes in all three columns in df2 and see the result of compare()
df3 = df.copy()
df3.loc[0,"col1"] = "c"
df3.loc[1,"col2"] = 100
df3.loc[2,"col3"] = 4.0

df3

	col1	col2	col3
0	c	1.0	1.0
1	a	100.0	2.0
2	b	3.0	4.0
3	b	NaN	4.0
4	a	5.0	5.0

Applying compare() in these two dataframe df and df3. for every column it has created col1, col2 and col3. Each column will have two subcolumn in the output called self and other. when we run df.compare(df3), it will put the values of df in in self and values of df3 in other.

df.compare(df3)

	col1		col2		col3
	self	other	self	other	self	other
0	a	c	NaN	NaN	NaN	NaN
1	NaN	NaN	2.0	100.0	NaN	NaN
2	NaN	NaN	NaN	NaN	3.0	4.0

##Comparing various columns instead of whole dataframe

df['col2'].equals(df2['col2'])

True

##Comparing elements of two different columns

output = pd.Series(df['col2'] == df2['col2'])
output

0     True
1     True
2     True
3    False
4     True
Name: col2, dtype: bool

df.loc[output]

	col1	col2	col3
0	a	1.0	1.0
1	a	2.0	2.0
2	b	3.0	3.0
4	a	5.0	5.0

Other Links