Open In Colab

###Pandas, the compare() function provides a way to compare two DataFrame objects and generate a DataFrame highlighting the differences between them. This can be particularly useful when you have two datasets and want to identify discrepancies or changes between them

##DataFrame.compare(other, align_axis=1, keep_shape=False, keep_equal=False)

So, let’s understand each of its parameters –

Returns another DataFrame with the differences between the two dataFrames.

Let’s create dataframe and see how compare works.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "col1": ["a", "a", "b", "b", "a"],
        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
    },
    columns=["col1", "col2", "col3"],
)

print(df)
  col1  col2  col3
0    a   1.0   1.0
1    a   2.0   2.0
2    b   3.0   3.0
3    b   NaN   4.0
4    a   5.0   5.0

Let’s create a copy of df dataframe and do some changes in new dataframe df2. Then we will try to use compare() to see the result.

# Using Compare without doing any changes in df dataframe and df2 dataframe
df2 = df.copy()
df.compare(df2)
#Making some changes in df2

df2.loc[0, "col1"] = "c"
df2.loc[2, "col3"] = 4.0

print(df2)
  col1  col2  col3
0    c   1.0   1.0
1    a   2.0   2.0
2    b   3.0   4.0
3    b   NaN   4.0
4    a   5.0   5.0
#Let's see how compare works.

df.compare(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
df.compare(df2, align_axis = 0)
col1 col3
0 self a NaN
other c NaN
2 self NaN 3.0
other NaN 4.0
# Let's make changes in all three columns in df2 and see the result of compare()
df3 = df.copy()
df3.loc[0,"col1"] = "c"
df3.loc[1,"col2"] = 100
df3.loc[2,"col3"] = 4.0

df3
col1 col2 col3
0 c 1.0 1.0
1 a 100.0 2.0
2 b 3.0 4.0
3 b NaN 4.0
4 a 5.0 5.0

Applying compare() in these two dataframe df and df3. for every column it has created col1, col2 and col3. Each column will have two subcolumn in the output called self and other. when we run df.compare(df3), it will put the values of df in in self and values of df3 in other.

df.compare(df3)
col1 col2 col3
self other self other self other
0 a c NaN NaN NaN NaN
1 NaN NaN 2.0 100.0 NaN NaN
2 NaN NaN NaN NaN 3.0 4.0

##Comparing various columns instead of whole dataframe

df['col2'].equals(df2['col2'])
True

##Comparing elements of two different columns

output = pd.Series(df['col2'] == df2['col2'])
output
0     True
1     True
2     True
3    False
4     True
Name: col2, dtype: bool
df.loc[output]
col1 col2 col3
0 a 1.0 1.0
1 a 2.0 2.0
2 b 3.0 3.0
4 a 5.0 5.0