python - Compare 2 Pandas dataframes, row by row, cell by cell -
i have 2 dataframes, df1
, df2
, , want following, storing results in df3
:
for each row in df1: each row in df2: create new row in df3 (called "df1-1, df2-1" or whatever) store results each cell(column) in df1: cell in df2 column name same cell in df1: compare cells (using comparing function func(a,b) ) and, depending on result of comparison, write result appropriate column of "df1-1, df2-1" row of df3)
for example, like:
df1 b c d foo bar foobar 7 gee whiz herp 10 df2 b c d zoo car foobar 8 df3 df1-df2 b c d foo-zoo func(foo,zoo) func(bar,car) func(foobar,foobar) func(7,8) gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar) func(10,8)
i've started this:
for r1 in df1.iterrows(): r2 in df2.iterrows(): c1 in r1: c2 in r2:
but not sure it, , appreciate help.
so continue discussion in comments, can use vectorization, 1 of selling points of library pandas or numpy. ideally, shouldn't ever calling iterrows()
. little more explicit suggestion:
# df1 , df2 provided above, example df3 = df1['a'] * 3 + df2['a'] # recall df2 has 1 row pandas broadcast nan there df3 0 foofoofoozoo 1 nan name: a, dtype: object # more # know df1 , df2 share column names, can initialize df3 names df3 = pd.dataframe(columns=df1.columns) colname in df1: df3[colname] = func(df1[colname], df2[colname])
now, have different functions applied different columns by, say, creating lambda functions , zipping them column names:
# example functions colafunc = lambda x, y: x + y colbfunc = lambda x, y; x - y .... columnfunctions = [colafunc, colbfunc, ...] # initialize df3 above df3 = pd.dataframe(columns=df1.columns) func, colname in zip(columnfunctions, df1.columns): df3[colname] = func(df1[colname], df2[colname])
the "gotcha" comes mind need sure function applicable data in columns. instance, if df1['a'] - df2['a']
(with df1, df2 have provided), raise valueerror
subtraction of 2 strings undefined. aware of.
edit, re: comment: doable well. iterate on dfx.columns larger, don't run keyerror
, , throw if
statement in there:
# other jazz # let's df1 [['a', 'b', 'c']] , df2 [['a', 'b', 'c', 'd']] # iterate on df2 columns colname in df2: if colname not in df1: df3[colname] = np.nan # sure import numpy np else: df3[colname] = func(df1[colname], df2[colname])
Comments
Post a Comment