Yona February 2016

How to avoid mutiple columns on Pandas.Merge

Imagine I have the following DataFrames on Pandas:

In [7]: A= pd.DataFrame([['foo'],['bar'],['quz'],['baz']],columns=['key'])

In [8]: A['value'] = 'None'

In [9]: A
Out[9]:
   key value
0  foo  None
1  bar  None
2  quz  None
3  baz  None

In [10]: B = pd.DataFrame([['foo',5],['bar',6],['quz',7]],columns= ['key','value'])

In [11]: B
Out[11]:
   key  value
0  foo      5
1  bar      6
2  quz      7

In [12]: pd.merge(A,B, on='key', how='outer')
Out[12]:
   key value_x  value_y
0  foo    None        5
1  bar    None        6
2  quz    None        7
3  baz    None      NaN

But what I want is (avoiding the repeat column basically):

   key  value
0  foo  5
1  bar  6
2  quz  7
3  baz  NaN

I suppose I can take the output and drop the _x value and rename the _y but that seems like an overkill. On SQL this would be trivial.

EDIT:

John as recomended to use:

In [1]: A.set_index('key', inplace=True)
        A.update(B.set_index('key'), join='left', overwrite=True)
        A.reset_index(inplace=True)

This works and does what I asked for.

Answers


johnchase February 2016

In the example you are merging two dataframes with the same column, one contains strings ('None') the other integers, pandas doesn't know which column value you want to keep and which should be replaced, so it creates a column for both.

You can use update instead

In [10]: A.update(B, join='left', overwrite=True)
In [11]: A
Out[11]:

    key value
0   foo 5
1   bar 6
2   quz 7
3   baz NaN

Another solution would be to just state the values that you want for the given column:

In [15]: A.loc[B.index, 'value'] = B.value
In [16]: A
Out[16]:

    key value
0   foo 5
1   bar 6
2   quz 7
3   baz NaN

Personally I prefer the second solution because I know exactly what is happening, but the first is probably closer to what you are looking for in your question.

EDIT:

If the indices don't match, I'm not quite sure how to make this happen. Hence I would suggest making them match:

In [1]: A.set_index('key', inplace=True)
        A.update(B.set_index('key'), join='left', overwrite=True)
        A.reset_index(inplace=True)

It may be that there is a better way to do this, but I don't believe pandas has a way to perform this operation outright.

The second solution can also be used with the updated index:

In [24]: A.set_index('key', inplace=True)
         A.loc[B.key, 'value'] = B.value.tolist()

Post Status

Asked in February 2016
Viewed 1,758 times
Voted 4
Answered 1 times

Search




Leave an answer