ojdo February 2016

Aggregate/Remove duplicate rows in DataFrame based on swapped index levels

Sample input

import pandas as pd
df = pd.DataFrame([
        ['A', 'B', 1, 5],
        ['B', 'C', 2, 2],
        ['B', 'A', 1, 1],
        ['C', 'B', 1, 3]], 
        columns=['from', 'to', 'type', 'value']) 
df = df.set_index(['from', 'to', 'type'])

Which looks like this:

                  value
from  to    type    
A     B     1     5
B     C     2     2
      A     1     1
C     B     1     3

Goal

I now want to remove "duplicate" rows from this in the following sense: for each row with an arbitrary index (A, B, type), if there exists a row (B, A, type), the value of the second row should be added to the first row and the second row discarded. In the example above, the row (B, A, 1) with value 1 should be added to the first row and discarded, leading to the following desired result.

Sample result

                  value
from  to  type
A     B   1       6
B     C   2       2
C     B   1       3

This is my best try so far. It feels unnecessarily verbose and clunky:

# aggregate val of rows with (a,b,t) == (b,a,t) 
df2 = df.reset_index()
df3 = df2.rename(columns={'from':'to', 'to':'from'})
df_both = df.join(df3.set_index(
                    ['from', 'to', 'type']), 
                    rsuffix='_b').sum(axis=1)

# then remove the second, i.e. (b,a,t) row
rows_to_keep = []
rows_to_remove = []
for a,b,t in df_both.index:
    if (b,a,t) in df_both.index and not (b,a,t) in rows_to_keep:
        rows_to_keep.append((a,b,t))
        rows_to_remove.append((b,a,t))

df_final = df_both.drop(rows_to_remove)
df_final

Especially the second "de-duplication" step feels very unpythonic. (How) can I improve these steps?

Answers


howMuchCheeseIsTooMuchCheese February 2016

Not sure how much better this is, but it's certainly different

  import pandas as pd
  from collections import Counter

  df = pd.DataFrame([
          ['A', 'B', 1, 5],
          ['B', 'C', 2, 2],
          ['B', 'A', 1, 1],
          ['C', 'B', 1, 3]], 
          columns=['from', 'to', 'type', 'value']) 
  df = df.set_index(['from', 'to', 'type'])
  ls = df.to_records()
  ls = list(ls)
  ls2=[]
  for l in ls:
      i=0
      while i <= l[3]:
          ls2.append(list(l)[:3])
          i+=1
  counted = Counter(tuple(sorted(entry)) for entry in ls2)

Post Status

Asked in February 2016
Viewed 1,029 times
Voted 6
Answered 1 times

Search




Leave an answer