Kevin February 2016

pandas top n values when group name in multiple columns

Trying to find the top n values from a large DataFrame. Keys are combinations of similarly named objects in my first two columns. However, I'd like to find the max, regardless of which column the key is located. Better demonstrated by example:

import itertools
import pandas as pd

np.random.seed(10)

pairs = [combo for combo in itertools.combinations(['apple','banana','pear','orange'], 2)]

df = pd.DataFrame(pairs, columns=['a','b'])
df['score'] = np.random.rand(6)

The original DataFrame:

In [2]: df
Out[2]: a   b   score
     0  apple   banana  0.771321
     1  apple   pear    0.020752
     2  apple   orange  0.633648
     3  banana  pear    0.748804
     4  banana  orange  0.498507
     5  pear    orange  0.224797

Here is how I would accomplish my task w/ SQL, assuming I have a database table named fruits that emulates df above:

uniq = pd.unique(df[['a', 'b']].values.ravel())

df_sql = pd.DataFrame()
for fruit in uniq:
    dfsql_tmp = pd.read_sql_query(
    """SELECT a,b,score FROM fruits
    WHERE a = %s
    OR b = %s
    ORDER BY score DESC
    LIMIT 1;""",
    engine, params=[fruit, fruit])

    df_sql = pd.concat([df_sql, dfsql_tmp], ignore_index=True)

This gets me exactly what I'm asking, the top n scores from each unique value (from the union of df['a'] and df['b']). Desired output:

In [5]: df_sql
Out[5]: a   b   score
     0  apple   banana  0.771321 #highest apple score
     1  apple   banana  0.771321 #highest banana score
     2  apple   orange  0.633648 #highest orange score
     3  banana  pear    0.748804 #highest pear score

EDIT

This does the trick also, but slow at scale:

N=1
df_new = pd.DataFrame()
for fruit in uniq:
    df_tmp = df[(df['a'] == fruit) | (df['b'] == fruit)].sort_values('score', ascending=False).head(N)
    df_n        

Answers


Randy C February 2016

This is not a pretty solution, and I suspect there's better ones out there, but here's a crack at it. This creates a ~550k row x 5 column DataFrame and runs in about 4 seconds on my laptop.

import string
import pandas as pd
import numpy as np
import itertools

np.random.seed(10)
pairs = [combo for combo in itertools.combinations(string.letters + string.digits, 4)]

df = pd.DataFrame(pairs, columns=['a', 'b', 'c', 'd'])
df['score'] = np.random.rand(len(df))

cols = ['a', 'b', 'c', 'd']
indexes = []

for c in pd.concat([df[col] for col in cols]).unique():
    indexes.append(df[reduce(lambda x, y: x | y, [df[col] == c for col in cols])]['score'].idxmax())
print df.ix[indexes]

Add in a .reset_index() at the end if you don't want the original indexes preserved in the output.

For top N, instead of doing .idxmax(), sort the reduced frame and take the first N indexes with .iloc[:N].

Post Status

Asked in February 2016
Viewed 2,553 times
Voted 13
Answered 1 times

Search




Leave an answer