Noobie February 2016

how to sum across many columns with pandas groupby?

I have a dataframe that looks like

day  type  col  d_1  d_2  d_3  d_4  d_5...
1    A     1    1    0    1    0
1    A     2    1    0    1    0
2    B     1    1    1    0    0

That is, I have one normal column (col) and many columns prefixed by d_

I need to perform a groupby by day and type and I want to compute the sum of the values in every d_ column for every day-type combination. I also need to perform other aggregation functions on the other columns in my data (such as col in the example)

I can use:

agg_df=df.groupby(['day','type']).agg({'d_1': 'sum', 'col': 'mean'})

but this computes the sum only for one d_ column. How can I specify all the possible d_ columns in my data?

In other words, I would like to write something like

agg_df=df.groupby(['day','type']).agg({'d_*': 'sum', 'col': 'mean'})

so that the expected output is:

day  type  col  d_1  d_2  d_3  d_4  d_5...
1    A     1.5  2    0    2    0    ...
2    B     1    1    1    0    0

As you can see, col is aggregated by mean, while the d_ columns are summed.

Thanks for your help!

Answers


Colonel Beauvel February 2016

You can use filter:

In [23]: df.groupby(['day','type'], as_index=False)[df.filter(regex='d_.*').columns].sum()

Out[23]:
   day type  d_1  d_2  d_3  d_4
0    1    A    2    0    2    0
1    2    B    1    1    0    0

If you wanna apply all functions in one shot:

dic = {}
dic.update({i:np.sum for i in df.filter(regex='d_.*').columns})
dic.update({'col':np.mean})

In [48]: df.groupby(['day','type'], as_index=False).agg(dic)
#Out[48]:
#   day type  d_2  d_3  d_1  col  d_4
#0    1    A    0    2    2  1.5    0
#1    2    B    1    0    1  1.0    0


Anton Protopopov February 2016

IIUC you need to subset your groupby dataframe with your d_* columns. You could find that columns with str.contain and pass it to the groupby dataframe:

cols = df.columns[df.columns.str.contains('(d_)+|col')]
agg_df=df.groupby(['day','type'])[cols].sum()


In [150]: df
Out[150]:
   day type  col  d_1  d_2  d_3  d_4
0    1    A    1    1    0    1    0
1    1    A    2    1    0    1    0
2    2    B    1    1    1    0    0

In [155]: agg_df
Out[155]:
          col  d_1  d_2  d_3  d_4
day type
1   A       3    2    0    2    0
2   B       1    1    1    0    0

Note: I added the col columns to the contains pattern as you requested. You could specify whatever regex expression you want and pass it with | symbol.

Post Status

Asked in February 2016
Viewed 2,454 times
Voted 6
Answered 2 times

Search




Leave an answer