Big Dogg February 2016

R: Split weighted column into equal-sized buckets

I would like to use something like dplyr's cut_number to split a column into buckets with approximately the same number of observations, where my dataset is in a compact form where each row has a weight (number of observations).

Example data frame:

df <- data.frame(
    x=c(18,17,18.5,20,20.5,24,24.4,18.3,31,34,39,20,19,34,23),
    weight=c(1,10,3,6,19,20,34,66,2,3,1,6,9,15,21)
)

If there were one observation of x per row, I would simply use df$bucket <- cut_number(df$x,3) to segment x into 3 buckets with approximately the same number of observations. But how do I take into account the fact that each row is weighted with some number of observations? I'd like to avoid splitting each row into weight rows since the original dataframe already has millions of rows.

Answers


Heroka February 2016

Here's another approach, based on my assumption that you have in total x1*weight1 + x2*weight2 +..... observations. Furthermore, each 'unique' observation can only be in one bucket. The approach uses sorting and the cumulative sum of the weights to create the buckets.

#sort data
df <- df[order(df$x),]

#calculate cumulative weights (this is why we sort)
df$cumulative_weight <- cumsum(df$weight)

#create bucket by cumulative weight
n_buckets <- 3
df$bucket <- cut(df$cumulative_weight, n_buckets)

#check: calculate total number of observations per bucket   


> aggregate(weight~bucket,FUN=sum, data=df)
       bucket weight
1 (9.79,78.7]     77
2  (78.7,147]     64
3   (147,216]     75


Megatron February 2016

Based on the comments, I think this may be the interval set you are seeking. Apologies for the general un-R-ness of it:

dfTest <- data.frame(x=1:6, weight=c(1,1,1,1,4,1))

f <- function(df, n) {
  interval <- round(sum(df$weight) / n)
  buckets <- vector(mode="integer", length(nrow(df)))
  bucketNum <- 1
  count <- 0
  for (i in 1:nrow(df)) {
    count <- count + df$weight[i]
    buckets[i] <- bucketNum
    if (count >= interval) {
      bucketNum <- bucketNum + 1
      count <- 0
    }
  }
  return(buckets)
}

Running this function buckets items as follows:

dfTest$bucket <- f(dfTest, 3)

#    x weight bucket
#  1 1      1      1
#  2 2      1      1
#  3 3      1      1
#  4 4      1      2
#  5 5      4      2
#  6 6      1      3

For your example:

df$bucket <- f(df, 3)
#        x weight bucket
#  1  18.0      1      1
#  2  17.0     10      1
#  3  18.5      3      1
#  4  20.0      6      1
#  5  20.5     19      1
#  6  24.0     20      1
#  7  24.4     34      1
#  8  18.3     66      2
#  9  31.0      2      2
#  10 34.0      3      2
#  11 39.0      1      2
#  12 20.0      6      3
#  13 19.0      9      3
#  14 34.0     15      3
#  15 23.0     21      3

Post Status

Asked in February 2016
Viewed 2,650 times
Voted 8
Answered 2 times

Search




Leave an answer