# Developers Planet

Big Dogg February 2016

### R: Split weighted column into equal-sized buckets

I would like to use something like dplyr's `cut_number` to split a column into buckets with approximately the same number of observations, where my dataset is in a compact form where each row has a weight (number of observations).

Example data frame:

``````df <- data.frame(
x=c(18,17,18.5,20,20.5,24,24.4,18.3,31,34,39,20,19,34,23),
weight=c(1,10,3,6,19,20,34,66,2,3,1,6,9,15,21)
)
``````

If there were one observation of x per row, I would simply use `df\$bucket <- cut_number(df\$x,3)` to segment `x` into 3 buckets with approximately the same number of observations. But how do I take into account the fact that each row is weighted with some number of observations? I'd like to avoid splitting each row into `weight` rows since the original dataframe already has millions of rows.

Heroka February 2016

Here's another approach, based on my assumption that you have in total x1*weight1 + x2*weight2 +..... observations. Furthermore, each 'unique' observation can only be in one bucket. The approach uses sorting and the cumulative sum of the weights to create the buckets.

``````#sort data
df <- df[order(df\$x),]

#calculate cumulative weights (this is why we sort)
df\$cumulative_weight <- cumsum(df\$weight)

#create bucket by cumulative weight
n_buckets <- 3
df\$bucket <- cut(df\$cumulative_weight, n_buckets)

#check: calculate total number of observations per bucket

> aggregate(weight~bucket,FUN=sum, data=df)
bucket weight
1 (9.79,78.7]     77
2  (78.7,147]     64
3   (147,216]     75
``````

Megatron February 2016

Based on the comments, I think this may be the interval set you are seeking. Apologies for the general un-R-ness of it:

``````dfTest <- data.frame(x=1:6, weight=c(1,1,1,1,4,1))

f <- function(df, n) {
interval <- round(sum(df\$weight) / n)
buckets <- vector(mode="integer", length(nrow(df)))
bucketNum <- 1
count <- 0
for (i in 1:nrow(df)) {
count <- count + df\$weight[i]
buckets[i] <- bucketNum
if (count >= interval) {
bucketNum <- bucketNum + 1
count <- 0
}
}
return(buckets)
}
``````

Running this function buckets items as follows:

``````dfTest\$bucket <- f(dfTest, 3)

#    x weight bucket
#  1 1      1      1
#  2 2      1      1
#  3 3      1      1
#  4 4      1      2
#  5 5      4      2
#  6 6      1      3
``````

``````df\$bucket <- f(df, 3)
#        x weight bucket
#  1  18.0      1      1
#  2  17.0     10      1
#  3  18.5      3      1
#  4  20.0      6      1
#  5  20.5     19      1
#  6  24.0     20      1
#  7  24.4     34      1
#  8  18.3     66      2
#  9  31.0      2      2
#  10 34.0      3      2
#  11 39.0      1      2
#  12 20.0      6      3
#  13 19.0      9      3
#  14 34.0     15      3
#  15 23.0     21      3
``````

#### Post Status

Asked in February 2016
Viewed 2,650 times
Voted 8