Big Dogg February 2016
### R: Split weighted column into equal-sized buckets

I would like to use something like dplyr's `cut_number`

to split a column into buckets with approximately the same number of observations, where my dataset is in a compact form where each row has a weight (number of observations).

Example data frame:

```
df <- data.frame(
x=c(18,17,18.5,20,20.5,24,24.4,18.3,31,34,39,20,19,34,23),
weight=c(1,10,3,6,19,20,34,66,2,3,1,6,9,15,21)
)
```

If there were one observation of x per row, I would simply use `df$bucket <- cut_number(df$x,3)`

to segment `x`

into 3 buckets with approximately the same number of observations. But how do I take into account the fact that each row is weighted with some number of observations? I'd like to avoid splitting each row into `weight`

rows since the original dataframe already has millions of rows.

Heroka February 2016

Here's another approach, based on my assumption that you have in total x1*weight1 + x2*weight2 +..... observations. Furthermore, each 'unique' observation can only be in one bucket. The approach uses sorting and the cumulative sum of the weights to create the buckets.

```
#sort data
df <- df[order(df$x),]
#calculate cumulative weights (this is why we sort)
df$cumulative_weight <- cumsum(df$weight)
#create bucket by cumulative weight
n_buckets <- 3
df$bucket <- cut(df$cumulative_weight, n_buckets)
#check: calculate total number of observations per bucket
> aggregate(weight~bucket,FUN=sum, data=df)
bucket weight
1 (9.79,78.7] 77
2 (78.7,147] 64
3 (147,216] 75
```

Megatron February 2016

Based on the comments, I think this may be the interval set you are seeking. Apologies for the general un-R-ness of it:

```
dfTest <- data.frame(x=1:6, weight=c(1,1,1,1,4,1))
f <- function(df, n) {
interval <- round(sum(df$weight) / n)
buckets <- vector(mode="integer", length(nrow(df)))
bucketNum <- 1
count <- 0
for (i in 1:nrow(df)) {
count <- count + df$weight[i]
buckets[i] <- bucketNum
if (count >= interval) {
bucketNum <- bucketNum + 1
count <- 0
}
}
return(buckets)
}
```

Running this function buckets items as follows:

```
dfTest$bucket <- f(dfTest, 3)
# x weight bucket
# 1 1 1 1
# 2 2 1 1
# 3 3 1 1
# 4 4 1 2
# 5 5 4 2
# 6 6 1 3
```

For your example:

```
df$bucket <- f(df, 3)
# x weight bucket
# 1 18.0 1 1
# 2 17.0 10 1
# 3 18.5 3 1
# 4 20.0 6 1
# 5 20.5 19 1
# 6 24.0 20 1
# 7 24.4 34 1
# 8 18.3 66 2
# 9 31.0 2 2
# 10 34.0 3 2
# 11 39.0 1 2
# 12 20.0 6 3
# 13 19.0 9 3
# 14 34.0 15 3
# 15 23.0 21 3
```

Asked in February 2016

Viewed 2,650 times

Voted 8

Answered 2 times

Viewed 2,650 times

Voted 8

Answered 2 times