josliber February 2016

### Use rle to group by runs when using dplyr

In R, I want to summarize my data after grouping it based on the runs of a variable `x` (aka each group of the data corresponds to a subset of the data where consecutive `x` values are the same). For instance, consider the following data frame, where I want to compute the average `y` value within each run of `x`:

``````(dat <- data.frame(x=c(1, 1, 1, 2, 2, 1, 2), y=1:7))
#   x y
# 1 1 1
# 2 1 2
# 3 1 3
# 4 2 4
# 5 2 5
# 6 1 6
# 7 2 7
``````

In this example, the `x` variable has runs of length 3, then 2, then 1, and finally 1, taking values 1, 2, 1, and 2 in those four runs. The corresponding means of `y` in those groups are 2, 4.5, 6, and 7.

It is easy to carry out this grouped operation in base R using `tapply`, passing `dat\$y` as the data, using `rle` to compute the run number from `dat\$x`, and passing the desired summary function:

``````tapply(dat\$y, with(rle(dat\$x), rep(seq_along(lengths), lengths)), mean)
#   1   2   3   4
# 2.0 4.5 6.0 7.0
``````

I figured I would be able to pretty directly carry over this logic to dplyr, but my attempts so far have all ended in errors:

``````library(dplyr)
# First attempt
dat %>%
group_by(with(rle(x), rep(seq_along(lengths), lengths))) %>%
summarize(mean(y))
# Error: cannot coerce type 'closure' to vector of type 'integer'

# Attempt 2 -- maybe "with" is the problem?
dat %>%
group_by(rep(seq_along(rle(x)\$lengths), rle(x)\$lengths)) %>%
summarize(mean(y))
# Error: invalid subscript type 'closure'
``````

For completeness, I could reimplement the `rle` run id myself using `cumsum`, `head`, and `tail` to get around this, but it makes the grouping code tougher to read and involves a bit of reinventing the wheel:

``````dat %>%
group_by(run=cumsum(c(1, head(x, -1) != tai        ``````
``` Answers Neal Fultz February 2016 If you explicitly create a grouping variable g it more or less works: > dat %>% transform(g=with(rle(dat\$x),{ rep(seq_along(lengths), lengths)}))%>% group_by(g) %>% summarize(mean(y)) Source: local data frame [4 x 2] g mean(y) (int) (dbl) 1 1 2.0 2 2 4.5 3 3 6.0 4 4 7.0 I used transform here because mutate throws an error. docendo discimus February 2016 One option seems to be the use of {} as in: dat %>% group_by(yy = {yy = rle(x); rep(seq_along(yy\$lengths), yy\$lengths)}) %>% summarize(mean(y)) #Source: local data frame [4 x 2] # # yy mean(y) # (int) (dbl) #1 1 2.0 #2 2 4.5 #3 3 6.0 #4 4 7.0 It would be nice if future dplyr versions also had an equivalent of data.table's rleid function. I noticed that this problem occurs when using a data.frame or tbl_df input but not, when using a tbl_dt or data.table input: dat %>% tbl_df %>% group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>% summarize(mean(y)) Error: cannot coerce type 'closure' to vector of type 'integer' dat %>% tbl_dt %>% group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>% summarize(mean(y)) Source: local data table [4 x 2] yy mean(y) (int) (dbl) 1 1 2.0 2 2 4.5 3 3 6.0 4 4 7.0 I reported this as an issue on dplyr's github page. ```
