josliber February 2016
### Use rle to group by runs when using dplyr

In R, I want to summarize my data after grouping it based on the runs of a variable `x`

(aka each group of the data corresponds to a subset of the data where consecutive `x`

values are the same). For instance, consider the following data frame, where I want to compute the average `y`

value within each run of `x`

:

```
(dat <- data.frame(x=c(1, 1, 1, 2, 2, 1, 2), y=1:7))
# x y
# 1 1 1
# 2 1 2
# 3 1 3
# 4 2 4
# 5 2 5
# 6 1 6
# 7 2 7
```

In this example, the `x`

variable has runs of length 3, then 2, then 1, and finally 1, taking values 1, 2, 1, and 2 in those four runs. The corresponding means of `y`

in those groups are 2, 4.5, 6, and 7.

It is easy to carry out this grouped operation in base R using `tapply`

, passing `dat$y`

as the data, using `rle`

to compute the run number from `dat$x`

, and passing the desired summary function:

```
tapply(dat$y, with(rle(dat$x), rep(seq_along(lengths), lengths)), mean)
# 1 2 3 4
# 2.0 4.5 6.0 7.0
```

I figured I would be able to pretty directly carry over this logic to dplyr, but my attempts so far have all ended in errors:

```
library(dplyr)
# First attempt
dat %>%
group_by(with(rle(x), rep(seq_along(lengths), lengths))) %>%
summarize(mean(y))
# Error: cannot coerce type 'closure' to vector of type 'integer'
# Attempt 2 -- maybe "with" is the problem?
dat %>%
group_by(rep(seq_along(rle(x)$lengths), rle(x)$lengths)) %>%
summarize(mean(y))
# Error: invalid subscript type 'closure'
```

For completeness, I could reimplement the `rle`

run id myself using `cumsum`

, `head`

, and `tail`

to get around this, but it makes the grouping code tougher to read and involves a bit of reinventing the wheel:

```
dat %>%
group_by(run=cumsum(c(1, head(x, -1) != tai
```

```
```

```
```

```
```### Answers

Neal Fultz February 2016
If you explicitly create a grouping variable `g`

it more or less works:

```
> dat %>% transform(g=with(rle(dat$x),{ rep(seq_along(lengths), lengths)}))%>%
group_by(g) %>% summarize(mean(y))
Source: local data frame [4 x 2]
g mean(y)
(int) (dbl)
1 1 2.0
2 2 4.5
3 3 6.0
4 4 7.0
```

I used `transform`

here because `mutate`

throws an error.

docendo discimus February 2016
One option seems to be the use of `{}`

as in:

```
dat %>%
group_by(yy = {yy = rle(x); rep(seq_along(yy$lengths), yy$lengths)}) %>%
summarize(mean(y))
#Source: local data frame [4 x 2]
#
# yy mean(y)
# (int) (dbl)
#1 1 2.0
#2 2 4.5
#3 3 6.0
#4 4 7.0
```

It would be nice if future dplyr versions also had an equivalent of data.table's `rleid`

function.

I noticed that this problem occurs when using a `data.frame`

or `tbl_df`

input but not, when using a `tbl_dt`

or `data.table`

input:

```
dat %>%
tbl_df %>%
group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
summarize(mean(y))
Error: cannot coerce type 'closure' to vector of type 'integer'
dat %>%
tbl_dt %>%
group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
summarize(mean(y))
Source: local data table [4 x 2]
yy mean(y)
(int) (dbl)
1 1 2.0
2 2 4.5
3 3 6.0
4 4 7.0
```

I reported this as an issue on dplyr's github page.

```
