hullibulli February 2016
### Use groups in scipy.stats.kruskal similar to R cran kruskal.test

I'm trying to replace some rpy2 code in a Python script with Python (scipy). In that context I need to replace a Kruskal-Wallis test (R:`kruskal.test()`

) with (Python:`scipy.stats.kruskal`

).

`scipy.stats.kruskal`

returns a similar H-statistic and P-value when comparing integers/floats only. However, I have some difficulty applying groups represented as strings.

Below is a subsample of the data:

```
y = [4.33917022422, 2.96541899883, 6.70475220836, 9.19889096119, 2.14087398016,
5.39520023918, 1.58443224287, 3.59625224078, 4.01998599966, 2.58058624352]
x = ['High_O2', 'High_O2', 'High_O2', 'High_O2', 'Low_O2',
'Low_O2', 'Low_O2', 'Low_O2', 'Mid_O2', 'Mid_O2']
```

In R one would just type:

```
kruskal.test(y,as.factor(x))
```

Doing the same thing in Python (2.7) using scipy (0.17):

```
from scipy import stats
stats.kruskal(y,x)
```

However, I get very low p values `(p<e-07)`

and quite high H-statistics `(26)`

when using scipy, which is incorrect. I have tried to replace the `x`

list with `{0,1,2}`

with no improvement.

How can I tell scipy to treat `x`

as groups during the ranking?

Each non-keyword argument passed to `scipy.stats.kruskal`

is treated as a separate group of y-values. By passing `x`

as one of the arguments, `kruskal`

attempts to treat your label strings as though they were a second group of y-values. The strings will get cast to NaNs (which ought to raise a `RuntimeWarning`

).

Instead, you need to group your `y`

values by label, then pass them as *separate* input arrays to `kruskal`

. For example:

```
# convert `y` to a numpy array for more convenient indexing
y = np.array(y)
# find unique group labels and their corresponding indices
label, idx = np.unique(x, return_inverse=True)
# make a list of arrays containing the y-values corresponding to each unique label
groups = [y[idx == i] for i, l in enumerate(label)]
# use `*` to unpack the list as a sequence of arguments to `stats.kruskal`
H, p = stats.kruskal(*groups)
print(H, p)
# 2.94545454545 0.22929927
```

Asked in February 2016

Viewed 2,178 times

Voted 4

Answered 1 times

Viewed 2,178 times

Voted 4

Answered 1 times