hullibulli February 2016

Use groups in scipy.stats.kruskal similar to R cran kruskal.test

I'm trying to replace some rpy2 code in a Python script with Python (scipy). In that context I need to replace a Kruskal-Wallis test (R:kruskal.test()) with (Python:scipy.stats.kruskal).

scipy.stats.kruskal returns a similar H-statistic and P-value when comparing integers/floats only. However, I have some difficulty applying groups represented as strings.

Below is a subsample of the data:

y = [4.33917022422, 2.96541899883, 6.70475220836, 9.19889096119, 2.14087398016,
     5.39520023918, 1.58443224287, 3.59625224078, 4.01998599966, 2.58058624352]
x = ['High_O2', 'High_O2', 'High_O2', 'High_O2', 'Low_O2',
      'Low_O2',  'Low_O2',  'Low_O2',  'Mid_O2',  'Mid_O2']

In R one would just type:

kruskal.test(y,as.factor(x))

Doing the same thing in Python (2.7) using scipy (0.17):

from scipy import stats
stats.kruskal(y,x)

However, I get very low p values (p<e-07) and quite high H-statistics (26) when using scipy, which is incorrect. I have tried to replace the x list with {0,1,2} with no improvement.

How can I tell scipy to treat x as groups during the ranking?

Answers


ali_m February 2016

Each non-keyword argument passed to scipy.stats.kruskal is treated as a separate group of y-values. By passing x as one of the arguments, kruskal attempts to treat your label strings as though they were a second group of y-values. The strings will get cast to NaNs (which ought to raise a RuntimeWarning).

Instead, you need to group your y values by label, then pass them as separate input arrays to kruskal. For example:

# convert `y` to a numpy array for more convenient indexing
y = np.array(y)

# find unique group labels and their corresponding indices
label, idx = np.unique(x, return_inverse=True)

# make a list of arrays containing the y-values corresponding to each unique label
groups = [y[idx == i] for i, l in enumerate(label)]

# use `*` to unpack the list as a sequence of arguments to `stats.kruskal`
H, p = stats.kruskal(*groups)

print(H, p)
# 2.94545454545 0.22929927

Post Status

Asked in February 2016
Viewed 2,178 times
Voted 4
Answered 1 times

Search




Leave an answer