JasonD February 2016

Weighting class in machine learning task

I'm trying out a machine learning task (binary classification) using caret and was wondering if there is a way to incorporate information about "uncertain" class, or to weight the classes differently.

As an illustration, I've cut and paste some of the code from the caret homepage working with the Sonar dataset (placeholder code - could be anything):

library(mlbench)
testdat <- get(data(Sonar))
set.seed(946)
testdat$Source<-as.factor(sample(c(LETTERS[1:6],LETTERS[1:3]),nrow(testdat),replace = T))

yielding:

summary(testdat$Source)  
 A  B  C  D  E  F   
49 51 44 17 28 19   

after which I would continue with a typical train,tune, and test routine once I decide on a model.

What I've added here is another factor column of a source, or where the corresponding "Class" came from. As an arbitrary example, say these were 6 different people who made their designation of "Class" using slightly different methods and I want to put greater importance on A's classification method than B's but less than C's and so forth.

The actual data are something like this, where there are class imbalances, both among the true/false, M/R, or whatever class, and among these Sources. From the vignettes and examples I have found, at least the former I would address by using a metric like ROC during tuning, but as to how to even incorporate the latter, I'm not sure.

  • separating the original data by Source and cycling through the factor levels one at a time, using the current level to build a model and the rest of the data to test it

  • instead of classification, turn it into a hybrid classification/regression problem, where I use the ranks of the sources as what I want to model. If A is considered best, then an "A positive" would get a score of +6, "A negative", a score of -6 and so on. Then perform a regression fit on these values, ignoring the Class column.

Answers


lejlot February 2016

There is no difference between weighting "because of importance" and weighting "because imbalance". These are exactly the same settings, they both refer to "how strongly should I penalize model for missclassifing sample from a particular class". Thus you do not need any regression (and should not do so! this is perfectly well stated classification problem, and you are simply overthinking it) but just providing samples weights, thats all. There are many models in caret accepting this kind of setting, including glmnet, glm, cforest etc. if you want to use svm you should change package (as ksvm does not support such things) for example to https://cran.r-project.org/web/packages/gmum.r/gmum.r.pdf (for sample or class weighting) or https://cran.r-project.org/web/packages/e1071/e1071.pdf (if it is class weighting)

Post Status

Asked in February 2016
Viewed 2,501 times
Voted 13
Answered 1 times

Search




Leave an answer