bobo32 February 2016

model.predictProbabilities() for LogisticRegression in Spark?

I'm running a multi-class Logistic Regression (withLBFGS) with Spark 1.6.

given x and possible labels {1.0,2.0,3.0} the final model will only output what is the best prediction, say 2.0.

If I'm interested to know what was the second best prediction, say 3.0, how could I retrieve that information?

In NaiveBayes I would use the model.predictProbabilities() function which for each sample would output a vector with all the probabilities for each possible outcome.

Answers


Daniel Darabos February 2016

There are two ways to do logistic regression in Spark: spark.ml and spark.mllib.

With DataFrames you can use spark.ml:

import org.apache.spark
import sqlContext.implicits._

def p(label: Double, a: Double, b: Double) =
  new spark.mllib.regression.LabeledPoint(
    label, new spark.mllib.linalg.DenseVector(Array(a, b)))

val data = sc.parallelize(Seq(p(1.0, 0.0, 0.5), p(0.0, 0.5, 1.0)))
val df = data.toDF

val model = new spark.ml.classification.LogisticRegression().fit(df)
model.transform(df).show

You get the raw predictions and probabilities:

+-----+---------+--------------------+--------------------+----------+
|label| features|       rawPrediction|         probability|prediction|
+-----+---------+--------------------+--------------------+----------+
|  1.0|[0.0,0.5]|[-19.037302860930...|[5.39764620520461...|       1.0|
|  0.0|[0.5,1.0]|[18.9861466274786...|[0.99999999431904...|       0.0|
+-----+---------+--------------------+--------------------+----------+

With RDDs you can use spark.mllib:

val model = new spark.mllib.classification.LogisticRegressionWithLBFGS().run(data)

This model does not expose the raw predictions and probabilities. You can take a look at predictPoint. It multiplies the vectors and picks the class with the highest prediction. The weights are publicly accessible, so you could copy that algorithm and save the predictions instead of just returning the highest one.


bobo32 February 2016

Following the suggestions from @Daniel Darabos:

  1. I tried to use the LogisticRegression function from ml instead of mllib Unfortunately it doesn't support the multi-class logistic regression but only the binary one.
  2. I took a look at PredictedPoint and modified it so that it prints all the probabilities for each class. Here it is what it looks like:
    def predictPointForMulticlass(featurizedVector:Vector,weightsArray:Vector,intercept:Double,numClasses:Int,numFeatures:Int) : Seq[(String, Double)] = {

        val weightsArraySize = weightsArray.size
        val dataWithBiasSize = weightsArraySize / (numClasses - 1)
        val withBias = false

        var bestClass = 0
        var maxMargin = 0.0
        var margins = new Array[Double](numClasses - 1)
        var temp_marginMap = new HashMap[Int, Double]()
        var res = new HashMap[Int, Double]()

        (0 until numClasses - 1).foreach { i =>
          var margin = 0.0
          var index = 0
          featurizedVector.toArray.foreach(value => {
            if (value != 0.0) {
              margin += value * weightsArray((i * dataWithBiasSize) + index)
            }
            index += 1
          }
          )
        // Intercept is required to be added into margin.
        if (withBias) {
            margin += weightsArray((i * dataWithBiasSize) + featurizedVector.size)
        }
        val prob = 1.0 / (1.0 + Math.exp(-margin))
        margins(i) = margin

        temp_marginMap += (i -> margin)

        if(margin > maxMargin) {
            maxMargin = margin
            bestClass = i + 1
          }
        }

        for ((k,v) <- temp_marginMap){
          val calc =probCalc(maxMargin,v)
          res += (k 

Post Status

Asked in February 2016
Viewed 1,775 times
Voted 9
Answered 2 times

Search




Leave an answer