Home Ask Login Register

Developers Planet

Your answer is one click away!

mithrix February 2016

How to compute summary statistic on Cassandra table with Spark DataFrame?

I'm trying to get the min, max mean of some Cassandra/SPARK data but I need to do it with JAVA.

import org.apache.spark.sql.DataFrame;
import static org.apache.spark.sql.functions.*;

DataFrame df = sqlContext.read()
        .format("org.apache.spark.sql.cassandra")
        .option("table",  "someTable")
        .option("keyspace", "someKeyspace")
        .load();

df.groupBy(col("keyColumn"))
        .agg(min("valueColumn"), max("valueColumn"), avg("valueColumn"))
        .show();

EDITED to show working version: Make sure to put " around the someTable and someKeyspace

Answers


MarcintheCloud February 2016

I suggest checking out https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector-demos

Which contains demos in both Scala and the equivalent Java.

You can also check out: http://spark.apache.org/documentation.html

Which has tons of examples that you can flip between Scala, Java, and Python versions.

I'm almost 100% certain that between those to links, you'll find exactly what you're looking for.

If there's anything you're having trouble with after that, feel free to update your question with a more specific error/problem.


purplebee February 2016

In general,

compile scala file: $ scalac Main.scala

create your java source file from Main.class file: $ javap Main

More info is available at following url: http://alvinalexander.com/scala/scala-class-to-decompiled-java-source-code-classes


zero323 February 2016

Just import your data as a DataFrame and apply required aggregations:

import org.apache.spark.sql.DataFrame;
import static org.apache.spark.sql.functions.*;

DataFrame df = sqlContext.read()
        .format("org.apache.spark.sql.cassandra")
        .option("table", someTable)
        .option("keyspace", someKeyspace)
        .load();

df.groupBy(col("keyColumn"))
        .agg(min("valueColumn"), max("valueColumn"), avg("valueColumn"))
        .show();

where someTable and someKeyspace store table name and keyspace respectively.

Post Status

Asked in February 2016
Viewed 1,454 times
Voted 13
Answered 3 times

Search




Leave an answer


Quote of the day: live life