Mohitt February 2016

Spark: Force two RDD[Key, Value] with co-located partitions using custom partitioner

I have two RDD[K,V], where K=Long and V=Object. Lets call the rdd1 and rdd2. I have a common custom Partitioner. I am trying to find a way to take union or join by avoiding or minimizing data movement.

val kafkaRdd1 = /* from kafka sources */
val kafkaRdd2 = /* from kafka sources */

val rdd1 = kafkaRdd1.partitionBy(new MyCustomPartitioner(24))
val rdd2 = kafkaRdd2.partitionBy(new MyCustomPartitioner(24))

val rdd3 = rdd1.union(rdd2) // Without shuffle
val rdd3 = rdd1.leftOuterjoin(rdd2) // Without shuffle

Is it safe to assume (or a way to enforce) the nth-Partition of both rdd1 and rdd2 on same slave node?

Answers


zero323 February 2016

It is not possible to enforce colocation in Spark but the method you use will minimize data movement. When PartitionerAwareUnionRDD is created input RDDs are analyzed to choose optimal output locations based on the number of records per location. See getPreferredLocations method for details.

Post Status

Asked in February 2016
Viewed 2,232 times
Voted 11
Answered 1 times

Search




Leave an answer