Spark: Force two RDD[Key, Value] with co-located partitions using custom partitioner
I have two RDD[K,V], where K=Long and V=Object. Lets call the rdd1 and rdd2. I have a common custom Partitioner. I am trying to find a way to take union or join by avoiding or minimizing data movement.
val kafkaRdd1 = /* from kafka sources */
val kafkaRdd2 = /* from kafka sources */
val rdd1 = kafkaRdd1.partitionBy(new MyCustomPartitioner(24))
val rdd2 = kafkaRdd2.partitionBy(new MyCustomPartitioner(24))
val rdd3 = rdd1.union(rdd2) // Without shuffle
val rdd3 = rdd1.leftOuterjoin(rdd2) // Without shuffle
Is it safe to assume (or a way to enforce) the nth-Partition of both rdd1 and rdd2 on same slave node?
It is not possible to enforce colocation in Spark but the method you use will minimize data movement. When PartitionerAwareUnionRDD is created input RDDs are analyzed to choose optimal output locations based on the number of records per location. See getPreferredLocations method for details.
Asked in February 2016Viewed 2,232 timesVoted 11Answered 1 times