Ramesh February 2016

Spark on YARN - saveAsTextFile() method creating lot of empty part files

I am running the Spark job on Hadoop YARN Cluster.

i am using saveAsTextFile() method to store the RDD to text file.

I can see more than 150 empty part files created out of 250 files.

Is there a way we can avoid this?

Answers


whaleberg February 2016

Each partition is written to it's own file. Empty partitions will be written as empty files.

In order to avoid writing the empty files you can either coalesce or repartition your RDD into a smaller number of partitions.

If you didn't expect to have empty partitions, it may be worth investigating why you have them. Empty partitions can happen either due to a filtering step which removed all the elements from some partitions, or due to a bad hash function. If the hashCode() for your RDD's elements doesn't distribute the elements well, it's possible to end up with an unbalanced RDD that has empty partitions.

Post Status

Asked in February 2016
Viewed 2,106 times
Voted 7
Answered 1 times

Search




Leave an answer