Home Ask Login Register

Developers Planet

Your answer is one click away!

Ramesh February 2016

Spark on YARN - saveAsTextFile() method creating lot of empty part files

I am running the Spark job on Hadoop YARN Cluster.

i am using saveAsTextFile() method to store the RDD to text file.

I can see more than 150 empty part files created out of 250 files.

Is there a way we can avoid this?


whaleberg February 2016

Each partition is written to it's own file. Empty partitions will be written as empty files.

In order to avoid writing the empty files you can either coalesce or repartition your RDD into a smaller number of partitions.

If you didn't expect to have empty partitions, it may be worth investigating why you have them. Empty partitions can happen either due to a filtering step which removed all the elements from some partitions, or due to a bad hash function. If the hashCode() for your RDD's elements doesn't distribute the elements well, it's possible to end up with an unbalanced RDD that has empty partitions.

Post Status

Asked in February 2016
Viewed 2,106 times
Voted 7
Answered 1 times


Leave an answer

Quote of the day: live life