Home Ask Login Register

Developers Planet

Your answer is one click away!

user3783034 February 2016

Writing Spark RDD as Gzipped file in Amazon s3

I have an output RDD in my spark code written in python. I want to save it in Amazon S3 as gzipped file. I have tried following functions. The below function correctly saves the output rdd in s3 but not in gzipped format.


The below function returns error:: TypeError: saveAsHadoopFile() takes at least 3 arguments (3 given)


Please guide me with the correct way to do this.


Avihoo Mamka February 2016

You need to specify the output format as well.

Try this:

output_rdd.saveAsHadoopFile("s3://<name-of-bucket>/", "org.apache.hadoop.mapred.TextOutputFormat", compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

You can use any of the Hadoop-supported compression codecs:

  • gzip: org.apache.hadoop.io.compress.GzipCodec
  • bzip2: org.apache.hadoop.io.compress.BZip2Codec
  • LZO: com.hadoop.compression.lzo.LzopCodec

Post Status

Asked in February 2016
Viewed 3,328 times
Voted 9
Answered 1 times


Leave an answer

Quote of the day: live life