user3783034 February 2016

Writing Spark RDD as Gzipped file in Amazon s3

I have an output RDD in my spark code written in python. I want to save it in Amazon S3 as gzipped file. I have tried following functions. The below function correctly saves the output rdd in s3 but not in gzipped format.

output_rdd.saveAsTextFile("s3://<name-of-bucket>/")

The below function returns error:: TypeError: saveAsHadoopFile() takes at least 3 arguments (3 given)

output_rdd.saveAsHadoopFile("s3://<name-of-bucket>/", 
                        compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec"
                       )

Please guide me with the correct way to do this.

Answers


Avihoo Mamka February 2016

You need to specify the output format as well.

Try this:

output_rdd.saveAsHadoopFile("s3://<name-of-bucket>/", "org.apache.hadoop.mapred.TextOutputFormat", compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

You can use any of the Hadoop-supported compression codecs:

  • gzip: org.apache.hadoop.io.compress.GzipCodec
  • bzip2: org.apache.hadoop.io.compress.BZip2Codec
  • LZO: com.hadoop.compression.lzo.LzopCodec

Post Status

Asked in February 2016
Viewed 3,328 times
Voted 9
Answered 1 times

Search




Leave an answer