4a616e February 2016

Loading numpy arrays stored in npz archive in PySpark

I have a large number of numpy arrays in S3 stored in npz archive. What is the best way to load them into a PySpark RDD/Dataframe of NumPy arrays? I have tried to load the file using the sc.wholeTextFiles API.

rdd=sc.wholeTextFiles("s3://[bucket]/[folder_containing_npz_files]") 

However numpy.load requires a file handle. And loading the file contents in memory as a string takes up a lot of memory.

Answers


zero323 February 2016

You cannot do much about memory requirements but otherwise BytesIO should work just fine:

from io import BytesIO

def extract(kv):
    k, v = kv
    with BytesIO(v) as r:
        for f, x in np.load(r).items():
            yield "{0}\t{1}".format(k, f), x

sc.binaryFiles(inputPath).flatMap(extract)

Post Status

Asked in February 2016
Viewed 3,507 times
Voted 8
Answered 1 times

Search




Leave an answer