How to generate and output multiple java objects using spark?

Question

I have a CSV on the hadoop file system hdfs that I want to convert into multiple serialized java objects using this framework:

https://github.com/clarkduvall/serpy

I heard of avro and parquet. I don't want to use those and want to output serialized binary files. My csv file contains records like:

Name, Age, Date
Jordan, 1, 1/1/2017
John, 5, 2/2/2017

Is this possible using Hadoop or Spark? The output objects should be readable by a normal non-hadoop/spark related Java program. Any example would be appreciated!

Al M · Accepted Answer · 2017-12-14T15:57:34.633

The output objects should be readable by a normal non-hadoop/spark related Java program

For that to work you will need to save your results outside of HDFS. So what you could do is:

Read the CSV data from HDFS using SparkContext.textFile in Spark
Grab a limited number of rows into your driver using RDD.take()
- The argument here will be the number of rows you want e.g. myRdd.take(1000) to grab 1000 rows
myRdd.collect() will grab everything, but if you have a lot of data, that can cause an OutOfMemoryError on your spark driver
Now you will have all the rows as an array, you can store them using a basic java serializer

Sample Code:

val sc = new SparkContext(conf)
val myRdd = sc.textFile("hdfs://namenode/mypath/myfile.csv")
val myArray = myRdd.take(100000)
//Store myArray to file using java serialiser

If you want to store as serialized data on HDFS, you can do this:

val sc = new SparkContext(conf)
val myRdd = sc.textFile("hdfs://namenode/mypath/myfile.csv")
myRdd.saveAsObjectFile("hdfs://namenode/mypath/myoutput.obj")

This will save an Array[String]. You can transform your RDD between lines 2 and 3 to make this more useful to the

So if I wanted to use serpy as the serializer.. which appears to be a custom serializer, how do I do this? It looks like saveAsObjectFile is a generic serializer? — Rolando, Dec 16 '17 at 16:42
Are you saving it to local disk or HDFS? For local disk you can use my first snippet. For saving to HDFS you will need my second snippet, along with a custom serialiser, some details on those here: https://stackoverflow.com/questions/36144618/spark-kryo-register-a-custom-serializer — Al M, Dec 17 '17 at 17:36

How to generate and output multiple java objects using spark?

1 Answers1