0

I have a CSV on the hadoop file system hdfs that I want to convert into multiple serialized java objects using this framework:

https://github.com/clarkduvall/serpy

I heard of avro and parquet. I don't want to use those and want to output serialized binary files. My csv file contains records like:

Name, Age, Date
Jordan, 1, 1/1/2017
John, 5, 2/2/2017

Is this possible using Hadoop or Spark? The output objects should be readable by a normal non-hadoop/spark related Java program. Any example would be appreciated!

Rolando
  • 58,640
  • 98
  • 266
  • 407

1 Answers1

0

The output objects should be readable by a normal non-hadoop/spark related Java program

For that to work you will need to save your results outside of HDFS. So what you could do is:

  • Read the CSV data from HDFS using SparkContext.textFile in Spark
  • Grab a limited number of rows into your driver using RDD.take()
    • The argument here will be the number of rows you want e.g. myRdd.take(1000) to grab 1000 rows
  • myRdd.collect() will grab everything, but if you have a lot of data, that can cause an OutOfMemoryError on your spark driver
  • Now you will have all the rows as an array, you can store them using a basic java serializer

Sample Code:

val sc = new SparkContext(conf)
val myRdd = sc.textFile("hdfs://namenode/mypath/myfile.csv")
val myArray = myRdd.take(100000)
//Store myArray to file using java serialiser

If you want to store as serialized data on HDFS, you can do this:

val sc = new SparkContext(conf)
val myRdd = sc.textFile("hdfs://namenode/mypath/myfile.csv")
myRdd.saveAsObjectFile("hdfs://namenode/mypath/myoutput.obj")

This will save an Array[String]. You can transform your RDD between lines 2 and 3 to make this more useful to the

Al M
  • 557
  • 4
  • 10
  • So if I wanted to use serpy as the serializer.. which appears to be a custom serializer, how do I do this? It looks like saveAsObjectFile is a generic serializer? – Rolando Dec 16 '17 at 16:42
  • Are you saving it to local disk or HDFS? For local disk you can use my first snippet. For saving to HDFS you will need my second snippet, along with a custom serialiser, some details on those here: https://stackoverflow.com/questions/36144618/spark-kryo-register-a-custom-serializer – Al M Dec 17 '17 at 17:36