Parse JSON data with Apache Spark and Scala - scala

I have this type of file with data where each line is a JSON object except first few words(see attached image). I want to parse this type of file using Spark and Scala. I have tried it using sqlContext.read.json(“path to json file”) but it gives me error(corrupt data) because whole data is not a JSON object. How do I parse this JSON file to SQL dataframe?

Try this:
val rawRdd = sc.textFile("path-to-the-file")
val jsonRdd = rawRdd.map(_.substring(32)) //32 - number of first characters to ignore
val df = spark.read.json(jsonRdd)

Related

How can I convert a Spark dataframe to an RDD[(String,PortableDatastream)]?

i have an issue ,
in my compagny for reading an avro file we do Something like this:
``` val rdd = spark.sparkContext.binaryFiles(path)
implicit val streamEncoder: Encoder[(String, PortableDataStream)] = Encoders.kryo[(String, PortableDataStream)]
spark.createDataset(rdd) ```
this format allow us to easily push the data to a server, without changing the avro, but now we need to modify the avro file for adding to it some new columns before send it to server, by doing something like this :
val avroDF2 = spark.read.avro(restitutionPath2)
val df = avroDF2.select(col("*") +:predefinedList.map(x => lit("").alias(x)): _*)
now my problem is that i don't find some way to convert my dataframe (df) to an RDD[(string,PortableDatastream)] in order to directy push it to my server instead of writing in the avro

How to assign textfile string to dictionary value into one variable and how to extract the value by passing key in spark scala?

I am reading a text file from the local file system. I want to convert String to Dictionary(MAP) store it into one variable. And want to extract value by passing key. I am new to spark scala.
scala>val file = sc.textFile("file:///test/prod_details.txt");
scala> file.foreach(println)
{"00000006-0000-0000": "AWS", "00000009-0000-0000": "JIRA", "00000010-0000-0000-0000": "BigData", "00000011-0000-0000-0000": "CVS"}
scala> val rowRDD=file.map(_.split(","))
Expected Result is :
If I pass the key as "00000010-0000-0000-0000",
the function should return the value as BigData
Since your file is in json format and is not big you can read your file with spark json connector and then extract keys and columns :
val df = session.read.json("path to file")
val keys = df.columns
val values = df.collect().last.toSeq
val map = keys.zip(values).toMap

Spark Scala - textFile() and sequenceFile() RDDs

I'm successfully loading my sequence files into a DataFrame with some code like this:
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.sequenceFile[LongWritable,String](src)
val jsonRecs = file.map((record: (String, String)) => new String(record._2))
val df = sqlContext.read.json(jsonRecs)
I'd like to do the same with some text files. The text files have a similar format as the sequence files (A timestamp, a tab char, then the json). But the problem is textFile() returns an RDD[String] instead of an RDD[LongWritable,String] like the sequenceFile() method.
My goal is to be able to test the program with either sequence files or text files as input.
How could I convert the RDD[String] coming from textFile() into an RDD[LongWritable,String]? Or is there a better solution?
Assuming that your text file is a CSV file, you can use following code for reading a CSV file in a Dataframe where spark is the SparkSession:
val df = spark.read.option("header", "false").csv("file.txt")
Like header option there are multiple options you can provide depending upon your requirement. Check this for more details.
Thanks for the responses. It's not a CSV but I guess it could be. It's just the text output of doing this on a sequence file in HDFS:
hdfs dfs -text /path/to/my/file > myFile.txt
Anyway, I found a solution that works for both sequence and text file for my use case. This code ends up setting the variable 'file' to a RDD[String,String] in both cases, and I can work with that.
var file = if (inputType.equalsIgnoreCase("text")) {
sc.textFile(src).map(line => (line.split("\t")(0), line.split("\t")(1)))
} else { // Default to assuming sequence files are input
sc.sequenceFile[String,String](src)
}

How to remove all records in a RDD including null?

I loaded an RDD from a csv file. However, this file includes invalid data. So, when I tried to output the contact of this RDD with first. The exception is
Caused by: java.lang.NumberFormatException: empty String
I hope to find solution to remove all records in the RDD when one record includes empty string. In addition, this RDD includes so many fields, so it is difficult to handle every field one by one. I remembers that DataFrame has such function, such as na.drop(). I need that this kind of function will work for RDD.
The code I used is like:
//using case class
case class Flight(dest_id:Long, dest:String, crsdeptime:Double, deptime:Double, depdelaymins:Double, crsarrtime:Double)
//defining function
def parseFlight(str: String): Flight = {
val line = str.split(",")
Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5).toLong)
}
//loading data
val textRDD = sc.textFile("/root/data/data.csv")
val flightsRDD = textRDD.map(parseFlight)
update
When I used RDD converted by DateFrame. I found every line of RDD is Row object. How to extract some fields of one Row to build Edge object?
If the header in the csv file matches the variable names in the case class, then it's easier to read the data as a dataframe and then use na.drop().
val flightsDf = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("/root/data/data.csv")
.na.drop()
.as[Flight]
If you want a rdd, it is always possible to convert it afterwards with flightsDf.rdd.

filter and store the result using spark

I am having file which contain data like below:
100|hyd|xxx|32
101|chn|yyy|98
103|chn|abc|87
104|hyd|nbx|56
Here I want to filter the data based on location(hyd,chn) and store it in a text file.
I tried the below code.
val file=sc.textFile("/home/cloudera/abc.txt")
val file2=file.map(line=>line.split("\\|"))
val file3 = file2.filter(line=>line.apply(1).matches("hyd")).saveAsTextFile("/home/cloudera/hyd")
When I check the /home/cloudera/hyd/part-00000 path data is stored in object format.
[Ljava.lang.String;#679e1175
I want the data to be stored in plain text format.
100|hyd|xxx|32
104|hyd|nbx|56
Thank you.
You are just missing one thing converting the list to String!
This can be easily done in this way:
val file=sc.textFile("/home/cloudera/abc.txt")
val file2=file.map(line=>line.split("\\|"))
val file3 = file2.filter(line=>line.apply(1).matches("hyd")).map(line=>line.mkString("|")).saveAsTextFile("/home/cloudera/hyd")