Copy all elements in RDD to Array - scala

So, I'm reading data from a JSON file and creating a DataFrame. Usually, I would use
sqlContext.read.json("//line//to//some-file.json")
Problem is that my JSON file isn't consistent. So, for each line in the file, there are 2 JSONs. Each line looks like this
{...data I don't need....}, {...data I need....}
I only need my DataFrame to be formed from the data I need, i.e. the second JSON of each line. So I read each line as a string and substring the part that I need, like so
val lines = sc.textFile(link, 2)
val part = lines.map( x => x.substring(x.lastIndexOf('{')).trim)
I want to get all the elements in 'part' as an Array[String] then turn the Array[String] into one string and make the DataFrame. Like so
val strings = part .collect() //doesn't work
val strings = part.take(1000) //works
val jsonStr = "[".concat(strings.mkString(", ")).concat("]")
The problem is, if I call part.collect(), it doesn't work but if I call part.take(N) it works. However, I'd like to get all my data and not just the first N.
Also, if I try part.take(part.count().toInt) it still doesn't work.
Any Ideas??
EDIT
I realized my problem after a good sleep. It was a silly mistake on my part. The very last line of my input file had a different format from the rest of the file.
So part.take(N) would work for all N less than part.count(). That's why part.collect() wasn't working.
Thanks for the help though!

Related

Read oneline file into dataframe

I have the task of reading a one line json file into spark. I´ve thought about either modifying the input file so that it fits spark.read.json(path) or read the whole file and modify it inmemory to make it fit the previous line as shown bellow:
import spark.implicit._
val file = sc.textFile(path).collect()(0)
val data = file.split("},").map(json => s"$json}")
val ds = data.toSeq.toDF()
Is there a way of directly reading the json or read the one line file into multiple rows?
Edit:
Sorry I didn´t crealy explain the json format, all the json in the same line:
{"key":"value"},{"key":"value2"},{"key":"value2"}
If imported with spark.read.json(path) it would only take the first value.
Welcome to SO HugoDife! I believe single line load is what spark.read.json() does and you are perhaps looking for this answer. If not maybe you want to adjust your question with a data example.

Saving pyspark dataframe with complicated schema in plain text for testing

How do I make clean test data for pyspark? I have figured something out that seems pretty good, but parts seem a little awkward, so I'm posting.
Let's say I have a dataframe df with a complicated schema and a small number of rows. I want test data checked into my repo. I don't want a binary file. At this point, I'm not sure the best way to proceed -but I'm thinking i have a file like
test_fn.py
and it has this in it
schema_str='struct<eventTimestamp:timestamp,list_data:array<struct<valueA:string,valueB:string,valueC:boolean>>>'
to get the schema in txt format, using the df.schema.simpleString() function. Then to get the rows - I do
lns = [row.json_txt for row in df.select((F.to_json(F.struct('*'))).alias('json_txt')).collect()]
now I put those lines in my test_fn.py file, or I could have a .json file in the repo.
Now to run the test, I have to make a dataframe with the correct schema and data from this text. It seems the only way spark will parse the simple string is if I create a dataframe with it, that is I can't pass that simple string to the from_json function? So this is a little awkward which is why I thought I'd post -
schema2 = spark.createDataFrame(data=[], schema=schema_str).schema
lns = # say I read the lns back from above
df_txt = spark.createDataFrame(data=lns, schema=T.StringType())
I see df_txt just has one column called 'value'
df_json = df_txt.select(F.from_json('value', schema=schema2).alias('xx'))
sel = ['xx.%s' % nm for nm in df_json.select('xx').schema.fields[0].dataType.fieldNames()]
df2 = df_json.select(*sel)
Now df2 should be the same as df1 - which I see is the case from the deepdiff module.

How to remove header by using filter function in spark?

I want to remove header from a file. But, since the file will be split into partitions, I can't just drop the first item. So I was using a filter function to figure it out and here below is the code I am using :
val noHeaderRDD = baseRDD.filter(line=>!line.contains("REPORTDATETIME"));
and the error I am getting says "error not found value line "what could be the issue here with this code?
I don't think anybody answered the obvious, whereby line.contains also possible:
val noHeaderRDD = baseRDD.filter(line => !(line contains("REPORTDATETIME")))
You were nearly there, just a syntax issue, but that is significant of course!
Using textFile as below:
val rdd = sc.textFile(<<path>>)
rdd.filter(x => !x.startsWith(<<"Header Text">>))
Or
In Spark 2.0:
spark.read.option("header","true").csv("filePath")

Bypass last line of each file in Spark (Scala)

This question is related to this.
I am processing an S3 folder containing csv.gz files in Spark. Each csv.gz file has a header that contains column names. This has been solved by the above SO link and the solution looks like this:
val rdd = sc.textFile("s3://.../my-s3-path").mapPartitions(_.drop(1))
The problem now is that it looks like some of the files have newline ('\n') at the end (we assume we are not sure which file). So when converting the RDD to DataFrame, I'm getting some error. The question now is:
How do I get rid of the last line of each file if it is '\n'?
Why not a simple filter:
val rdd = sc.textFile("s3...").filter(line => !line.equalsIgnoreCase("\n")).mapPartition...
Or filter any empty line:
val rdd = sc.textFile("s3...").filter(line => !line.trim().isEmpty)...

Spark: RDD.saveAsTextFile when using a pair of (K,Collection[V])

I have a dataset of employees and their leave-records. Every record (of type EmployeeRecord) contains EmpID (of type String) and other fields. I read the records from a file and then transform into PairRDDFunctions:
val empRecords = sc.textFile(args(0))
....
val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)
At this point, 'empsGroupedByEmpID' is of type RDD[String,Iterable[EmployeeRecord]]. I transform this into PairRDDFunctions:
val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)
Then, I go for processing the records as per the logic of the application. Finally, I get an RDD of type [Iterable[EmployeeRecord]]
val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>
When I try to write the contents of this RDD to a text file using the available API thus:
finalRecords.saveAsTextFile("./path/to/save")
the I find that in the file every record begins with an ArrayBuffer(...). What I need is a file with one EmployeeRecord in each line. Is that not possible? Am I missing something?
I have spotted the missing API. It is well...flatMap! :-)
By using flatMap with identity, I can get rid of the Iterator and 'unpack' the contents, like so:
finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")
That solves the problem I have been having.
I also have found this post suggesting the same thing. I wish I saw it a bit earlier.