How to remove header by using filter function in spark? - scala

I want to remove header from a file. But, since the file will be split into partitions, I can't just drop the first item. So I was using a filter function to figure it out and here below is the code I am using :
val noHeaderRDD = baseRDD.filter(line=>!line.contains("REPORTDATETIME"));
and the error I am getting says "error not found value line "what could be the issue here with this code?

I don't think anybody answered the obvious, whereby line.contains also possible:
val noHeaderRDD = baseRDD.filter(line => !(line contains("REPORTDATETIME")))
You were nearly there, just a syntax issue, but that is significant of course!

Using textFile as below:
val rdd = sc.textFile(<<path>>)
rdd.filter(x => !x.startsWith(<<"Header Text">>))
Or
In Spark 2.0:
spark.read.option("header","true").csv("filePath")

Related

Write Header only CSV record from Spark Scala DataFrame

My requirement is to write only Header CSV record using Spark Scala DataFrame. Can any one help me on this.
val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/"
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")
The above one is working and able to create header in the CSV with tab delimiter. Since I am using spark session I am creating sparkContext in the second line. outDF is my dataframe created before these statements.
Two things are outstanding, can you one of you help me.
1. The above working code is not overriding the files, so every time I need to delete the files manually. I could not find override option, can you help me.
2. Since I am doing a select statement and schema, will it be consider as action and start another lineage for this statement. If it is true then this would degrade the performance.
If you need to output only header you can use this code:
df.schema.fieldNames.reduce(_ + "," + _)
It will create line of CSV with names of columns
I tested and the solution below did not affect any performance.
val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/"
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")
I got a solution to handle this situation. Define the columns in the configuration file and write those columns in an file. Here is the snipet.
val Header = prop.getProperty("OUT_HEADER_COLUMNS").replaceAll("\"","").replaceAll(",","\t")
scala.tools.nsc.io.File(s"$HeadOPath").writeAll(s"$Header")

Copy all elements in RDD to Array

So, I'm reading data from a JSON file and creating a DataFrame. Usually, I would use
sqlContext.read.json("//line//to//some-file.json")
Problem is that my JSON file isn't consistent. So, for each line in the file, there are 2 JSONs. Each line looks like this
{...data I don't need....}, {...data I need....}
I only need my DataFrame to be formed from the data I need, i.e. the second JSON of each line. So I read each line as a string and substring the part that I need, like so
val lines = sc.textFile(link, 2)
val part = lines.map( x => x.substring(x.lastIndexOf('{')).trim)
I want to get all the elements in 'part' as an Array[String] then turn the Array[String] into one string and make the DataFrame. Like so
val strings = part .collect() //doesn't work
val strings = part.take(1000) //works
val jsonStr = "[".concat(strings.mkString(", ")).concat("]")
The problem is, if I call part.collect(), it doesn't work but if I call part.take(N) it works. However, I'd like to get all my data and not just the first N.
Also, if I try part.take(part.count().toInt) it still doesn't work.
Any Ideas??
EDIT
I realized my problem after a good sleep. It was a silly mistake on my part. The very last line of my input file had a different format from the rest of the file.
So part.take(N) would work for all N less than part.count(). That's why part.collect() wasn't working.
Thanks for the help though!

I need help parsing a file in scala for running a spark job

I'm running a Spark Job in Scala and I'm struck with parsing the input file.
The Input file(TAB separated) is something like,
date=20160701 name=mike age=26
date=20160402 name=john age=33
I want to parse it and extract only values and not the keys, such as,
20160701 mike 26
20160402 john 33
How can this be achieved in SCALA?
I'm using,
SCALA VERSION: 2.11
You can use CSVParser() and you know the location for key, it will be easy and clean
Test data
val data = "date=20160701\tname=mike\tage=26\ndate=20160402 name=john\tage=33\n"
One statement to do what you asked
val rdd = sc.parallelize(data.split('\n'))
.map(_.split('\t') // split into key=value
.map(_.split('=')(1))) // split those at "=" and select only the value
Display what we got
rdd.collect().foreach(r=>println(r.mkString(",")))
// 20160701,mike,26
// 20160402,john,33
But don't do this for real code. It's very fragile in the face of data format errors, etc. Use CSVParser or something instead as Narendra Parmar suggests.
val rdd = sc.textFile()
rdd.map(x => x.split("\t")).map(x => x.split("=")(1)).map(x => x.mkstring("\t")).saveAsTextFile("")

Bypass last line of each file in Spark (Scala)

This question is related to this.
I am processing an S3 folder containing csv.gz files in Spark. Each csv.gz file has a header that contains column names. This has been solved by the above SO link and the solution looks like this:
val rdd = sc.textFile("s3://.../my-s3-path").mapPartitions(_.drop(1))
The problem now is that it looks like some of the files have newline ('\n') at the end (we assume we are not sure which file). So when converting the RDD to DataFrame, I'm getting some error. The question now is:
How do I get rid of the last line of each file if it is '\n'?
Why not a simple filter:
val rdd = sc.textFile("s3...").filter(line => !line.equalsIgnoreCase("\n")).mapPartition...
Or filter any empty line:
val rdd = sc.textFile("s3...").filter(line => !line.trim().isEmpty)...

Spark: RDD.saveAsTextFile when using a pair of (K,Collection[V])

I have a dataset of employees and their leave-records. Every record (of type EmployeeRecord) contains EmpID (of type String) and other fields. I read the records from a file and then transform into PairRDDFunctions:
val empRecords = sc.textFile(args(0))
....
val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)
At this point, 'empsGroupedByEmpID' is of type RDD[String,Iterable[EmployeeRecord]]. I transform this into PairRDDFunctions:
val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)
Then, I go for processing the records as per the logic of the application. Finally, I get an RDD of type [Iterable[EmployeeRecord]]
val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>
When I try to write the contents of this RDD to a text file using the available API thus:
finalRecords.saveAsTextFile("./path/to/save")
the I find that in the file every record begins with an ArrayBuffer(...). What I need is a file with one EmployeeRecord in each line. Is that not possible? Am I missing something?
I have spotted the missing API. It is well...flatMap! :-)
By using flatMap with identity, I can get rid of the Iterator and 'unpack' the contents, like so:
finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")
That solves the problem I have been having.
I also have found this post suggesting the same thing. I wish I saw it a bit earlier.