How read a csv with quotes using sparkcontext - scala

I've started recently to use scala spark, in particular I'm trying to use GraphX in order to make a graph from a csv. To read a csv file with spark context I always do this:
val rdd = sc.textFile("file/path")
.map(line => line.split(","))
In this way I obtain an RDD of objects Array[String].
My problem is that the csv file contains strings delimited by quotes ("") and number without quotes, an example of some lines inside the file is the following:
"Luke",32,"Rome"
"Mary",43,"London"
"Mario",33,"Berlin"
If I use the method split(",") I obtain String objects that inside contain quotes, for instance the string Luke is saved as "Luke" and not as Luke.
How can I do to not consider quotes and make the correct string objects?
I hope I was clear to explain my problem

you can let the Spark DataFrame level CSV parser resolve that for you
val rdd=spark.read.csv("file/path").rdd.map(_.mkString(",")).map(_.split(","))
by the way, you can transform the Row directly to VertexId, (String,String) in the first map based on the Row fields

Try with below example.
import org.apache.spark.sql.SparkSession
object DataFrameFromCSVFile {
def main(args:Array[String]):Unit= {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val filePath="C://zipcodes.csv"
//Chaining multiple options
val df2 = spark.read.options(Map("inferSchema"->"true","sep"->",","header"->"true")).csv(filePath)
df2.show(false)
df2.printSchema()
}
}

Related

Spark scala read multiple files from S3 using Seq(paths)

I have a scala program that reads json files into a DataFrame using DataFrameReader, using a file pattern like "s3n://bucket/filepath/*.json" to specify files. Now I need to read both ".json" and ".json.gz" (gzip) files into the dataframe.
Since current approach uses a wildcard, like this:
session.read().json("s3n://bucket/filepath/*.json")
I want to read both json and json-gzip files, but I have not found documentation for the wildcard pattern expression. I was tempted to compose a more complex wildcard, but the lack of wildcard documentation motivated me to consider another approach.
Reading the documentation for Spark, it says that the DataFrameReader has these relevant methods,
json(path: String): DataFrame
json(paths: String*): DataFrame
Which would produce code more like this:
// spark.isInstanceOf[SparkSession]
// val reader: DataFrameReader = spark.read
val df: DataFrame = spark.read.json(path: String)
// or
val df: DataFrame = spark.read.json(paths: String*)
I need to read json and json-gzip files, but I may need to read other filename formats. The second method (above) accepts a Scala Seq(uence), which means I could provide a Seq(uence), which I could later add other filename wildcards.
// session.isInstanceOf[SparkSession]
val s3json: String = "s3n://bucket/filepath/*.json"
val s3gzip: String = "s3n://bucket/filepath/*.json.gz"
val paths: Seq[String] = Seq(s3json, s3gzip)
val df: DataFrame = session.read().json(paths)
Please comment on this approach, and is this idionatic?
I have also seen examples of the last line with the splat operator ("_") added to the paths sequence. Is that needed? Can you explain what the ": _" part does?
val df: DataFrame = session.read().json(paths: _*)
Example of the splat operator use are here:
How to read multiple directories in s3 in spark Scala?
How to pass a list of paths to spark.read.load?
Adding further to blackbishop's answer, you can use val df = spark.read.json(paths: _*) for reading files from entirely independent buckets/folders.
val paths = Seq("s3n://bucket1/filepath1/","s3n://bucket2/filepath/2")
val df = spark.read.json(paths: _*)
The _* converts a Seq to variable arguments needed by path function.
You can use brace expansions in your path to include the 2 extensions:
val df = spark.read.json("s3n://bucket/filepath/{*.json,*.json.gz}")
If your bucket contains only .json and .json.gz files, you can actually read all the files:
val df = spark.read.json("s3n://bucket/filepath/")

Best practice to define implicit/explicit encoding in dataframe column value extraction without RDD

I am trying to get column data in a collection without RDD map api (doing the pure dataframe way)
object CommonObject{
def doSomething(...){
.......
val releaseDate = tableDF.where(tableDF("item") <=> "releaseDate").select("value").map(r => r.getString(0)).collect.toList.head
}
}
this is all good except Spark 2.3 suggests
No implicits found for parameter evidence$6: Encoder[String]
between map and collect
map(r => r.getString(0))(...).collect
I understand to add
import spark.implicits._
before the process however it requires a spark session instance
it's pretty annoying especially when there is no spark session instance in a method. As a Spark newbie how to nicely resolve the implicit encoding parameter in the context?
You can always add a call to SparkSession.builder.getOrCreate() inside your method. Spark will find the already existing SparkSession and won't create a new one, so there is no performance impact. Then you can import explicits which will work for all case classes. This is easiest way to add encoding. Alternatively an explicit encoder can be added using Encoders class.
val spark = SparkSession.builder
.appName("name")
.master("local[2]")
.getOrCreate()
import spark.implicits._
The other way is to get SparkSession from the dataframe dataframe.sparkSession
def dummy (df : DataFrame) = {
val spark = df.sparkSession
import spark.implicits._
}

How to remove all records in a RDD including null?

I loaded an RDD from a csv file. However, this file includes invalid data. So, when I tried to output the contact of this RDD with first. The exception is
Caused by: java.lang.NumberFormatException: empty String
I hope to find solution to remove all records in the RDD when one record includes empty string. In addition, this RDD includes so many fields, so it is difficult to handle every field one by one. I remembers that DataFrame has such function, such as na.drop(). I need that this kind of function will work for RDD.
The code I used is like:
//using case class
case class Flight(dest_id:Long, dest:String, crsdeptime:Double, deptime:Double, depdelaymins:Double, crsarrtime:Double)
//defining function
def parseFlight(str: String): Flight = {
val line = str.split(",")
Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5).toLong)
}
//loading data
val textRDD = sc.textFile("/root/data/data.csv")
val flightsRDD = textRDD.map(parseFlight)
update
When I used RDD converted by DateFrame. I found every line of RDD is Row object. How to extract some fields of one Row to build Edge object?
If the header in the csv file matches the variable names in the case class, then it's easier to read the data as a dataframe and then use na.drop().
val flightsDf = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("/root/data/data.csv")
.na.drop()
.as[Flight]
If you want a rdd, it is always possible to convert it afterwards with flightsDf.rdd.

Spark: Write each record in RDD to individual files in HDFS directory

I have a requirement where I want to write each individual records in an RDD to an individual file in HDFS.
I did it for the normal filesystem but obviously, it doesn't work for HDFS.
stream.foreachRDD{ rdd =>
if(!rdd.isEmpty()) {
rdd.foreach{
msg =>
val value = msg._2
println(value)
val fname = java.util.UUID.randomUUID.toString
val path = dir + fname
write(path, value)
}
}
}
where write is a function which writes to the filesystem.
Is there a way to do it within spark so that for each record I can natively write to the HDFS, without using any other tool like Kafka Connect or Flume??
EDIT: More Explanation
For eg:
If my DstreamRDD has the following records,
abcd
efgh
ijkl
mnop
I need different files for each record, so different file for "abcd", different for "efgh" and so on.
I tried creating an RDD within the streamRDD but I learnt it's not allowed as the RDD's are not serializable.
You can forcefully repartition the rdd to no. of partitions as many no. of records and then save
val rddCount = rdd.count()
rdd.repartition(rddCount).saveAsTextFile("your/hdfs/loc")
You can do in couple of ways..
From rdd, you can get the sparkCOntext, once you got the sparkCOntext, you can use parallelize method and pass the String as List of String.
For example:
val sc = rdd.sparkContext
sc.parallelize(Seq("some string")).saveAsTextFile(path)
Also, you can use sqlContext to convert the string to DF then write in the file.
for Example:
import sqlContext.implicits._
Seq(("some string")).toDF

skip header of csv while reading multiple files into rdd in scala

I am trying to read multiple csvs into an rdd from a path. This path has many csvs Is there a way I can avoid the headers while reading all the csvs into rdd? or use spotsRDD to omit out the header without having to use filter or deal with each csv individually and then union them?
val path ="file:///home/work/csvs/*"
val spotsRDD= sc.textFile(path)
println(spotsRDD.count())
Thanks
That is pity you are using spark 1.0.0.
You can use CSV Data Source for Apache Spark but this library requires Spark 1.3+ and btw. this library was inlined to Spark 2.x.
But we can analyse and implement something similar.
When we look into the com/databricks/spark/csv/DefaultSource.scala there is
val useHeader = parameters.getOrElse("header", "false")
and then in the com/databricks/spark/csv/CsvRelation.scala there is
// If header is set, make sure firstLine is materialized before sending to executors.
val filterLine = if (useHeader) firstLine else null
baseRDD().mapPartitions { iter =>
// When using header, any input line that equals firstLine is assumed to be header
val csvIter = if (useHeader) {
iter.filter(_ != filterLine)
} else {
iter
}
parseCSV(csvIter, csvFormat)
so if we assume the first line is only once in RDD (our csv rows) we can do something like in the example below:
CSV example file:
Latitude,Longitude,Name
48.1,0.25,"First point"
49.2,1.1,"Second point"
47.5,0.75,"Third point"
scala> val csvData = sc.textFile("test.csv")
csvData: org.apache.spark.rdd.RDD[String] = test.csv MapPartitionsRDD[24] at textFile at <console>:24
scala> val header = csvDataRdd.first
header: String = Latitude,Longitude,Name
scala> val csvDataWithoutHeaderRdd = csvDataRdd.mapPartitions{iter => iter.filter(_ != header)}
csvDataWithoutHeaderRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at mapPartitions at <console>:28
scala> csvDataWithoutHeaderRdd.foreach(println)
49.2,1.1,"Second point"
48.1,0.25,"First point"
47.5,0.75,"Third point"