How to parse a csv string into a Spark dataframe using scala? - scala

I would like to convert a RDD containing records of strings, like below, to a Spark dataframe.
"Mike,2222-003330,NY,34"
"Kate,3333-544444,LA,32"
"Abby,4444-234324,MA,56"
....
The schema line is not inside the same RDD, but in a another variable:
val header = "name,account,state,age"
So now my question is, how do I use the above two, to create a dataframe in Spark? I am using Spark version 2.2.
I did search and saw a post:
Can I read a CSV represented as a string into Apache Spark using spark-csv
.
However it's not exactly what I need and I can't figure out a way to modify this piece of code to work in my case.
Your help is greatly appreciated.

The easier way would probably be to start from the CSV file and read it directly as a dataframe (by specifying the schema). You can see an example here: Provide schema while reading csv file as a dataframe.
When the data already exists in an RDD you can use toDF() to convert to a dataframe. This function also accepts column names as input. To use this functionality, first import the spark implicits using the SparkSession object:
val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._
Since the RDD contains strings it needs to first be converted to tuples representing the columns in the dataframe. In this case, this will be a RDD[(String, String, String, Int)] since there are four columns (the last age column is changed to int to illustrate how it can be done).
Assuming the input data are in rdd:
val header = "name,account,state,age"
val df = rdd.map(row => row.split(","))
.map{ case Array(name, account, state, age) => (name, account, state, age.toInt)}
.toDF(header.split(","):_*)
Resulting dataframe:
+----+-----------+-----+---+
|name| account|state|age|
+----+-----------+-----+---+
|Mike|2222-003330| NY| 34|
|Kate|3333-544444| LA| 32|
|Abby|4444-234324| MA| 56|
+----+-----------+-----+---+

Related

Create an empty DF using schema from another DF (Scala Spark)

I have to compare a DF with another one that is the same schema readed from a specific path, but maybe in that path there are not files so I've thought that I have to compare it with a null DF with the same columns as the original.
So I am trying to create a DF with the schema from another DF that contains a lot of columns but I can't find a solution for this. I have been reading the following posts but no one helps me:
How to create an empty DataFrame with a specified schema?
How to create an empty DataFrame? Why "ValueError: RDD is empty"?
How to create an empty dataFrame in Spark
How can I do it in scala? Or is better take other option?
originalDF.limit(0) will return an empty dataframe with the same schema.

How to Create data frame schema from a header file

I have 2 data files:
1 file is header file and other is data file.
Header file is having 2 columns (Id,Tags):header.txt
Id,Tags
Now I am trying to create a dataFrame Schema Out of the header file:(I have to use this approach as in real time ,there are 1000 of columns in header.txt and data.txt. So,manually creating case class with 1000 column is not possible.
val dataFile=sparkSession.read.format("text").load("data.txt")
val headerFile=sparkSession.sparkContext.textFile("header.txt")
val fields=
headerFile.flatMap(x=>x.split(",")).map(fieldName=>StructField(fieldName,StringType,true))
val schema=StructType(fields)
But above line is failing with Cannot resolve overloaded method StructType.
Can some one please help
StructType needs an array of StructField and you are using fields variable which is an RDD[String] so collect the rdd to create StructType.
val fields= headerFile.flatMap(x=>x.split(","))
.map(fieldName=>StructField(fieldName,StringType,true))
val schema=StructType(fields.collect)

Difference between sparksession text and textfile methods? [duplicate]

This question already has an answer here:
Difference between sc.textFile and spark.read.text in Spark
(1 answer)
Closed 3 years ago.
I am working with Spark scala shell and trying to create dataframe and datasets from a text file.
For getting datasets from a text file, there are two options, text and textFile methods as follows:
scala> spark.read.
csv format jdbc json load option options orc parquet schema table text textFile
Here is how i am gettting datasets and dataframe from both these methods:
scala> val df = spark.read.text("/Users/karanverma/Documents/logs1.txt")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> val df = spark.read.textFile("/Users/karanverma/Documents/logs1.txt")
df: org.apache.spark.sql.Dataset[String] = [value: string]
So my question is what is the difference between the two methods for text file?
When to use which methods?
As I've noticed that they are almost having the same functionality,
It just that spark.read.text transform data to Dataset which is a distributed collection of data, while spark.read.textFile transform data to Dataset[Type] which is consist of Dataset organized into named columns.
Hope it helps.

How do I use a from_json() dataframe in Spark?

I'm trying to create a dataset from a json-string within a dataframe in Databricks 3.5 (Spark 2.2.1). In the code block below 'jsonSchema' is a StructType with the correct layout for the json-string which is in the 'body' column of the dataframe.
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema))
This returns a dataframe where the root object is
jsontostructs(CAST(body AS STRING)):struct
followed by the fields in the schema (looks correct). When I try another select on the newDF
val transform = newDF.select($"propertyNameInTheParsedJsonObject")
it throws the exception
org.apache.spark.sql.AnalysisException: cannot resolve '`columnName`' given
input columns: [jsontostructs(CAST(body AS STRING))];;
I'm aparently missing something. I hoped from_json would return a dataframe I could manipulate further.
My ultimate objective is to cast the json-string within the oldDF body-column to a dataset.
from_json returns a struct or (array<struct<...>>) column. It means it is a nested object. If you've provided a meaningful name:
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema) as "parsed")
and the schema describes a plain struct you could use standard methods like
newDF.select($"parsed.propertyNameInTheParsedJsonObject")
otherwise please follow the instructions for accessing arrays.

Spark's toDS vs to DF

I understand that one can convert an RDD to a Dataset using rdd.toDS. However there also exists rdd.toDF. Is there really any benefit of one over the other?
After playing with the Dataset API for a day, I find out that almost any operation takes me out to a DataFrame (for instance withColumn). After converting an RDD with toDS, I often find out that another conversion to a DataSet is needed, because something brought me to a DataFrame again.
Am I using the API wrongly? Should I stick with .toDF and only convert to a DataSet in the end of a chain of operations? Or is there a benefit to using toDS earlier?
Here is a small concrete example
spark
.read
.schema (...)
.json (...)
.rdd
.zipWithUniqueId
.map[(Integer,String,Double)] { case (row,id) => ... }
.toDS // now with a Dataset API (should use toDF here?)
.withColumnRenamed ("_1", "id" ) // now back to a DataFrame, not type safe :(
.withColumnRenamed ("_2", "text")
.withColumnRenamed ("_2", "overall")
.as[ParsedReview] // back to a Dataset
Michael Armburst nicely explained that shift to dataset and dataframe and the difference between the two. Basically in spark 2.x they converged dataset and dataframe API into one with slight difference.
"DataFrame is just DataSet of generic row objects. When you don't know all the fields, DF is the answer".