Convert an RDD to a DataFrame in Spark using Scala - scala

I have textRDD: org.apache.spark.rdd.RDD[(String, String)]
I would like to convert it to a DataFrame. The columns correspond to the title and content of each page(row).

Use toDF(), provide the column names if you have them.
val textDF = textRDD.toDF("title": String, "content": String)
textDF: org.apache.spark.sql.DataFrame = [title: string, content: string]
or
val textDF = textRDD.toDF()
textDF: org.apache.spark.sql.DataFrame = [_1: string, _2: string]
The shell auto-imports (I am using version 1.5), but you may need import sqlContext.implicits._ in an application.

I usually do this like the following:
Create a case class like this:
case class DataFrameRecord(property1: String, property2: String)
Then you can use map to convert into the new structure using the case class:
rdd.map(p => DataFrameRecord(prop1, prop2)).toDF()

Related

Spark: Create Single Column Complex Type Dataframe

Suppose I have a case class as follows:
final case class Person(name: String, age: Int)
I'd like to create a single column dataframe that has a complex StructType of Person. I want spark to infer the schema.
val data = Seq(Person("Tom", 30), Person("Anna", 35))
val df = spark.createDataFrame(data)
I want spark to infer that the dataframe is a single column with complex type of Person. Currently, it splits Person up into multiple columns
You can map the data to the desired structure.
A helper class:
case class PersonWrapper(person: Person)
Now there are two options:
Mapping the scala sequence before creating the Spark dataframe:
val df = spark.createDataFrame(data.map( PersonWrapper(_)))
or
mapping the Spark dataframe/dataset:
val df = spark.createDataset(data).map(PersonWrapper(_))
You can either use:
final case class PersonAttributes(name: String, age: Int)
final case class Person(attributes: PersonAttributes)
then:
val data = Seq(
Person(PersonAttributes("Tom", 30)),
Person(PersonAttributes("Anna", 35))
)
Or you can create the dataset as you are, then using withColumn with struct to create the complex structure you want:
.withColumn("data", struct(col("name"), col("age")))
Good luck!

How do I getAs[Location]("location") from a dataframe row?

I have a class Location(lat, lon), I created a dataframe df = Seq(Location(1,2), Location(3,4)).toDF. When I try to do this:
df.map(row =>
row.getAs[Location]("location")
)
it fails, because there's no encoder for Location. But how am I supposed to convert it into a Dataset of Location?
I tried:
df.map(row =>
val seq = row.getAs[Seq[Int]]("location")
Location(seq(0), seq(1))
)
But it doesn't work either.
I am really confused. How do I solve this problem?
if you have case class Location(lat: Int, lon: Int) followed by val df = Seq(Location(1,2), Location(3,4)).toDF you could convert this dataframe df into a dataset or change that line to be val ds = Seq(Location(1,2), Location(3,4)).toDS where ds is ds: org.apache.spark.sql.Dataset[Location] = [lat: int, lon: int] which is what you said you wanted in one of the comments.

Spark Scala: Cannot up cast from string to int as it may truncate

I got this exception while playing with spark.
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Cannot up cast price from string to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "price")
- root class: "org.spark.code.executable.Main.Record"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
How Can this exception be solved? Here is the code
object Main {
case class Record(transactionDate: Timestamp, product: String, price: Int, paymentType: String, name: String, city: String, state: String, country: String,
accountCreated: Timestamp, lastLogin: Timestamp, latitude: String, longitude: String)
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "C:\\winutils\\");
val schema = Encoders.product[Record].schema
val df = SparkConfig.sparkSession.read
.option("header", "true")
.csv("SalesJan2009.csv");
import SparkConfig.sparkSession.implicits._
val ds = df.as[Record]
//ds.groupByKey(body => body.state).count().show()
import org.apache.spark.sql.expressions.scalalang.typed.{
count => typedCount,
sum => typedSum
}
ds.groupByKey(body => body.state)
.agg(typedSum[Record](_.price).name("sum(price)"))
.withColumnRenamed("value", "group")
.alias("Summary by state")
.show()
}
You read the csv file first and tried to convert to it to dataset which has different schema. Its better to pass the schema created while reading the csv file as below
val spark = SparkSession.builder()
.master("local")
.appName("test")
.getOrCreate()
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Record].schema
val ds = spark.read
.option("header", "true")
.schema(schema) // passing schema
.option("timestampFormat", "MM/dd/yyyy HH:mm") // passing timestamp format
.csv(path)// csv path
.as[Record] // convert to DS
The default timestampFormat is yyyy-MM-dd'T'HH:mm:ss.SSSXXX so you also need to pass your custom timestampFormat.
Hope this helps
In my case, the problem was that I was using this:
case class OriginalData(ORDER_ID: Int, USER_ID: Int, ORDER_NUMBER: Int, ORDER_DOW: Int, ORDER_HOUR_OF_DAY: Int, DAYS_SINCE_PRIOR_ORDER: Double, ORDER_DETAIL: String)
However, in the CSV file, I had this for example:
Yes, having "Friday" where only integers representing days of the week should appear, means that I need to clean data. However, to be able to read my CSV file using spark.read.csv("data/jaimemontoya/01.csv"), I used the following code, where the value of ORDER_DOW now is String, not Int anymore:
case class OriginalData(ORDER_ID: Int, USER_ID: Int, ORDER_NUMBER: Int, ORDER_DOW: String, ORDER_HOUR_OF_DAY: Int, DAYS_SINCE_PRIOR_ORDER: Double, ORDER_DETAIL: String)
Add this option on read:
.option("inferSchema", true)

Extracting `Seq[(String,String,String)]` from spark DataFrame

I have a spark DF with rows of Seq[(String, String, String)]. I'm trying to do some kind of a flatMap with that but anything I do try ends up throwing
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple3
I can take a single row or multiple rows from the DF just fine
df.map{ r => r.getSeq[Feature](1)}.first
returns
Seq[(String, String, String)] = WrappedArray([ancient,jj,o], [olympia_greece,nn,location] .....
and the data type of the RDD seems correct.
org.apache.spark.rdd.RDD[Seq[(String, String, String)]]
The schema of the df is
root
|-- article_id: long (nullable = true)
|-- content_processed: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- lemma: string (nullable = true)
| | |-- pos_tag: string (nullable = true)
| | |-- ne_tag: string (nullable = true)
I know this problem is related to spark sql treating the RDD rows as org.apache.spark.sql.Row even though they idiotically say that it's a Seq[(String, String, String)]. There's a related question (link below) but the answer to that question doesn't work for me. I am also not familiar enough with spark to figure out how to turn it into a working solution.
Are the rows Row[Seq[(String, String, String)]] or Row[(String, String, String)] or Seq[Row[(String, String, String)]] or something even crazier.
I'm trying to do something like
df.map{ r => r.getSeq[Feature](1)}.map(_(1)._1)
which appears to work but doesn't actually
df.map{ r => r.getSeq[Feature](1)}.map(_(1)._1).first
throws the above error. So how am I supposed to (for instance) get the first element of the second tuple on each row?
Also WHY has spark been designed to do this, it seems idiotic to claim that something is of one type when in fact it isn't and can not be converted to the claimed type.
Related question: GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table
Related bug report: http://search-hadoop.com/m/q3RTt2bvwy19Dxuq1&subj=ClassCastException+when+extracting+and+collecting+DF+array+column+type
Well, it doesn't claim that it is a tuple. It claims it is a struct which maps to Row:
import org.apache.spark.sql.Row
case class Feature(lemma: String, pos_tag: String, ne_tag: String)
case class Record(id: Long, content_processed: Seq[Feature])
val df = Seq(
Record(1L, Seq(
Feature("ancient", "jj", "o"),
Feature("olympia_greece", "nn", "location")
))
).toDF
val content = df.select($"content_processed").rdd.map(_.getSeq[Row](0))
You'll find exact mapping rules in the Spark SQL programming guide.
Since Row is not exactly pretty structure you'll probably want to map it to something useful:
content.map(_.map {
case Row(lemma: String, pos_tag: String, ne_tag: String) =>
(lemma, pos_tag, ne_tag)
})
or:
content.map(_.map ( row => (
row.getAs[String]("lemma"),
row.getAs[String]("pos_tag"),
row.getAs[String]("ne_tag")
)))
Finally a slightly more concise approach with Datasets:
df.as[Record].rdd.map(_.content_processed)
or
df.select($"content_processed").as[Seq[(String, String, String)]]
although this seems to be slightly buggy at this moment.
There is important difference the first approach (Row.getAs) and the second one (Dataset.as). The former one extract objects as Any and applies asInstanceOf. The latter one is using encoders to transform between internal types and desired representation.
object ListSerdeTest extends App {
implicit val spark: SparkSession = SparkSession
.builder
.master("local[2]")
.getOrCreate()
import spark.implicits._
val myDS = spark.createDataset(
Seq(
MyCaseClass(mylist = Array(("asd", "aa"), ("dd", "ee")))
)
)
myDS.toDF().printSchema()
myDS.toDF().foreach(
row => {
row.getSeq[Row](row.fieldIndex("mylist"))
.foreach {
case Row(a, b) => println(a, b)
}
}
)
}
case class MyCaseClass (
mylist: Seq[(String, String)]
)
Above code is yet another way to deal with nested structure. Spark default Encoder will encode TupleX, making them nested struct, that's why you are seeing this strange behaviour. and like others said in the comment, you can't just do getAs[T]() since it's just a cast(x.asInstanceOf[T]), therefore will give you runtime exceptions.

How to map a SchemaRDD to a PairRDD

I am trying to figure out how to map a SchemaRDD object that I retrieved from a sql HiveContext over to a PairRDDFunctions[String, Vector] object where the string value is the name column in the schemaRDD and the rest of the columns (BytesIn, BytesOut, etc...) are the vector.
Assuming you have columns: "name", "bytesIn", "bytesOut"
val schemaRDD: SchemaRDD = ...
val pairs: RDD[(String, (Long, Long)] =
schemaRDD.select("name", "bytesIn", "bytesOut").rdd.map {
case Row(name, bytesIn, bytesOut) =>
name -> (bytesIn, bytesOut)
}
// To import PairRDDFunctions via implicits
import SparkContext._
pairs.groupByKey ... etc