schema error while converting Vector collection to dataframe - scala

I have a vector collection named values which I'm trying to convert to a dataframe
scala.collection.immutable.Vector[(String, Double)] = Vector((1,1.0), (2,2.4), (3,3.7), (4,5.0), (5,4.9))
I have defined a custom schema as follows and tried to do the conversion.
val customSchema = new StructType()
.add("A", IntegerType, true)
.add("B", DoubleType, true)
val df = values.toDF.schema(customSchema)
This gives me an error saying,
error: overloaded method value apply with alternatives:
(fieldIndex: Int)org.apache.spark.sql.types.StructField <and>
(names: Set[String])org.apache.spark.sql.types.StructType <and>
(name: String)org.apache.spark.sql.types.StructField
cannot be applied to (org.apache.spark.sql.types.StructType)
I've tried all the methods described here and here as well as the StructType documentation to create the schema. However all methods lead to the same custom schema, customSchema: org.apache.spark.sql.types.StructType = StructType(StructField(A,IntegerType,true), StructField(B,DoubleType,true))
toDF method works just fine without a custom schema. However I want to force a custom schema. Can anyone tell me what I'm doing wrong here?

schema is a property. You should use schema when you want to get StructType of DataFrame or Dataset.
val df = values.toDF
df.schema
//prints
StructType(StructField(_1,IntegerType,false), StructField(_2,DoubleType,false))
To convert a vector to a DataFrame or Dataset, you can use spark.createDataFrame or spark.createDataset. These methods are overloaded and they expect RDD or JavaRDD or java.util.List or Row and schema information. You can do the following to convert your Vector into DataFrame:
val df = spark.createDataFrame(vec.toDF.rdd, customSchema)
df.schema
//prints
StructType(StructField(A,IntegerType,true), StructField(B,DoubleType,true))
I hope it helps!

Related

unable to store row elements of a dataset, via mapPartitions(), in variables

I am trying to create a Spark Dataset, and then using mapPartitions, trying to access each of its elements and store those in variables. Using below piece of code for the same:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sql("select col1,col2,col3 from table limit 10")
val schema = StructType(Seq(
StructField("col1", StringType),
StructField("col2", StringType),
StructField("col3", StringType)))
val encoder = RowEncoder(schema)
df.mapPartitions{iterator => { val myList = iterator.toList
myList.map(x=> { val value1 = x.getString(0)
val value2 = x.getString(1)
val value3 = x.getString(2)}).iterator}} (encoder)
The error I am getting against this code is:
<console>:39: error: type mismatch;
found : org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Encoder[Unit]
val value3 = x.getString(2)}).iterator}} (encoder)
Eventually, I am targeting to store the row elements in variables, and perform some operation with these. Not sure what am I missing here. Any help towards this would be highly appreciated!
Actually, there are several problems with your code:
Your map-statement has no return value, therefore Unit
If you return a tuple of String from mapPartitions, you don't need a RowEncoder (because you don't return a Row, but a Tuple3 which does not need a encoder because its a Product)
You can write your code like this:
df
.mapPartitions{itr => itr.map(x=> (x.getString(0),x.getString(1),x.getString(2)))}
.toDF("col1","col2","col3") // Convert Dataset to Dataframe, get desired field names
But you could just use a simple select statement in DataFrame API, no need for mapPartitions here
df
.select($"col1",$"col2",$"col3")

How to use from_json standard function with custom schema (error: overloaded method value from_json with alternative)?

I am consuming JSON data from AWS Kinesis stream, but I am getting the following error when I try to use the from_json() standard function:
command-5804948:32: error: overloaded method value from_json with alternatives:
(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.Column)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.types.DataType)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.Column
cannot be applied to (String, org.apache.spark.sql.types.StructType)
.select(from_json("jsonData", dataSchema).as("devices"))
I have tried both of the below to define my schema:
val dataSchema = new StructType()
.add("ID", StringType)
.add("Type", StringType)
.add("Body", StringType)
.add("Submitted", StringType)
val dataSchema = StructType(Seq(StructField("ID",StringType,true), StructField("Type",StringType,true), StructField("Body",StringType,true), StructField("Submitted",StringType,true)))
Here is my code:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import java.nio.ByteBuffer
import scala.util.Random
val dataSchema = new StructType()
.add("ID", StringType)
.add("Type", StringType)
.add("Body", StringType)
.add("Submitted", StringType)
// val dataSchema = StructType(Seq(StructField("ID",StringType,true), StructField("Type",StringType,true), StructField("Body",StringType,true), StructField("Submitted",StringType,true)))
val kinesisDF = spark.readStream
.format("kinesis")
.option("streamName", "**************************")
.option("region", "********")
.option("initialPosition", "TRIM_HORIZON")
.option("awsAccessKey", "****************")
.option("awsSecretKey", "************************************")
.load()
val schemaDF = kinesisDF
.selectExpr("cast (data as STRING) jsonData")
.select(from_json("jsonData", dataSchema).as("devices"))
.select("devices.*")
.load()
display(schemaDF)
If you do the following:
val str_data = kinesisDF
.selectExpr("cast (data as STRING) jsonData")
display(str_data)
you can see that the stream data looks like:
{"ID":"1266ee3d99bc-96f942a6-434c-6442-a762","Type":"BT","Body":"{\"TN\":\"ND\",\"TD\":\"JSON:{\\"vw\\":\\"CV\\"}\",\"LT\":\"BT\",\"TI\":\"9ff2-4749250dd142-793ffb20-eb8e-47f7\",\"CN\":\"OD\",\"CI\":\"eb\",\"UI\":\"abc004\",\"AN\":\"1234567\",\"TT\":\"2019-09-15T09:48:25.0395209Z\",\"FI\":\"N/A\",\"HI\":\"N/A\",\"SV\":6}","Submitted":"2019-09-15 09:48:26.079"}
{"ID":"c8eb956ee98c-68d668b7-e7a6-9ea2-49a5","Type":"MS","Body":"{\"MT\":\"N/A\",\"EP\":\"N/A\",\"RQ\":\"{\\"IA]\\":false,\\"AN\\":null,\\"ACI\\":\\"1266ee3d99bc-96f942a6-434c-6442-a762\\",\\"CI\\":\\"ebb\\",\\"CG\\":\\"8b8a-4ab17555f2fa-da0c8047-b5a6-4ebe\\",\\"UI\\":\\"def211\\",\\"UR\\":\\"EscC\\",\\"UL\\":\\"SCC\\",\\"TI\\":\\"b9d2-d4f646a15d66-dc519f4a-48c3-4e7b\\",\\"TN\\":null,\\"MN\\":null,\\"CTZ\\":null,\\"PM\\":null,\\"TS\\":null,\\"CI\\":\\"ebc\\",\\"ALDC\\":null}","Submitted":"2019-09-15 09:49:46.901"}
The value for the "Body" key is another JSON/nested JSON that is why I have put it as a StringType in the schema so that gets stored in the column as is.
I get the following error when I run the above code:
How to fix it?
That part of the error says it all:
cannot be applied to (String, org.apache.spark.sql.types.StructType)
That means that there are three different alternatives of from_json standard function, and all of them expect a Column object not a String.
You can simply fix it by using $ syntax (or using col standard function) as follows:
.select(from_json($"jsonData", dataSchema).as("devices"))
Note the $ before the column name that turns it (implicitly) into a Column object.

Encoder[Row] in Scala Spark

I'm trying to perform a simple map on a Dataset[Row] (DataFrame) in Spark 2.0.0. Something as simple as this
val df: DataSet[Row] = ...
df.map { r: Row => r }
But the compiler is complaining that I'm not providing the implicit Encoder[Row] argument to the map function:
not enough arguments for method map: (implicit evidence$7:
Encoder[Row]).
Everything works fine if I convert to an RDD first ds.rdd.map { r: Row => r } but shouldn't there be an easy way to get an Encoder[Row] like there is for tuple types Encoders.product[(Int, Double)]?
[Note that my Row is dynamically sized in such a way that it can't easily be converted into a strongly-typed Dataset.]
An Encoder needs to know how to pack the elements inside the Row. So you could write your own Encoder[Row] by using row.structType which determines the elements of your Row at runtime and uses the corresponding decoders.
Or if you know more about the data that goes into Row, you could use https://github.com/adelbertc/frameless/
SSry to be a "bit" late. Hopefully this helps to someone who is hitting the problem right now. Easiest way to define encoder is deriving the structure from existing DataFrame:
val df = Seq((1, "a"), (2, "b"), (3, "c").toDF("id", "name")
val myEncoder = RowEndocer(df.schema)
Such approach could be useful when you need altering existing fields from your original DataFrame.
If you're dealing with completely new structure, explicit definition relying on StructType and StructField (as suggested in #Reactormonk 's little cryptic response).
Example defining the same encoder:
val myEncoder2 = RowEncoder(StructType(
Seq(StructField("id", IntegerType),
StructField("name", StringType)
)))
Please remember org.apache.spark.sql._, org.apache.spark.sql.types._ and org.apache.spark.sql.catalyst.encoders.RowEncoder libraries have to be imported.
In your specific case where the mapped function does not change the schema, you can pass in the encoder of the DataFrame itself:
df.map(r => r)(df.encoder)

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:
// sc is the SparkContext, while sqlContext is the SQLContext.
// Define the case class and raw data
case class Dog(name: String)
val data = Array(
Dog("Rex"),
Dog("Fido")
)
// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)
// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)
// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])
// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()
The output I'm seeing is:
Dog(Rex)
Dog(Fido)
++
||
++
||
||
++
What am I missing?
Thanks!
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
You can create a DataFrame directly from a Seq of case class instances using toDF as follows:
val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Case Class Approach won't Work in cluster mode. It'll give ClassNotFoundException to the case class you defined.
Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like
val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }
val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))
sqlContext.createDataFrame(rdd,rddStruct)
toDF() wont work either

toDF() not handling RDD

I have an RDD of Rows called RowRDD. I am simply trying to convert into DataFrame. From the examples I have seen on the internet from various places, I am seeing that I shoudl be trying RowRDD.toDF() I am getting the error :
value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
It doesn't work because Row is not a Product type and createDataFrame with as single RDD argument is defined only for RDD[A] where A <: Product.
If you want to use RDD[Row] you have to provide a schema as the second argument. If you think about it is should be obvious. Row is just just a container of Any and as such it doesn't provide enough information for schema inference.
Assuming this is the same RDD as defined in your previous question then schema is easy to generate:
import org.apache.spark.sql.types._
import org.apache.spark.rdd.RD
val rowRdd: RDD[Row] = ???
val schema = StructType(
(1 to rowRdd.first.size).map(i => StructField(s"_$i", StringType, false))
)
val df = sqlContext.createDataFrame(rowRdd, schema)