How to use from_json standard function with custom schema (error: overloaded method value from_json with alternative)? - scala

I am consuming JSON data from AWS Kinesis stream, but I am getting the following error when I try to use the from_json() standard function:
command-5804948:32: error: overloaded method value from_json with alternatives:
(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.Column)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.types.DataType)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.Column
cannot be applied to (String, org.apache.spark.sql.types.StructType)
.select(from_json("jsonData", dataSchema).as("devices"))
I have tried both of the below to define my schema:
val dataSchema = new StructType()
.add("ID", StringType)
.add("Type", StringType)
.add("Body", StringType)
.add("Submitted", StringType)
val dataSchema = StructType(Seq(StructField("ID",StringType,true), StructField("Type",StringType,true), StructField("Body",StringType,true), StructField("Submitted",StringType,true)))
Here is my code:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import java.nio.ByteBuffer
import scala.util.Random
val dataSchema = new StructType()
.add("ID", StringType)
.add("Type", StringType)
.add("Body", StringType)
.add("Submitted", StringType)
// val dataSchema = StructType(Seq(StructField("ID",StringType,true), StructField("Type",StringType,true), StructField("Body",StringType,true), StructField("Submitted",StringType,true)))
val kinesisDF = spark.readStream
.format("kinesis")
.option("streamName", "**************************")
.option("region", "********")
.option("initialPosition", "TRIM_HORIZON")
.option("awsAccessKey", "****************")
.option("awsSecretKey", "************************************")
.load()
val schemaDF = kinesisDF
.selectExpr("cast (data as STRING) jsonData")
.select(from_json("jsonData", dataSchema).as("devices"))
.select("devices.*")
.load()
display(schemaDF)
If you do the following:
val str_data = kinesisDF
.selectExpr("cast (data as STRING) jsonData")
display(str_data)
you can see that the stream data looks like:
{"ID":"1266ee3d99bc-96f942a6-434c-6442-a762","Type":"BT","Body":"{\"TN\":\"ND\",\"TD\":\"JSON:{\\"vw\\":\\"CV\\"}\",\"LT\":\"BT\",\"TI\":\"9ff2-4749250dd142-793ffb20-eb8e-47f7\",\"CN\":\"OD\",\"CI\":\"eb\",\"UI\":\"abc004\",\"AN\":\"1234567\",\"TT\":\"2019-09-15T09:48:25.0395209Z\",\"FI\":\"N/A\",\"HI\":\"N/A\",\"SV\":6}","Submitted":"2019-09-15 09:48:26.079"}
{"ID":"c8eb956ee98c-68d668b7-e7a6-9ea2-49a5","Type":"MS","Body":"{\"MT\":\"N/A\",\"EP\":\"N/A\",\"RQ\":\"{\\"IA]\\":false,\\"AN\\":null,\\"ACI\\":\\"1266ee3d99bc-96f942a6-434c-6442-a762\\",\\"CI\\":\\"ebb\\",\\"CG\\":\\"8b8a-4ab17555f2fa-da0c8047-b5a6-4ebe\\",\\"UI\\":\\"def211\\",\\"UR\\":\\"EscC\\",\\"UL\\":\\"SCC\\",\\"TI\\":\\"b9d2-d4f646a15d66-dc519f4a-48c3-4e7b\\",\\"TN\\":null,\\"MN\\":null,\\"CTZ\\":null,\\"PM\\":null,\\"TS\\":null,\\"CI\\":\\"ebc\\",\\"ALDC\\":null}","Submitted":"2019-09-15 09:49:46.901"}
The value for the "Body" key is another JSON/nested JSON that is why I have put it as a StringType in the schema so that gets stored in the column as is.
I get the following error when I run the above code:
How to fix it?

That part of the error says it all:
cannot be applied to (String, org.apache.spark.sql.types.StructType)
That means that there are three different alternatives of from_json standard function, and all of them expect a Column object not a String.
You can simply fix it by using $ syntax (or using col standard function) as follows:
.select(from_json($"jsonData", dataSchema).as("devices"))
Note the $ before the column name that turns it (implicitly) into a Column object.

Related

Spark use dbutils.fs.ls().toDF in .jar file

I'm trying to package my jar based off of code in a databricks notebook.
I have the following line that works in databricks but is throwing an error in the scala code:
import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
val spark = SparkSession
.builder()
.appName("myApp")
.master("local")
.enableHiveSupport()
.getOrCreate()
val sc = SparkContext.getOrCreate()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
import sqlContext.implicits._
...
var file_details = dbutils.fs.ls(folder_path2).toDF()
Which gives the error:
error: value toDF is not a member of Seq[com.databricks.backend.daemon.dbutils.FileInfo]
Does anyone know how to use dbutils.fs.ls().toDF() in a Scala .jar?
Edit: I found a similar question for pyspark that I'm trying to translate to Scala:
val dbutils = com.databricks.service.DBUtils
val ddlSchema = new ArrayType(
new StructType()
.add("path",StringType)
.add("name",StringType)
.add("size",IntegerType)
, true)
var folder_path = "abfss://container#storage.dfs.core.windows.net"
var file_details = dbutils.fs.ls(folder_path)
var df = spark.createDataFrame(sc.parallelize(file_details),ddlSchema)
but I'm getting this error:
error: overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[com.databricks.service.FileInfo], org.apache.spark.sql.types.ArrayType)
var df = spark.createDataFrame(sc.parallelize(file_details),ddlSchema)
Ok I got it!!! Here is the code I used:
var file_details = dbutils.fs.ls(folder_path)
var fileData = file_details.map(x => (x.path, x.name, x.size.toString))
var rdd = sc.parallelize(fileData)
val rowRDD = rdd.map(attributes => Row(attributes._1, attributes._2, attributes._3.toInt))
val schema = StructType( Array(
StructField("path", StringType,true),
StructField("name", StringType,true),
StructField("size", IntegerType,true)
))
var fileDf = spark.createDataFrame(rowRDD, schema)
In order to trigger the implicit conversion to a Dataset like container and then have toDF() available you also need an implicit spark Encoder (besides the already present spark.implicits._ )
I think this auto-derivation will work and will make toDF() available:
val implicit encoder = org.apache.spark.sql.Encoders.product[com.databricks.backend.daemon.dbutils.FileInfo]
Otherwise yeah you can work directly with RDDs.

unable to store row elements of a dataset, via mapPartitions(), in variables

I am trying to create a Spark Dataset, and then using mapPartitions, trying to access each of its elements and store those in variables. Using below piece of code for the same:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sql("select col1,col2,col3 from table limit 10")
val schema = StructType(Seq(
StructField("col1", StringType),
StructField("col2", StringType),
StructField("col3", StringType)))
val encoder = RowEncoder(schema)
df.mapPartitions{iterator => { val myList = iterator.toList
myList.map(x=> { val value1 = x.getString(0)
val value2 = x.getString(1)
val value3 = x.getString(2)}).iterator}} (encoder)
The error I am getting against this code is:
<console>:39: error: type mismatch;
found : org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Encoder[Unit]
val value3 = x.getString(2)}).iterator}} (encoder)
Eventually, I am targeting to store the row elements in variables, and perform some operation with these. Not sure what am I missing here. Any help towards this would be highly appreciated!
Actually, there are several problems with your code:
Your map-statement has no return value, therefore Unit
If you return a tuple of String from mapPartitions, you don't need a RowEncoder (because you don't return a Row, but a Tuple3 which does not need a encoder because its a Product)
You can write your code like this:
df
.mapPartitions{itr => itr.map(x=> (x.getString(0),x.getString(1),x.getString(2)))}
.toDF("col1","col2","col3") // Convert Dataset to Dataframe, get desired field names
But you could just use a simple select statement in DataFrame API, no need for mapPartitions here
df
.select($"col1",$"col2",$"col3")

schema error while converting Vector collection to dataframe

I have a vector collection named values which I'm trying to convert to a dataframe
scala.collection.immutable.Vector[(String, Double)] = Vector((1,1.0), (2,2.4), (3,3.7), (4,5.0), (5,4.9))
I have defined a custom schema as follows and tried to do the conversion.
val customSchema = new StructType()
.add("A", IntegerType, true)
.add("B", DoubleType, true)
val df = values.toDF.schema(customSchema)
This gives me an error saying,
error: overloaded method value apply with alternatives:
(fieldIndex: Int)org.apache.spark.sql.types.StructField <and>
(names: Set[String])org.apache.spark.sql.types.StructType <and>
(name: String)org.apache.spark.sql.types.StructField
cannot be applied to (org.apache.spark.sql.types.StructType)
I've tried all the methods described here and here as well as the StructType documentation to create the schema. However all methods lead to the same custom schema, customSchema: org.apache.spark.sql.types.StructType = StructType(StructField(A,IntegerType,true), StructField(B,DoubleType,true))
toDF method works just fine without a custom schema. However I want to force a custom schema. Can anyone tell me what I'm doing wrong here?
schema is a property. You should use schema when you want to get StructType of DataFrame or Dataset.
val df = values.toDF
df.schema
//prints
StructType(StructField(_1,IntegerType,false), StructField(_2,DoubleType,false))
To convert a vector to a DataFrame or Dataset, you can use spark.createDataFrame or spark.createDataset. These methods are overloaded and they expect RDD or JavaRDD or java.util.List or Row and schema information. You can do the following to convert your Vector into DataFrame:
val df = spark.createDataFrame(vec.toDF.rdd, customSchema)
df.schema
//prints
StructType(StructField(A,IntegerType,true), StructField(B,DoubleType,true))
I hope it helps!

Not able to create parquet files in hdfs using spark shell

I want to create parquet file in hdfs and then read it through hive as external table. I'm struck with stage failures in spark-shell while writing parquet files.
Spark Version: 1.5.2
Scala Version: 2.10.4
Java: 1.7
Input file:(employee.txt)
1201,satish,25
1202,krishna,28
1203,amith,39
1204,javed,23
1205,prudvi,23
In Spark-Shell:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val employee = sc.textFile("employee.txt")
employee.first()
val schemaString = "id name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType};
val schema = StructType(schemaString.split(" ").map(fieldName ⇒ StructField(fieldName, StringType, true)))
val rowRDD = employee.map(_.split(",")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
val finalDF = employeeDF.toDF();
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
var WriteParquet= finalDF.write.parquet("/user/myname/schemaParquet")
When I type the last command I get,
ERROR
SPARK APPLICATION MANAGER
I even tried increasing the executor memory, its still failing.
Also Importantly , finalDF.show() is producing the same error.
So, I believe I have made a logical error here.
Thanks for supporting
The issue here is you are creating a schema with all the fields/columns type defaulted to StringType. But while passing the values in the schema, the value of Id and Age is being converted to Integer as per the code.Hence, throwing the Matcherror while running.
The data types of columns in the schema should match the data type of values being passed to it. Try the below code.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val employee = sc.textFile("employee.txt")
employee.first()
//val schemaString = "id name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types._;
val schema = StructType(StructField("id", IntegerType, true) :: StructField("name", StringType, true) :: StructField("age", IntegerType, true) :: Nil)
val rowRDD = employee.map(_.split(" ")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
val finalDF = employeeDF.toDF();
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
var WriteParquet= finalDF.write.parquet("/user/myname/schemaParquet")
This code should run fine.

toDF() not handling RDD

I have an RDD of Rows called RowRDD. I am simply trying to convert into DataFrame. From the examples I have seen on the internet from various places, I am seeing that I shoudl be trying RowRDD.toDF() I am getting the error :
value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
It doesn't work because Row is not a Product type and createDataFrame with as single RDD argument is defined only for RDD[A] where A <: Product.
If you want to use RDD[Row] you have to provide a schema as the second argument. If you think about it is should be obvious. Row is just just a container of Any and as such it doesn't provide enough information for schema inference.
Assuming this is the same RDD as defined in your previous question then schema is easy to generate:
import org.apache.spark.sql.types._
import org.apache.spark.rdd.RD
val rowRdd: RDD[Row] = ???
val schema = StructType(
(1 to rowRdd.first.size).map(i => StructField(s"_$i", StringType, false))
)
val df = sqlContext.createDataFrame(rowRdd, schema)