Spark create dataframe in IDE (using databricks-connect) - scala

I'm attempting to run some code from my databricks notebook in an IDE using databrick connect. I can't seem to figure out how to create a simple dataframe.
Using:
import spark.implicits._
var Table_Count = Seq((cdpos_df.count(),I_count,D_count,U_count)).toDF("Table_Count","I_Count","D_Count","U_Count")
gives the error message value toDF is not a member of Seq[(Long, Long, Long, Long)].
Trying to create the dataframe from scratch:
var dataRow = Seq((cdpos_df.count(),I_count,D_count,U_count))
var schemaRow = List(
StructField("Table_Count", LongType, true),
StructField("I_Count", LongType, true),
StructField("D_Count", LongType, true),
StructField("U_Count", LongType, true)
)
var TableCount = spark.createDataFrame(
sc.parallelize(dataRow),
StructType(schemaRow)
)
Gives the error message
overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[(Long, Long, Long, Long)], org.apache.spark.sql.types.StructType)

Combining the methods using:
var TableCount = spark.createDataFrame(
sc.parallelize(dataRow)
// StructType(schemaRow)
).toDF("Table_Count","I_Count","D_Count","U_Count")
gets rid of the errors but I still need to build this in a bit...

Related

Spark use dbutils.fs.ls().toDF in .jar file

I'm trying to package my jar based off of code in a databricks notebook.
I have the following line that works in databricks but is throwing an error in the scala code:
import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
val spark = SparkSession
.builder()
.appName("myApp")
.master("local")
.enableHiveSupport()
.getOrCreate()
val sc = SparkContext.getOrCreate()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
import sqlContext.implicits._
...
var file_details = dbutils.fs.ls(folder_path2).toDF()
Which gives the error:
error: value toDF is not a member of Seq[com.databricks.backend.daemon.dbutils.FileInfo]
Does anyone know how to use dbutils.fs.ls().toDF() in a Scala .jar?
Edit: I found a similar question for pyspark that I'm trying to translate to Scala:
val dbutils = com.databricks.service.DBUtils
val ddlSchema = new ArrayType(
new StructType()
.add("path",StringType)
.add("name",StringType)
.add("size",IntegerType)
, true)
var folder_path = "abfss://container#storage.dfs.core.windows.net"
var file_details = dbutils.fs.ls(folder_path)
var df = spark.createDataFrame(sc.parallelize(file_details),ddlSchema)
but I'm getting this error:
error: overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[com.databricks.service.FileInfo], org.apache.spark.sql.types.ArrayType)
var df = spark.createDataFrame(sc.parallelize(file_details),ddlSchema)
Ok I got it!!! Here is the code I used:
var file_details = dbutils.fs.ls(folder_path)
var fileData = file_details.map(x => (x.path, x.name, x.size.toString))
var rdd = sc.parallelize(fileData)
val rowRDD = rdd.map(attributes => Row(attributes._1, attributes._2, attributes._3.toInt))
val schema = StructType( Array(
StructField("path", StringType,true),
StructField("name", StringType,true),
StructField("size", IntegerType,true)
))
var fileDf = spark.createDataFrame(rowRDD, schema)
In order to trigger the implicit conversion to a Dataset like container and then have toDF() available you also need an implicit spark Encoder (besides the already present spark.implicits._ )
I think this auto-derivation will work and will make toDF() available:
val implicit encoder = org.apache.spark.sql.Encoders.product[com.databricks.backend.daemon.dbutils.FileInfo]
Otherwise yeah you can work directly with RDDs.

how to create a spark DataFrame using a listOfData and schema

I am trying to create a DataFrame from a list of data and also want to apply schema on it.
From the Spark Scala doc I am trying to use this createDataframe signature which accepts list of row and a schema as StructType.
def createDataFrame(rows: List[Row], schema: StructType): DataFrame
Sample Code I am trying below
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val simpleData = List(Row("James", "Sales", 3000),
Row("Michael", "Sales", 4600),
Row("Robert", "Sales", 4100),
Row("Maria", "Finance", 3000)
)
val schema = StructType(Array(
StructField("name",StringType,false),
StructField("department",StringType,false),
StructField("salary",IntegerType,false)))
val df = spark.createDataFrame(simpleData,schema)
But I am getting below error
command-3391230614683259:15: error: overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (List[org.apache.spark.sql.Row], org.apache.spark.sql.types.StructType)
val df = spark.createDataFrame(simpleData,schema)
Please suggest what I am doing wrong.
The error is telling you that it needs a Java List not a Scala List:
import scala.jdk.CollectionConverters._
val df = spark.createDataFrame(simpleData.asJava, schema)
See this question for alternatives for CollectionConverters if you are using an earlier versions of Scala than 2.13.
Another option is to pass an RDD:
val df = spark.createDataFram(sc.parallelize(simpleData), schema)
sc being the SparkContext object.

How to use from_json standard function with custom schema (error: overloaded method value from_json with alternative)?

I am consuming JSON data from AWS Kinesis stream, but I am getting the following error when I try to use the from_json() standard function:
command-5804948:32: error: overloaded method value from_json with alternatives:
(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.Column)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.types.DataType)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.Column
cannot be applied to (String, org.apache.spark.sql.types.StructType)
.select(from_json("jsonData", dataSchema).as("devices"))
I have tried both of the below to define my schema:
val dataSchema = new StructType()
.add("ID", StringType)
.add("Type", StringType)
.add("Body", StringType)
.add("Submitted", StringType)
val dataSchema = StructType(Seq(StructField("ID",StringType,true), StructField("Type",StringType,true), StructField("Body",StringType,true), StructField("Submitted",StringType,true)))
Here is my code:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import java.nio.ByteBuffer
import scala.util.Random
val dataSchema = new StructType()
.add("ID", StringType)
.add("Type", StringType)
.add("Body", StringType)
.add("Submitted", StringType)
// val dataSchema = StructType(Seq(StructField("ID",StringType,true), StructField("Type",StringType,true), StructField("Body",StringType,true), StructField("Submitted",StringType,true)))
val kinesisDF = spark.readStream
.format("kinesis")
.option("streamName", "**************************")
.option("region", "********")
.option("initialPosition", "TRIM_HORIZON")
.option("awsAccessKey", "****************")
.option("awsSecretKey", "************************************")
.load()
val schemaDF = kinesisDF
.selectExpr("cast (data as STRING) jsonData")
.select(from_json("jsonData", dataSchema).as("devices"))
.select("devices.*")
.load()
display(schemaDF)
If you do the following:
val str_data = kinesisDF
.selectExpr("cast (data as STRING) jsonData")
display(str_data)
you can see that the stream data looks like:
{"ID":"1266ee3d99bc-96f942a6-434c-6442-a762","Type":"BT","Body":"{\"TN\":\"ND\",\"TD\":\"JSON:{\\"vw\\":\\"CV\\"}\",\"LT\":\"BT\",\"TI\":\"9ff2-4749250dd142-793ffb20-eb8e-47f7\",\"CN\":\"OD\",\"CI\":\"eb\",\"UI\":\"abc004\",\"AN\":\"1234567\",\"TT\":\"2019-09-15T09:48:25.0395209Z\",\"FI\":\"N/A\",\"HI\":\"N/A\",\"SV\":6}","Submitted":"2019-09-15 09:48:26.079"}
{"ID":"c8eb956ee98c-68d668b7-e7a6-9ea2-49a5","Type":"MS","Body":"{\"MT\":\"N/A\",\"EP\":\"N/A\",\"RQ\":\"{\\"IA]\\":false,\\"AN\\":null,\\"ACI\\":\\"1266ee3d99bc-96f942a6-434c-6442-a762\\",\\"CI\\":\\"ebb\\",\\"CG\\":\\"8b8a-4ab17555f2fa-da0c8047-b5a6-4ebe\\",\\"UI\\":\\"def211\\",\\"UR\\":\\"EscC\\",\\"UL\\":\\"SCC\\",\\"TI\\":\\"b9d2-d4f646a15d66-dc519f4a-48c3-4e7b\\",\\"TN\\":null,\\"MN\\":null,\\"CTZ\\":null,\\"PM\\":null,\\"TS\\":null,\\"CI\\":\\"ebc\\",\\"ALDC\\":null}","Submitted":"2019-09-15 09:49:46.901"}
The value for the "Body" key is another JSON/nested JSON that is why I have put it as a StringType in the schema so that gets stored in the column as is.
I get the following error when I run the above code:
How to fix it?
That part of the error says it all:
cannot be applied to (String, org.apache.spark.sql.types.StructType)
That means that there are three different alternatives of from_json standard function, and all of them expect a Column object not a String.
You can simply fix it by using $ syntax (or using col standard function) as follows:
.select(from_json($"jsonData", dataSchema).as("devices"))
Note the $ before the column name that turns it (implicitly) into a Column object.

Spark not able to create dataframe with single MapType as column using scala

I am using spark version 2.2. I am trying to create a dataframe with 1 column as MapType.
I have tried following things for that:
scala> val mapSeq = Seq((Map(1-> 2, 11-> 22))).toDF("num")
<console>:23: error: value toDF is not a member of Seq[scala.collection.immutable.Map[Int,Int]]
val mapSeq = Seq((Map(1-> 2, 11-> 22))).toDF("num")
and
scala> spark.createDataFrame(List(Map("a" -> "b")), StructType(MapType(StringType, StringType)))
error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.MapType)
spark.createDataFrame(List(Map("a" -> "b")), StructType(MapType(StringType, StringType)))
Both the methods throws error.
I tried searching and found this medium blog - https://medium.com/#mrpowers/working-with-spark-arraytype-and-maptype-columns-4d85f3c8b2b3 But it blog there is some older version of spark used.
Any Ideas on how can I create this?

scala.collection.immutable.Iterable[org.apache.spark.sql.Row] to DataFrame ? error: overloaded method value createDataFrame with alternatives

I have some sql.Row objects that I wish to convert to a DataFrame in Spark 1.6.x
My Rows look like:
events: scala.collection.immutable.Iterable[org.apache.spark.sql.Row] = List([14183197,Browse,80161702,8702170626376335,59,527780275219,List(NavigationLevel, Session)], [14183197,Browse,80161356,8702171157207449,72,527780278061,List(StartPlay, Action, Session)])
Printed Out:
events.foreach(println)
[14183197,Browse,80161702,8702170626376335,59,527780275219,List(NavigationLevel, Session)]
[14183197,Browse,80161356,8702171157207449,72,527780278061,List(StartPlay, Action, Session)]
So I created a schema for the data;
val schema = StructType(Array(
StructField("trackId", IntegerType, true),
StructField("location", StringType, true),
StructField("videoId", IntegerType, true),
StructField("id", StringType, true),
StructField("sequence", IntegerType, true),
StructField("time", StringType, true),
StructField("type", ArrayType(StringType), true)
))
And then I attempt to the create the DataFrame by :
val df = sqlContext.createDataFrame(events, schema)
But I get the following error;
error: overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (scala.collection.immutable.Iterable[org.apache.spark.sql.Row], org.apache.spark.sql.types.StructType)
I not sure why I get this, is it because the underlying data in the Row has no type information ?
Any help is greatly appreciated
You have to parallelize:
val sc: SparkContext = ???
val df = sqlContext.createDataFrame(sc.parallelize(events), schema)