Spark use dbutils.fs.ls().toDF in .jar file - scala

I'm trying to package my jar based off of code in a databricks notebook.
I have the following line that works in databricks but is throwing an error in the scala code:
import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
val spark = SparkSession
.builder()
.appName("myApp")
.master("local")
.enableHiveSupport()
.getOrCreate()
val sc = SparkContext.getOrCreate()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
import sqlContext.implicits._
...
var file_details = dbutils.fs.ls(folder_path2).toDF()
Which gives the error:
error: value toDF is not a member of Seq[com.databricks.backend.daemon.dbutils.FileInfo]
Does anyone know how to use dbutils.fs.ls().toDF() in a Scala .jar?
Edit: I found a similar question for pyspark that I'm trying to translate to Scala:
val dbutils = com.databricks.service.DBUtils
val ddlSchema = new ArrayType(
new StructType()
.add("path",StringType)
.add("name",StringType)
.add("size",IntegerType)
, true)
var folder_path = "abfss://container#storage.dfs.core.windows.net"
var file_details = dbutils.fs.ls(folder_path)
var df = spark.createDataFrame(sc.parallelize(file_details),ddlSchema)
but I'm getting this error:
error: overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[com.databricks.service.FileInfo], org.apache.spark.sql.types.ArrayType)
var df = spark.createDataFrame(sc.parallelize(file_details),ddlSchema)

Ok I got it!!! Here is the code I used:
var file_details = dbutils.fs.ls(folder_path)
var fileData = file_details.map(x => (x.path, x.name, x.size.toString))
var rdd = sc.parallelize(fileData)
val rowRDD = rdd.map(attributes => Row(attributes._1, attributes._2, attributes._3.toInt))
val schema = StructType( Array(
StructField("path", StringType,true),
StructField("name", StringType,true),
StructField("size", IntegerType,true)
))
var fileDf = spark.createDataFrame(rowRDD, schema)

In order to trigger the implicit conversion to a Dataset like container and then have toDF() available you also need an implicit spark Encoder (besides the already present spark.implicits._ )
I think this auto-derivation will work and will make toDF() available:
val implicit encoder = org.apache.spark.sql.Encoders.product[com.databricks.backend.daemon.dbutils.FileInfo]
Otherwise yeah you can work directly with RDDs.

Related

Spark create dataframe in IDE (using databricks-connect)

I'm attempting to run some code from my databricks notebook in an IDE using databrick connect. I can't seem to figure out how to create a simple dataframe.
Using:
import spark.implicits._
var Table_Count = Seq((cdpos_df.count(),I_count,D_count,U_count)).toDF("Table_Count","I_Count","D_Count","U_Count")
gives the error message value toDF is not a member of Seq[(Long, Long, Long, Long)].
Trying to create the dataframe from scratch:
var dataRow = Seq((cdpos_df.count(),I_count,D_count,U_count))
var schemaRow = List(
StructField("Table_Count", LongType, true),
StructField("I_Count", LongType, true),
StructField("D_Count", LongType, true),
StructField("U_Count", LongType, true)
)
var TableCount = spark.createDataFrame(
sc.parallelize(dataRow),
StructType(schemaRow)
)
Gives the error message
overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[(Long, Long, Long, Long)], org.apache.spark.sql.types.StructType)
Combining the methods using:
var TableCount = spark.createDataFrame(
sc.parallelize(dataRow)
// StructType(schemaRow)
).toDF("Table_Count","I_Count","D_Count","U_Count")
gets rid of the errors but I still need to build this in a bit...

how to create a spark DataFrame using a listOfData and schema

I am trying to create a DataFrame from a list of data and also want to apply schema on it.
From the Spark Scala doc I am trying to use this createDataframe signature which accepts list of row and a schema as StructType.
def createDataFrame(rows: List[Row], schema: StructType): DataFrame
Sample Code I am trying below
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val simpleData = List(Row("James", "Sales", 3000),
Row("Michael", "Sales", 4600),
Row("Robert", "Sales", 4100),
Row("Maria", "Finance", 3000)
)
val schema = StructType(Array(
StructField("name",StringType,false),
StructField("department",StringType,false),
StructField("salary",IntegerType,false)))
val df = spark.createDataFrame(simpleData,schema)
But I am getting below error
command-3391230614683259:15: error: overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (List[org.apache.spark.sql.Row], org.apache.spark.sql.types.StructType)
val df = spark.createDataFrame(simpleData,schema)
Please suggest what I am doing wrong.
The error is telling you that it needs a Java List not a Scala List:
import scala.jdk.CollectionConverters._
val df = spark.createDataFrame(simpleData.asJava, schema)
See this question for alternatives for CollectionConverters if you are using an earlier versions of Scala than 2.13.
Another option is to pass an RDD:
val df = spark.createDataFram(sc.parallelize(simpleData), schema)
sc being the SparkContext object.

How to use from_json standard function with custom schema (error: overloaded method value from_json with alternative)?

I am consuming JSON data from AWS Kinesis stream, but I am getting the following error when I try to use the from_json() standard function:
command-5804948:32: error: overloaded method value from_json with alternatives:
(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.Column)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.types.DataType)org.apache.spark.sql.Column <and>
(e: org.apache.spark.sql.Column,schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.Column
cannot be applied to (String, org.apache.spark.sql.types.StructType)
.select(from_json("jsonData", dataSchema).as("devices"))
I have tried both of the below to define my schema:
val dataSchema = new StructType()
.add("ID", StringType)
.add("Type", StringType)
.add("Body", StringType)
.add("Submitted", StringType)
val dataSchema = StructType(Seq(StructField("ID",StringType,true), StructField("Type",StringType,true), StructField("Body",StringType,true), StructField("Submitted",StringType,true)))
Here is my code:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import java.nio.ByteBuffer
import scala.util.Random
val dataSchema = new StructType()
.add("ID", StringType)
.add("Type", StringType)
.add("Body", StringType)
.add("Submitted", StringType)
// val dataSchema = StructType(Seq(StructField("ID",StringType,true), StructField("Type",StringType,true), StructField("Body",StringType,true), StructField("Submitted",StringType,true)))
val kinesisDF = spark.readStream
.format("kinesis")
.option("streamName", "**************************")
.option("region", "********")
.option("initialPosition", "TRIM_HORIZON")
.option("awsAccessKey", "****************")
.option("awsSecretKey", "************************************")
.load()
val schemaDF = kinesisDF
.selectExpr("cast (data as STRING) jsonData")
.select(from_json("jsonData", dataSchema).as("devices"))
.select("devices.*")
.load()
display(schemaDF)
If you do the following:
val str_data = kinesisDF
.selectExpr("cast (data as STRING) jsonData")
display(str_data)
you can see that the stream data looks like:
{"ID":"1266ee3d99bc-96f942a6-434c-6442-a762","Type":"BT","Body":"{\"TN\":\"ND\",\"TD\":\"JSON:{\\"vw\\":\\"CV\\"}\",\"LT\":\"BT\",\"TI\":\"9ff2-4749250dd142-793ffb20-eb8e-47f7\",\"CN\":\"OD\",\"CI\":\"eb\",\"UI\":\"abc004\",\"AN\":\"1234567\",\"TT\":\"2019-09-15T09:48:25.0395209Z\",\"FI\":\"N/A\",\"HI\":\"N/A\",\"SV\":6}","Submitted":"2019-09-15 09:48:26.079"}
{"ID":"c8eb956ee98c-68d668b7-e7a6-9ea2-49a5","Type":"MS","Body":"{\"MT\":\"N/A\",\"EP\":\"N/A\",\"RQ\":\"{\\"IA]\\":false,\\"AN\\":null,\\"ACI\\":\\"1266ee3d99bc-96f942a6-434c-6442-a762\\",\\"CI\\":\\"ebb\\",\\"CG\\":\\"8b8a-4ab17555f2fa-da0c8047-b5a6-4ebe\\",\\"UI\\":\\"def211\\",\\"UR\\":\\"EscC\\",\\"UL\\":\\"SCC\\",\\"TI\\":\\"b9d2-d4f646a15d66-dc519f4a-48c3-4e7b\\",\\"TN\\":null,\\"MN\\":null,\\"CTZ\\":null,\\"PM\\":null,\\"TS\\":null,\\"CI\\":\\"ebc\\",\\"ALDC\\":null}","Submitted":"2019-09-15 09:49:46.901"}
The value for the "Body" key is another JSON/nested JSON that is why I have put it as a StringType in the schema so that gets stored in the column as is.
I get the following error when I run the above code:
How to fix it?
That part of the error says it all:
cannot be applied to (String, org.apache.spark.sql.types.StructType)
That means that there are three different alternatives of from_json standard function, and all of them expect a Column object not a String.
You can simply fix it by using $ syntax (or using col standard function) as follows:
.select(from_json($"jsonData", dataSchema).as("devices"))
Note the $ before the column name that turns it (implicitly) into a Column object.

Spark 2 to Spark 1.6

I am trying to convert following code to run on the spark 1.6 but, on which I am facing certain issues. while converting the sparksession to context
object TestData {
def makeIntegerDf(spark: SparkSession, numbers: Seq[Int]): DataFrame =
spark.createDataFrame(
spark.sparkContext.makeRDD(numbers.map(Row(_))),
StructType(List(StructField("column", IntegerType, nullable = false)))
)
}
How Do I convert it to make it run on spark 1.6
SparkSession is supported from spark 2.0 on-wards only. So if you want to use spark 1.6 then you would need to create SparkContext and sqlContext in driver class and pass them to the function.
so you can create
val conf = new SparkConf().setAppName("simple")
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
and then call the function as
val callFunction = makeIntegerDf(sparkContext, sqlContext, numbers)
And your function should be as
def makeIntegerDf(sparkContext: SparkContext, sqlContext: SQLContext, numbers: Seq[Int]): DataFrame =
sqlContext.createDataFrame(
sparkContext.makeRDD(numbers.map(Row(_))),
StructType(List(StructField("column", IntegerType, nullable = false)))
)
The only main difference here is the use of spark which is a spark session as opposed to spark context.
So you would do something like this:
object TestData {
def makeIntegerDf(sc: SparkContext, sqlContext: SQLContext, numbers: Seq[Int]): DataFrame =
sqlContext.createDataFrame(
sc.makeRDD(numbers.map(Row(_))),
StructType(List(StructField("column", IntegerType, nullable = false)))
)
}
Of course you would need to create a spark context instead of spark session in order to provide it to the function.

Not able to create parquet files in hdfs using spark shell

I want to create parquet file in hdfs and then read it through hive as external table. I'm struck with stage failures in spark-shell while writing parquet files.
Spark Version: 1.5.2
Scala Version: 2.10.4
Java: 1.7
Input file:(employee.txt)
1201,satish,25
1202,krishna,28
1203,amith,39
1204,javed,23
1205,prudvi,23
In Spark-Shell:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val employee = sc.textFile("employee.txt")
employee.first()
val schemaString = "id name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType};
val schema = StructType(schemaString.split(" ").map(fieldName ⇒ StructField(fieldName, StringType, true)))
val rowRDD = employee.map(_.split(",")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
val finalDF = employeeDF.toDF();
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
var WriteParquet= finalDF.write.parquet("/user/myname/schemaParquet")
When I type the last command I get,
ERROR
SPARK APPLICATION MANAGER
I even tried increasing the executor memory, its still failing.
Also Importantly , finalDF.show() is producing the same error.
So, I believe I have made a logical error here.
Thanks for supporting
The issue here is you are creating a schema with all the fields/columns type defaulted to StringType. But while passing the values in the schema, the value of Id and Age is being converted to Integer as per the code.Hence, throwing the Matcherror while running.
The data types of columns in the schema should match the data type of values being passed to it. Try the below code.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val employee = sc.textFile("employee.txt")
employee.first()
//val schemaString = "id name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types._;
val schema = StructType(StructField("id", IntegerType, true) :: StructField("name", StringType, true) :: StructField("age", IntegerType, true) :: Nil)
val rowRDD = employee.map(_.split(" ")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
val finalDF = employeeDF.toDF();
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
var WriteParquet= finalDF.write.parquet("/user/myname/schemaParquet")
This code should run fine.