spark sql query in playframework shows empty result - scala

When I run a GroupBy query in spark-shell it shows me perfect results but when I run the same query using spark SQL in playframework with the same spark version it shows me an empty result. I have a similar issue with join Operations. Cross join operations work in play framework with sbt but other joins show empty results.
I am using spark 2.4.0, play framework 2.7 and sbt 1.2.8.
here is my code
var file_name1 ="excel1.xlsx"
var file_name2 ="excel2.xlsx"
var df1 =Spark.sqlContext.read
.format("com.crealytics.spark.excel")
.option("sheetName", "sheet1") // Required
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "false")
.option("inferSchema", true)
.option("addColorColumns", "false")
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss")
.load(file_name1)
.limit(10)
var df2 =Spark.sqlContext.read
.format("com.crealytics.spark.excel")
.option("sheetName", "sheet1") // Required
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "false")
.option("inferSchema", true) // Optional, default: false
.option("addColorColumns", "false") // Optional, default: false
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional,
default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.load(file_name2)
.limit(10)
df1.registerTempTable("temp1")
df2.registerTempTable("temp2")
val df= sqlContext.sql("select * from temp1 inner join temp2 on
temp1.a==temp2.a")
df.show()

Related

Replace "" (empty string) to null using Spark dataframe.write method while writing to csv

While writing a spark dataframe using write method to a csv file, the csv file is getting populated as "" for null strings
101|abc|""|555
102|""|xyz|743
Using the below code:
dataFrame
.coalesce(16)
.write
.format("csv")
.option("delimiter", "|")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue", null)
.option("emptyValue", null)
.mode(SaveMode.Overwrite)
.save(path)
Expected output:
101|abc|null|555
102|null|xyz|743
Spark version 3.2 and Scala version 2.1
The issue seems to be in the option definition; the option values should be specified as String "null" instead of null, like:
dataFrame.coalesce(16).write.format("csv")
.option("delimiter", "|")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue", "null")
.option("emptyValue", "null")
.mode(SaveMode.Overwrite).save(path)
dataFrame.coalesce(16).write.format("csv")
.option("delimiter", "|")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue", "\u0000")
.option("emptyValue", "\u0000")
.mode(SaveMode.Overwrite).save(path)
This solved the issue.

Spark Scala code not working similarly then its pyspark version

I have a general question about Spark.
Should Pyspark and Scala Spark always have the same behaviour when we use the exact same code ?
If yes, how can ou explain this example:
Scala version:
val inputDf = spark
.readStream
.format("csv")
.schema(schema)
.option("ignoreChanges", "true")
.option("delimiter", ";").option("header", true)
.load("/input/")
def processIsmedia(df: DataFrame, batchId: Long): Unit = {
val ids = df
.select("id").distinct().collect().toList
.map(el => s"$el")
ids.foreach { id =>
val datedDf = df.filter(col("id") === id)
datedDf
.write
.format("delta")
.option("mergeSchema", "true")
.partitionBy("id")
.option("replaceWhere", s"id == '$id'")
.mode("overwrite")
.save("/res/")
}
}
inputDf
.writeStream
.format("delta")
.foreachBatch(processIsmedia _)
.queryName("tgte")
.option("checkpointLocation", "/check")
.trigger(Trigger.Once)
.start()
Python version:
inputDf = spark \
.readStream \
.format("csv") \
.schema(schema) \
.option("ignoreChanges", "true") \
.option("delimiter", ";").option("header", True) \
.load("/in/") \
def processDf(df, epoch_id):
PartitionKey = "id"
df.cache()
ids=[x.id for x in df.select("id").distinct().collect()]
for idd in ids:
idd =str(idd)
tmp = df.filter(df.id == idd)
tmp.write.format("delta").option("mergeSchema", "true").partitionBy(PartitionKey).option("replaceWhere", "id == '$i'".format(i=idd)).save("/res/")
inputDf.writeStream.format("delta").foreachBatch(processDf).queryName("aaaa").option("checkpointLocation", "/check").trigger(once=True).start()
Both codes are exactly equivalent.
They are supposed to write data (append new partitions and overwrite existant ones).
With Scala it is working perfectly fine.
With Python I am having an error :
Data written out does not match replaceWhere 'id == '$i''.
So my question is: Isnt spark the same thing whether it is used with Scala, Java, Python or even R ? How can this error be possible then ?
The python code is not performing a replace for the value in idd and the resulting string is "id == '$i'" which is not the case in your scala code i.e.
.option("replaceWhere", "id == '$i'".format(i=idd))
should be
.option("replaceWhere", "id == '{i}'".format(i=idd))
Let me know if this change works for you.

Except and ExceptAll functions for apache spark's dataset are giving an empty dataframe during streaming

Except and ExceptAll function for apache spark's dataset are giving empty dataframe during streaming.
I am using two datasets, both are batch,
the left-one has few rows that are not present in the right-one.
Upon running it gives correct output i.e. rows that are in the left-one but not in the right one.
Now I am repeating the same in streaming, the left-one is streaming while the right-one is batch source, in this scenario, I am getting an empty dataframe upon which dataframe.writeStream throws an exception None.get error.
package exceptPOC
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object exceptPOC {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("pocExcept")
.setMaster("local")
.set("spark.driver.host", "localhost")
val sparkSession =
SparkSession.builder().config(sparkConf).getOrCreate()
val schema = new StructType()
.add("sepal_length", DoubleType, true)
.add("sepal_width", DoubleType, true)
.add("petal_length", DoubleType, true)
.add("petal_width", DoubleType, true)
.add("species", StringType, true)
var df1 = sparkSession.readStream
.option("header", "true")
.option("inferSchema", "true")
.option("treatEmptyValuesAsNulls", "false")
.option("delimiter", ",")
.schema(schema)
.csv(
"some/path/exceptPOC/streamIris" //contains iris1
)
var df2 = sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.option("treatEmptyValuesAsNulls", "false")
.option("delimiter", ",")
.schema(schema)
.csv(
"/some/path/exceptPOC/iris2.csv"
)
val exceptDF = df1.except(df2)
exceptDF.writeStream
.format("console")
.start()
.awaitTermination()
}
}

Checkpoint for many streaming source

im working with zeppelin ,I read many files from many source in spark streaming like this :
val var1 = spark
.readStream
.schema(var1_raw)
.option("sep", ",")
.option("mode", "PERMISSIVE")
.option("maxFilesPerTrigger", 100)
.option("treatEmptyValuesAsNulls", "true")
.option("newFilesOnly", "true")
.csv(path_var1 )
val chekpoint_var1 = var1
.writeStream
.format("csv")
.option("checkpointLocation", path_checkpoint_var1)
.option("Path",path_checkpoint )
.option("header", true)
.outputMode("Append")
.queryName("var1_backup")
.start().awaitTermination()
val var2 = spark
.readStream
.schema(var2_raw)
.option("sep", ",")
.option("mode", "PERMISSIVE") //
.option("maxFilesPerTrigger", 100)
.option("treatEmptyValuesAsNulls", "true")
.option("newFilesOnly", "true")
.csv(path_var2 )
val chekpoint_var2 = var2
.writeStream
.format("csv")
.option("checkpointLocation", path_checkpoint_var2) //
.option("path",path_checkpoint_2 )
.option("header", true)
.outputMode("Append")
.queryName("var2_backup")
.start().awaitTermination()
when i re run the job i got this message :
java.lang.IllegalArgumentException: Cannot start query with name var1_backup as a query with that name is already active
*****************the solution*******************
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
and after i call the checkpoint function on the dataframe
*****************the solution*******************
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
and after i call the checkpoint function on the dataframe

Spark how to read from multiple Elastic Search clusters

I need to read data from two different Elastic Search clusters. one for logs and one for products data and I tried to put different sparkConf() when creating SparkSession but it seems it works only with the first SparkSession I created
val config1 = new SparkConf().setAppName("test")
.set("spark.driver.allowMultipleContexts", "true")
.set("es.index.auto.create", "true")
.set("es.nodes.discovery", "false")
.set("es.nodes.wan.only", "true")
.set("es.nodes.client.only", "false")
.set("es.nodes", s"$esNode1:$esPort1")
val config2 = new SparkConf().setAppName("test")
.set("spark.driver.allowMultipleContexts", "true")
.set("es.index.auto.create", "true")
.set("es.nodes.discovery", "false")
.set("es.nodes.wan.only", "true")
.set("es.nodes.client.only", "false")
.set("es.nodes", s"$esNode2:$esPort2")
val session1 = SparkSession.builder.master('local').config(config1).getOrCreate()
val session2 = SparkSession.builder.master('local').config(config2).getOrCreate()
session1.read.format("org.elasticsearch.spark.sql").load(path)
session2.read.format("org.elasticsearch.spark.sql").load(path)
it seems spark does not support for multiple sessions with the same format because I am using the same SparkSession with Mysql(jdbc) too and it works well. is there an alternative way to get data from multiple ElasticSearch clusters?
Create only one session per Spark application. Then read 2 DataFrames this way:
val config = new SparkConf().setAppName("test")
.set("spark.driver.allowMultipleContexts", "true")
.set("es.index.auto.create", "true")
.set("es.nodes.discovery", "false")
.set("es.nodes.wan.only", "true")
.set("es.nodes.client.only", "false")
val session = SparkSession.builder.master("local").config(config).getOrCreate
val df1 = session.read.format("org.elasticsearch.spark.sql")
.option("es.nodes", s"$esNode1:$esPort1").load(path)
val df2 = session.read.format("org.elasticsearch.spark.sql")
.option("es.nodes", s"$esNode2:$esPort2").load(path)