Spark Scala code not working similarly then its pyspark version - scala

I have a general question about Spark.
Should Pyspark and Scala Spark always have the same behaviour when we use the exact same code ?
If yes, how can ou explain this example:
Scala version:
val inputDf = spark
.readStream
.format("csv")
.schema(schema)
.option("ignoreChanges", "true")
.option("delimiter", ";").option("header", true)
.load("/input/")
def processIsmedia(df: DataFrame, batchId: Long): Unit = {
val ids = df
.select("id").distinct().collect().toList
.map(el => s"$el")
ids.foreach { id =>
val datedDf = df.filter(col("id") === id)
datedDf
.write
.format("delta")
.option("mergeSchema", "true")
.partitionBy("id")
.option("replaceWhere", s"id == '$id'")
.mode("overwrite")
.save("/res/")
}
}
inputDf
.writeStream
.format("delta")
.foreachBatch(processIsmedia _)
.queryName("tgte")
.option("checkpointLocation", "/check")
.trigger(Trigger.Once)
.start()
Python version:
inputDf = spark \
.readStream \
.format("csv") \
.schema(schema) \
.option("ignoreChanges", "true") \
.option("delimiter", ";").option("header", True) \
.load("/in/") \
def processDf(df, epoch_id):
PartitionKey = "id"
df.cache()
ids=[x.id for x in df.select("id").distinct().collect()]
for idd in ids:
idd =str(idd)
tmp = df.filter(df.id == idd)
tmp.write.format("delta").option("mergeSchema", "true").partitionBy(PartitionKey).option("replaceWhere", "id == '$i'".format(i=idd)).save("/res/")
inputDf.writeStream.format("delta").foreachBatch(processDf).queryName("aaaa").option("checkpointLocation", "/check").trigger(once=True).start()
Both codes are exactly equivalent.
They are supposed to write data (append new partitions and overwrite existant ones).
With Scala it is working perfectly fine.
With Python I am having an error :
Data written out does not match replaceWhere 'id == '$i''.
So my question is: Isnt spark the same thing whether it is used with Scala, Java, Python or even R ? How can this error be possible then ?

The python code is not performing a replace for the value in idd and the resulting string is "id == '$i'" which is not the case in your scala code i.e.
.option("replaceWhere", "id == '$i'".format(i=idd))
should be
.option("replaceWhere", "id == '{i}'".format(i=idd))
Let me know if this change works for you.

Related

How to create Predicate for reading data using Spark SQL in Scala

I can read the Oracle table using this simple Scala program:
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.sql.sources.partitionColumnTypeInference.enabled", false)
.config("spark.executor.memory", "8g")
.config("spark.executor.cores", 4)
.config("spark.task.cpus", 1)
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:#x.x.x.x:1521:orcl")
.option("dbtable", "big_table")
.option("user", "test")
.option("password", "123456")
.load()
jdbcDF.show()
However, the table is huge and each node have to read part of it. So, I must use a hash function to distribute data among Spark nodes. To have that Spark has Predicates. In fact, I did that in Python. The table has the column named NUM, that Hash Function receives each value and returns an Integer between num_partitions and 0. The predicate list is in following:
hash_function = lambda x: 'ora_hash({}, {})'.format(x, num_partitions)
hash_df = connection.read_sql_full(
'SELECT distinct {0} hash FROM {1}'.format(hash_function(var.hash_col), source_table_name))
hash_values = list(hash_df.loc[:, 'HASH'])
hash_values for num_partitions=19 is :
hash_values=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]
predicates = [
"to_date({0},'YYYYMMDD','nls_calendar=persian')= to_date({1} ,'YYYYMMDD','nls_calendar=persian') " \
"and hash_func({2},{3}) = {4}"
.format(partition_key, current_date, hash_col, num_partitions, hash_val) for hash_val in
hash_values]
Then I read the table based on the predicates like this:
dataframe = spark.read \
.option('driver', 'oracle.jdbc.driver.OracleDriver') \
.jdbc(url=spark_url,
table=table_name,
predicates=predicates)
Would you please guide me how to create Predicates List in Scala as I explained in Python?
Any help is really appreciated.
Problem Solved.
I changed the code like this, then it's work:
import org.apache.spark.sql.SparkSession
import java.sql.Connection
import oracle.jdbc.pool.OracleDataSource
object main extends App {
def read_spark(): Unit = {
val numPartitions = 19
val partitionColumn = "name"
val field_date = "test"
val current_date = "********"
// Define JDBC properties
val url = "jdbc:oracle:thin:#//x.x.x.x:1521/orcl"
val properties = new java.util.Properties()
properties.put("url", url)
properties.put("user", "user")
properties.put("password", "pass")
// Define the where clauses to assign each row to a partition
val predicateFct = (partition: Int) => s"""ora_hash("$partitionColumn",$numPartitions) = $partition"""
val predicates = (0 until numPartitions).map{partition => predicateFct(partition)}.toArray
val test_table = s"(SELECT * FROM table where $field_date=$current_date) dbtable"
// Load the table into Spark
val df = spark.read
.format("jdbc")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("dbtable", test_table)
.jdbc(url, test_table, predicates, properties)
df.show()
}
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.sql.sources.partitionColumnTypeInference.enabled", false)
.config("spark.executor.memory", "8g")
.config("spark.executor.cores", 4)
.config("spark.task.cpus", 1)
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
read_spark()
}

I don't know how to do the same using parquet file

Link to (data.csv) and (output.csv)
import org.apache.spark.sql._
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val tempDF=spark.read.csv("data.csv")
tempDF.coalesce(1).write.parquet("Parquet")
val rdd = sc.textFile("Parquet")
I Convert data.csv into optimised parquet file and then loaded it and now i want to do all the transformation on parquet file just like i did on csv file given below and then save it as a parquet file.Link of (data.csv) and (output.csv)
val header = rdd.first
val rdd1 = rdd.filter(_ != header)
val resultRDD = rdd1.map { r =>
val Array(country, values) = r.split(",")
country -> values
}.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map { case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))
import spark.sqlContext.implicits._
val dataSet = resultRDD.map { case (country: String, values: String) => CountryAgg(country, values) }.toDS()
dataSet.coalesce(1).write.option("header","true").csv("output")
}
case class CountryAgg(country: String, values: String)
}
I reckon, you are trying to add up corresponding elements from the array based on Country. I have done this using DataFrame APIs, which makes the job easier.
Code for your reference:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("path", "/path/to/input/data.csv")
.load()
val df1 = df.select(
$"Country",
(split($"Values", ";"))(0).alias("c1"),
(split($"Values", ";"))(1).alias("c2"),
(split($"Values", ";"))(2).alias("c3"),
(split($"Values", ";"))(3).alias("c4"),
(split($"Values", ";"))(4).alias("c5")
)
.groupBy($"Country")
.agg(
sum($"c1" cast "int").alias("s1"),
sum($"c2" cast "int").alias("s2"),
sum($"c3" cast "int").alias("s3"),
sum($"c4" cast "int").alias("s4"),
sum($"c5" cast "int").alias("s5")
)
.select(
$"Country",
concat(
$"s1", lit(";"),
$"s2", lit(";"),
$"s3", lit(";"),
$"s4", lit(";"),
$"s5"
).alias("Values")
)
df1.repartition(1)
.write
.format("csv")
.option("delimiter",",")
.option("header", "true")
.option("path", "/path/to/output")
.save()
Here is the output for your reference.
scala> df1.show()
+-------+-------------------+
|Country| Values|
+-------+-------------------+
|Germany| 144;166;151;172;70|
| China| 218;239;234;209;75|
| India| 246;153;148;100;90|
| Canada| 183;258;150;263;71|
|England|178;114;175;173;153|
+-------+-------------------+
P.S.:
You can change the output format to parquet/orc or anything you wish.
I have repartitioned df1 into 1 partition just so that you could get a single output file. You can choose to repartition or not based
on your usecase
Hope this helps.
You could just read the file as parquet and perform the same operations on the resulting dataframe:
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("data.parquet")
If you need an rdd you can then just call:
val rdd = parquetFileDF.rdd
The you can proceed with the transformations as before and write as parquet like you have in your question.

spark sql query in playframework shows empty result

When I run a GroupBy query in spark-shell it shows me perfect results but when I run the same query using spark SQL in playframework with the same spark version it shows me an empty result. I have a similar issue with join Operations. Cross join operations work in play framework with sbt but other joins show empty results.
I am using spark 2.4.0, play framework 2.7 and sbt 1.2.8.
here is my code
var file_name1 ="excel1.xlsx"
var file_name2 ="excel2.xlsx"
var df1 =Spark.sqlContext.read
.format("com.crealytics.spark.excel")
.option("sheetName", "sheet1") // Required
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "false")
.option("inferSchema", true)
.option("addColorColumns", "false")
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss")
.load(file_name1)
.limit(10)
var df2 =Spark.sqlContext.read
.format("com.crealytics.spark.excel")
.option("sheetName", "sheet1") // Required
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "false")
.option("inferSchema", true) // Optional, default: false
.option("addColorColumns", "false") // Optional, default: false
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional,
default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.load(file_name2)
.limit(10)
df1.registerTempTable("temp1")
df2.registerTempTable("temp2")
val df= sqlContext.sql("select * from temp1 inner join temp2 on
temp1.a==temp2.a")
df.show()

How to join 2 spark sql streams

ENV:
Scala spark version: 2.1.1
This is my streams (read from kafka):
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("JoinStreams")
val spark = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
val schema = StructType(
List(
StructField("t", DataTypes.StringType),
StructField("dst", DataTypes.StringType),
StructField("dstPort", DataTypes.IntegerType),
StructField("src", DataTypes.StringType),
StructField("srcPort", DataTypes.IntegerType),
StructField("ts", DataTypes.LongType),
StructField("len", DataTypes.IntegerType),
StructField("cpu", DataTypes.DoubleType),
StructField("l", DataTypes.StringType),
StructField("headers", DataTypes.createArrayType(DataTypes.StringType))
)
)
val baseDataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", 'topic')
.load()
.selectExpr("cast (value as string) as json")
.select(from_json($"json", schema).as("data"))
.select($"data.*")
val requestsDataFrame = baseDataFrame
.filter("t = 'REQUEST'")
.repartition($"dst")
.withColumn("rowId", monotonically_increasing_id())
val responseDataFrame = baseDataFrame
.filter("t = 'RESPONSE'")
.repartition($"src")
.withColumn("rowId", monotonically_increasing_id())
responseDataFrame.createOrReplaceTempView("responses")
requestsDataFrame.createOrReplaceTempView("requests")
val dataFrame = spark.sql("select * from requests left join responses ON requests.rowId = responses.rowId")
I get this ERROR when starting the application:
org.apache.spark.sql.AnalysisException: Left outer/semi/anti joins with a streaming DataFrame/Dataset on the right is not supported;;
How can I join these two streams?
I also try to do direct join and get the same error.
Should I first save it to file and then read it again?
What is the best practice?
It seems you need Spark 2.3:
"In Spark 2.3, we have added support for stream-stream joins..."
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins

How to persist output of window() function in JDBC with Spark SQL DataFrame?

When the following snippet executes:
...
stream
.map(_.value())
.flatMap(MyParser.parse(_))
.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val dataFrame = rdd.toDF();
val countsDf = dataFrame.groupBy($"action", window($"time", "1 hour")).count()
val query = countsDf.write.mode("append").jdbc(url, "stats_table", prop)
})
....
This error happens: java.lang.IllegalArgumentException: Can't get JDBC type for struct<start:timestamp,end:timestamp>
How would one go about saving the output of org.apache.spark.sql.functions.window() function to a MySQL DB?
I ran into the same problem using SPARK SQL:
val query3 = dataFrame
.groupBy(org.apache.spark.sql.functions.window($"timeStamp", "10 minutes"), $"data")
.count()
.writeStream
.outputMode(OutputMode.Complete())
.options(prop)
.option("checkpointLocation", "file:///tmp/spark-checkpoint1")
.option("table", "temp")
.format("com.here.olympus.jdbc.sink.OlympusDBSinkProvider")
.start
And I solved by adding a user defined function
val toString = udf{(window:GenericRowWithSchema) => window.mkString("-")}
For me String works, but you can change the function according to your needs, you can even have two functions to return start and end separately.
My query changed to:
val query3 = dataFrame
.groupBy(org.apache.spark.sql.functions.window($"timeStamp", "10 minutes"), $"data")
.count()
.withColumn("window",toString($"window"))
.writeStream
.outputMode(OutputMode.Complete())
.options(prop)
.option("checkpointLocation", "file:///tmp/spark-checkpoint1")
.option("table", "temp")
.format("com.here.olympus.jdbc.sink.OlympusDBSinkProvider")
.start