Why dataframe cannot be accessed inside UDF ? [Apache Spark Scala] [duplicate] - scala

This question already has answers here:
Why accesing DataFrame from UDF results in NullPointerException?
(2 answers)
Closed 3 years ago.
I am currently doing streaming project using Apache Spark. I have 2 data source, the first one I get news data from Kafka. This data is always updating every time. And the second one, I get masterWord dictionary. This variable contains dataframe of words and the unique key of words.
I want to process news data, then convert it from Seq of words become Seq of words_id by matching the data to masterWord dictionary. But, I have problems when accessing the masterWord dataframe in my UDF. When I am trying to access dataframe inside UDF, Spark return this error
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 i
n stage 4.0 (TID 4, localhost, executor driver): java.lang.NullPointerException
Why dataframe cannot be accessed inside UDF ?
What is the best practice to get value from another dataframe ?
This is my code
// read data stream from Kafka
val kafka = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", PropertiesLoader.kafkaBrokerUrl)
.option("subscribe", PropertiesLoader.kafkaTopic)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", "100")
.load()
// Transform data stream to Dataframe
val kafkaDF = kafka.selectExpr("CAST(value AS STRING)").as[(String)]
.select(from_json($"value", ColsArtifact.rawSchema).as("data"))
.select("data.*")
.withColumn("raw_text", concat(col("title"), lit(" "), col("text"))) // add column aggregate title and text
// read master word dictionary
val readConfig = ReadConfig(Map("uri" -> "mongodb://10.252.37.112/prayuga", "database" -> "prayuga", "collection" -> "master_word_2"))
var masterWord = MongoSpark.load(spark, readConfig)
// call UDF
val aggregateDF = kafkaDF.withColumn("text_aggregate", aggregateMongo(col("text_selected")))
// UDF
val aggregateMongo = udf((content: Seq[String]) => {
masterWord.show()
...
// code for query masterWord whether var content exist or not in masterWord dictionary
})

The dataframe lives in the spark context and it only available as such inside the driver
Each of the tasks sees the fraction (partition) of the data and can work with that. if you want to make the data in the dataframe available inside a udf you have to serialize it to the master and then you can broadcast it (or pass it as parameter, which will essentially do the same) to the udf, in which case Spark will send the whole thing to each instance of the udf running

If you want to use DataFrames inside UDFs, you must create a Broadcast :
import spark.implicits._
val df_name =Seq("Raphael").toDF("name")
val bc_df_name: Broadcast[DataFrame] = spark.sparkContext.broadcast(df_name)
// use df_name inside udf
val udf_doSomething = udf(() => bc_df_name.value.as[String].first())
Seq(1,2,3)
.toDF("i")
.withColumn("test",udf_doSomething())
.show()
gives
+---+-------+
| i| test|
+---+-------+
| 1|Raphael|
| 2|Raphael|
| 3|Raphael|
+---+-------+
This at least works in local mode, nut sure whether this also works on clusters. Anyway I would not recommend this approach, better convert (collect) the content of the dataframe in a scala datastructure on the driver (e.g. a Map) und broadcast this variable, or use a join instead.

Related

How to parallelize operations on partitions of a dataframe

I have a dataframe df =
+--------------------+
| id|
+-------------------+
|113331567dc042f...|
|5ffbbd1adb4c413...|
|08c782a4ae854e8...|
|24cee418e04a461...|
|a65f47c2aecc455...|
|a74355ef35d442d...|
|86f1a9b7ffc843b...|
|25c8abd6895e445...|
|b89ce33788f4484...|
.....................
with million elements.
I want to repartition the dataframe into multiple partitions and pass each partition elelemts as list to database api call that returns spark dataset.
Something like this.
df2 = df.repartition(10)
df2.foreach-partition { partition =>
val result = spark.read
.format("custom.databse")
.where(__key in partition.toList)
.load
}
And at the end I would ike to do a Union of all the result datasets returned for each of the partition.
expected output will be final dataset of strings.
+--------------------+
| customer names|
+-------------------+
|eddy |
|jaman |
|cally |
|sunny |
|adam |
.....................
Can anyone help me to convert it to real code in spark-scala
From what I see in documentation it could be possible to something like this. You'll have to use RDD API and SparkContext so you could use parallelize to partition your data into n partitions. After that you can call foreachPartition which should already give you iterator on your data directly, no need to collect data.
Conceptually what you are asking is not really possible in Spark.
Your API call is a SparkContext dependent function ( i.e. spark.read ) , And one cannot use a SparkContext inside a partition function. In simpler words, you cannot pass spark object to executors. For ref
For even simpler imagination : think of of a Dataset having each row as Dataset. Is it even possible ? no.
In your case there can be 2 ways to solve this :
Case 1 : One by One then Union
Convert the keys to list and Split them evenly
FOr each split call spark.read api and keep Unioning .
//split into 10000 sized lists
val listOfListOfKeys : List[List[String]]= df.collect().grouped(10000).toList
//Bring Dataset for 1st 10000 keys (1st list)
var resultDf = spark.read.format("custom.databse")
.where(__key in listOfListOfKeys.apply(0)).load
//drop the 1st item
listOfListOfKeys.drop(1)
//bring rest of them
for (listOfKeys <- listOfListOfKeys) {
val tempDf = spark.read.format("custom.databse")
.where(__key in listOfKeys).load
resultDf.union(tempDf);
}
There will scaling issues with this approah because of the collected data on the driver. But if you want to use the "spark.read" api, then this might be the only easy way.
Case 2 : foreachPartition + Normal DB call which returns a iterator
If you can find another way to get the data from your Db which returns a iterator or any single threaded spark independent object. Then you can achieve what you want exactly by applying what Filip has answered i.e. df.repartition.rdd.foreachPartition(yourDbCallFuntion())

How to append an index column to a spark data frame using spark streaming in scala?

I am using something like this:
df.withColumn("idx", monotonically_increasing_id())
But I get an exception as it is NOT SUPPORTED:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Expression(s): monotonically_increasing_id() is not supported with streaming DataFrames/Datasets;;
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:143)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:250)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:316)
Any ideas how to add an index or row number column to spark streaming dataframe in scala?
Full stacktrace: https://justpaste.it/5bdqr
There are a few operations that cannot exists anywhere in a streaming plan of Spark Streaming, unfortunately including monotonically_increasing_id(). Double check for this fact transformed1 is failing with the error as in your question, here is a reference on this check in Spark source code:
import org.apache.spark.sql.functions._
val df = Seq(("one", 1), ("two", 2)).toDF("foo", "bar")
val schema = df.schema
df.write.parquet("/tmp/out")
val input = spark.readStream.format("parquet").schema(schema).load("/tmp/out")
val transformed1 = input.withColumn("id", monotonically_increasing_id())
transformed1.writeStream.format("parquet").option("format", "append") .option("path", "/tmp/out2") .option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()
import org.apache.spark.sql.expressions.Window
val windowSpecRowNum = Window.partitionBy("foo").orderBy("foo")
val transformed2 = input.withColumn("row_num", row_number.over(windowSpecRowNum))
transformed2.writeStream.format("parquet").option("format", "append").option("path", "/tmp/out2").option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()
Also I tried to add indexing with Window over a column in DF - transformed2 in the snapshot above - it also failed, but with a different error):
"Non-time-based windows are not supported on streaming
DataFrames/Datasets"
All unsupported operator checks for Spark Streaming you can find here - seems like the traditional ways of adding an index column in Spark Batch don't work in Spark Streaming.

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

How can I save an RDD into HDFS and later read it back?

I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how?
It is possible.
In RDD you have saveAsObjectFile and saveAsTextFile functions. Tuples are stored as (value1, value2), so you can later parse it.
Reading can be done with textFile function from SparkContext and then .map to eliminate ()
So:
Version 1:
rdd.saveAsTextFile ("hdfs:///test1/");
// later, in other program
val newRdds = sparkContext.textFile("hdfs:///test1/part-*").map (x => {
// here remove () and parse long / strings
})
Version 2:
rdd.saveAsObjectFile ("hdfs:///test1/");
// later, in other program - watch, you have tuples out of the box :)
val newRdds = sparkContext.sc.sequenceFile("hdfs:///test1/part-*", classOf[Long], classOf[String])
I would recommend to use DataFrame if your RDD is in tabular format. a data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
where a RDD is a Resilient Distributed Dataset that is more of a blackbox or core abstraction of data that cannot be optimized.
However, you can go from a DataFrame to an RDD and vice-versa, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via toDF method.
The following is the example to create/store a DataFrame in CSV and Parquet format in HDFS,
val conf = {
new SparkConf()
.setAppName("Spark-HDFS-Read-Write")
}
val sqlContext = new SQLContext(sc)
val sc = new SparkContext(conf)
val hdfs = "hdfs:///"
val df = Seq((1, "Name1")).toDF("id", "name")
// Writing file in CSV format
df.write.format("com.databricks.spark.csv").mode("overwrite").save(hdfs + "user/hdfs/employee/details.csv")
// Writing file in PARQUET format
df.write.format("parquet").mode("overwrite").save(hdfs + "user/hdfs/employee/details")
// Reading CSV files from HDFS
val dfIncsv = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load(hdfs + "user/hdfs/employee/details.csv")
// Reading PQRQUET files from HDFS
val dfInParquet = sqlContext.read.parquet(hdfs + "user/hdfs/employee/details")

Trying to execute a spark sql query from a UDF

I am trying to write a inline function in spark framework using scala which will take a string input, execute a sql statement and return me a String value
val testfunc: (String=>String)= (arg1:String) =>
{val k = sqlContext.sql("""select c_code from r_c_tbl where x_nm = "something" """)
k.head().getString(0)
}
I am registering this scala function as an UDF
val testFunc_test = udf(testFunc)
I have a dataframe over a hive table
val df = sqlContext.table("some_table")
Then I am calling the udf in a withColumn and trying to save it in a new dataframe.
val new_df = df.withColumn("test", testFunc_test($"col1"))
But everytime i try do this i get an error
16/08/10 21:17:08 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 10.0.1.5): java.lang.NullPointerException
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:41)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at org.apache.spark.sql.DataFrame.foreach(DataFrame.scala:1434)
I am relatively new to spark and scala . But I am not sure why this code should not run. Any insights or an work around will be highly appreciated.
Please note that I have not pasted the whole error stack . Please let me know if it is required.
You can't use sqlContext in your UDF - UDFs must be serializable to be shipped to executors, and the context (which can be thought of as a connection to the cluster) can't be serialized and sent to the node - only the driver application (where the UDF is defined, but not executed) can use the sqlContext.
Looks like your usecase (perform a select from table X per record in table Y) would better be accomplished by using a join.