How to parallelize operations on partitions of a dataframe - scala

I have a dataframe df =
+--------------------+
| id|
+-------------------+
|113331567dc042f...|
|5ffbbd1adb4c413...|
|08c782a4ae854e8...|
|24cee418e04a461...|
|a65f47c2aecc455...|
|a74355ef35d442d...|
|86f1a9b7ffc843b...|
|25c8abd6895e445...|
|b89ce33788f4484...|
.....................
with million elements.
I want to repartition the dataframe into multiple partitions and pass each partition elelemts as list to database api call that returns spark dataset.
Something like this.
df2 = df.repartition(10)
df2.foreach-partition { partition =>
val result = spark.read
.format("custom.databse")
.where(__key in partition.toList)
.load
}
And at the end I would ike to do a Union of all the result datasets returned for each of the partition.
expected output will be final dataset of strings.
+--------------------+
| customer names|
+-------------------+
|eddy |
|jaman |
|cally |
|sunny |
|adam |
.....................
Can anyone help me to convert it to real code in spark-scala

From what I see in documentation it could be possible to something like this. You'll have to use RDD API and SparkContext so you could use parallelize to partition your data into n partitions. After that you can call foreachPartition which should already give you iterator on your data directly, no need to collect data.

Conceptually what you are asking is not really possible in Spark.
Your API call is a SparkContext dependent function ( i.e. spark.read ) , And one cannot use a SparkContext inside a partition function. In simpler words, you cannot pass spark object to executors. For ref
For even simpler imagination : think of of a Dataset having each row as Dataset. Is it even possible ? no.
In your case there can be 2 ways to solve this :
Case 1 : One by One then Union
Convert the keys to list and Split them evenly
FOr each split call spark.read api and keep Unioning .
//split into 10000 sized lists
val listOfListOfKeys : List[List[String]]= df.collect().grouped(10000).toList
//Bring Dataset for 1st 10000 keys (1st list)
var resultDf = spark.read.format("custom.databse")
.where(__key in listOfListOfKeys.apply(0)).load
//drop the 1st item
listOfListOfKeys.drop(1)
//bring rest of them
for (listOfKeys <- listOfListOfKeys) {
val tempDf = spark.read.format("custom.databse")
.where(__key in listOfKeys).load
resultDf.union(tempDf);
}
There will scaling issues with this approah because of the collected data on the driver. But if you want to use the "spark.read" api, then this might be the only easy way.
Case 2 : foreachPartition + Normal DB call which returns a iterator
If you can find another way to get the data from your Db which returns a iterator or any single threaded spark independent object. Then you can achieve what you want exactly by applying what Filip has answered i.e. df.repartition.rdd.foreachPartition(yourDbCallFuntion())

Related

Spark Repeatable/Deterministic Results [duplicate]

This question already has answers here:
Why does df.limit keep changing in Pyspark?
(3 answers)
Closed 2 years ago.
I'm running the Spark code below (basically created as a MVE) which does a:
Read parquet and limit
Partition by
Join
Filter
I'm struggling to understand why I get a different number of rows in the joined dataframe i.e. the dataframe after stage 3 above each time I run the application. Why is this happening?
The reason I think that shouldn't be happening is that the limit is deterministic so each time the same rows should be in the partitioned dataframe, albeit in a different order. In the join I am joining on the field that the partition was done on. I am expecting to have every combination of pairs within a partition, but I think this should equate to the same number each time.
def main(args: Array[String]) {
val maxRows = args(0)
val spark = SparkSession.builder.getOrCreate()
val windowSpec = Window.partitionBy("epoch_1min").orderBy("epoch")
val data = spark.read.parquet("srcfile.parquet").limit(maxRows.toInt)
val partitionDf = data.withColumn("row", row_number().over(windowSpec))
partitionDf.persist(StorageLevel.MEMORY_ONLY)
logger.debug(s"${partitionDf.count()} rows in partitioned data")
val dfOrig = partitionDf.withColumnRenamed("epoch_1min", "epoch_1min_orig").withColumnRenamed("row", "row_orig")
val dfDest = partitionDf.withColumnRenamed("epoch_1min", "epoch_1min_dest").withColumnRenamed("row", "row_dest")
val joined = dfOrig.join(dfDest, dfOrig("epoch_1min_orig") === dfDest("epoch_1min_dest"), "inner")
logger.debug(s"Rows in joined dataframe ${joined.count()}")
val filtered = joined.filter(col("row_orig") < col("row_dest"))
logger.debug(s"Rows in filtered dataframe ${filtered.count()}")
}
there could be underlying data changes if you start a new App.
Otherwise, using Spark SQL just like ANSI SQL on an RDBMS, there is no guaranteed ordering of data when ORDER BY is not used. So, you cannot assume with varying Executor allocation that the processing will be the same (without ordering/sorting) second time around, etc.

Spark multiple dynamic aggregate functions, countDistinct not working

Aggregation on Spark dataframe with multiple dynamic aggregation operations.
I want to do aggregation on a Spark dataframe using Scala with multiple dynamic aggregation operations (passed by user in JSON). I'm converting the JSON to a Map.
Below is some sample data:
colA colB colC colD
1 2 3 4
5 6 7 8
9 10 11 12
The Spark aggregation code which I am using:
var cols = ["colA","colB"]
var aggFuncMap = Map("colC"-> "sum", "colD"-> "countDistinct")
var aggregatedDF = currentDF.groupBy(cols.head, cols.tail: _*).agg(aggFuncMap)
I have to pass aggFuncMap as Map only, so that user can pass any number of aggregations through the JSON configuration.
The above code is working fine for some aggregations, including sum, min, max, avg and count.
However, unfortunately this code is not working for countDistinct (maybe because it is camel case?).
When running the above code, I am getting this error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Undefined function: 'countdistinct'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'
Any help will be appreciated!
It's currently not possible to use agg with countDistinct inside a Map. From the documentation we see:
The available aggregate methods are avg, max, min, sum, count.
A possible fix would be to change the Map to a Seq[Column],
val cols = Seq("colA", "colB")
val aggFuncs = Seq(sum("colC"), countDistinct("colD"))
val df2 = df.groupBy(cols.head, cols.tail: _*).agg(aggFuncs.head, aggFuncs.tail: _*)
but that won't help very much if the user are to specify the aggregations in a configuration file.
Another approach would be to use expr, this function will evaluate a string and give back a column. However, expr won't accept "countDistinct", instead "count(distinct(...))" needs to be used.
This could be coded as follows:
val aggFuncs = Seq("sum(colC)", "count(distinct(colD))").map(e => expr(e))
val df2 = df.groupBy(cols.head, cols.tail: _*).agg(aggFuncs.head, aggFuncs.tail: _*)

How to use a specific row to modify the others in a dataframe

Given the following dataframe
id | value
1 | 2
2 | 3
3 | 4
I want to divide all value by the value of reference : the value associated with id 1
I've come up with :
df.cache()
val ref = df
.filter($"id" === 1)
.withColumnRenamed("value", "ref")
df
.crossJoin(broadcast(ref))
.withColumn("div", $"value" / $"ref)
Pro :
avoid collect(), so data are not sent to the spark driver.
use cache() to avoid to compute 2 times the input DataFrame.
Is there a better way ?
use cache() to avoid to compute 2 times the input DataFrame.
This is not correct. cache like all other spark operation is laze and will not evaluate unless some action is called upon. so when you do
val ref = df
.filter($"id" === 1)
.withColumnRenamed("value", "ref")
df
.crossJoin(broadcast(ref))
.withColumn("div", $"value" / $"ref)
and eventually call save 2 operations will be evaluated simultaneously. If you really want to cache the data immediately you need to call some action on the cached data
df.count
may be .

How do I convert Array[Row] to RDD[Row]

I have a scenario where I want to convert the result of a dataframe which is in the format Array[Row] to RDD[Row]. I have tried using parallelize, but I don't want to use it as it needs to contain entire data in a single system which is not feasible in production box.
val Bid = spark.sql("select Distinct DeviceId, ButtonName from stb").collect()
val bidrdd = sparkContext.parallelize(Bid)
How do I achieve this? I tried the approach given in this link (How to convert DataFrame to RDD in Scala?), but it didn't work for me.
val bidrdd1 = Bid.map(x => (x(0).toString, x(1).toString)).rdd
It gives an error value rdd is not a member of Array[(String, String)]
The variable Bid which you've created here is not a DataFrame, it is an Array[Row], that's why you can't use .rdd on it. If you want to get an RDD[Row], simply call .rdd on the DataFrame (without calling collect):
val rdd = spark.sql("select Distinct DeviceId, ButtonName from stb").rdd
Your post contains some misconceptions worth noting:
... a dataframe which is in the format Array[Row] ...
Not quite - the Array[Row] is the result of collecting the data from the DataFrame into Driver memory - it's not a DataFrame.
... I don't want to use it as it needs to contain entire data in a single system ...
Note that as soon as you use collect on the DataFrame, you've already collected entire data into a single JVM's memory. So using parallelize is not the issue.

Spark word2vec findSynonyms on Dataframes

I am trying to use findSynonyms operation without collecting (action). Here is an example. I have a DataFrame which holds vectors.
df.show()
+--------------------+
| result|
+--------------------+
|[-0.0081423431634...|
|[0.04309031420520...|
|[0.03857229948043...|
+--------------------+
I want to use findSynonyms on this DataFrame. I tried
df.map{case Row(vector:Vector) => model.findSynonyms(vector)}
but it throws null pointer exception. Then I've learned, spark does not support nested transformations or actions. One possible way to do is collecting this DataFrame and run findSynonyms then. How can I do this operation on DataFrame level?
If I have understood correctly, you want to perform a function on each row in the DataFrame. To do that, you can declare a User Defined Function (UDF). In your case the UDF will take a vector as input.
import org.apache.spark.sql.functions._
val func = udf((vector: Vector) => {model.findSynonyms(vector)})
df.withColumn("synonymes", func($"result"))
A new column "synonymes" will be created using the results from the func function.