Spark multiple dynamic aggregate functions, countDistinct not working - scala

Aggregation on Spark dataframe with multiple dynamic aggregation operations.
I want to do aggregation on a Spark dataframe using Scala with multiple dynamic aggregation operations (passed by user in JSON). I'm converting the JSON to a Map.
Below is some sample data:
colA colB colC colD
1 2 3 4
5 6 7 8
9 10 11 12
The Spark aggregation code which I am using:
var cols = ["colA","colB"]
var aggFuncMap = Map("colC"-> "sum", "colD"-> "countDistinct")
var aggregatedDF = currentDF.groupBy(cols.head, cols.tail: _*).agg(aggFuncMap)
I have to pass aggFuncMap as Map only, so that user can pass any number of aggregations through the JSON configuration.
The above code is working fine for some aggregations, including sum, min, max, avg and count.
However, unfortunately this code is not working for countDistinct (maybe because it is camel case?).
When running the above code, I am getting this error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Undefined function: 'countdistinct'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'
Any help will be appreciated!

It's currently not possible to use agg with countDistinct inside a Map. From the documentation we see:
The available aggregate methods are avg, max, min, sum, count.
A possible fix would be to change the Map to a Seq[Column],
val cols = Seq("colA", "colB")
val aggFuncs = Seq(sum("colC"), countDistinct("colD"))
val df2 = df.groupBy(cols.head, cols.tail: _*).agg(aggFuncs.head, aggFuncs.tail: _*)
but that won't help very much if the user are to specify the aggregations in a configuration file.
Another approach would be to use expr, this function will evaluate a string and give back a column. However, expr won't accept "countDistinct", instead "count(distinct(...))" needs to be used.
This could be coded as follows:
val aggFuncs = Seq("sum(colC)", "count(distinct(colD))").map(e => expr(e))
val df2 = df.groupBy(cols.head, cols.tail: _*).agg(aggFuncs.head, aggFuncs.tail: _*)

Related

How to parallelize operations on partitions of a dataframe

I have a dataframe df =
+--------------------+
| id|
+-------------------+
|113331567dc042f...|
|5ffbbd1adb4c413...|
|08c782a4ae854e8...|
|24cee418e04a461...|
|a65f47c2aecc455...|
|a74355ef35d442d...|
|86f1a9b7ffc843b...|
|25c8abd6895e445...|
|b89ce33788f4484...|
.....................
with million elements.
I want to repartition the dataframe into multiple partitions and pass each partition elelemts as list to database api call that returns spark dataset.
Something like this.
df2 = df.repartition(10)
df2.foreach-partition { partition =>
val result = spark.read
.format("custom.databse")
.where(__key in partition.toList)
.load
}
And at the end I would ike to do a Union of all the result datasets returned for each of the partition.
expected output will be final dataset of strings.
+--------------------+
| customer names|
+-------------------+
|eddy |
|jaman |
|cally |
|sunny |
|adam |
.....................
Can anyone help me to convert it to real code in spark-scala
From what I see in documentation it could be possible to something like this. You'll have to use RDD API and SparkContext so you could use parallelize to partition your data into n partitions. After that you can call foreachPartition which should already give you iterator on your data directly, no need to collect data.
Conceptually what you are asking is not really possible in Spark.
Your API call is a SparkContext dependent function ( i.e. spark.read ) , And one cannot use a SparkContext inside a partition function. In simpler words, you cannot pass spark object to executors. For ref
For even simpler imagination : think of of a Dataset having each row as Dataset. Is it even possible ? no.
In your case there can be 2 ways to solve this :
Case 1 : One by One then Union
Convert the keys to list and Split them evenly
FOr each split call spark.read api and keep Unioning .
//split into 10000 sized lists
val listOfListOfKeys : List[List[String]]= df.collect().grouped(10000).toList
//Bring Dataset for 1st 10000 keys (1st list)
var resultDf = spark.read.format("custom.databse")
.where(__key in listOfListOfKeys.apply(0)).load
//drop the 1st item
listOfListOfKeys.drop(1)
//bring rest of them
for (listOfKeys <- listOfListOfKeys) {
val tempDf = spark.read.format("custom.databse")
.where(__key in listOfKeys).load
resultDf.union(tempDf);
}
There will scaling issues with this approah because of the collected data on the driver. But if you want to use the "spark.read" api, then this might be the only easy way.
Case 2 : foreachPartition + Normal DB call which returns a iterator
If you can find another way to get the data from your Db which returns a iterator or any single threaded spark independent object. Then you can achieve what you want exactly by applying what Filip has answered i.e. df.repartition.rdd.foreachPartition(yourDbCallFuntion())

Spark DataFrame orderBy and DataFrameWriter sortBy, is there a difference?

Is there a difference in the output between sorting before or after the .write command on a DataFrame?
val people : DataFrame[Person]
people
.orderBy("name")
.write
.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("test_segments")
and
val people : DataFrame[Person]
people
.write
.sortBy("name")
.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("test_segments")
The different between those is explained on the comments within the code:
orderBy: Is a Dataset/Dataframe operation. Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.
sortBy: Is a DataFrameWriter operation. Sorts the output in each bucket by the given columns.
The sortBy method will only work when you are also defining buckets (bucketBy). Otherwise you will get an exception:
if (sortColumnNames.isDefined && numBuckets.isEmpty) {
throw new AnalysisException("sortBy must be used together with bucketBy")
}
The columns defined in sortBy are used in the BucketSpec as sortColumnNames like shown below:
Params:
numBuckets – number of buckets.
bucketColumnNames – the names of the columns that used to generate the bucket id.
sortColumnNames – the names of the columns that used to sort data in each bucket.
case class BucketSpec(
numBuckets: Int,
bucketColumnNames: Seq[String],
sortColumnNames: Seq[String])

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)

How to use a specific row to modify the others in a dataframe

Given the following dataframe
id | value
1 | 2
2 | 3
3 | 4
I want to divide all value by the value of reference : the value associated with id 1
I've come up with :
df.cache()
val ref = df
.filter($"id" === 1)
.withColumnRenamed("value", "ref")
df
.crossJoin(broadcast(ref))
.withColumn("div", $"value" / $"ref)
Pro :
avoid collect(), so data are not sent to the spark driver.
use cache() to avoid to compute 2 times the input DataFrame.
Is there a better way ?
use cache() to avoid to compute 2 times the input DataFrame.
This is not correct. cache like all other spark operation is laze and will not evaluate unless some action is called upon. so when you do
val ref = df
.filter($"id" === 1)
.withColumnRenamed("value", "ref")
df
.crossJoin(broadcast(ref))
.withColumn("div", $"value" / $"ref)
and eventually call save 2 operations will be evaluated simultaneously. If you really want to cache the data immediately you need to call some action on the cached data
df.count
may be .

How do I convert Array[Row] to RDD[Row]

I have a scenario where I want to convert the result of a dataframe which is in the format Array[Row] to RDD[Row]. I have tried using parallelize, but I don't want to use it as it needs to contain entire data in a single system which is not feasible in production box.
val Bid = spark.sql("select Distinct DeviceId, ButtonName from stb").collect()
val bidrdd = sparkContext.parallelize(Bid)
How do I achieve this? I tried the approach given in this link (How to convert DataFrame to RDD in Scala?), but it didn't work for me.
val bidrdd1 = Bid.map(x => (x(0).toString, x(1).toString)).rdd
It gives an error value rdd is not a member of Array[(String, String)]
The variable Bid which you've created here is not a DataFrame, it is an Array[Row], that's why you can't use .rdd on it. If you want to get an RDD[Row], simply call .rdd on the DataFrame (without calling collect):
val rdd = spark.sql("select Distinct DeviceId, ButtonName from stb").rdd
Your post contains some misconceptions worth noting:
... a dataframe which is in the format Array[Row] ...
Not quite - the Array[Row] is the result of collecting the data from the DataFrame into Driver memory - it's not a DataFrame.
... I don't want to use it as it needs to contain entire data in a single system ...
Note that as soon as you use collect on the DataFrame, you've already collected entire data into a single JVM's memory. So using parallelize is not the issue.