I'm using Spark 2.4.0 and Scala 2.11.
I have Dataset[Users] , when Users consists of: (country,id,url).
I want to group this DS by country, and for each group ,
make request for the URL , to get details about users from this country.
What is the best approach to do it?
using mapPartitions? foreachPartition?

mapPartitions and foreachPartitition were for RDDs. Now Dataset can also use mapPartitions.
In general you should use the Spark DSL- or Spark SQL APIs on Dataframes or DataSets. These use Catalyst Optimizer implying less thinking to do and it also works in parallel mode. An example for a Dataframe is, similar to DataSet:
import org.apache.spark.sql.functions._
import spark.implicits._
//import org.apache.spark.sql._
//import org.apache.spark.sql.types._
val df = Seq(
("green","y", 4),
("blue","n", 7),
("red","y", 7),
("yellow","y", 7),
("cyan","y", 7)
).toDF("colour", "status", "freq")
val df2 = df.where("status = 'y'")
.select($"freq", $"colour")
|4 |[green] |
|7 |[red, yellow, cyan] |
But as in case of RDDs you can use mapPartitions on a DS.


Spark Scala - Identify the gap between dates across multiple rows

I am new to Apache Spark, I have a use case to find the date gap identification between multiple dates.
In the above example, the member had a gap between 2018-02-01 to 2018-02-14. How to find this Apache Spark 2.3.4 using Scala.
Excepted output for the above scenario is,
You could use datediff along with Window function lag to check for day-gaps between current and previous rows, and compute the missing date ranges with some date functions:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
import java.sql.Date
val df = Seq(
(1, Date.valueOf("2018-01-01"), Date.valueOf("2018-01-31")),
(1, Date.valueOf("2018-02-16"), Date.valueOf("2018-02-28")),
(1, Date.valueOf("2018-03-01"), Date.valueOf("2018-03-31")),
(2, Date.valueOf("2018-07-01"), Date.valueOf("2018-07-31")),
(2, Date.valueOf("2018-08-16"), Date.valueOf("2018-08-31"))
).toDF("MemberId", "StartDate", "EndDate")
val win = Window.partitionBy("MemberId").orderBy("StartDate", "EndDate")
withColumn("PrevEndDate", coalesce(lag($"EndDate", 1).over(win), date_sub($"StartDate", 1))).
withColumn("DayGap", datediff($"StartDate", $"PrevEndDate")).
where($"DayGap" > 1).
select($"MemberId", date_add($"PrevEndDate", 1).as("StartDateGap"), date_sub($"StartDate", 1).as("EndDateGap")).
// +--------+------------+----------+
// |MemberId|StartDateGap|EndDateGap|
// +--------+------------+----------+
// | 1| 2018-02-01|2018-02-15|
// | 2| 2018-08-01|2018-08-15|
// +--------+------------+----------+

Can't query Spark DF from Hive after `saveAsTable` - Spark SQL specific format, which is NOT compatible with Hive

I am trying to save a dataframe as an external table which will be queried both with spark and possibly with hive, but somehow, I cannot query or see any data with hive. It works on in spark.
Here is how to reproduce the problem:
scala> println(spark.conf.get("spark.sql.catalogImplementation"))
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("spark.sql.sources.bucketing.enabled", true)
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("hive.enforce.bucketing","true")
scala> spark.conf.set("optimize.sort.dynamic.partitionining","true")
scala> spark.conf.set("hive.vectorized.execution.enabled","true")
scala> spark.conf.set("hive.enforce.sorting","true")
scala> spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
scala> spark.conf.set("hive.metastore.uris", "thrift://localhost:9083")
scala> var df = spark.range(20).withColumn("random", round(rand()*90))
df: org.apache.spark.sql.DataFrame = [id: bigint, random: double]
scala> df.head
res19: org.apache.spark.sql.Row = [0,46.0]
scala> df.repartition(10, col("random")).write.mode("overwrite").option("compression", "snappy").option("path", "s3a://company-bucket/dev/hive_confs/").format("orc").bucketBy(10, "random").sortBy("random").saveAsTable("hive_random")
19/08/01 19:26:55 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`hive_random` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Here is how I query in hive:
Beeline version 2.3.4-amzn-2 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select * from hive_random;
| hive_random.col |
No rows selected (0.213 seconds)
But it works fine in spark:
scala> spark.sql("SELECT * FROM hive_random").show
| id|random|
| 3| 13.0|
| 15| 13.0|
| 8| 46.0|
| 9| 65.0|
There is warning after your saveAsTable call. That's where the hint lies -
'Persisting bucketed data source table default.hive_random into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.'
The reason being 'saveAsTable' creates RDD partitions but not Hive partitions, the workaround is to create the table via hql before calling DataFrame.saveAsTable.
I will suggest t try couple of thing. First, try to set hive execution engine to use Spark.
set hive.execution.engine=spark;
Second, try to create external table in metastore and then save data to that table.
The Semantics of bucketed table in Spark and Hive is different.
The doc has details of the differences in semantics.
It states that
Data is written to bucketed tables but the output does not adhere with expected
bucketing spec. This leads to incorrect results when one tries to consume the
Spark written bucketed table from Hive.
Workaround: If reading from both engines is the requirement, writes need to happen from Hive

Filling blank field in a DataFrame with previous field value

I am working with Scala and Spark and I am relatively new to programming in Scala, so maybe my question has a simple solution.
I have one DataFrame that keeps information about the active and deactivate clients in some promotion. That DataFrame shows the Client Id, the action that he/she took (he can activate or deactivate from the promotion at any time) and the Date that he or she took this action. Here is an example of that format:
Example of how the DataFrame works
I want a daily monitoring of the clients that are active and wish to see how this number varies through the days, but I am not able to code anything that works like that.
My idea was to make a crossJoin of two Dataframes; one that has only the Client Ids and another with only the dates, so I would have all the Dates related to all the Client IDs and I only needed to see the Client Status in each of the Dates (if the Client is active or desactive). So after that I made a left join of these new Dataframe with the DataFrame that related the Client ID and the events, but the result is a lot of dates that have a "null" status and I don't know how to fill it with the correct status. Here's the example:
Example of the final DataFrame
I have already tried to use the lag function, but it did not solve my problem. Does anyone have any idea that could help me?
Thank You!
A slightly expensive operation due to Spark SQL having restrictions on correlated sub-queries with <, <= >, >=.
Starting from your second dataframe with NULLs and assuming that large enough system and volume of data manageable:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
// My sample input
val df = Seq(
(1,"2018-03-12", "activate"),
(1,"2018-03-13", null),
(1,"2018-03-14", null),
(1,"2018-03-15", "deactivate"),
(1,"2018-03-16", null),
(1,"2018-03-17", null),
(1,"2018-03-18", "activate"),
(2,"2018-03-13", "activate"),
(2,"2018-03-14", "deactivate"),
(2,"2018-03-15", "activate")
).toDF("ID", "dt", "act")
val w = Window.partitionBy("ID").orderBy(col("dt").asc)
val df2 = df.withColumn("rank", dense_rank().over(w)).select("ID", "dt","act", "rank") //.where("rank == 1")
val df3 = df2.filter($"act".isNull)
val df4 = df2.filter(!($"act".isNull)).toDF("ID2", "dt2", "act2", "rank2")
val df5 = df3.join(df4, (df3("ID") === df4("ID2")) && (df4("rank2") < df3("rank")),"inner")
val w2 = Window.partitionBy("ID", "rank").orderBy(col("rank2").desc)
val df6 = df5.withColumn("rank_final", dense_rank().over(w2)).where("rank_final == 1").select("ID", "dt","act2").toDF("ID", "dt", "act")
val df7 = df.filter(!($"act".isNull))
val dfFinal = df6.union(df7)
|ID |dt |act |
|1 |2018-03-13|activate |
|1 |2018-03-14|activate |
|1 |2018-03-16|deactivate|
|1 |2018-03-17|deactivate|
|1 |2018-03-12|activate |
|1 |2018-03-15|deactivate|
|1 |2018-03-18|activate |
|2 |2018-03-13|activate |
|2 |2018-03-14|deactivate|
|2 |2018-03-15|activate |
I solved this step-wise and in a rush, but no so apparent.

Concat of all data frame columns using fold, reduce with Spark / Scala

The following works fine with a dynamic column generation:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
import org.apache.spark.sql.DataFrame
val input = sc.parallelize(Seq(
("a", "5a", "7w", "9", "a12", "a13")
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")
val columns_to_concat = input.columns => col(c)): _*).as("concat_column")).show(false)
|a5a7w9a12a13 |
How can I do this with foldLeft, reduceLeft - whilst retaining the dynamic column generation?
I always get either an error, or a null value returned. Whilst concat suffices, I am curious as to how fold, etc. could work.
It is definitely not the way to go*, but if you treat it as a programming exercise:
import org.apache.spark.sql.functions.{col, concat, lit}, _))
or""))(concat(_, _))
* Because
It is a convoluted solution for something that already is provided by a high level API.
Because it requires additional work from the planner / optimizer to flatten recursive expression, not to mention that the expression don't use tail call recursion and can simply overflow.

spark (Scala) dataframe filtering (FIR)

Let say I have a dataframe ( stored in scala val as df) which contains the data from a csv:
which I have no problem reading this from file as a spark dataframe in scala language.
I would like to add a filtered column (by filter I meant signal processing moving average filtering), (say I want to do (T[n]+T[n-1])/2.0):
(Actually, say for the first row, I want 32.5 instead of (65+0)/2.0. I wrote it to clarify the expected 2-time-step filtering operation output)
So how to achieve this? I am not familiar with spark dataframe operation which combine rows iteratively along column...
Spark 3.1+
import org.apache.spark.sql.functions.timestamp_seconds
Spark 2.0+
In Spark 2.0 and later it is possible to use window function as a input for groupBy. It allows you to specify windowDuration, slideDuration and startTime (offset). It works only with TimestampType column but it is not that hard to find a workaround for that. In your case it will require some additional steps to correct for boundaries but general solution can expressed as shown below:
import org.apache.spark.sql.functions.{window, avg}
.withColumn("ts", $"time".cast("timestamp"))
.groupBy(window($"ts", windowDuration="2 seconds", slideDuration="1 second"))
Spark < 2.0
If there is a natural way to partition your data you can use window functions as follows:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.mean
val w = Window.partitionBy($"id").orderBy($"time").rowsBetween(-1, 0)
val df = sc.parallelize(Seq(
(1L, 0, 65), (1L, 1, 67), (1L, 2, 62), (1L, 3, 59)
)).toDF("id", "time", "temperature")$"*", mean($"temperature").over(w).alias("temperatureAvg")).show
// +---+----+-----------+--------------+
// | id|time|temperature|temperatureAvg|
// +---+----+-----------+--------------+
// | 1| 0| 65| 65.0|
// | 1| 1| 67| 66.0|
// | 1| 2| 62| 64.5|
// | 1| 3| 59| 60.5|
// +---+----+-----------+--------------+
You can create windows with arbitrary weights using lead / lag functions:
lit(0.6) * $"temperature" +
lit(0.3) * lag($"temperature", 1) +
lit(0.2) * lag($"temperature", 2)
It is still possible without partitionBy clause but will be extremely inefficient. If this is the case you won't be able to use DataFrames. Instead you can use sliding over RDD (see for example Operate on neighbor elements in RDD in Spark). There is also spark-timeseries package you may find useful.