Spark/Scala - RDD fill with last non null value - scala

I have an rdd that looks like this:
timestamp,user_id,search_id
[2021-08-14 14:38:31,user_a,null]
[2021-08-14 14:42:01,user_a,ABC]
[2021-08-14 14:55:12,user_a,null]
[2021-08-14 14:56:19,user_a,null]
[2021-08-14 15:01:36,user_a,null]
[2021-08-14 15:02:22,user_a,null]
[2021-08-15 07:38:07,user_b,XYZ]
[2021-08-15 07:39:59,user_b,null]
I would like to associate the events that do not have a search_id with previous search_ids by filling the null values in "search_id" with the latest non null value (when there is one) grouped by user_id.
Therefore, my output would look like this:
timestamp,user_id,search_id
[2021-08-14 14:38:31,user_a,null]
[2021-08-14 14:42:01,user_a,ABC]
[2021-08-14 14:55:12,user_a,ABC]
[2021-08-14 14:56:19,user_a,ABC]
[2021-08-14 15:01:36,user_a,ABC]
[2021-08-14 15:02:22,user_a,ABC]
[2021-08-15 07:38:07,user_b,XYZ]
[2021-08-15 07:39:59,user_b,XYZ]
I found a solution for spark dataframes that used org.apache.spark.sql.functions.last and a window here --> Spark Window function last not null value but my context doesn't allow me to convert the rdd to a dataframe at the moment so I was wondering if any of you had an idea of how this could be done.

I guess groupBy user (https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/rdd/RDD.html#groupBy(scala.Function1,%20scala.reflect.ClassTag) ) and then flatMapWith (don't forget to sort grouped items, because groupBy doesn't preserve order) which will fix your search ids. All this presuming you don't have too many items per user

One way to get this done is by knowing the max times that we need to call the lag() function.
Try this.
Input:
val df1=spark.sql("""
select timestamp'2021-08-14 14:38:31' timestamp, 'user_a' user_id, 'null' search_id union all
select '2021-08-14 14:42:01' , 'user_a', 'ABC' union all
select '2021-08-14 14:55:12' , 'user_a', 'null' union all
select '2021-08-14 14:56:19' , 'user_a', 'null' union all
select '2021-08-14 15:01:36' , 'user_a', 'null' union all
select '2021-08-14 15:02:22' , 'user_a', 'null' union all
select '2021-08-15 07:38:07' , 'user_b', 'XYZ' union all
select '2021-08-15 07:39:59' , 'user_b', 'null'
""")
df1.orderBy("timestamp").show(false)
df1.printSchema
df1.createOrReplaceTempView("df1")
+-------------------+-------+---------+
|timestamp |user_id|search_id|
+-------------------+-------+---------+
|2021-08-14 14:38:31|user_a |null |
|2021-08-14 14:42:01|user_a |ABC |
|2021-08-14 14:55:12|user_a |null |
|2021-08-14 14:56:19|user_a |null |
|2021-08-14 15:01:36|user_a |null |
|2021-08-14 15:02:22|user_a |null |
|2021-08-15 07:38:07|user_b |XYZ |
|2021-08-15 07:39:59|user_b |null |
+-------------------+-------+---------+
Now calculate the max times
val max_count = spark.sql(" select max(c) from (select count(*) c from df1 group by user_id)").as[Long].first
max_count: Long = 6
Create a mutable dataframe, so that we can loop around and assign it to the same df.
var df2=df1
for( i <- 1 to max_count.toInt )
{
df2=df2.withColumn("search_id",expr(""" case when search_id <> 'null' then search_id
else lag(search_id) over(partition by user_id order by timestamp) end """))
}
df2.orderBy("timestamp").show(false)
+-------------------+-------+---------+
|timestamp |user_id|search_id|
+-------------------+-------+---------+
|2021-08-14 14:38:31|user_a |null |
|2021-08-14 14:42:01|user_a |ABC |
|2021-08-14 14:55:12|user_a |ABC |
|2021-08-14 14:56:19|user_a |ABC |
|2021-08-14 15:01:36|user_a |ABC |
|2021-08-14 15:02:22|user_a |ABC |
|2021-08-15 07:38:07|user_b |XYZ |
|2021-08-15 07:39:59|user_b |XYZ |
+-------------------+-------+---------+

Related

Spark (Scala) Pairwise subtract all rows in data frame

I have a data frame that looks something along the lines of:
+-----+-----+------+-----+
|col1 |col2 |col3 |col4 |
+-----+-----+------+-----+
|1.1 |2.3 |10.0 |1 |
|2.2 |1.5 |5.0 |1 |
|3.3 |1.3 |1.5 |1 |
|4.4 |0.5 |7.0 |1 |
|5.5 |1.2 |8.1 |2 |
|6.6 |2.3 |8.2 |2 |
|7.7 |4.5 |10.3 |2 |
+-----+-----+------+-----+
I would like to subtract each row from the row above but only if they have the same entry in col4, so 2-1, 3-2 but not 5-4. Also col4 should not be changed, so the result would be
+-----+-----+------+------+
|col1 |col2 |col3 |col4 |
+-----+-----+------+------+
|1.1 |-0.8 |-5.0 |1 |
|1.1 |-0.2 |-3.5 |1 |
|1.1 |-0.8 |5.5 |1 |
|1.1 |1.1 |0.1 |2 |
|1.1 |2.2 |2.1 |2 |
+-----+-----+------+------+
This sounds like it'd be simple, but I can't seem to figure it out
You could accomplish this using spark-sql i.e. creating a temporary view with your dataframe and applying the following sql. It uses window functions LAG to subtract the previous row value ordered by col1 and partitioned by col4. The first row value in each group partitioned by col4 is identified using row_number and filtered.
df.createOrReplaceTempView('my_temp_view')
results = sparkSession.sql('<insert sql below here>')
SELECT
col1,
col2,
col3,
col4
FROM (
SELECT
(col1 - (LAG(col1,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col1,
(col2 - (LAG(col2,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col2,
(col3 - (LAG(col3,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col3,
col4,
ROW_NUMBER() OVER (PARTITION BY col4 ORDER BY col1) rn
FROM
my_temp_view
) t
WHERE rn <> 1
db-fiddle
Here just the idea with a self-JOIN based on RDD with zipWithIndex and back to DF - some overhead, that you can tailor, z being your col4.
At scale I am not sure about the performance that Catalyst Optimizer will apply, I looked at .explain(true); not convinced entirely, but I find it hard to interpret the output sometimes. Ordering of data is guaranteed.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField,StructType,IntegerType, ArrayType, LongType}
val df = sc.parallelize(Seq( (1.0, 2.0, 1), (0.0, -1.0, 1), (3.0, 4.0, 1), (6.0, -2.3, 4))).toDF("x", "y", "z")
val newSchema = StructType(df.schema.fields ++ Array(StructField("rowid", LongType, false)))
val rddWithId = df.rdd.zipWithIndex
val dfZippedWithId = spark.createDataFrame(rddWithId.map{ case (row, index) => Row.fromSeq(row.toSeq ++ Array(index))}, newSchema)
dfZippedWithId.show(false)
dfZippedWithId.printSchema()
val res = dfZippedWithId.as("dfZ1").join(dfZippedWithId.as("dfZ2"), $"dfZ1.z" === $"dfZ2.z" &&
$"dfZ1.rowid" === $"dfZ2.rowid" -1
,"inner")
.withColumn("newx", $"dfZ2.x" - $"dfZ1.x")//.explain(true)
res.show(false)
returns the input:
+---+----+---+-----+
|x |y |z |rowid|
+---+----+---+-----+
|1.0|2.0 |1 |0 |
|0.0|-1.0|1 |1 |
|3.0|4.0 |1 |2 |
|6.0|-2.3|4 |3 |
+---+----+---+-----+
and the result which you can tailor by selecting and adding extra calculations:
+---+----+---+-----+---+----+---+-----+----+
|x |y |z |rowid|x |y |z |rowid|newx|
+---+----+---+-----+---+----+---+-----+----+
|1.0|2.0 |1 |0 |0.0|-1.0|1 |1 |-1.0|
|0.0|-1.0|1 |1 |3.0|4.0 |1 |2 |3.0 |
+---+----+---+-----+---+----+---+-----+----+

Spark Dataframe filldown

I would like to do a "filldown" type operation on a dataframe in order to remove nulls and make sure the last row is a kind of summary row, containing the last known values for each column based on the timestamp, grouped by the itemId. As I'm using Azure Synapse Notebooks the language can be Scala, Pyspark, SparkSQL or even c#. However the problem here is that the real solution has up to millions of rows and hundreds of columns, so I need a dynamic solution that can take advantage of Spark. We can provision a big cluster to how to make sure we take good advantage of it?
Sample data:
// Assign sample data to dataframe
val df = Seq(
( 1, "10/01/2021", 1, "abc", null ),
( 2, "11/01/2021", 1, null, "bbb" ),
( 3, "12/01/2021", 1, "ccc", null ),
( 4, "13/01/2021", 1, null, "ddd" ),
( 5, "10/01/2021", 2, "eee", "fff" ),
( 6, "11/01/2021", 2, null, null ),
( 7, "12/01/2021", 2, null, null )
).
toDF("eventId", "timestamp", "itemId", "attrib1", "attrib2")
df.show
Expected results with rows 4 and 7 as summary rows:
+-------+----------+------+-------+-------+
|eventId| timestamp|itemId|attrib1|attrib2|
+-------+----------+------+-------+-------+
| 1|10/01/2021| 1| abc| null|
| 2|11/01/2021| 1| abc| bbb|
| 3|12/01/2021| 1| ccc| bbb|
| 4|13/01/2021| 1| ccc| ddd|
| 5|10/01/2021| 2| eee| fff|
| 6|11/01/2021| 2| eee| fff|
| 7|12/01/2021| 2| eee| fff|
+-------+----------+------+-------+-------+
I have reviewed this option but had trouble adapting it for my use case.
Spark / Scala: forward fill with last observation
I have a kind of working SparkSQL solution but it will be very verbose for the high volume of columns, hoping for something easier to maintain:
%%sql
WITH cte (
SELECT
eventId,
itemId,
ROW_NUMBER() OVER( PARTITION BY itemId ORDER BY timestamp ) AS rn,
attrib1,
attrib2
FROM df
)
SELECT
eventId,
itemId,
CASE rn WHEN 1 THEN attrib1
ELSE COALESCE( attrib1, LAST_VALUE(attrib1, true) OVER( PARTITION BY itemId ) )
END AS attrib1_xlast,
CASE rn WHEN 1 THEN attrib2
ELSE COALESCE( attrib2, LAST_VALUE(attrib2, true) OVER( PARTITION BY itemId ) )
END AS attrib2_xlast
FROM cte
ORDER BY eventId
For many columns you could create an expression as below
val window = Window.partitionBy($"itemId").orderBy($"timestamp")
// Instead of selecting columns you could create a list of columns
val expr = df.columns
.map(c => coalesce(col(c), last(col(c), true).over(window)).as(c))
df.select(expr: _*).show(false)
Update:
val mainColumns = df.columns.filterNot(_.startsWith("attrib"))
val aggColumns = df.columns.diff(mainColumns).map(c => coalesce(col(c), last(col(c), true).over(window)).as(c))
df.select(( mainColumns.map(col) ++ aggColumns): _*).show(false)
Result:
+-------+----------+------+-------+-------+
|eventId|timestamp |itemId|attrib1|attrib2|
+-------+----------+------+-------+-------+
|1 |10/01/2021|1 |abc |null |
|2 |11/01/2021|1 |abc |bbb |
|3 |12/01/2021|1 |ccc |bbb |
|4 |13/01/2021|1 |ccc |ddd |
|5 |10/01/2021|2 |eee |fff |
|6 |11/01/2021|2 |eee |fff |
|7 |12/01/2021|2 |eee |fff |
+-------+----------+------+-------+-------+

I need a type of group-sort that I couldn't figure out with ROW_NUMBER on T-SQL

I have a table with a table_id row and 2 other rows. I want type of numbering with row_number function and I want result to seem like this:
id |col1 |col2 |what I want
------------------------------
1 |x |a |1
2 |x |b |2
3 |x |a |3
4 |x |a |3
5 |x |c |4
6 |x |c |4
7 |x |c |4
please consider that;
there's only one x, so "partition by col1" is OK. other than that;
there are two sequences of a's, and they'll be counted seperately
(not 1,2,1,1,3,3,3). and sorting must be by id, not by col2 (so
order by col2 is NOT OK).
I want that number to increase by one anytime col2 changes compared to previous line.
row_number () over (partition by col1 order by col2) DOESN'T WORK. because I want it ordered by id.
Using LAG and a windowed COUNT appears to get you what you are after:
WITH Previous AS(
SELECT V.id,
V.col1,
V.col2,
V.[What I want],
LAG(V.Col2,1,V.Col2) OVER (ORDER BY ID ASC) AS PrevCol2
FROM (VALUES(1,'x','a',1),
(2,'x','b',2),
(3,'x','a',3),
(4,'x','a',3),
(5,'x','c',4),
(6,'x','c',4),
(7,'x','c',4))V(id, col1, col2, [What I want]))
SELECT P.id,
P.col1,
P.col2,
P.[What I want],
COUNT(CASE P.Col2 WHEN P.PrevCol2 THEN NULL ELSE 1 END) OVER (ORDER BY P.ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) +1 AS [What you get]
FROM Previous P;
DB<>Fiddle

Scala: Using a spark sql function when selecting column from a dataframe

I have a two table/dataframe: A and B
A has following columns: cust_id, purch_date
B has one column: cust_id, col1 (col1 is not needed)
Following sample shows content of each table:
Table A
cust_id purch_date
34564 2017-08-21
34564 2017-08-02
34564 2017-07-21
23847 2017-09-13
23423 2017-06-19
Table B
cust_id col1
23442 x
12452 x
12464 x
23847 x
24354 x
I want to select the cust_id and first day of month of purch_date where the selected cust_id are not there in B.
This can be achieved in SQL by following command:
select a.cust_id, trunc(purch_date, 'MM') as mon
from a
left join b
on a.cust_id = b.cust_id
where b.cust_id is null
group by cust_id, mon;
Following will be the output:
Table A
cust_id purch_date
34564 2017-08-01
34564 2017-07-01
23423 2017-06-01
I tried the following to implement the same in Scala:
import org.apache.spark.sql.functions._
a = spark.sql("select * from db.a")
b = spark.sql("select * from db.b")
var out = a.join(b, Seq("cust_id"), "left")
.filter("col1 is null")
.select("cust_id", trunc("purch_date", "month"))
.distinct()
But I am getting different errors like:
error: type mismatch; found: StringContext required: ?{def $: ?}
I am stuck here and couldn't find enough documentation/answers on net.
Select should contain Columns instead of Strings:
Input:
df1:
+-------+----------+
|cust_id|purch_date|
+-------+----------+
| 34564|2017-08-21|
| 34564|2017-08-02|
| 34564|2017-07-21|
| 23847|2017-09-13|
| 23423|2017-06-19|
+-------+----------+
df2:
+-------+----+
|cust_id|col1|
+-------+----+
| 23442| X|
| 12452| X|
| 12464| X|
| 23847| X|
| 24354| X|
+-------+----+
Change your query as below:
df1.join(df2, Seq("cust_id"), "left").filter("col1 is null")
.select($"cust_id", trunc($"purch_date", "MM"))
.distinct()
.show()
Output:
+-------+---------------------+
|cust_id|trunc(purch_date, MM)|
+-------+---------------------+
| 23423| 2017-06-01|
| 34564| 2017-07-01|
| 34564| 2017-08-01|
+-------+---------------------+

Postgresql select rows(a result) as array

Assume I have following table, plus some data.
create table "common"."log"("id" bigserial primary key,
"level" int not null default 0);
Now I have this select query that would return something like this.
select * from common.log where id=147;
+------+--------+
|id |level |
+------+--------+
|147 |1 |
|147 |2 |
|147 |2 |
|147 |6 |
|147 |90 |
+------+--------+
Now I like to have something like following rather above
+------+---------------+
|id |arr_level |
+------+---------------+
|147 |{1,2,2,6,90} |
+------+---------------+
So is there any implicit select clause/way for doing this? thanks.
pgsql v9.3
You can user array function like this
Select '147' as id,array(select level from common.log where id=147) as arr_level;
Another way, probably more useful if you have more than one id to query:
SELECT id, array_agg(level) FROM common.log GROUP BY id;
See: aggregate functions.