How to generate cumulative concatenation in Spark SQL - scala

My Input for spark is below:
Col_1
Col_2
Amount
1
0
35/310320
1
1
35/5
1
1
180/-310350
17
1
0/1000
17
17
0/-1000
17
17
74/314322
17
17
74/5
17
17
185/-3142
I want to generate the below Output using spark SQL:
Output
35/310320
35/310320/35/5
35/310320/35/5/180/-310350
0/1000
0/1000/0/-1000
0/1000/0/-1000/74/314322
0/1000/0/-1000/74/314322/74/5
0/1000/0/-1000/74/314322/74/5/185/-3142
Conditions & Procedure: If col_1 and col_2 values are not the same then consider the current amount value for the new Output column but both are the same then concatenate the previous all amount value by /.
i.e. 17 from col_1 where col_1 & col_2 value are different so consider current amount 0/1000. Next step both column values is the same so the value is 0/1000/0/-1000 and so on. Need to create this logic for dynamic data in spark SQL or Spark Scala.

You can use concat_ws on a list of amount obtained from collect_list over an appropriate window:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn(
"output",
concat_ws(
"/",
collect_list("amount").over(
Window.partitionBy("col_1")
.orderBy("col_2")
.rowsBetween(Window.unboundedPreceding, 0)
)
)
)

Related

Persisting loop dataframes for group concat functions in Pyspark

I'm trying to aggregate a spark dataframe up to a unique ID, selecting the first non-null value from that column for that ID given a sort column. Basically replicating MySQL's group_concat function.
The SO post here Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function was very helpful in replicating the group_concat for a single column. I need to do this for a dynamic list of columns.
I would rather not have to copy this code for each column (dozen +, could be dynamic in the future), so am trying to implement in a loop (frowned on in spark I know!) given a list of column names. Loop runs successfully but, the previous iterations don't persist even when the intermediate df is cached/persisted (re: Cacheing and Loops in (Py)Spark).
Any help, pointers or a more elegant non-looping solution would be appreciated (not afraid to try a bit of scala if there is a functional programming approach more suitable)!
Given following df:
unique_id
row_id
first_name
last_name
middle_name
score
1000000
1000002
Simmons
Bonnie
Darnell
88
1000000
1000006
Dowell
Crawford
Anne
87
1000000
1000007
NULL
Eric
Victor
89
1000000
1000000
Zachary
Fields
Narik
86
1000000
1000003
NULL
NULL
Warren
92
1000000
1000008
Paulette
Ronald
Irvin
85
group_column = "unique_id"
concat_list = ['first_name','last_name','middle_name']
sort_column = "score"
sort_order = False
df_final=df.select(group_column).distinct()
for i in concat_list:\
df_helper=df
df_helper=df_helper.groupBy(group_column)\
.agg(sort_array(collect_list(struct(sort_column,i)),sort_order).alias('collect_list'))\
.withColumn("sorted_list",col("collect_list."+str(i)))\
.withColumn("first_item",slice(col("sorted_list"),1,1))\
.withColumn(i,concat_ws(",",col("first_item")))\
.drop("collect_list")\
.drop("sorted_list")\
.drop("first_item")
print(i)
df_final=df_final.join(df_helper,group_column,"inner")
df_final.cache()
df_final.display() #I'm using databricks
My result looks like:
unique_id
middle_name
1000000
Warren
My desired result is:
unique_id
first_name
last_name
middle_name
1000000
Simmons
Eric
Warren
Second set of tables if they don't pretty print above
I found a solution to my own question: Add a .collect() call on my dataframe as I join to it, not a persist() or cache(); this will produce the expected dataframe.
group_column = "unique_id"
enter code hereconcat_list = ['first_name','last_name','middle_name']
sort_column = "score"
sort_order = False
df_final=df.select(group_column).distinct()
for i in concat_list:\
df_helper=df
df_helper=df_helper.groupBy(group_column)\
.agg(sort_array(collect_list(struct(sort_column,i)),sort_order).alias('collect_list'))\
.withColumn("sorted_list",col("collect_list."+str(i)))\
.withColumn("first_item",slice(col("sorted_list"),1,1))\
.withColumn(i,concat_ws(",",col("first_item")))\
.drop("collect_list")\
.drop("sorted_list")\
.drop("first_item")
print(i)
df_final=df_final.join(df_helper,group_column,"inner")
df_final.collect()
df_final.display() #I'm using databricks

Flatten Json Key, values in Pyspark

In a table having 2 columns and 2 Records :
Record 1 : Column 1 - my_col value as: {"XXX": ["123","456"],"YYY": ["246","135"]} and Column 2 - ID as A123
Record 2 : Column 1 - my_col value as: {"XXX": ["123","456"],"YYY": ["246","135"], "ZZZ":["333","444"]} and Column 2 - ID as B222
Need to parse/flatten using pyspark
Expectation :
Key
Value
ID
XXX
123
A123
XXX
456
A123
YYY
246
A123
YYY
135
A123
ZZZ
333
B222
ZZZ
444
B222
If your column is a string, you may use the from_json and custom_schema to convert it to a MapType before using explode to extract it into the desired results. I assumed that your initial column was named my_col and that your data was in a dataframe named input_df.
An example is shown below
Approach 1: Using pyspark api
from pyspark.sql import functions as F
from pyspark.sql import types as T
custom_schema = T.MapType(T.StringType(),T.ArrayType(T.StringType()))
output_df = (
input_df.select(
F.from_json(F.col('my_col'),custom_schema).alias('my_col_json')
)
.select(F.explode('my_col_json'))
.select(
F.col('key'),
F.explode('value')
)
)
Approach 2: Using spark sql
# Step 1 : Create a temporary view that may be queried
input_df.createOrReplaceTempView("input_df")
# Step 2: Run the following sql on your spark session
output_df = sparkSession.sql("""
SELECT
key,
EXPLODE(value)
FROM (
SELECT
EXPLODE(from_json(my_col,"MAP<STRING,ARRAY<STRING>>"))
FROM
input_df
) t
""")
For json column
If already json
from pyspark.sql import functions as F
output_df = (
input_df.select(F.explode('my_col_json'))
.select(
F.col('key'),
F.explode('value')
)
)
or
# Step 1 : Create a temporary view that may be queried
input_df.createOrReplaceTempView("input_df")
# Step 2: Run the following sql on your spark session
output_df = sparkSession.sql("""
SELECT
key,
EXPLODE(value)
FROM (
SELECT
EXPLODE(my_col)
FROM
input_df
) t
""")
Let me know if this works for you.

KDB/Q-sql Dynamic Grouping and con-canting columns in output

I have a table where I have to perform group by on dynamic columns and perform aggregation, result will be column values concatenating group-by tables and aggregations on col supplied by users.
For example :
g1 g2 g3 g4 col1 col2
A D F H 10 20
A E G I 11 21
B D G J 12 22
B E F L 13 23
C D F M 14 24
C D G M 15 25
and if I need to perform group by g1,g2,g4 and avg aggregation on col1 output should be like this
filed val
Avg[A-D-H-col1] 10.0
Avg[A-E-I-col1] 11.0
Avg[B-D-J-col1] 12.0
Avg[B-E-L-col1] 13.0
Avg[C-D-M-col1] 14.5
I am able to perform this if my group by columns are fixed using q-sql
t:([]g1:`A`A`B`B`C`C;g2:`D`E`D`E`D`D;g3:`F`G`G`F`F`G;g4:`H`I`J`L`M`M;col1:10 11 12 13 14 15;col2:20 21 22 23 24 25)
select filed:first ("Avg[",/:(({"-" sv x} each string (g1,'g2,'g4)),\:"-col1]")),val: avg col1 by g1,g2,g4 from t
I want to use functional query for the same , means I want a function which take list of group by columns, aggregation to perform and col name andtable name as input and output like above query. I can perform group by easily using dynamic columns but not able to con-cat in fields. function signature will be something like this
fun{[glist; agg; col,t] .. ;... }[g1g2g4;avg;col1,t]
Please help me to make above query as dynamic.
You may try following function:
specialGroup: {[glist;agg;col;table]
res: ?[table;();{x!x}glist; enlist[`val]!enlist(agg;col)];
aggname: string agg;
aggname: upper[1#aggname], 1_aggname;
res: ![res;();0b;enlist[`filed]!enlist({(y,"["),/:("-"sv/:string flip x),\:"]"};enlist,glist,enlist[enlist col];aggname)];
res
};
specialGroup[`g1`g2`g4;avg;`col1;t]
specialGroup aggregates values into val column first. And populates filed column after grouping. This helps to avoid generating filed duplicates and selecting first of them.
If you modify Anton's code to this it will change the output dynamically
specialGroup: {[glist;agg;col;table]
res: ?[table;();{x!x}glist; enlist[`val]!enlist(agg;col)];
res: ![res;();0b;enlist[`filed]!enlist({(#[string[y];0;upper],"["),/:("-"sv/:string flip x),\:"]"}[;agg];enlist,glist,enlist[enlist col])];
res
};
As the part of the code that made that string was inside another function you need to pass the agg parameter to the inner function.

Spark Scala Loop according to a specific Column

I have a table that looks like the one given below:
Col_1 Col_2 Col_3 Col_4 Col_5
1 1a data data data
1 1b data data data
1 1c data data data
1 1d data data data
2 2a data data data
2 2b data data data
2 2c data data data
Col_1 is associated with Col_2.
When I query this for individual record:
val A = spark.table(“table_name”).filter($”Col_1” ==== “1”).select(“Col_2”).distinct.show()
I get the output as:
Col_2
1a
1b
1c
1d
Now, I want to run this as a loop. Taking values from Col_1 and giving the output associated with Col_2.
Could someone help me with this?
One possible solution:
import spark.implicits._
val inDf: DataFrame = spark.table("table_name")
val res: Array[DataFrame] = inDf.select($"Col_1")
.distinct()
.collect().map(_.get(0))
.par //optional to parallel execute
.map { col1 =>
inDf.filter($"Col_1" === col1)
.select($"Col_2")
}.toArray

Appending multiple samples of a column into dataframe in spark

I have n (length) values in a spark column. I want to create a spark dataframe of k columns (where k is number of samples) and m rows (where m is sample size). I tried using withColumn, it is not working. Join by creating unique id will be very inefficient for me.
e.g. Spark column has following values :
102
320
11
101
2455
124
I want to create 2 samples of fraction 0.5 as columns in data frame.
So sampled data frame will be something like
sample1,sample2
320,101
124,2455
2455,11
Let df has a column UNIQUE_ID_D, I need k samples from this column. Here is the sample code for k = 2
var df1 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_1")
var df2 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_2")
df1.withColumn("NEW_UNIQUE_ID", df2.col("ID_2")).show
This wont work since withColumn can not access df2 column.
There is only way to join df1 and df2 by adding sequence column(join column) in both df's.
It is very inefficient for my use case since if I want to take 100 samples, I need to join 100 times in a loop for a single column. I need to perform this operation for all columns in original df.
How could I achieve this?