Spark Scala Loop according to a specific Column - scala

I have a table that looks like the one given below:
Col_1 Col_2 Col_3 Col_4 Col_5
1 1a data data data
1 1b data data data
1 1c data data data
1 1d data data data
2 2a data data data
2 2b data data data
2 2c data data data
Col_1 is associated with Col_2.
When I query this for individual record:
val A = spark.table(“table_name”).filter($”Col_1” ==== “1”).select(“Col_2”).distinct.show()
I get the output as:
Col_2
1a
1b
1c
1d
Now, I want to run this as a loop. Taking values from Col_1 and giving the output associated with Col_2.
Could someone help me with this?

One possible solution:
import spark.implicits._
val inDf: DataFrame = spark.table("table_name")
val res: Array[DataFrame] = inDf.select($"Col_1")
.distinct()
.collect().map(_.get(0))
.par //optional to parallel execute
.map { col1 =>
inDf.filter($"Col_1" === col1)
.select($"Col_2")
}.toArray

Related

Pyspark fill null value of a column based on value of another column

I have a dataframe with 2 columns: col1 and col2:
col1 col2
aaa 111
222
ccc 333
I want to fill the null values (here the 2nd row of col1).
Here for example the logic I want to use is: if col2 is 222 and col1 is null, use the arbitrary string "zzz". For each possibility in col2, I have an arbitrary string I want to fill col1 if it's null (if it's not, I just want to get the value that is already in col1).
My idea was to do something like this:
mapping = {"222":"zzz", "444":"fff"}
df = df.select(F.when(F.col('col1').isNull(), mapping[F.col('col2')] ).otherwise(F.col('col1'))
I know F.col() is actually a column object and I can't simply do this.
What is the simplest solution to achieve the result I want with pyspark please ?
This should work:
from pyspark.sql.functions import col, create_map, lit, when
from itertools import chain
mapping = {"222":"zzz", "444":"fff"}
mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])
df = df.select(when(col('col1').isNull(), mapping_expr[col('col2')] ).otherwise(col('col1'))

Can we able to do Transformation in Druid

I am having a scenario where I will be receiving data in csv files and there I need to generate some columns with the existing one.
Example:
Col_1 Col_2 Col_3 Col_4
abc 1 No 123
xyz 2 Yes 123
def 1 Yes 345
Expected:
Col_1 Col_2 Col_3 Col_4 Col_5 Col_6
abc 1 No 123 1 1
xyz 2 Yes 123 0 0
def 1 Yes 345 0 0
Col_5 Condition : if Col_1 = 'abc' then 1 else 0 end
Col_6 Condition : max(Col_5) over (Col_2)
I know we can perform transformations in Druid when we loading the file in it, I tried simpler condition which is working fine for me, but I am Pretty doubt to perform aggregate and other transformation like Col_6 here.
Also we need to perform aggregate on different files data which we going to receive, Assume we get 2 file today and we loaded the data to Druid table, Tomorrow again we got some 3 files which is having data for same (ID) which is Col_2 here then we need to do aggregation based on all the records we have, Example : Col_6 generation here...
Shall this will be possible in Druid?
Take a look at https://druid.apache.org/docs/latest/misc/math-expr.html
which contains many transform expressions you can use.
In particular, I tested your use case with the wikipedia demo data by creating the following expressions:
{
"type": "expression",
"name": "isNB",
"expression": "case_simple(\"namespace\", 'Main',1,0)"
},
{
"type": "expression",
"expression": "greatest( case_simple(\"IsNew\", True, 1, 0), case_simple(\"namespace\", 'Main',1,0)",
"name": "combined_calc"
}
One thing to note is that transform expressions cannot refer to other transform expressions, so calculations need to all be done from the raw input fields.
Col_5 Condition : if Col_1 = 'abc' then 1 else 0
You can use the following
df = df.withColumn('Col_5', f.when((f.col('Col_1') == 'abc'), 1).otherwise(0))
Col_6 Condition : max(Col_5) over (Col_2)
You can apply window operation
windowSpec = Window.partitionBy("Col_2").orderBy("Col_5").desc()
df_max = df.withColumn("row_number", row_number().over(windowSpec)).filter(
f.col("row_number") == 1
)
Now remove duplicates for each Col_2 and then join the df_max with your main df.
The above code snippet is in python, but spark API is the same so you can use it with minimal changes.
The first type,if Col_1 = 'abc' then 1 else 0, would not be hard. Eg, see this article with similar examples.
The second, aggregating over one of the columns, doesn't sound possible. We can aggregate over all the dimensions taken together (like a primary key), but not over one single dimension, afaik.

Persisting loop dataframes for group concat functions in Pyspark

I'm trying to aggregate a spark dataframe up to a unique ID, selecting the first non-null value from that column for that ID given a sort column. Basically replicating MySQL's group_concat function.
The SO post here Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function was very helpful in replicating the group_concat for a single column. I need to do this for a dynamic list of columns.
I would rather not have to copy this code for each column (dozen +, could be dynamic in the future), so am trying to implement in a loop (frowned on in spark I know!) given a list of column names. Loop runs successfully but, the previous iterations don't persist even when the intermediate df is cached/persisted (re: Cacheing and Loops in (Py)Spark).
Any help, pointers or a more elegant non-looping solution would be appreciated (not afraid to try a bit of scala if there is a functional programming approach more suitable)!
Given following df:
unique_id
row_id
first_name
last_name
middle_name
score
1000000
1000002
Simmons
Bonnie
Darnell
88
1000000
1000006
Dowell
Crawford
Anne
87
1000000
1000007
NULL
Eric
Victor
89
1000000
1000000
Zachary
Fields
Narik
86
1000000
1000003
NULL
NULL
Warren
92
1000000
1000008
Paulette
Ronald
Irvin
85
group_column = "unique_id"
concat_list = ['first_name','last_name','middle_name']
sort_column = "score"
sort_order = False
df_final=df.select(group_column).distinct()
for i in concat_list:\
df_helper=df
df_helper=df_helper.groupBy(group_column)\
.agg(sort_array(collect_list(struct(sort_column,i)),sort_order).alias('collect_list'))\
.withColumn("sorted_list",col("collect_list."+str(i)))\
.withColumn("first_item",slice(col("sorted_list"),1,1))\
.withColumn(i,concat_ws(",",col("first_item")))\
.drop("collect_list")\
.drop("sorted_list")\
.drop("first_item")
print(i)
df_final=df_final.join(df_helper,group_column,"inner")
df_final.cache()
df_final.display() #I'm using databricks
My result looks like:
unique_id
middle_name
1000000
Warren
My desired result is:
unique_id
first_name
last_name
middle_name
1000000
Simmons
Eric
Warren
Second set of tables if they don't pretty print above
I found a solution to my own question: Add a .collect() call on my dataframe as I join to it, not a persist() or cache(); this will produce the expected dataframe.
group_column = "unique_id"
enter code hereconcat_list = ['first_name','last_name','middle_name']
sort_column = "score"
sort_order = False
df_final=df.select(group_column).distinct()
for i in concat_list:\
df_helper=df
df_helper=df_helper.groupBy(group_column)\
.agg(sort_array(collect_list(struct(sort_column,i)),sort_order).alias('collect_list'))\
.withColumn("sorted_list",col("collect_list."+str(i)))\
.withColumn("first_item",slice(col("sorted_list"),1,1))\
.withColumn(i,concat_ws(",",col("first_item")))\
.drop("collect_list")\
.drop("sorted_list")\
.drop("first_item")
print(i)
df_final=df_final.join(df_helper,group_column,"inner")
df_final.collect()
df_final.display() #I'm using databricks

How to generate cumulative concatenation in Spark SQL

My Input for spark is below:
Col_1
Col_2
Amount
1
0
35/310320
1
1
35/5
1
1
180/-310350
17
1
0/1000
17
17
0/-1000
17
17
74/314322
17
17
74/5
17
17
185/-3142
I want to generate the below Output using spark SQL:
Output
35/310320
35/310320/35/5
35/310320/35/5/180/-310350
0/1000
0/1000/0/-1000
0/1000/0/-1000/74/314322
0/1000/0/-1000/74/314322/74/5
0/1000/0/-1000/74/314322/74/5/185/-3142
Conditions & Procedure: If col_1 and col_2 values are not the same then consider the current amount value for the new Output column but both are the same then concatenate the previous all amount value by /.
i.e. 17 from col_1 where col_1 & col_2 value are different so consider current amount 0/1000. Next step both column values is the same so the value is 0/1000/0/-1000 and so on. Need to create this logic for dynamic data in spark SQL or Spark Scala.
You can use concat_ws on a list of amount obtained from collect_list over an appropriate window:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn(
"output",
concat_ws(
"/",
collect_list("amount").over(
Window.partitionBy("col_1")
.orderBy("col_2")
.rowsBetween(Window.unboundedPreceding, 0)
)
)
)

Appending multiple samples of a column into dataframe in spark

I have n (length) values in a spark column. I want to create a spark dataframe of k columns (where k is number of samples) and m rows (where m is sample size). I tried using withColumn, it is not working. Join by creating unique id will be very inefficient for me.
e.g. Spark column has following values :
102
320
11
101
2455
124
I want to create 2 samples of fraction 0.5 as columns in data frame.
So sampled data frame will be something like
sample1,sample2
320,101
124,2455
2455,11
Let df has a column UNIQUE_ID_D, I need k samples from this column. Here is the sample code for k = 2
var df1 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_1")
var df2 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_2")
df1.withColumn("NEW_UNIQUE_ID", df2.col("ID_2")).show
This wont work since withColumn can not access df2 column.
There is only way to join df1 and df2 by adding sequence column(join column) in both df's.
It is very inefficient for my use case since if I want to take 100 samples, I need to join 100 times in a loop for a single column. I need to perform this operation for all columns in original df.
How could I achieve this?