I'm trying to aggregate a spark dataframe up to a unique ID, selecting the first non-null value from that column for that ID given a sort column. Basically replicating MySQL's group_concat function.
The SO post here Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function was very helpful in replicating the group_concat for a single column. I need to do this for a dynamic list of columns.
I would rather not have to copy this code for each column (dozen +, could be dynamic in the future), so am trying to implement in a loop (frowned on in spark I know!) given a list of column names. Loop runs successfully but, the previous iterations don't persist even when the intermediate df is cached/persisted (re: Cacheing and Loops in (Py)Spark).
Any help, pointers or a more elegant non-looping solution would be appreciated (not afraid to try a bit of scala if there is a functional programming approach more suitable)!
Given following df:
unique_id
row_id
first_name
last_name
middle_name
score
1000000
1000002
Simmons
Bonnie
Darnell
88
1000000
1000006
Dowell
Crawford
Anne
87
1000000
1000007
NULL
Eric
Victor
89
1000000
1000000
Zachary
Fields
Narik
86
1000000
1000003
NULL
NULL
Warren
92
1000000
1000008
Paulette
Ronald
Irvin
85
group_column = "unique_id"
concat_list = ['first_name','last_name','middle_name']
sort_column = "score"
sort_order = False
df_final=df.select(group_column).distinct()
for i in concat_list:\
df_helper=df
df_helper=df_helper.groupBy(group_column)\
.agg(sort_array(collect_list(struct(sort_column,i)),sort_order).alias('collect_list'))\
.withColumn("sorted_list",col("collect_list."+str(i)))\
.withColumn("first_item",slice(col("sorted_list"),1,1))\
.withColumn(i,concat_ws(",",col("first_item")))\
.drop("collect_list")\
.drop("sorted_list")\
.drop("first_item")
print(i)
df_final=df_final.join(df_helper,group_column,"inner")
df_final.cache()
df_final.display() #I'm using databricks
My result looks like:
unique_id
middle_name
1000000
Warren
My desired result is:
unique_id
first_name
last_name
middle_name
1000000
Simmons
Eric
Warren
Second set of tables if they don't pretty print above
I found a solution to my own question: Add a .collect() call on my dataframe as I join to it, not a persist() or cache(); this will produce the expected dataframe.
group_column = "unique_id"
enter code hereconcat_list = ['first_name','last_name','middle_name']
sort_column = "score"
sort_order = False
df_final=df.select(group_column).distinct()
for i in concat_list:\
df_helper=df
df_helper=df_helper.groupBy(group_column)\
.agg(sort_array(collect_list(struct(sort_column,i)),sort_order).alias('collect_list'))\
.withColumn("sorted_list",col("collect_list."+str(i)))\
.withColumn("first_item",slice(col("sorted_list"),1,1))\
.withColumn(i,concat_ws(",",col("first_item")))\
.drop("collect_list")\
.drop("sorted_list")\
.drop("first_item")
print(i)
df_final=df_final.join(df_helper,group_column,"inner")
df_final.collect()
df_final.display() #I'm using databricks
Related
I am having a scenario where I will be receiving data in csv files and there I need to generate some columns with the existing one.
Example:
Col_1 Col_2 Col_3 Col_4
abc 1 No 123
xyz 2 Yes 123
def 1 Yes 345
Expected:
Col_1 Col_2 Col_3 Col_4 Col_5 Col_6
abc 1 No 123 1 1
xyz 2 Yes 123 0 0
def 1 Yes 345 0 0
Col_5 Condition : if Col_1 = 'abc' then 1 else 0 end
Col_6 Condition : max(Col_5) over (Col_2)
I know we can perform transformations in Druid when we loading the file in it, I tried simpler condition which is working fine for me, but I am Pretty doubt to perform aggregate and other transformation like Col_6 here.
Also we need to perform aggregate on different files data which we going to receive, Assume we get 2 file today and we loaded the data to Druid table, Tomorrow again we got some 3 files which is having data for same (ID) which is Col_2 here then we need to do aggregation based on all the records we have, Example : Col_6 generation here...
Shall this will be possible in Druid?
Take a look at https://druid.apache.org/docs/latest/misc/math-expr.html
which contains many transform expressions you can use.
In particular, I tested your use case with the wikipedia demo data by creating the following expressions:
{
"type": "expression",
"name": "isNB",
"expression": "case_simple(\"namespace\", 'Main',1,0)"
},
{
"type": "expression",
"expression": "greatest( case_simple(\"IsNew\", True, 1, 0), case_simple(\"namespace\", 'Main',1,0)",
"name": "combined_calc"
}
One thing to note is that transform expressions cannot refer to other transform expressions, so calculations need to all be done from the raw input fields.
Col_5 Condition : if Col_1 = 'abc' then 1 else 0
You can use the following
df = df.withColumn('Col_5', f.when((f.col('Col_1') == 'abc'), 1).otherwise(0))
Col_6 Condition : max(Col_5) over (Col_2)
You can apply window operation
windowSpec = Window.partitionBy("Col_2").orderBy("Col_5").desc()
df_max = df.withColumn("row_number", row_number().over(windowSpec)).filter(
f.col("row_number") == 1
)
Now remove duplicates for each Col_2 and then join the df_max with your main df.
The above code snippet is in python, but spark API is the same so you can use it with minimal changes.
The first type,if Col_1 = 'abc' then 1 else 0, would not be hard. Eg, see this article with similar examples.
The second, aggregating over one of the columns, doesn't sound possible. We can aggregate over all the dimensions taken together (like a primary key), but not over one single dimension, afaik.
I have the following table t:
t:([]sym:3#`ibm;time:10:01:01 10:01:04 10:01:08;price:100 101 105;val:("hello";"world";"test"))
How do I perform the following query:
select from t where val in ("hello"; "test")
Wherein I am expecting the following result:
sym time price val
---------------------------
ibm 10:01:01 100 hello
ibm 10:01:08 105 test
It looks like your query does return the result you require.
Alternatively, the keyword 'like' can be used.
When we use a where clause at the end of a select statement, the 'where' section expects a single boolean value to tell it whether the column should be selected.
When we do where val in "hello" , it will actually return a boolean for each element of the string that it matches on (when it isn't wrapped):
q)val:"hello"
q)val in "hello"
11111b
Therefore, to obtain a single boolean returned we use the keyword like
q)val like "hello"
1b
Furthermore, when passing a list of strings to a where clause, an 'each-right' adverb should be used to instruct the where clause to operate on each instance of the list.
q)val like/: ("hello";"test")
10b
However, we are again faced with multiple booleans, when the where clause expects a single
Therefore we use the keyword any to return results when either hello or test are present.
q)any val like/: ("hello";"test")
1b
And we can see, this gives the results required
q)select from t where any val like/: ("hello";"test")
sym time price val
--------------------------
ibm 10:01:01 100 "hello"
ibm 10:01:08 105 "test"
Hope this helps
q) select from t where any val like/: ("hello"; "test")
Output:
sym time price val
---------------------------
ibm 10:01:01 100 hello
ibm 10:01:08 105 test
Currently I have an input file(millions of records) where all the records contain a 2 character Identifier. Multiple lines in this input file will be concatenated into only one record in the output file, and how this is determined is SOLELY based on the sequential order of the Identifier
For example, the records would begin as below
1A
1B
1C
2A
2B
2C
1A
1C
2B
2C
1A
1B
1C
1A marks the beginning of a new record, so the output file would have 3 records in this case. Everything between the "1A"s will be combined into one record
1A+1B+1C+2A+2B+2C
1A+1C+2B+2C
1A+1B+1C
The number of records between the "1A"s varies, so I have to iterate through and check the Identifier.
I am unsure how to approach this situation using scala/spark.
My strategy is to:
Load the Input file into the dataframe.
Create an Identifier column based on substring of record.
Create a new column, TempID and a variable, x that is set to 0
Iterate through the dataframe
if Identifier =1A, x = x+1
TempID= variable x
Then create a UDF to concat records with the same TempID.
To summarize my question:
How would I iterate through the dataframe, check the value of Identifier column, then assign a tempID(whose value increases by 1 if the value of identifier column is 1A)
This is dangerous. The issue is that spark is not guaranteed keep the same order among elements, especially since they might cross partition boundaries. So when you iterate over them you could get a different order back. This also has to happen entirely sequentially, so at that point why not just skip spark entirely and run it as regular scala code as a preproccessing step before getting to spark.
My recommendation would be to either look into writing a custom data inputformat/data source, or perhaps you could use "1A" as a record delimiter similar to this question.
First - usually "iterating" over a DataFrame (or Spark's other distributed collection abstractions like RDD and Dataset) is either wrong or impossible. The term simply does not apply. You should transform these collections using Spark's functions instead of trying to iterate over them.
You can achieve your goal (or - almost, details to follow) using Window Functions. The idea here would be to (1) add an "id" column to sort by, (2) use a Window function (based on that ordering) to count the number of previous instances of "1A", and then (3) using these "counts" as the "group id" that ties all records of each group together, and group by it:
import functions._
import spark.implicits._
// sample data:
val df = Seq("1A", "1B", "1C", "2A", "2B", "2C", "1A", "1C", "2B", "2C", "1A", "1B", "1C").toDF("val")
val result = df.withColumn("id", monotonically_increasing_id()) // add row ID
.withColumn("isDelimiter", when($"val" === "1A", 1).otherwise(0)) // add group "delimiter" indicator
.withColumn("groupId", sum("isDelimiter").over(Window.orderBy($"id"))) // add groupId using Window function
.groupBy($"groupId").agg(collect_list($"val") as "list") // NOTE: order of list might not be guaranteed!
.orderBy($"groupId").drop("groupId") // removing groupId
result.show(false)
// +------------------------+
// |list |
// +------------------------+
// |[1A, 1B, 1C, 2A, 2B, 2C]|
// |[1A, 1C, 2B, 2C] |
// |[1A, 1B, 1C] |
// +------------------------+
(if having the result as a list does not fit your needs, I'll leave it to you to transform this column to whatever you need)
The major caveat here is that collect_list does not necessarily guarantee preserving order - once you use groupBy, the order is potentially lost. So - the order within each resulting list might be wrong (the separation to groups, however, is necessarily correct). If that's important to you, it can be worked around by collecting a list of a column that also contains the "id" column and using it later to sort these lists.
EDIT: realizing this answer isn't complete without solving this caveat, and realizing it's not trivial - here's how you can solve it:
Define the following UDF:
val getSortedValues = udf { (input: mutable.Seq[Row]) => input
.map { case Row (id: Long, v: String) => (id, v) }
.sortBy(_._1)
.map(_._2)
}
Then, replace the row .groupBy($"groupId").agg(collect_list($"val") as "list") in the suggested solution above with these rows:
.groupBy($"groupId")
.agg(collect_list(struct($"id" as "_1", $"val" as "_2")) as "list")
.withColumn("list", getSortedValues($"list"))
This way we necessarily preserve the order (with the price of sorting these small lists).
I am comparing from different table to get the COLUMN_NAME of the MAXIMUM value
Examples.
These are example tables: Fruit_tb, Vegetable_tb, State_tb, Foods_tb
Under Fruit_tb
fr_id fruit_one fruit_two
1 20 50
Under Vegetables_tb (v = Vegetables)
v_id v_one V_two
1 10 9
Under State_tb
stateid stateOne stateTwo
1 70 87
Under Food_tb
foodid foodOne foodTwo
1 10 3
Now here is the scenario, I want to get the COLUMN NAMES of the max or greatest value in each table.
You can maybe find out the row which contains the max value of a column. For eg:
SELECT fr_id , MAX(fruit_one) FROM Fruit_tb GROUP BY fr_id;
In order to find the out the max value of a table:
SELECT fr_id ,fruit_one FROM Fruit_tb WHERE fruit_one<(SELECT max(fruit_one ) from Fruit_tb) ORDER BY fr_id DESC limit 1;
A follow up SO for the above scenario.
Maybe you can use GREATEST in order to get the column name which has the max value. But then what I'm not sure is whether you'll be able to retrieve all the columns of different tables at once. You can do something like this to retrieve from a single table:
SELECT CASE GREATEST(`id`,`fr_id`)
WHEN `id` THEN `id`
WHEN `fr_id` THEN `fr_id`
ELSE 0
END AS maxcol,
GREATEST(`id`,`fr_id`) as maxvalue FROM Fruit_tb;
Maybe this SO could help you. Hope it helps!
I'm trying to find a way to use Perl to further process a PostgreSQL output. If there's a better way to do this via PostgreSQL, please let me know. I basically need to choose certain columns (Realtime, Value) in a file to concatenate certains columns to create a row while keeping ID and CAT.
First time posting, so please let me know if I missed anything.
Input:
ID CAT Realtime Value
A 1 time1 55
A 1 time2 57
B 1 time3 75
C 2 time4 60
C 3 time5 66
C 3 time6 67
Output:
ID CAT Time Values
A 1 time 1,time2 55,57
B 1 time3 75
C 2 time4 60
C 3 time5,time6 66,67
You could do this most simply in Postgres like so (using array columns)
CREATE TEMP TABLE output AS SELECT
id, cat, ARRAY_AGG(realtime) as time, ARRAY_AGG(value) as values
FROM input GROUP BY id, cat;
Then select whatever you want out of the output table.
SELECT id
, cat
, string_agg(realtime, ',') AS realtimes
, string_agg(value, ',') AS values
FROM input
GROUP BY 1, 2
ORDER BY 1, 2;
string_agg() requires PostgreSQL 9.0 or later and concatenates all values to a delimiter-separated string - while array_agg() (v8.4+) creates am array out of the input values.
About 1, 2 - I quote the manual on the SELECT command:
GROUP BY clause
expression can be an input column name, or the name or ordinal number
of an output column (SELECT list item), or ...
ORDER BY clause
Each expression can be the name or ordinal number of an output column
(SELECT list item), or
Emphasis mine. So that's just notational convenience. Especially handy with complex expressions in the SELECT list.