How do you select the 'maximum' struct from each group - pyspark

I have a dataFrame that contains an id column and a struct of two values order_value
example_input = spark.createDataFrame([(1, (1,2)), (1, (2,1)), (2, (1,2))], ["id", "order_value"])
I would like to keep one record from each id, that is the maximum of the order_value column. Specifically the maximum of the order (first part of order_value) with ties broken by the the maximum of the value (second part of order_value)
How can this be done?
example_input.groupby('id').max() doesn't seem to work as it complains that order_value is not numeric.
my desired output is given by:
example_output = spark.createDataFrame([(1, (2,1)), (2, (1,2))], ["id", "order_value"])

Try with array_max function in spark.
Example:
#groupby on id then collect_list to create an array to find max in the array
example_input.groupBy("id").agg(array_max(collect_list(col("order_value"))).alias("order_value")).\
show(10,False)

Related

Pyspark dynamic column name

I have a dataframe which contains months and will change quite frequently. I am saving this dataframe values as list e.g. months = ['202111', '202112', '202201']. Using a for loop to to iterate through all list elements and trying to provide dynamic column values with following code:
for i in months:
df = (
adjustment_1_prepared_df.select("product", "mnth", "col1", "col2")
.groupBy("product")
.agg(
f.min(f.when(condition, f.col("col1")).otherwise(9999999)).alias(
concat("col3_"), f.lit(i.col)
)
)
)
So basically in alias I am trying to give column name as a combination of constant (minInv_) and a variable (e.g. 202111) but I am getting error. How can I give a column name as combination of fixed string and a variable.
Thanks in advance!
.alias("col3_"+str(i.col))

How can I refer a column by its index?

I can use col("mycolumnname") function to get the column object.
Based on the documentation the only possible parameter is the name of the column.
Is there any way to get the column object by its index?
Try this:
Let n be the index variable (integer).
df.select(df.columns[n]).show()
Is the expected result like this?
import pyspark.sql.functions as F
...
data = [
(1, 'AC Milan'),
(2, 'Real Madrid'),
(3, 'Bayern Munich')
]
df = spark.createDataFrame(data, ['id', 'club'])
df.select(F.col('club')).show()
df.select(df['club']).show()

Spark Scala - how to compare certain element in one row with another element in a different row

I have a joined RDD of RDD[Int, (String, String),(String, String)]
for example:
(1, (UserID1, pwd1),(UserID2, pwd2))
(2, (UserID2, pwd2),(UserID3, pwd3))
(3, (UserID3, pwd3),(UserID4, pwd4))
As you can see these 3 rows are chained together by the 3rd placeholder of a given row to the 2nd placeholder of the next row (row 1 and row 2 are linked by (UserID2, pwd2); row 2 and 3 are linked by (UserID3, pwd3).
How to I process the data so that so I can dedupe the common items and get the result as Rdd
((UserID1, pwd1),(UserID2, pwd2),(UserID3, pwd3),(UserID4, pwd4))
It could be achieved using flatMap and then distinct .
scala> rdd.map(s=>Seq(s._2,s._3)).flatMap(s=>s).distinct.foreach(println(_))
output:
(UserID4,pwd4)
(UserID2,pwd2)
(UserID3,pwd3)
(UserID1,pwd1)

pyspark group by sum

I have a pyspark dataframe with 4 columns.
id/ number / value / x
I want to groupby columns id, number, and then add a new columns with the sum of value per id and number. I want to keep colunms x without doing nothing on it.
df= df.select("id","number","value","x")
.groupBy( 'id', 'number').withColumn("sum_of_value",df.value.sum())
At the end I want a data frame with 5 columns : id/ number / value / x /sum_of_value)
Does anyone can help ?
The result you are trying to achieve doesn't make sense. Your output dataframe will only have columns that were grouped by or aggregated (summed in this case). x and value would have multiple values when you group by id and number.
You can have a 3-column output (id, number and sum(value)) like this:
df_summed = df.groupBy(['id', 'number'])['value'].sum()
Lets say your DataFrame df has 3 Columns Initially.
df1 = df.groupBy("id","number").count()
Now df1 will contain 2 columns id, number and count.
Now you can join df1 and df based on columns "id" and "number" and select whatever columns you would like to select.
Hope it helps.
Regards,
Neeraj

to find even odd and plain series in postgresql

this might be naive but I want to know is there any way possible to find rows which have even entries and which have odd.
I have data in this format
"41,43,45,49,35,39,47,37"
"12,14,18,16,20,24,22,10"
"1,2,3,4,5,6"
"1,7,521,65,32"
what I have tried is to split these values with an Id column and then reading them with Even and Odd functions but it is too time taking.
is there a query or a function through I can find out that which of the rows are even, odd sequence and arbitrary?
thanks in advance.
Assuming your table has some way of (uniquely) identifying each of those lists, this should work:
create table data (id integer, list text);
insert into data
values
(1, '41,43,45,49,35,39,47,37'),
(2, '12,14,18,16,20,24,22,10'),
(3, '1,2,3,4,5,6'),
(4, '1,7,521,65,32');
with normalized as (
select id, unnest(string_to_array(list,',')::int[]) as val
from data
)
select id,
bool_and((val % 2) = 0) as all_even,
bool_and((val % 2) <> 0) as all_odd
from normalized
group by id;
Not sure if that is fast enough for your needs