Avarage per group in PySpark - pyspark

I have PySpark dataframe below:
cust | amount |
----------------
A | 5 |
A | 1 |
A | 3 |
B | 4 |
B | 4 |
B | 2 |
C | 2 |
C | 1 |
C | 7 |
C | 5 |
I need to group by column 'cust' and calculates the average per group.
Expected result:
cust | avg_amount
-------------------
A | 3
B | 3.333
C | 7.5
I've been using the code as below but giving me the error.
data.withColumn("avg_amount", F.avg("amount"))
Any idea how I can make this average?

Use groupBy to count the number of transactions and the average of amount by customer:
from pyspark.sql import functions as F
data = data.groupBy("cust")\
.agg(
F.count("*").alias("amount"),
F.avg("amount").alias("avg_amount")
)

Related

postgres sql : getting unified rows

I have one table where I dump all records from different sources (x, y, z) like below
+----+------+--------+
| id | source |
+----+--------+
| 1 | x |
| 2 | y |
| 3 | x |
| 4 | x |
| 5 | y |
| 6 | z |
| 7 | z |
| 8 | x |
| 9 | z |
| 10 | z |
+----+--------+
Then I have one mapping table where I map values between sources based on my usecase like below
+----+-----------+
| id | mapped_id |
+----+-----------+
| 1 | 2 |
| 1 | 9 |
| 3 | 7 |
| 4 | 10 |
| 5 | 1 |
+----+-----------+
I want merged results where I can see only unique results like
+-----+------------+
| id | mapped_ids |
+-----+------------+
| 1 | 2,9,5 |
| 3 | 7 |
| 4 | 10 |
| 6 | null |
| 8 | null |
+-----+------------+
I am trying different options but could not figure this out, is there way I can write joins to do this. I have to use the mapping table where associations are stored and identify unique records along with records which are not mapped anywhere.
My understanding is, you want to see all dump_table IDs that do not appear in the mapping_id column and then aggregate the mapped_ids for those that are left:
select d1.id,
array_agg(m1.mapped_id order by m1.mapped_id) filter (where m1.mapped_id is not null) as mapped_ids
from dump_table d1
left join mapping_table m1 using (id)
where not exists (select *
from mapping_table m2
where m2.mapped_id = d1.id)
group by d1.id;
Online example: https://rextester.com/JQZ17650
Try something like this:
SELECT id, name, ARRAY_AGG(mapped_id) AS mapped_ids
FROM table1 AS t1
LEFT JOIN table2 AS t2 USING (id)
GROUP BY id, name

Transform structure of Spark DF. Create one column or row for each value in a column. Impute values [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have a Spark DF with the following structure:
+--------------------------------------+
| user| time | counts |
+--------------------------------------+
| 1 | 2018-06-04 16:00:00.0 | 5 |
| 1 | 2018-06-04 17:00:00.0 | 7 |
| 1 | 2018-06-04 17:30:00.0 | 7 |
| 1 | 2018-06-04 18:00:00.0 | 8 |
| 1 | 2018-06-04 18:30:00.0 | 10 |
| 1 | 2018-06-04 19:00:00.0 | 9 |
| 1 | 2018-06-04 20:00:00.0 | 7 |
| 2 | 2018-06-04 17:00:00.0 | 4 |
| 2 | 2018-06-04 18:00:00.0 | 4 |
| 2 | 2018-06-04 18:30:00.0 | 5 |
| 2 | 2018-06-04 19:30:00.0 | 7 |
| 3 | 2018-06-04 16:00:00.0 | 6 |
+--------------------------------------+
It was obtained from an event-log table using the following code:
ranked.groupBy($"user", sql.functions.window($"timestamp", "30 minutes"))
.agg(sum("id").as("counts"))
.withColumn("time", $"window.start")
As can be seen looking at the time column, not all 30-min intervals registered events for each user, i.e. not all user groups of frames are of equal lengths. I'd like to impute (possibly with NA's or 0's) missing time values and create a table (or RDD) like the following:
+-----------------------------------------------------------------------------+
| user| 2018-06-04 16:00:00 | 2018-06-04 16:30:00 | 2018-06-04 17:00:00 | ... |
+-----------------------------------------------------------------------------+
| 1 | 5 | NA (or 0) | 7 | ... |
| 2 | NA (or 0) | NA (or 0) | 4 | ... |
| 3 | 6 | NA (or 0) | NA (or 0) | ... |
+-----------------------------------------------------------------------------+
The transpose of the table above (with a time, column, and a column for the counts of each user) would theoretically work too, but I am not sure it would be optimal spark-wise as I have almost a million different users.
How can I perform a table re-structuring like described?
If each time window appears for at least one user, a simple pivot would do the trick (and put null for missing values). With millions of rows, it should be the case.
val reshaped_df = df.groupBy("user").pivot("time").agg(sum('counts))
In case a column is still missing, you could access the list of the columns with reshaped_df.columns and then add the missing ones. You would need to generate the list of columns that you expect (expected_columns) and then generate the missing ones as follows:
val expected_columns = ???
var result = reshaped_df
expected_columns
.foreach{ c =>
if(! result.columns.contains(c))
result = result.withColumn(c, lit(null))
}

how to get multiple rows from one row in spark scala [duplicate]

This question already has an answer here:
Flattening Rows in Spark
(1 answer)
Closed 5 years ago.
I have a dataframe in spark like below and I want to convert all the column in different rows with respect to first column id.
+----------------------------------+
| id code1 code2 code3 code4 code5 |
+----------------------------------+
| 1 A B C D E |
| 1 M N O P Q |
| 1 P Q R S T |
| 2 P A C D F |
| 2 S D F R G |
+----------------------------------+
I want the output like below format
+-------------+
| id code |
+-------------+
| 1 A |
| 1 B |
| 1 C |
| 1 D |
| 1 E |
| 1 M |
| 1 N |
| 1 O |
| 1 P |
| 1 Q |
| 1 P |
| 1 Q |
| 1 R |
| 1 S |
| 1 T |
| 2 P |
| 2 A |
| 2 C |
| 2 D |
| 2 F |
| 2 S |
| 2 D |
| 2 F |
| 2 R |
| 2 G |
+-------------+
Can anyone please help me here how I will get the above output with spark and scala.
using array, explode and drop functions should have you the desired output as
df.withColumn("code", explode(array("code1", "code2", "code3", "code4", "code5")))
.drop("code1", "code2", "code3", "code4", "code5")
OR
as defined by undefined_variable, you can just use select
df.select($"id", explode(array("code1", "code2", "code3", "code4", "code5")).as("code"))
df.select(col("id"),explode(concat_ws(",",Seq(col(code1),col("code2"),col("code3"),col("code4"),col("code5")))))
Basically idea is first concat all required columns and then explode it

How to set sequence number of sub-elements in TSQL unsing same element as parent?

I need to set a sequence inside T-SQL when in the first column I have sequence marker (which is repeating) and use other column for ordering.
It is hard to explain so I try with example.
This is what I need:
|------------|-------------|----------------|
| Group Col | Order Col | Desired Result |
|------------|-------------|----------------|
| D | 1 | NULL |
| A | 2 | 1 |
| C | 3 | 1 |
| E | 4 | 1 |
| A | 5 | 2 |
| B | 6 | 2 |
| C | 7 | 2 |
| A | 8 | 3 |
| F | 9 | 3 |
| T | 10 | 3 |
| A | 11 | 4 |
| Y | 12 | 4 |
|------------|-------------|----------------|
So my marker is A (each time I met A I must start new group inside my result). All rows before first A must be set to NULL.
I know that I can achieve that with loop but it would be slow solution and I need to update a lot of rows (may be sometimes several thousand).
Is there a way to achive this without loop?
You can use window version of COUNT to get the desired result:
SELECT [Group Col], [Order Col],
COUNT(CASE WHEN [Group Col] = 'A' THEN 1 END)
OVER
(ORDER BY [Order Col]) AS [Desired Result]
FROM mytable
If you need all rows before first A set to NULL then use SUM instead of COUNT.
Demo here

Finding value difference in column pairs

I'm using SQL server 2008R2 and I have a view which returns the following:
+----+-------+-------+-------+-------+-------+-------+
| ID | col1A | col1B | col2A | col2B | col3A | col3B |
+----+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 1 | 3 | 5 | 4 | 4 |
| 2 | 1 | 1 | 5 | 5 | 5 | 4 |
| 3 | 3 | 4 | 5 | 5 | 4 | 4 |
| 4 | 1 | 2 | 5 | 5 | 4 | 3 |
| 5 | 1 | 1 | 2 | 2 | 3 | 3 |
+----+-------+-------+-------+-------+-------+-------+
As you can see this view contains column pairs (col1A and col1B), (col2A and col2B), (col3A and col3B).
I need to query this view and find rows where the column pairs contain different values.
So I would be looking to return:
+----+------------+---+-----+
| ID | ColumnType | A | B |
+----+------------+---+-----+
| 1 | Col2 | 3 | 5 |
| 2 | Col3 | 5 | 4 |
| 3 | Col1 | 3 | 4 |
| 4 | Col1 | 1 | 2 |
| 4 | Col3 | 4 | 3 |
+----+------------+---+-----+
I think I need to use UNPIVOT but not sure how – appreciate any suggestions?
Since you are using SQL Server 2008+ you can use CROSS APPLY to unpivot the pair of columns and then you can easily compare the values in the A and B to return the rows that don't match:
select t.ID,
c.ColumnType,
c.A,
c.B
from [dbo].[yourview] t
cross apply
(
values
('Col1', Col1A, Col1B),
('Col2', Col2A, Col2B),
('Col3', Col3A, Col3B)
) c (ColumnType, A, B)
where c.A <> c.B;
If you have different datatypes in your columns, then you'll need to convert the data to the same type. You can do this conversion within the VALUES clause:
select t.ID,
c.ColumnType,
c.A,
c.B
from [dbo].[yourview] t
cross apply
(
values
('Col1', cast(Col1A as varchar(50)), Col1B),
('Col2', cast(Col2A as varchar(50)), Col2B),
('Col3', cast(Col3A as varchar(50)), Col3B)
) c (ColumnType, A, B)
where c.A <> c.B