how to get multiple rows from one row in spark scala [duplicate] - scala

This question already has an answer here:
Flattening Rows in Spark
(1 answer)
Closed 5 years ago.
I have a dataframe in spark like below and I want to convert all the column in different rows with respect to first column id.
+----------------------------------+
| id code1 code2 code3 code4 code5 |
+----------------------------------+
| 1 A B C D E |
| 1 M N O P Q |
| 1 P Q R S T |
| 2 P A C D F |
| 2 S D F R G |
+----------------------------------+
I want the output like below format
+-------------+
| id code |
+-------------+
| 1 A |
| 1 B |
| 1 C |
| 1 D |
| 1 E |
| 1 M |
| 1 N |
| 1 O |
| 1 P |
| 1 Q |
| 1 P |
| 1 Q |
| 1 R |
| 1 S |
| 1 T |
| 2 P |
| 2 A |
| 2 C |
| 2 D |
| 2 F |
| 2 S |
| 2 D |
| 2 F |
| 2 R |
| 2 G |
+-------------+
Can anyone please help me here how I will get the above output with spark and scala.

using array, explode and drop functions should have you the desired output as
df.withColumn("code", explode(array("code1", "code2", "code3", "code4", "code5")))
.drop("code1", "code2", "code3", "code4", "code5")
OR
as defined by undefined_variable, you can just use select
df.select($"id", explode(array("code1", "code2", "code3", "code4", "code5")).as("code"))

df.select(col("id"),explode(concat_ws(",",Seq(col(code1),col("code2"),col("code3"),col("code4"),col("code5")))))
Basically idea is first concat all required columns and then explode it

Related

Avarage per group in PySpark

I have PySpark dataframe below:
cust | amount |
----------------
A | 5 |
A | 1 |
A | 3 |
B | 4 |
B | 4 |
B | 2 |
C | 2 |
C | 1 |
C | 7 |
C | 5 |
I need to group by column 'cust' and calculates the average per group.
Expected result:
cust | avg_amount
-------------------
A | 3
B | 3.333
C | 7.5
I've been using the code as below but giving me the error.
data.withColumn("avg_amount", F.avg("amount"))
Any idea how I can make this average?
Use groupBy to count the number of transactions and the average of amount by customer:
from pyspark.sql import functions as F
data = data.groupBy("cust")\
.agg(
F.count("*").alias("amount"),
F.avg("amount").alias("avg_amount")
)

Replace null by negative id number in not consecutive rows in hive

I have this table in my database:
| id | desc |
|-------------|
| 1 | A |
| 2 | B |
| NULL | C |
| 3 | D |
| NULL | D |
| NULL | E |
| 4 | F |
---------------
And I want to transform this table into a table that replace nulls by consecutive negative ids:
| id | desc |
|-------------|
| 1 | A |
| 2 | B |
| -1 | C |
| 3 | D |
| -2 | D |
| -3 | E |
| 4 | F |
---------------
Anyone knows how can I do this in hive?
Below approach works
select coalesce(id,concat('-',ROW_NUMBER() OVER (partition by id))) as id,desc from database_name.table_name;

Spotfire moving average by group

I'm using a software named spotfire in which I can set "custom expressions" with MDX. By the way, it isn't a requirement to know this software to answer this question. I expect general answers that could help other people as well even if they don't use spotfire.
I need to get the moving average for each group. I have a lot of groups, I can't make a table per group.
Below is an example of my table:
ID | GROUP | DATE | VALUE | MVG_AVG
------------------------------------------
a | A | 05/10 | 5 |
b | B | 05/10 | 4 |
c | A | 05/11 | 9 |
d | B | 05/11 | 7 |
e | B | 05/12 | 7 |
f | B | 05/13 | 7 |
g | A | 05/12 | 1 |
h | B | 05/14 | 1 |
I found the LastPeriods function, but I can't get it work for each group. I use n=3 for the function.
Here is the same table with expected results for moving average :
ID | GROUP | DATE | VALUE | MVG_AVG
------------------------------------------
a | A | 05/10 | 5 | 5 #because no previous value (for group A)
b | B | 05/10 | 4 | 4
c | A | 05/11 | 9 | 5 # =(5+9+1)/3
d | B | 05/11 | 7 | 6
e | B | 05/12 | 7 | 7
f | B | 05/13 | 7 | 5
g | A | 05/12 | 1 | 1 #because no next value (for group A)
h | B | 05/14 | 1 | 1
Here is my current custom expression in Spotfire, it doesn't take into account groups:
Sum([VALUE]) OVER (LastPeriods(3,[DATE])) / 3

How to adapt a query so that it can perform all possible combinations (cross join) between elements of the same table?

Considering the following table:
group | type | element
------+------+--------
1 | 1 | A
1 | 1 | B
1 | 2 | C
1 | 2 | D
1 | 3 | E
1 | 3 | F
2 | 4 | G
2 | 4 | H
2 | 5 | I
2 | 5 | J
2 | 5 | K
3 | 4 | L
3 | 4 | M
3 | 4 | N
3 | 5 | O
3 | 5 | P
3 | 6 | Q
3 | 7 | R
3 | 7 | S
3 | 7 | T
I need to select all possible combinations between elements of the element column, but filtering through the group column and grouping by column type.
Using the following query I can do this query by filtering through the elements of group 1 (this has types 1,2 and 3):
SELECT T1.*, T2.type, T2.element, T3.type, T3.element
FROM (SELECT * FROM test WHERE "group" = 1 AND type = 1) AS T1
CROSS JOIN (SELECT * FROM test WHERE "group" = 1 AND type = 2) AS T2
CROSS JOIN (SELECT * FROM test WHERE "group" = 1 AND type = 3) AS T3;
Obtaining the following result:
group | type | element | type | element | type | element
------+------+---------+------+---------+------+--------
1 | 1 | A | 2 | C | 3 | E
1 | 1 | A | 2 | C | 3 | F
1 | 1 | A | 2 | D | 3 | E
1 | 1 | A | 2 | D | 3 | F
1 | 1 | B | 2 | C | 3 | E
1 | 1 | B | 2 | C | 3 | F
1 | 1 | B | 2 | D | 3 | E
1 | 1 | B | 2 | D | 3 | F
Already to filter by the elements of group 2 I need to use another query reducing the amount of cross join because group 2 has fewer types (only types 4 and 5):
SELECT T1.*, T2.type, T2.element
FROM (SELECT * FROM test WHERE "group" = 2 AND type = 4) AS T1
CROSS JOIN (SELECT * FROM test WHERE "group" = 2 AND type = 5) AS T2;
Obtaining the following result:
group | type | element | type | element
------+------+---------+------+--------
2 | 4 | G | 5 | I
2 | 4 | G | 5 | J
2 | 4 | G | 5 | K
2 | 4 | H | 5 | I
2 | 4 | H | 5 | J
2 | 4 | H | 5 | K
And finally to select the elements filtering by the group 3, I need to increase the amount of cross join because this group has 4 types (4,5,6 and 7):
SELECT T1.*, T2.type, T2.element, T3.type, T3.element, T4.type, T4.element
FROM (SELECT * FROM test WHERE "group" = 3 AND type = 4) AS T1
CROSS JOIN (SELECT * FROM test WHERE "group" = 3 AND type = 5) AS T2
CROSS JOIN (SELECT * FROM test WHERE "group" = 3 AND type = 6) AS T3
CROSS JOIN (SELECT * FROM test WHERE "group" = 3 AND type = 7) AS T4;
Obtaining the following result:
group | type | element | type | element | type | element | type | element
------+------+---------+------+---------+------+---------+------+--------
3 | 4 | L | 5 | O | 6 | Q | 7 | R
3 | 4 | L | 5 | O | 6 | Q | 7 | S
3 | 4 | L | 5 | O | 6 | Q | 7 | T
3 | 4 | L | 5 | P | 6 | Q | 7 | R
3 | 4 | L | 5 | P | 6 | Q | 7 | S
3 | 4 | L | 5 | P | 6 | Q | 7 | T
3 | 4 | M | 5 | O | 6 | Q | 7 | R
3 | 4 | M | 5 | O | 6 | Q | 7 | S
3 | 4 | M | 5 | O | 6 | Q | 7 | T
3 | 4 | M | 5 | P | 6 | Q | 7 | R
3 | 4 | M | 5 | P | 6 | Q | 7 | S
3 | 4 | M | 5 | P | 6 | Q | 7 | T
3 | 4 | N | 5 | O | 6 | Q | 7 | R
3 | 4 | N | 5 | O | 6 | Q | 7 | S
3 | 4 | N | 5 | O | 6 | Q | 7 | T
3 | 4 | N | 5 | P | 6 | Q | 7 | R
3 | 4 | N | 5 | P | 6 | Q | 7 | S
3 | 4 | N | 5 | P | 6 | Q | 7 | T
DOUBT!
How can I make a single query that can give me the results independent of the number of different (distinct) groups, types, and elements?
To better understand the relationships between group, type, and element type columns:
Group -> Type[element A, element B ...]
1 -> 1[A,B], 2[C,D], 3[E,F]
2 -> 4[G,H], 5[I,J,K]
3 -> 4[L,M,N], 5[O,P], 6[Q], 7[R,S,T]

Finding value difference in column pairs

I'm using SQL server 2008R2 and I have a view which returns the following:
+----+-------+-------+-------+-------+-------+-------+
| ID | col1A | col1B | col2A | col2B | col3A | col3B |
+----+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 1 | 3 | 5 | 4 | 4 |
| 2 | 1 | 1 | 5 | 5 | 5 | 4 |
| 3 | 3 | 4 | 5 | 5 | 4 | 4 |
| 4 | 1 | 2 | 5 | 5 | 4 | 3 |
| 5 | 1 | 1 | 2 | 2 | 3 | 3 |
+----+-------+-------+-------+-------+-------+-------+
As you can see this view contains column pairs (col1A and col1B), (col2A and col2B), (col3A and col3B).
I need to query this view and find rows where the column pairs contain different values.
So I would be looking to return:
+----+------------+---+-----+
| ID | ColumnType | A | B |
+----+------------+---+-----+
| 1 | Col2 | 3 | 5 |
| 2 | Col3 | 5 | 4 |
| 3 | Col1 | 3 | 4 |
| 4 | Col1 | 1 | 2 |
| 4 | Col3 | 4 | 3 |
+----+------------+---+-----+
I think I need to use UNPIVOT but not sure how – appreciate any suggestions?
Since you are using SQL Server 2008+ you can use CROSS APPLY to unpivot the pair of columns and then you can easily compare the values in the A and B to return the rows that don't match:
select t.ID,
c.ColumnType,
c.A,
c.B
from [dbo].[yourview] t
cross apply
(
values
('Col1', Col1A, Col1B),
('Col2', Col2A, Col2B),
('Col3', Col3A, Col3B)
) c (ColumnType, A, B)
where c.A <> c.B;
If you have different datatypes in your columns, then you'll need to convert the data to the same type. You can do this conversion within the VALUES clause:
select t.ID,
c.ColumnType,
c.A,
c.B
from [dbo].[yourview] t
cross apply
(
values
('Col1', cast(Col1A as varchar(50)), Col1B),
('Col2', cast(Col2A as varchar(50)), Col2B),
('Col3', cast(Col3A as varchar(50)), Col3B)
) c (ColumnType, A, B)
where c.A <> c.B