How to use join text with group by in Pyspark? - pyspark

I have a pyspark dataframe
id
events
a0
a-markets-l1
a0
a-markets-watch
a0
a-markets-buy
c7
a-markets-z2
c7
scroll_down
a0
a-markets-sell
b2
next_screen
I am trying to join events by grouping IDs
Here's my python code
df_events_userpath = df_events.groupby('id').agg({ 'events': lambda x: ' '.join(x)}).reset_index()
id
events
a0
a-markets-l1 a-markets-watch a-markets-buy a-markets-sell
c7
a-markets-z2 scroll_down
b2
next_screen

I have tried using collect_set
df.groupBy("id").agg(f.collect_set("events").alias("events"))

Related

How to change "65→67→69" to "J7,G2,P9" in SQL/PostgreSQL/MySQL? Or use split fields/value mapper in Pentaho Data Integration (Spoon) to realize it?

How to change "65→67→69" to "J7,G2,P9" in SQL/PostgreSQL/MySQL? Or use split fields/value mapper in Pentaho Data Integration (Spoon) to realize it?
I use KETTLE(Pentaho Data Integration/Spoon) to insert data to PostgreSQL from other databases, I have a field with below data
value
-----------
65→67→69
15→19→17
25→23→45
19→28→98
ID value
--------
65 J7
67 G2
69 P9
15 A8
19 b9
17 C1
25 b12
23 e12
45 A23
28 C17
98 F18
And how to change the above value to the below value? Is there any SQL way or KETTLE way to realize it?
new_value
-----------
J7,G2,P9
A8,b9,C1
b12,e12,A23
b9,C17,B18
Thanks so much for any advice.
Assuming these tables:
create table table1 (value text);
insert into table1 (value)
values
('65→67→69'),
('15→19→17'),
('25→23→45'),
('19→28→98')
;
create table table2 (id int, value text);
insert into table2 (id, value)
values
(65, 'J7'),
(67, 'G2'),
(69, 'P9'),
(15, 'A8'),
(19, 'b9'),
(17, 'C1'),
(25, 'b12'),
(23, 'e12'),
(45, 'A23'),
(28, 'C17'),
(98, 'F18')
;
In Postgres you can use a scalar subselect:
select t1.value,
(select string_agg(t2.value, ',' order by t.idx)
from table_2 t2
join lateral unnest(string_to_array(t1.value,'→')) with ordinality as t(val,idx) on t2.id::text = t.val
) as new_value
from table_1 t1;
Online example

SQL: Data Cleaning

I am facing a problem which I do not know how to categorize. So, pardon me for the generic title. I have a dataset like:
Table1: Column1, Column2, Column3.
According to my business logic, for a pair of 'Column1 Column2', the Column3 can have only one unique value. So below table is a problematic one because of the second entry:
Table1
Column1 Column2 Column3
A1 B1 R
A1 B1 O << ERROR! for A1-B1 pair only one value on column3 is accepted
A2 B2 R
A2 B3 J
A3 B3 K
A4 B5 K
From above table I would like to find the problematic entries:
A1 B1 R
A1 B1 O
Thanks in advance for your help !
Using your example column names, you can run the following query to just see the Column1/Column2 pairs that have more than 1 value in Column 3.
SELECT Column1, Column2, COUNT(DISTINCT Column3) as Column3
FROM Table1
GROUP BY Column1, Column2
HAVING COUNT(DISTINCT Column3) > 1
You can omit the HAVING line to see the complete list of Column1/Column2 pairs.

sql query to count number of users based on event sequence

I have a table called test which is sorted by time.
user_id event time
1 e1 t1
1 e3 t2
1 e2 t3
2 e2 t4
2 e1 t5
2 e5 t6
3 e2 t7
3 e4 t8
I have to find out how many unique user_id is there in which event e1 happens before e2. here the answer is one with user_id 1.
I am using postgresql.
Any help would be much appreciated.
This is probably your solution, with a sub-select of events where ev2:
WITH event(user_id,event,time) AS (
VALUES (1,'e1','t1'),
(1,'e3','t2'),
(1,'e2','t3'),
(2,'e2','t4'),
(2,'e1','t5'),
(2,'e5','t6'),
(3,'e2','t7'),
(3,'e4','t8'))
SELECT count(event.event) FROM event
JOIN (SELECT user_id, time
FROM event WHERE event = 'e2') AS ev2 ON event.user_id = ev2.user_id
WHERE event.time < ev2.time AND event.event = 'e1'
Filter all rows before ev2 takes place and the value should be equal to ev1.
SELECT e.user_id,
Count(e.event)
FROM event e
join(SELECT user_id,
TIME
FROM event
WHERE event = 'e2') AS ee
ON e.user_id = ee.user_id
WHERE e.TIME < ee.TIME
AND e.event = 'e1'
GROUP BY e.user_id

PostgreSQL / Hive join multiple tables

Table a:
id value0
101 a1
102 a2
103 a3
Table b:
id value1
101 b1
101 b2
101 b3
Table c:
id value2
101 c1
103 c3
103 c4
Rezult table:
id value0 value1 value2
101 a1 b1 0
101 a1 b2 0
101 a1 b3 0
101 a1 0 c1
102 a2 0 0
103 a3 0 c3
103 a3 0 c4
Is it possible to produce rezult table from tables a, b, c with one query (without creating two tables and join them)? Maybe there is a possibility to do it by using only left joins?
This may help you-
select t1.id, t2.id, t3.id
from tablea t1 inner join tableb t2 on t1.id = t2.id
inner join tablec t3 on t2.id=t3.id
group by id
If you have a base table, select that and do a left join to the others. If none of your tables can act as a base table, you can use full joins (both works as outer joins):
select *
from table_a
full join table_b using (id)
full join table_c using (id)
This will select sql NULLs, where there is no data, but you can use COLAESCE(value0, 'N/A'), etc. to select some default data.

How can i query the last data from 3 tables

now i have 3 tables, for example A,B,C
the relation between them is A onetomany B, B onetomany C.
C is a table saved photos
now i want get data from A, but only the last photo each A.
the colomns maybe like this:
table a:
id a_msg
a1 msg in a
a2 msg in a
a3 msg in a
table b:
id b_msg a_id
b1 some data in b a1
b2 some data in b a1
b3 some data in b a2
b4 some data in b a3
table c:
id url createdate c_msg b_id
c1 /file/1.jpg 2014-12-01 06:55:54.600 some data in c b1
c2 /file/2.jpg 2014-12-01 06:55:54.601 some data in c b1
c3 /file/3.jpg 2014-12-01 06:55:54.602 some data in c b1
c4 /file/4.jpg 2014-12-01 06:55:54.603 some data in c b2
c5 /file/5.jpg 2014-12-01 06:55:54.604 some data in c b2
c6 /file/6.jpg 2014-12-01 06:55:54.605 some data in c b3
the result i want get
c_id url createdate c_msg b_msg b_id a_msg a_id
c6 /file/6.jpg 2014-12-01 06:55:54.605 some data in c some data in b b3 msg in a a1
c5 /file/5.jpg 2014-12-01 06:55:54.604 some data in c some data in b b2 msg in a a1
Sorry ,i don't know how to use tool to describle the table,hope you can easily understand what i mean.
if my description is not clear enough,i will edit the question,thank you if anyone can help me
Consider the following as an example :
create table table_a (id int,a_msg text);
create table table_b (id int,b_msg text,a_id int);
create table table_c (id int,url text,createdate timestamp with time zone,c_msg text ,b_id int);
and the data
insert into table_a values (1,'msg in table_a')
,(2,'2nd msg in table_a')
,(3,'3rd msg in table_a');
insert into table_b values (20,'msg in table_b',1)
,(21,'2nd msg in table_b',2)
,(22,'3rd msg in table_b',3);
insert into table_c values (30,'url','2014-12-01 06:55:54.600','msg in table_c',20)
,(31,'url 1','2014-12-01 06:55:54.604','2nd msg in table_c',21)
,(32,'url 2','2014-12-01 06:55:54.605','3rd msg in table_c',22);
to get the result you need to use INNER JOIN and to get the last two data use order by createdate desc limit 2
select c.id,c.url
,c.createdate
,c.c_msg,b.b_msg
,b.id bi_id,a.a_msg
,a.id a_id
from
table_c c inner join table_b b on c.b_id=b.id /* to get data from table_b */
inner join table_a a on b.a_id=a.id /* to get data from table_a */
order by createdate desc limit 2 /* DESC will sort from the highest date time values and LIMIT 2 will return two rows */
>SQLFIDDLE DEMO WITH OP'S DATA