group rows into a list in pyspark

group rows into a list in pyspark - group-by

I have a spark dataframe having a structure similar to the following table
**col1** **col2**
A 1
B 2
A 3
B 4
C 1
A 2
I want it to be grouped on col1 and create a list of values on col2. Following should be my output
**col1** **list**
A [1,3,2]
B [2, 4]
C [1]
Can someone point me to any references?

This should do the job:
df.groupBy($"col1").agg( collect_list($"col2") )

Related

Replace first n entries in a column in kdb

How can I replace the values in the first n columns of my table？
i.e. mycol:(1 2 3 4) to mycol:(a a 3 4)
Thank you in advance!

If it's the values within mycol that you want updated then they will need to be of the same type as the existing values. See below.
q)t:([]mycol:`$string 1+til 4;mycol2:til 4)
q)update mycol:`a from t where i<2
mycol mycol2
------------
a 0
a 1
3 2
4 3
One way around this though is to enlist mycol, that way updates of any type can be made.
q)t:([]mycol:1+til 4;mycol2:til 4)
q)update mycol:`a from(update enlist each mycol from t)where i<2
mycol mycol2
------------
`a 0
`a 1
,3 2
,4 3
q)meta update mycol:`a from(update enlist each mycol from t)where i<2
c | t f a
------| -----
mycol |
mycol2| j
It's unclear from your question whether you want the column names or the column values changed. If it's the column names, you can use xcol.
q)(2#`a)xcol([]w:3#til 3;x:3#.Q.a;y:`;z:0N)
a a y z
-------
0 a
1 b
2 c

PostgreSQL: Count from x to y

How do I count in PostgreSQL from X to Y where X and Y come from table fields.
Similar to how a for loop works.
I tried to use ranges.

Use generate_series:
demo:db<>fiddle
It is not quiet clear what your data looks like:
A: You have one table with two columns x and y and exactly one row:
x | y
--+--
2 | 5
B: You have one table with two columns x and y and more than one row:
x | y
--+--
2 | 5
3 |10
It's not quiet clear what you mean with "count":
I.: You want one row per element:
count
-----
2
3
4
5
II.: You want an array of all elements:
count
-----
{2,3,4,5}
A.I: You have a table with two columns x and y with exactly one row and want one row per element:
SELECT generate_series(x, y)
FROM values;
Result as seen in I. above
B.I: You have many rows:
Same query, result is the union of the generated rows:
count
-----
2
3
4
5
3
4
5
6
7
8
9
10
A.II: One row, want an array:
You have to group your generated series
SELECT array_agg(gs)
FROM (
SELECT x, generate_series(x, y) as gs
FROM values
) s
GROUP BY x
Result as seen in II.
B.II: Many rows, want an array:
Same query, result looks like:
count
------------------
{2,3,4,5}
{3,4,5,6,7,8,9,10}

SQL Renumbering index after group by

I have the following input table:
Seq Group GroupSequence
1 0
2 4 A
3 4 B
4 4 C
5 0
6 6 A
7 6 B
8 0
Output table is:
Line NewSeq GroupSequence
1 1
2 2 A
3 2 B
4 2 C
5 3
6 4 A
7 4 B
8 5
The rules for the input table are:
Any positive integer in the Group column indicates that the rows are grouped together. The entire field may be NULL or blank. A null or 0 indicates that the row is processed on its own. In the above example there are two groups and three 'single' rows.
the GroupSequence column is a single character that sorts within the group. NULL, blank, 'A', 'B' 'C' 'D' are the only characters allowed.
if Group has a positive integer, there must be alphabetic character in GroupSequence.
I need a query that creates the output table with a new column that sequences as shown.
External apps needs to iterate through this table in either Line or NewSeq order(same order, different values)
I've tried variations on GROUP BY, PARTITION BY, OVER(), etc. WITH no success.
Any help much appreciated.

Perhaps this will help
The only trick here is Flg which will indicate a new Group Sequence (values will be 1 or 0). Then it is a small matter to sum(Flg) via a window function.
Edit - Updated Flg method
Example
Declare #YourTable Table ([Seq] int,[Group] int,[GroupSequence] varchar(50))
Insert Into #YourTable Values
(1,0,null)
,(2,4,'A')
,(3,4,'B')
,(4,4,'C')
,(5,0,null)
,(6,6,'A')
,(7,6,'B')
,(8,0,null)
Select Line = Row_Number() over (Order by Seq)
,NewSeq = Sum(Flg) over (Order By Seq)
,GroupSequence
From (
Select *
,Flg = case when [Group] = lag([Group],1) over (Order by Seq) then 0 else 1 end
From #YourTable
) A
Order By Line
Returns
Line NewSeq GroupSequence
1 1 NULL
2 2 A
3 2 B
4 2 C
5 3 NULL
6 4 A
7 4 B
8 5 NULL

Functional update - multivariable function with dynamic columns

Any help with the following would be much appreciated!
I have two tables: table1 is a summary table whilst table2 is a list of all data points. I want to be able to summarise the information in table2 for each row in table1.
table1:flip `grp`constraint!(`a`b`c`d; 10 10 20 20);
table2:flip `grp`cat`constraint`val!(`a`a`a`a`a`b`b`b;`cl1`cl1`cl1`cl2`cl2`cl2`cl2`cl1; 10 10 10 10 10 10 20 10; 1 2 3 4 5 6 7 8);
function:{[grpL;constraintL;catL] first exec total: sum val from table2 where constraint=constraintL, grp=grpL,cat=catL};
update cl1:function'[grp;constraint;`cl1], cl2:function'[grp;constraint;`cl2] from table1;
The fourth line of this code achieves what I want for the two categories:cl1 and cl2
In table1 I want to name a new column with the name of the category (cl1, cl2, etc.) and I want the values in that column to be the output from running the function over that column.
However, I have hundreds of different categories, so don't want to have to list them out manually as in the fourth line. How would I pass in a list of categories, e.g. below?
`cl1`cl2`cl3

Sticking to your approach, you would just have to make your update statement functional and then iterate over the columns like so:
{![`table1;();0b;(1#x)!enlist ((';function);`grp;`constraint;1#x)]} each `cl1`cl2
Assuming you can amend table1 in place. If you must retain the original table1 then you can pass it by value though it will consume more memory
{![x;();0b;(1#y)!enlist ((';function);`grp;`constraint;1#y)]}/[table1;`cl1`cl2]
Another approach would be to aggregate, pivot and join though it's not necessarily a better solution as you get nulls rather than zeros
a:select sum val by cat,grp,constraint from table2
p:exec (exec distinct cat from a)#cat!val by grp,constraint from a
table1 lj p

There are several different methods you can look into.
The easiest method would be a functional update - http://code.kx.com/wiki/JB:QforMortals2/queries_q_sql#Functional_update
Below, though, should somewhat prove more useful, quicker and neater:
Your problem can be split into 2 parts. For the first part, you are looking to create a sum of each category by grp and constraint within table2. As for the second part, you are looking to join these results (the lookups) onto the corresponding records from table1.
You can create the necessary groups using by
q)exec val,cat by grp,constraint from table2
grp constraint| val cat
--------------| ------------------------------
a 10 | 1 2 3 4 5 `cl1`cl1`cl1`cl2`cl2
b 10 | 6 8 `cl2`cl1
b 20 | ,7 ,`cl2
Note though, this will only create nested lists of the columns in your select query
Next is to sum each of the cat groups
q)exec sum each val group cat by grp,constraint from table2
grp constraint|
--------------| ------------
a 10 | `cl1`cl2!6 9
b 10 | `cl2`cl1!6 8
b 20 | (,`cl2)!,7
Then, to create the cat's columns you can use a pivot like syntax - http://code.kx.com/wiki/Pivot
q)cats:asc exec distinct cat from table2
q)exec cats#sum each val group cat by grp,constraint from table2
grp constraint| cl1 cl2
--------------| -------
a 10 | 6 9
b 10 | 8 6
b 20 | 7
Now you can use this lookup table and index into each row from table1
q)(exec cats#sum each val group cat by grp,constraint from table2)[table1]
cl1 cl2
-------
6 9
8 6
To fill the nulls with zeros, use the carat symbol - http://code.kx.com/wiki/Reference/Caret
q)0^(exec cats#sum each val group cat by grp,constraint from table2)[table1]
cl1 cl2
-------
6 9
8 6
0 0
0 0
And now you can join on each row from table1 to your results using join-each
q)table1,'0^(exec cats#sum each val group cat by grp,constraint from table2)[table1]
grp constraint cl1 cl2
----------------------
a 10 6 9
b 10 8 6
c 20 0 0
d 20 0 0
HTH, Sean

This approach is the easiest way to pass in a list of categories
{table1^flip x!function'[table1`grp;table1`constraint;]each x}`cl1`cl2

T-SQL table variable data order

I have a UDF which returns table variable like
--
--
RETURNS #ElementTable TABLE
(
ElementID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
ElementValue VARCHAR(MAX)
)
AS
--
--
Is the order of data in this table variable guaranteed to be same as the order data is inserted into it. e.g. if I issue
INSERT INTO #ElementTable(ElementValue) VALUES ('1')
INSERT INTO #ElementTable(ElementValue) VALUES ('2')
INSERT INTO #ElementTable(ElementValue) VALUES ('3')
I expect data will always be returned in that order when I say
select ElementValue from #ElementTable --Here I don't use order by
EDIT:
If order by is not guaranteed then the following query
SELECT T1.ElementValue,T2.ElementValue FROM dbo.MyFunc() T1
Cross Apply dbo.MyFunc T2
order by t1.elementid
will not produce 9x9 matrix as
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
consistently.
Is there any possibility that it could be like
1 2
1 1
1 3
2 3
2 2
2 1
3 1
3 2
3 3
How to do it using my above function?

No, the order is not guaranteed to be the same.
Unless, of course you are using ORDER BY. Then it is guaranteed to be the same.

Given your update, you obtain it in the obvious way - you ask the system to give you the results in the order you want:
SELECT T1.ElementValue,T2.ElementValue FROM dbo.MyFunc() T1
Cross join dbo.MyFunc() T2
order by t1.elementid, t2.elementid
You are guaranteed that if you're using inefficient single row inserts within your UDF, that the IDENTITY values will match the order in which the individual INSERT statements were specified.

Order is not guaranteed.
But if all you want is just simply to get your records back in the same order you inserted them, then just order by your primary key. Since you already have that field setup as an auto-increment, it should suffice.

...or use a deterministic function
SELECT TOP 9
M1 = (ROW_NUMBER() OVER(ORDER BY id) + 2) / 3,
M2 = (ROW_NUMBER() OVER(ORDER BY id) + 2) % 3 + 1
FROM
sysobjects
M1 M2
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

group rows into a list in pyspark - group-by

I have a spark dataframe having a structure similar to the following table col1 col2 A 1 B 2 A 3 B 4 C 1 A 2 I want it to be grouped on col1 and create a list of values on col2. Following should be my output col1 list A [1,3,2] B [2, 4] C [1] Can someone point me to any references?

This should do the job: df.groupBy($"col1").agg( collect_list($"col2") )

Related

Replace first n entries in a column in kdb

PostgreSQL: Count from x to y

SQL Renumbering index after group by

Functional update - multivariable function with dynamic columns

T-SQL table variable data order

Categories

Resources

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

group rows into a list in pyspark - group-by

I have a spark dataframe having a structure similar to the following table **col1** **col2** A 1 B 2 A 3 B 4 C 1 A 2 I want it to be grouped on col1 and create a list of values on col2. Following should be my output **col1** **list** A [1,3,2] B [2, 4] C [1] Can someone point me to any references?

This should do the job: df.groupBy($"col1").agg( collect_list($"col2") )

Related

Replace first n entries in a column in kdb

PostgreSQL: Count from x to y

SQL Renumbering index after group by

Functional update - multivariable function with dynamic columns

T-SQL table variable data order

Categories

Resources

I have a spark dataframe having a structure similar to the following table col1 col2 A 1 B 2 A 3 B 4 C 1 A 2 I want it to be grouped on col1 and create a list of values on col2. Following should be my output col1 list A [1,3,2] B [2, 4] C [1] Can someone point me to any references?