How can I evenly divide records into N groups based on the values? - group-by

For a table as follows, how can I divide these records evenly into 3 groups based on the value of “factor_value”?
sym date factor_value
------ ---------- ------------
100000 2022.04.27 1
100001 2022.04.27 2
100002 2022.04.27 3
100003 2022.04.27 4
100004 2022.04.27 5
100005 2022.04.27 6
100006 2022.04.27 7
100007 2022.04.27 8
100008 2022.04.27 9
100009 2022.04.27 10
100010 2022.04.28
100000 2022.04.28
100001 2022.04.28
100002 2022.04.28 3
100003 2022.04.28 4
100004 2022.04.28 5
100005 2022.04.28 6
100006 2022.04.28 7
100007 2022.04.28 8
100008 2022.04.28 9

This can be implemented by DolphinDB functions cutPoints and asof.
sym=take(string(100000..100010),20)
date=sort(take(2022.04.27..2022.04.28,20))
factor_value= 1..10 join take(int(),3) join 3..9
tb= table( sym, date, factor_value)
select *,asof(cutPoints(int(factor_value*100000),3),factor_value*100000)+1 as factor_quantile from tb context by date csort factor_value having size(distinct(factor_value*100000))>3
First, use contexy by with csort to sort the column factor_value. Then allocate the records into 3 groups evenly with cutPoints. asof returns the grouping number for each element in the group.
output:
sym date factor_value factor_quantile
------ ---------- ------------ ---------------
100000 2022.04.27 1 1
100001 2022.04.27 2 1
100002 2022.04.27 3 1
100003 2022.04.27 4 1
100004 2022.04.27 5 2
100005 2022.04.27 6 2
100006 2022.04.27 7 2
100007 2022.04.27 8 3
100008 2022.04.27 9 3
100009 2022.04.27 10 3
100010 2022.04.28 1
100000 2022.04.28 1
100001 2022.04.28 1
100002 2022.04.28 3 1
100003 2022.04.28 4 2
100004 2022.04.28 5 2
100005 2022.04.28 6 2
100006 2022.04.28 7 3
100007 2022.04.28 8 3
100008 2022.04.28 9 3

Related

How to perform a pivot() and write.parquet on each partition of pyspark dataframe?

I have a spark dataframe df as below:
key| date | val | col3
1 1 10 1
1 2 12 1
2 1 5 1
2 2 7 1
3 1 30 2
3 2 20 2
4 1 12 2
4 2 8 2
5 1 0 2
5 2 12 2
I want to:
1) df_pivot = df.groupBy(['date', 'col3']).pivot('key').sum('val')
2) df_pivot.write.parquet('location')
But my data can get really big with millions of unique keys and unique col3.
Is there any way where i do the above operations per partition of col3?
Eg: For partition where col3==1, do the pivot and write the parquet
Note: I do not want to use a for loop!

Mutual followers postgresql query

How can I find mutual followers? I am new to postgres and I am having a hard time constructing this.
followers
u_id f_id
3 1
2 5
4 3
1 2
4 1
1 3
2 8
2 10
1 4
2 4
3 4
4 5
4 8
example of expected output:
u_id f_id
3 1
1 3
4 3
3 4
...
...

How do a simultaneous ascending and descending sort in KDB/Q

In SQL, one can do
SELECT from tbl ORDER BY col1, col2 DESC
In KDB, one can do
`col1 xasc select from tbl
or
`col2 xdesc select from tbl
But how does one sort by col1 ascending then by col2 descending in KDB/Q?
2 sorts.
Create example data:
q)show tbl:([]a:10?10;b:10?10;c:10?10)
a b c
-----
8 4 8
1 9 1
7 2 9
2 7 5
4 0 4
5 1 6
4 9 6
2 2 1
7 1 8
8 8 5
Do sorting:
q)`a xasc `b xdesc tbl
a b c
-----
1 9 1
2 7 5
2 2 1
4 9 6
4 0 4
5 1 6
7 2 9
7 1 8
8 8 5
8 4 8

KDB: Aggregate across consecutive rows with common label

I would like to sum across consecutive rows that share the same label. Any very simple ways to do this?
Example: I start with this table...
qty flag
1 OFF
3 ON
2 ON
2 OFF
9 OFF
4 ON
... and would like to generate...
qty flag
1 OFF
5 ON
11 OFF
4 ON
One method:
q)show t:flip`qty`flag!(1 3 2 2 9 4;`OFF`ON`ON`OFF`OFF`ON)
qty flag
--------
1 OFF
3 ON
2 ON
2 OFF
9 OFF
4 ON
q)show result:select sum qty by d:sums differ flag,flag from t
d flag1| qty
----------| ---
1 OFF | 1
2 ON | 5
3 OFF | 11
4 ON | 4
Then to get it in the format you require:
q)`qty`flag#0!result
qty flag
--------
1 OFF
5 ON
11 OFF
4 ON

how do I get a list of all descendant nodes related to its parent node in rows in tsql?

as a sample, my table (table name: hier) looks like this:
parentID childID
-------- -------
0 1
1 2
1 3
2 5
2 8
3 4
3 6
3 7
4 9
and I want it to output this:
parentID RelatedID
-------- ---------
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
2 5
2 8
3 4
3 6
3 7
3 9
4 9
With cte(p, d)
As
(
Select a.parentID, b.childID From hier a inner join hier b on a.childID=b.parentID
)
Select * From cte Union Select * From hier