KDB: Aggregate across consecutive rows with common label - aggregate

I would like to sum across consecutive rows that share the same label. Any very simple ways to do this?
Example: I start with this table...
qty flag
1 OFF
3 ON
2 ON
2 OFF
9 OFF
4 ON
... and would like to generate...
qty flag
1 OFF
5 ON
11 OFF
4 ON

One method:
q)show t:flip`qty`flag!(1 3 2 2 9 4;`OFF`ON`ON`OFF`OFF`ON)
qty flag
--------
1 OFF
3 ON
2 ON
2 OFF
9 OFF
4 ON
q)show result:select sum qty by d:sums differ flag,flag from t
d flag1| qty
----------| ---
1 OFF | 1
2 ON | 5
3 OFF | 11
4 ON | 4
Then to get it in the format you require:
q)`qty`flag#0!result
qty flag
--------
1 OFF
5 ON
11 OFF
4 ON

Related

Create a range of dates in a pyspark DataFrame

I have the following abstracted DataFrame (my original DF has 60 billion lines +)
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-05 8 4
2 2021-02-03 2 0
1 2021-02-07 12 5
2 2021-02-05 1 3
My expected ouput is:
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-02 10 2
1 2021-02-03 10 2
1 2021-02-04 10 2
1 2021-02-05 8 4
1 2021-02-06 8 4
1 2021-02-07 12 5
2 2021-02-03 2 0
2 2021-02-04 2 0
2 2021-02-05 1 3
Basically, what I need is: if Val1 or Val2 changes in a period of time, all the values between this two dates must have have the value from previous date. (To be more clearly, look at ID 2).
I know that I can do this in many ways (window function, udf,...) but my doubt is, since my original DF has more than 60 billion lines, what is the best approach to do this processing?
I think the best approach (performance-wise) is performing an inner join (probably with broadcasting). If you worry about the number of records, I suggest you run them by batch (could be the number of records, or by date, or even a random number). The general idea is just to avoid running all at once.

How to perform a pivot() and write.parquet on each partition of pyspark dataframe?

I have a spark dataframe df as below:
key| date | val | col3
1 1 10 1
1 2 12 1
2 1 5 1
2 2 7 1
3 1 30 2
3 2 20 2
4 1 12 2
4 2 8 2
5 1 0 2
5 2 12 2
I want to:
1) df_pivot = df.groupBy(['date', 'col3']).pivot('key').sum('val')
2) df_pivot.write.parquet('location')
But my data can get really big with millions of unique keys and unique col3.
Is there any way where i do the above operations per partition of col3?
Eg: For partition where col3==1, do the pivot and write the parquet
Note: I do not want to use a for loop!

Mutual followers postgresql query

How can I find mutual followers? I am new to postgres and I am having a hard time constructing this.
followers
u_id f_id
3 1
2 5
4 3
1 2
4 1
1 3
2 8
2 10
1 4
2 4
3 4
4 5
4 8
example of expected output:
u_id f_id
3 1
1 3
4 3
3 4
...
...

How do a simultaneous ascending and descending sort in KDB/Q

In SQL, one can do
SELECT from tbl ORDER BY col1, col2 DESC
In KDB, one can do
`col1 xasc select from tbl
or
`col2 xdesc select from tbl
But how does one sort by col1 ascending then by col2 descending in KDB/Q?
2 sorts.
Create example data:
q)show tbl:([]a:10?10;b:10?10;c:10?10)
a b c
-----
8 4 8
1 9 1
7 2 9
2 7 5
4 0 4
5 1 6
4 9 6
2 2 1
7 1 8
8 8 5
Do sorting:
q)`a xasc `b xdesc tbl
a b c
-----
1 9 1
2 7 5
2 2 1
4 9 6
4 0 4
5 1 6
7 2 9
7 1 8
8 8 5
8 4 8

how do I get a list of all descendant nodes related to its parent node in rows in tsql?

as a sample, my table (table name: hier) looks like this:
parentID childID
-------- -------
0 1
1 2
1 3
2 5
2 8
3 4
3 6
3 7
4 9
and I want it to output this:
parentID RelatedID
-------- ---------
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
2 5
2 8
3 4
3 6
3 7
3 9
4 9
With cte(p, d)
As
(
Select a.parentID, b.childID From hier a inner join hier b on a.childID=b.parentID
)
Select * From cte Union Select * From hier