Collapse by values - aggregate

I have a dataset that looks like this (var1, var2, var3, ... have possible values of 1,2,3,4,5,6).
var1 (3,3,4,1,2,4,5,6...) var2 (4,5,2,2,3,1,6,6...) var3 (1,2,2,4,2,3,6,5...)
And I'd like to collapse the count of the values of var1 var2 var3... by values (1,2,3,4,5,6) and create a dataset like this:
value(1,2,3,4,5,6)
var1_count(1,1,2,2,1,1)
var2_count(1,2,1,1,1,2)
var3_count(1,3,1,1,1,1)
I tried the command collapse(count), but I don't have a grouping variable. Is there a way to aggregate variables by their values?

Please read advice on MCVE: https://stackoverflow.com/help/mcve
Your data can be read in as follows.
clear
mat var1 = (3,3,4,1,2,4,5,6)
mat var2 = (4,5,2,2,3,1,6,6)
mat var3 = (1,2,2,4,2,3,6,5)
set obs 8
forval j = 1/3 {
gen var`j' = var`j'[1, _n]
}
list, sep(0)
+--------------------+
| var1 var2 var3 |
|--------------------|
1. | 3 4 1 |
2. | 3 5 2 |
3. | 4 2 2 |
4. | 1 2 4 |
5. | 2 3 2 |
6. | 4 1 3 |
7. | 5 6 6 |
8. | 6 6 5 |
+--------------------+
One way to get that tabulation is to install tabm using
ssc install tab_chi
help tabm
tabm var?
| values
variable | 1 2 3 4 5 6 | Total
-----------+------------------------------------------------------------------+----------
var1 | 1 1 2 2 1 1 | 8
var2 | 1 2 1 1 1 2 | 8
var3 | 1 3 1 1 1 1 | 8
-----------+------------------------------------------------------------------+----------
Total | 3 6 4 4 3 4 | 24
tabm also offers a replace option to save the tabulation as a new dataset.

Related

pySpark sequential window function / access previously computed valued (previous row)

I have a pySpark dataframe and need to compute a column which depends on the value of the same column in the previous row. But instead of using the old value of this column of the previous row, I need the new one, to which the calculation was already applied.
Specifically, I need to compute the following: Given a dataframe with a column A in a specific order order. Compute a column B sequentially as B = MAX(0, LAG(B) - LAG(A)), starting with a default value of 0 for the first row.
Example:
Input:
order | A
------|----
0 | -1
1 | -2
2 | 4
3 | 4
4 | -1
5 | 4
6 | -1
Wanted output:
order | A | B
------|----|---
0 | -1 | 0 <- B here is here set to 0
1 | -2 | 1
2 | 4 | 3
3 | 4 | 0
4 | -1 | 0
5 | 4 | 1
6 | -1 | 0
Using the default F.lag window function does not work, since this one yields only the old previous row, since otherwise distributed computing is no longer possible, if it needs to be computed sequentially.

A Postgres query to get subtraction of a value in a row by the value in the next row

I have a table like(mytable):
id | value
=========
1 | 4
2 | 5
3 | 8
4 | 16
5 | 8
...
I need a query to give me subtraction on each rows by next row:
id | value | diff
=================
1 | 4 | 4 (4-Null)
2 | 5 | 1 (5-4)
3 | 8 | 3 (8-5)
4 | 16 | 8 (16-8)
5 | 8 | -8 (8-16)
...
Right now I use a python script to do so, but I guess it's faster if I create a view from this table.
You should use window functions - LAG() in this case:
SELECT id, value, value - LAG(value, 1) OVER (ORDER BY id) AS diff
FROM mytable
ORDER BY id;

kdb+/q: Fast vector update given a list of keys and values to be updated

Given a list of ids/keys and a set of corresponding values for a constant column:
q)ikeys: 1 2 3 5;
q)ivals: 100 100.5 101.5 99.5;
What is the fastest way to update the `toupd column in the following table such that the rows that match the given ikeys are updated to the new values in ivals:i.e.
q) show tab;
ikeys | `toupd `noupd
------|--------------
1 | 0.5 1
2 | 100.5 2
3 | 500.5 4
4 | 400.5 8
5 | 400.5 16
6 | 600.5 32
7 | 700.5 64
is updated to:
q) show restab;
ikeys | `toupd `noupd
------|--------------
1 | 100 1
2 | 100.5 2
3 | 101.5 4
4 | 400.5 8
5 | 99.5 16
6 | 600.5 32
7 | 700.5 64
furthermore, is there a canonical method with which one could update multiple columns in this manner.
thanks
A dot amend is another approach which more easily generalises to more than one column. It can also take advantage of amend-in-place which would be the most memory efficient approach as it doesn't create a duplicate copy of the table in memory (assumes global).
ikeys:1 2 3 5
ivals:100 100.5 101.5 99.5
tab:([ikeys:1+til 7]toupd:.5 100.5 500.5 400.5 400.5 600.5 700.5;noupd:1 2 4 8 16 32 64)
q).[tab;(([]ikeys);`toupd);:;ivals]
ikeys| toupd noupd
-----| -----------
1 | 100 1
2 | 100.5 2
3 | 101.5 4
4 | 400.5 8
5 | 99.5 16
6 | 600.5 32
7 | 700.5 64
/amend in place
.[`tab;(([]ikeys);`toupd);:;ivals]
/generalise to two columns
q).[tab;(([]ikeys);`toupd`noupd);:;flip(ivals;1000 2000 3000 4000)]
ikeys| toupd noupd
-----| -----------
1 | 100 1000
2 | 100.5 2000
3 | 101.5 3000
4 | 400.5 8
5 | 99.5 4000
6 | 600.5 32
7 | 700.5 64
/you could amend in place here too
.[`tab;(([]ikeys);`toupd`noupd);:;flip(ivals;1000 2000 3000 4000)]
Here are two different ways of doing it.
tab lj ([ikeys] toupd: ivals)
or
m: ikeys
update toupd: ivals from tab where ikeys in m
I'm sure there are plenty more ways. If you want to find out which is fastest for your purpose (and your data), try using q)\t:1000 yourCodeHere for large tables and see which suits you best.
As for which is the canonical way for multiple columns, I imagine it would be the update, but it's a matter of personal preference, just do whatever is fastest.
A dictionary is also a common method of updating values given a mapping. Indexing the dictionary with the ikeys column gives the new values and then we fill in nulls with the old toupd column values.
q)show d:ikeys!ivals
1| 100
2| 100.5
3| 101.5
5| 99.5
q)update toupd:toupd^d ikeys from tab
ikeys| toupd noupd
-----| -----------
1 | 100 1
2 | 100.5 2
3 | 101.5 4
4 | 400.5 8
5 | 99.5 16
6 | 600.5 32
7 | 700.5 64
It also worth noting that the update condition with the where clause is not guaranteed to work in all cases, e.g. if you have more mapping values than appear in your ikeys column.
q)m:ikeys:1 2 3 5 7 11
q)ivals:100 100.5 101.5 99.5 100 100
q)update toupd: ivals from tab where ikeys in m
'length

PostgreSQL WITH RECURSIVE query to get ordered parent-child chain by a Partition Key

I have the issue writing a sql script on PostgreSQL 9.6.6 which orders steps in a process by using the steps' parent-child ID's, and this grouped/partitioned per process ID. I couldn't find this special case here, so I apologize if I missed it and would please you to provide me the link to the solution in the comments.
The case: I have a table which looks like this:
processID | stepID | parentID
1 1 NULL
1 3 5
1 2 4
1 4 3
1 5 1
2 1 NULL
2 3 5
2 2 4
2 4 3
2 5 1
Now I have to order the steps by starting with the step where parentID is NULL for each processID .
Note: I cannot simply order StepID or parentID as new steps I put within the whole process get a higher stepID then the last step in the process (continuous generating surrogate key).
I have to order the steps for every processID, that I will receive the following output:
processID | stepID | parentID
1 1 NULL
1 5 1
1 3 5
1 4 3
1 2 4
2 1 NULL
2 5 1
2 3 5
2 4 3
2 2 4
I tried to do this with the CTE function WITH RECURSIVE:
WITH RECURSIVE
starting (processID,stepID, parentID) AS
(
SELECT b.processID,b.stepID, b.parentID
FROM process b
WHERE b.parentID ISNULL
),
descendants (processID,stepID, parentID) AS
(
SELECT b.processID,b.stepID, b.stepparentID
FROM starting b
UNION ALL
SELECT b.processID,b.stepID, b.parentID
FROM process b
JOIN descendants AS c ON b.parentID = c.stepID
)
SELECT * FROM descendants
The result is not what I am searching for. As we have hundreds of processes, I receive a list where the first records are the different processIDs which have a NULL value as parentID.
I guess I have to recursive the whole script on the processID again, but have no idea how.
Thank you for your help!
You should calculate the level of each step:
with recursive starting as (
select processid, stepid, parentid, 0 as level
from process
where parentid is null
union all
select p.processid, p.stepid, p.parentid, level+ 1
from starting s
join process p on s.stepid = p.parentid and s.processid = p.processid
)
select *
from starting
order by processid, level
processid | stepid | parentid | level
-----------+--------+----------+-------
1 | 1 | | 0
1 | 5 | 1 | 1
1 | 3 | 5 | 2
1 | 4 | 3 | 3
1 | 2 | 4 | 4
2 | 1 | | 0
2 | 5 | 1 | 1
2 | 3 | 5 | 2
2 | 4 | 3 | 3
2 | 2 | 4 | 4
(10 rows)
Of course, you can skip the last column in the final select if you do not need it.

Add a key element for n rows in PySpark Dataframe

I have a dataframe like the one shown below.
id | run_id
--------------
4 | 12345
6 | 12567
10 | 12890
13 | 12450
I wish to add a new column say Key that will have value 1 for the first n rows and 2 for the next n rows. The result will be like:
id | run_id | key
----------------------
4 | 12345 | 1
6 | 12567 | 1
10 | 12890 | 2
13 | 12450 | 2
Is it possibile to do the same with PySpark?. Thanks in advance for the help.
Here is one way to do it using zipWithIndex:
# sample rdd
rdd=sc.parallelize([[4,12345], [6,12567], [10,12890], [13,12450]])
# group size for key
n=2
# add rownumber and then label in batches of size n
rdd=rdd.zipWithIndex().map(lambda (x, rownum): x+[int(rownum/n)+1])
# convert to dataframe
df=rdd.toDF(schema=['id', 'run_id', 'key'])
df.show(4)