Join Keyed tables in a list - kdb

I've got a list of keyed tables:
ktbls:(([filter: `a`b] user1: 3 4f);([filter: `a`c] user2: 3 4f);([filter: `$()] user3: "f"$()))
I want to join the tables by column, so I want to run this: ktbls[0],'ktbls[1],'ktbls[2] which results in the keyed table below:
filter|user1 user2 user3
a | 3 3 0n
b | 4 0n 0n
c | 0n 4 0n
Now since the length of the keyed table list can vary I need to somehow functionalise: ktbls[0],'ktbls[1],'ktbls[2],'...
However, I can't seem to figure it out.

Using your syntax:
q){x,'y}/[ktbls] / alternate forms ,'/[ktbls] or (,')/[ktbls]
filter| user1 user2 user3
------| -----------------
a | 3 3
b | 4
c | 4
But perhaps union join (uj) could work too?
q)(uj/)ktbls
filter| user1 user2 user3
------| -----------------
a | 3 3
b | 4
c | 4
Alternative syntax uj/[ktbls].
See the documentation on this use case of over /.

Related

split keyvalue pair data into new columns

My table looks something like this:
id | data
1 | A=1000 B=2000
2 | A=200 C=300
In kdb is there a way to normalize the data such that the final table is as follows:
id | data.1 | data.2
1 | A | 1000
1 | B | 2000
2 | A | 200
2 | C | 300
One option would be to make use of 0: & it's key-value parsing functionality, documented here https://code.kx.com/q/ref/file-text/#key-value-pairs e.g.
q)ungroup delete data from {x,`data1`data2!"S= "0:x`data}'[t]
id data1 data2
---------------
1 A "1000"
1 B "2000"
2 A "200"
2 C "300"
assuming you want data2 to be long datatype (j), can do
update "J"$data2 from ungroup delete data from {x,`data1`data2!"S= "0:x`data}'[t]
You could use a combination of vs (vector from scalar), each-both ' and ungroup:
q)t:([]id:1 2;data:("A=1000 B=2000";"A=200 C=300"))
q)t
id data
------------------
1 "A=1000 B=2000"
2 "A=200 C=300"
q)select id, dataA:`$data[;0], dataB:"J"$data[;1] from
ungroup update data: "=" vs '' " " vs ' data from t
id dataA dataB
--------------
1 A 1000
1 B 2000
2 A 200
2 C 300
I wouldn't recommend naming the columns with . e.g. data.1

PostgreSQL - Setting null values to missing rows in a join statement

SQL newbie here. I'm trying to write a query that generates a scoring table, setting null to a student's grades in a module for which they haven't yet taken their exams (on PostgreSQL).
So I start with tables that look something like this:
student_evaluation:
|student_id| module_id | course_id |grade |
|----------|-----------|-----------|-------|
| 1 | 1 | 1 |3 |
| 1 | 1 | 1 |7 |
| 1 | 2 | 1 |8 |
| 2 | 4 | 2 |9 |
course_module:
| module_id | course_id |
| ---------- | --------- |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
In our use case, a course is made up of several modules. Each module has a single exam, but a student who failed his exam may have a couple of retries. The same module may also be present in different courses, but an exam attempt only counts for one instance of the module (ie. student A passed module 1's exam on course 1. If course 2 also has module 1, student A has to retake the same exam for course 2 if he also has access to that course).
So the output should look like this:
student_id
module_id
course_id
grade
1
1
1
3
1
1
1
7
1
2
1
8
1
3
1
null
2
4
2
9
I feel like this should have been a simple task, but I think I have a very flawed understanding of how outer and cross joins work. I have tried stuff like:
SELECT se.student_id, se.module_id, se.course_id, se.grade FROM student_evaluation se
RIGHT OUTER JOIN course_module ON course_module.course_id = se.course_id
AND course_module.module_id = se.module_id
or
SELECT se.student_id, se.module_id, se.course_id, se.grade FROM student_evaluation se
CROSS JOIN course_module WHERE course_module.course_id = se.course_id
Neither worked. These all feel wrong, but I'm lost as to what would be the proper way to go about this.
Thank you in advance.
I think you need both join types: first use a cross join to build a list of all combinations of students and courses, then use an outer join to add the grades.
SELECT sc.student_id,
sc.module_id,
sc.course_id,
se.grade
FROM student_evaluation se
RIGHT JOIN (SELECT s.student_id,
c.module_id,
c.course_id
FROM (SELECT DISTINCT student_id
FROM student_evaluation) AS s
CROSS JOIN course_module AS c) AS sc
USING (course_id));

How would you create a group identifier based on one column, but sorted by another?

I am attempting to create column Group via T-SQL.
If a cluster of accounts are in a row, consider that as one group. if the account is seen again lower in the list (cluster or not), then consider it a new group. This seems straight forward, but I cannot seem to see the solution... Below there are three clusters of account 3456, each having a different group number (Group 1,4, and 6)
+-------+---------+------+
| Group | Account | Sort |
+-------+---------+------+
| 1 | 3456 | 1 |
| 1 | 3456 | 2 |
| 2 | 9878 | 3 |
| 3 | 5679 | 4 |
| 4 | 3456 | 5 |
| 4 | 3456 | 6 |
| 4 | 3456 | 7 |
| 5 | 1295 | 8 |
| 6 | 3456 | 9 |
+-------+---------+------+
UPDATE: I left this out of the original requirements, but a cluster of accounts could have more than two accounts. I updated the example data to include this scenario.
Here's how I'd do it:
--Sample Data
DECLARE #table TABLE (Account INT, Sort INT);
INSERT #table
VALUES (3456,1),(3456,2),(9878,3),(5679,4),(3456,5),(3456,6),(1295,7),(3456,8);
--Solution
SELECT [Group] = DENSE_RANK() OVER (ORDER BY grouper.groupID), grouper.Account, grouper.Sort
FROM
(
SELECT t.*, groupID = ROW_NUMBER() OVER (ORDER BY t.sort) +
CASE t.Account WHEN LEAD(t.Account,1) OVER (ORDER BY t.sort) THEN 1 ELSE 0 END
FROM #table AS t
) AS grouper;
Results:
Group Account Sort
------- ----------- -----------
1 3456 1
1 3456 2
2 9878 3
3 5679 4
4 3456 5
4 3456 6
5 1295 7
6 3456 8
Update based on OPs comment below (20190508)
I spent a couple days banging my head on how to handle groups of three or more; it was surprisingly difficult but what I came up with handles bigger clusters and is way better than my first answer. I updated the sample data to include bigger clusters.
Note that I include a UNIQUE constraint for the sort column - this creates a unique index. You don't need the constraint for this solution to work but, having an index on that column (clustered, nonclustered unique or just nonclustered) will improve the performance dramatically.
--Sample Data
DECLARE #table TABLE (Account INT, Sort INT UNIQUE);
INSERT #table
VALUES (3456,1),(3456,2),(9878,3),(5679,4),(3456,5),(3456,6),(1295,7),(1295,8),(1295,9),(1295,10),(3456,11);
-- Better solution
WITH Groups AS
(
SELECT t.*, Grouper =
CASE t.Account WHEN LAG(t.Account,1,t.Account) OVER (ORDER BY t.Sort) THEN 0 ELSE 1 END
FROM #table AS t
)
SELECT [Group] = SUM(sg.Grouper) OVER (ORDER BY sg.Sort)+1, sg.Account, sg.Sort
FROM Groups AS sg;
Results:
Group Account Sort
----------- ----------- -----------
1 3456 1
1 3456 2
2 9878 3
3 5679 4
4 3456 5
4 3456 6
5 1295 7
5 1295 8
5 1295 9
5 1295 10
6 3456 11

How to preserve order of a DataFrame when writing it as CSV with partitioning by columns?

I sort the rows of a DataFrame and write it out to disk like so:
df.
orderBy("foo").
write.
partitionBy("bar", "moo").
option("compression", "gzip").
csv(outDir)
When I look into the generated .csv.gz files, their order is not preserved. Is this the way Spark does this? Is there a way to preserve order when writing a DF to disk with a partitioning?
Edit: To be more precise: Not the order of the CSVs is off, but the order inside them. Let's say I have it like the following after df.orderBy (for simplicity, I now only partition by one column):
foo | bar | baz
===============
1 | 1 | 1
1 | 2 | 2
1 | 1 | 3
2 | 3 | 4
2 | 1 | 5
3 | 2 | 6
3 | 3 | 7
4 | 2 | 9
4 | 1 | 10
I expect it to be like this, e.g. for files in folder bar=1:
part-00000-NNN.csv.gz:
1,1
1,3
2,5
part-00001-NNN.csv.gz:
3,8
4,10
But what it is like:
part-00000-NNN.csv.gz:
1,1
2,5
1,3
part-00001-NNN.csv.gz:
4,10
3,8
It's been a while but I witnessed this again. I finally came across a workaround.
Suppose, your schema is like:
time: bigint
channel: string
value: double
If you do:
df.sortBy("time").write.partitionBy("channel").csv("hdfs:///foo")
the timestamps in the individual part-* files get tossed around.
If you do:
df.sortBy("channel", "time").write.partitionBy("channel").csv("hdfs:///foo")
the order is correct.
I think it has to do with shuffling. So, as a workaround, I am now sorting by the columns I want my data to be partitioned by first, then by the column I want to have it sorted in the individual files.

Tableau calculated field summing up the values

I have a table like this
----------------------------------------------
ID Name Value |
---------------------------------------------|
1 Bob 4 |
2 Mary 3 |
3 Bob 5 |
4 Jane 3 |
5 Jane 1 |
----------------------------------------------
Is there any ways to do out a calculated field where if the name is "Bob" , it'll sum up all the values that have the name "Bob"?
Thanks in advance!
If Name = “Bob” then Value end