How to do a between join clause in KDB? - kdb

Suppose I have a table A with the columns bucket_start_date, bucket_end_date,
A
bucket_start_date | bucket_end_date
2015.05.02 | 2015.05.08
2015.05.08 | 2015.05.12
Also suppose i have a table B with the columns date, coins.
A
date | coins
2015.05.02 | 5
2015.05.06 | 11
2015.05.09 | 32
How do I do a join in kdb that logically looks like
select A.bucket_start_date, A.bucket_end_date, sum(coins) from A join B where B.date BETWEEN A.bucket_start_date and A.bucket_end_date group by A.bucket_start_date, A.bucket_end_date
So I want the result to look like
bucket_start_date | bucket_end_date | sum(coins)
2015.05.02 | 2015.05.08 | 16
2015.05.08 | 2015.05.12 | 32

A window join is a natural way of acheiving this result. Below is a wj1 function that will get what you are after:
q)wj1[A`bucket_start_date`bucket_end_date;`date;A;(B;(sum;`coins))]
bucket_start_date bucket_end_date coins
---------------------------------------
2015.05.02 2015.05.08 16
2015.05.08 2015.05.12 32
The first variable is a pair of lists of dates, with the first being beginning dates and last being end dates.
The second variable is the common columns, in this case you want to use the date column, since you are looking in which window each date fits in.
The third and fourth variable contains the simple tables to join, and finally (sum;`coins) is a list of the function to be applied to the given column. Again, in this case you are summing the coins column within each window.
A wj considers prevailing values on entry to each interval, whilst wj1 considers only values occuring in each interval. You can change wj1 to wj in the function to see the difference.

Firstly it is good convention not to use _ in naming conventions as _ is also used as the drop operator in q.
q)data:([]bucketSt:2015.05.02 2015.05.08;bucketEnd:2015.05.08 2015.05.12)
q)daterange:([]date:2015.05.02 2015.05.06 2015.05.09; coins: 5 11 32)
But the solution to the question without window join can be a fairly straightforward select statement.
update coins:({exec sum coins from daterange where date within x} each get each data) from data
starting from the inside of the () brackets.
q)get each data
2015.05.02 2015.05.08
2015.05.08 2015.05.12
returns the start and end times for each row.
Where a simple exec statement with aggregation gets the necessary results from the daterange table. Finally using an update statement on the original table with the new values. Returning the table as follows:
bucketSt bucketEnd coins
---------------------------
2015.05.02 2015.05.08 16
2015.05.08 2015.05.12 32
There is a possibility to do a window join as well which is more effective, but this should be easily understandable. Hope it helps!

Related

Postgres aggregate function for selecting best / first element in a group

I have a table of elements with a number of items that need de-duping based on a priority. The following is a grossly simplified but representative example:
sophia=> select * from numbers order by value, priority;
value | priority | label
-------+----------+-------
1 | 1 | One
1 | 2 | Eins
2 | 1 | Two
2 | 2 | Zwei
3 | 2 | Drei
4 | 1 | Four
4 | 2 | Vier
(7 rows)
I want to restrict this to returning only a single row per number. Easy enough, I can use the first() aggregate function detailed in https://wiki.postgresql.org/wiki/First/last_(aggregate)
sophia=> select value, first(label) from numbers group by value order by value;
value | first
-------+-------
1 | One
2 | Two
3 | Drei
4 | Four
(4 rows)
sophia=>
The problem with this is that the order isn't well defined so if the DB rows were inserted in a different order, I might get this:
sophia=> select value, first(label) from numbers group by value order by value;
value | first
-------+-------
1 | Eins
2 | Zwei
3 | Drei
4 | Vier
(4 rows)
Of course, the solution to that also seems simple, in that I could just do an order by:
sophia=> select value, first(label) from (select * from numbers order by priority) foo group by value order by value;
value | first
-------+-------
1 | One
2 | Two
3 | Drei
4 | Four
(4 rows)
sophia=>
However, the problem here is that the query optimizer is free to discard the order by rules in subqueries meaning that this doesn't always work and breaks in random nasty places.
I have a solution that I'm currently using in a handful of places that relies on array_agg.
sophia=> select value, (array_agg(label order by priority))[1] as best_label from numbers group by value;
value | best_label
-------+------------
1 | One
2 | Two
3 | Drei
4 | Four
(4 rows)
sophia=>
This provides robust ordering but involves creating a bunch of extra arrays during query time that just get thrown away and hence the performance on larger datasets rather sucks.
So the question is, is there a better, cleaner, faster way of dealing with this?
Your last attempt includes the answer to your question, you just didn't realise it:
array_agg(label order by priority)
Note the order by clause inside the aggregate function. This isn't special to array_agg, but is a general part of the syntax for using aggregate functions:
Ordinarily, the input rows are fed to the aggregate function in an unspecified order. In many cases this does not matter; for example, min produces the same result no matter what order it receives the inputs in. However, some aggregate functions (such as array_agg and string_agg) produce results that depend on the ordering of the input rows. When using such an aggregate, the optional order_by_clause can be used to specify the desired ordering. The order_by_clause has the same syntax as for a query-level ORDER BY clause, as described in Section 7.5, except that its expressions are always just expressions and cannot be output-column names or numbers.
Thus the solution to your problem is simply to put an order by inside your first aggregate expression:
select value, first(label order by priority) from numbers group by value order by value;
Given how elegant this is, I'm surprised that first and last are still not implemented as built-in aggregates.
The Postgres select statement has a clause called DISTINCT ON which is extremely useful in the case when you would like to return one of a group. In this case, you would use:
SELECT DISTINCT ON (value) value, label
FROM numbers
ORDER BY value, priority;
Using DISTINCT ON is generally faster than other methods involving groups or window functions.

Postgis DB Structure: Identical tables for multiple years?

I have multiple datasets for different years as a shapefile and converted them to postgres tables, which confronts me with the following situation:
I got tables boris2018, boris2017, boris2016 and so on.
They all share an identical schema, for now let's focus on the following columns (example is one row out of the boris2018 table). The rows represent actual postgis geometry with certain properties.
brw | brwznr | gema | entw | nuta
-----+--------+---------+------+------
290 | 285034 | Sieglar | B | W
the 'brwznr' column is an ID of some kind, but it does not seem to be entirely consistent across all years for each geometry.
Then again, most of the tables contain duplicate information. The geometry should be identical in every year, although this is not guaranteed, too.
What I first did was to match the brwznr of each year with the 2018 data, adding a brw17, brw2016, ... column to my boris2018 data, like so:
brw18 | brw17 | brw16 | brwznr | gema | entw | nuta
-------+-------+-------+--------+---------+------+------
290 | 260 | 250 | 285034 | Sieglar | B | W
This led to some data getting lost (because there was no matching brwznr found), some data wrongly matched (because some matching was wrong due to inconsistencies in the data) and it didn't feel right.
What I actually want to achieve is having fast queries that get me the different brw values for a certain coordinate, something around the lines of
SELECT ortst, brw, gema, gena
FROM boris2018, boris2017, boris2016
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
or
SELECT ortst, brw18, brw17, brw16, gema, gena
FROM boris
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
although this obviously wrong/has its deficits.
Since I am new to databases in general, I can't really tell whether this is a querying problem or a database structure problem.
I hope anyone can help, your time and effort is highly appreciated!
Tim
Have you tried using a CTE?
WITH j AS (
SELECT ortst, brw, gema, gena FROM boris2016
UNION
SELECT ortst, brw, gema, gena FROM boris2017
UNION
SELECT ortst, brw, gema, gena FROM boris2018)
SELECT * FROM j
WHERE ST_Intersects(j.geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
Depending on your needs, you might wanna use UNION ALL. Note that this approach might not be fastest one when dealing with very large tables. If it is the case, consider merging the result of these three queries into another table and create an index using the geom field. Let me know in the comments if it is the case.

Using filtered results as field for calculated field in Tableau

I have a table that looks like this:
+------------+-----------+---------------+
| Invoice_ID | Charge_ID | Charge_Amount |
+------------+-----------+---------------+
| 1 | A | $10 |
| 1 | B | $20 |
| 2 | A | $10 |
| 2 | B | $20 |
| 2 | C | $30 |
| 3 | C | $30 |
| 3 | D | $40 |
+------------+-----------+---------------+
In Tableau, how can I have a field that SUMs the Charge_Amount for the Charge_IDs B, C and D, where the invoice has a Charge_ID of A? The result would be $70.
My datasource is SQL Server, so I was thinking that I could add a field (called Has_ChargeID_A) to the SQL Server Table that tells if the invoice has a Charge_ID of A, and then in Tableau just do a SUM of all the rows where Has_ChargeID_A is true and Charge_ID is either B, C or D. But I would prefer if I can do this directly in Tableau (not this exactly, but anything that will get me to the same result).
Your intuition is steering you in the right direction. You do want to filter to only Invoices that contain row with a Charge_ID of A, and you can do this directly in Tableau.
First place Invoice_ID on the filter shelf, then select the Condition tab for the filter. Then select the "By formula" option on the condition tab and enter the formula you wish to use to determine which invoice_ids are included by the filter.
Here is a formula for your example:
count(if Charge_ID = 'A' then 'Y' end) > 0
For each data row, it will calculate the value of the expression inside the parenthesis, and then only include invoice_ids with at least one non-null value for the internal expression. (The implicit else for the if statement, "returns" null).
The condition tab for a dimension field equates to a HAVING clause in SQL.
If condition formulas get complex, it's often a good a idea to define them with a calculated field -- or a combination of several simpler calculated fields, just to keep things manageable.
Finally, if you end up working with sets of dimensions like this frequently, you can define them as sets. You can still drop sets on the filter shelf, but then can reuse them in other ways: like testing set membership in a calculated field (like a SQL IN clause), or by creating new sets using intersection and union operators. You can think of sets like named filters, such as the set of invoices that contain type A charge.

PostgreSQL amount for each day summed up in weeks

I've been trying to find a solution to this challenge all day.
I've got a table:
id | amount | type | date | description | club_id
-------+---------+------+----------------------------+---------------------------------------+---------+--------
783 | 10000 | 5 | 2011-08-23 12:52:19.995249 | Sign on fee | 7
The table has a lot more data than this.
What I'm trying to do is get the sum of amount for each week, given a specific club_id.
The last thing I ended up with was this, but it doesn't work:
WITH RECURSIVE t AS (
SELECT EXTRACT(WEEK FROM date) AS week, amount FROM club_expenses WHERE club_id = 20 AND EXTRACT(WEEK FROM date) < 10 ORDER BY week
UNION ALL
SELECT week+1, amount FROM t WHERE week < 3
)
SELECT week, amount FROM t;
I'm not sure why it doesn't work, but it complains about the UNION ALL.
I'll be off to bed in a minute, so I won't be able to see any answers before tomorrow (sorry).
I hope I've described it adequately.
Thanks in advance!
It looks to me like you are trying to use the UNION ALL to retrieve a subset of the first part of the query. That won't work. You have two options. The first is to use user defined functions to add behavior as you need it and the second is to nest your WITH clauses. I tend to prefer the former, but you may be preferring the latter.
To do the functions/table methods approach you create a function which accepts as input a row from a table and does not hit the table directly. This provides a bunch of benefits including the ability to easily index the output. Here the function would look like:
CREATE FUNCTION week(club_expense) RETURNS int LANGUAGE SQL IMMUTABLE AS $$
select EXTRACT(WEEK FROM $1.date)
$$;
Now you have a usable macro which can be used where you would use a column. You can then:
SELECT c.week, sum(amount) FROM club_expense c
GROUP BY c.week;
Note that the c. before week is not optional. The parser converts that into week(c). If you want to limit this to a year, you can do the same with years.
This is a really neat, useful feature of Postgres.

Postgresql finding and counting unique tuples

I have a question where I must find unique tuples in a table, where they have not ever been seen inside another table. I then must count those tuples and display the ones that occur more then 10 times. That is, there are some jobs that don't require job staff, I am to find the jobs that have never ever required a job staff and has ran in terms more then 10 times.
create table JobStaff (
Job integer references Job(id),
staff integer references Staff(id),
role integer references JobRole(id),
primary key (job,staff,role)
);
create table Job (
id integer,
branch integer not null references Branches(id),
term integer not null references Terms(id),
primary key (id)
);
Essentially my code exists as:
CREATE OR REPLACE VIEW example AS
SELECT * FROM Job
WHERE id NOT IN (SELECT DISTINCT jobid FROM JobStaff);
create or replace view exampleTwo as
select branch, count(*) as ct from example group by 1;
create or replace view sixThree as
select branch, ct from exampleTwo where ct > 30;
This seems to return two extra rows then the expected result. I asked my lecturer and he said that it's because I'm counting courses that SOMETIMES
EDIT: this means, that for all the terms the job was available, there were no job staff assigned to it
EDIT2: expected output and what I get:
what i got:
branch | cou
---------+-----
7973 | 34
7978 | 31
8386 | 33
8387 | 32
8446 | 32
8447 | 32
8448 | 31
61397 | 31
62438 | 32
63689 | 31
expected:
branch | cou
---------+-----
7973 | 34
8387 | 32
8446 | 32
8447 | 32
8448 | 31
61397 | 31
62438 | 32
63689 | 31
You have to understand that SQL works in the way that you must know what you want before you can design the query.
In your question you write that you are looking for jobs that are not in jobstuff but now that we have the answer here, its clear that you were looking for branches that are not in jobstuff. My advice to you is: Take you time to word (in your language, or in English but not in SQL) exactly what you want BEFORE trying to implement it. When you are more experienced that isn't necessary always but for a beginner its the best way to learn SQL.
The Solution:
The thing here is that you don't need views to cascade queries, you can just select from inner queries. Also to filter elements based on a computed value (like count) can be done by the having clause.
The count(DISTINCT ...) causes duplicate entries to be only counted once, so if a Job is in a term twice, it now gets counted only once.
The query below selects all branches that ever got a jobstaff and then looks for jobs that are not in this list.
As far as I understood your question this should help you:
SELECT branch, count(DISTINCT term_id) as cou
FROM jobs j
WHERE j.branch NOT IN (SELECT branch FROM job WHERE job.id IN (select jobid FROM jobstaff))
GROUP BY branch
HAVING count(DISTINCT term_id) > 10