Pivot table with columns as year/date in KDB+ - kdb

I am trying to create a pivot table with columns as year out of a simple table
q)growth:([] stock:asc 9#`goog`apple`nokia; year: 9#2015 2016 2017; returns:9?20 )
q)growth
stock year returns
------------------
apple 2015 9
apple 2016 18
apple 2017 17
goog 2015 8
goog 2016 13
goog 2017 17
nokia 2015 12
nokia 2016 12
nokia 2017 2
but I am not able to get the correct structure, it is still returning me a dictionary rather than multiple year columns.
q)exec (distinct growth`year)#year!returns by stock:stock from growth
stock|
-----| ----------------------
apple| 2015 2016 2017!9 18 17
goog | 2015 2016 2017!8 13 17
nokia| 2015 2016 2017!12 12 2
am I doing anything wrong?

You need to convert the years to symbols in order to use them as column headers. In this case I have updated the growth table first then performed the pivot:
q)exec distinct[year]#year!returns by stock:stock from update `$string year from growth
stock| 2015 2016 2017
-----| --------------
apple| 12 8 10
goog | 1 9 11
nokia| 5 6 1
Additionally you may see that I have changed to distinct[year] from (distinct growth`year) as this yields the same result with year being pulled from the updated table.

The column names of a table in KDB should be symbols rather than any other data type.
In your pivot table , the datatype of 'year' column is int\long this is the reason a proper table is not turning up.
If you type cast it as symbol, then it will work.
q)growth:([] stock:asc 9#`goog`apple`nokia; year: 9#2015 2016 2017; returns:9?20 )
q)growth:update `$string year from growth
q)exec (distinct growth`year)#year!returns by stock:stock from growth
stock| 2015 2016 2017
-----| --------------
apple| 9 18 17
goog | 8 13 17
nokia| 12 12 2
Alternatively, you can switch the pivot columns to 'stock' rather than 'year' and get a pivot table with the same original table.
q)growth:([] stock:asc 9#`goog`apple`nokia; year: 9#2015 2016 2017; returns:9?20 )
q)show exec (distinct growth`stock)#stock!returns by year:year from growth
year| apple goog nokia
----| ----------------
2015| 4 2 4
2016| 5 13 12
2017| 12 6 1

Related

Is there a way to do a selective sum using a time interval in Postgres?

I have two tables, the first table has columns: id, start_time, and end_time. The second table has columns: id, timestamp, value. Is there a way to make a sum of table 2 based on the conditions in table 1?
Table 1:
id
start_date
end_date
5
2000-01-01 01:00:00
2000-01-05 02:45:00
5
2000-01-10 01:00:00
2000-01-15 02:45:00
6
2000-01-01 01:00:00
2000-01-05 02:45:00
6
2000-01-11 01:00:00
2000-01-12 02:45:00
6
2000-01-15 01:00:00
2000-01-20 02:45:00
Table 2:
id
timestamp
value
5
2000-01-01 05:00:00
1
5
2000-01-01 06:00:00
2
6
2000-01-01 05:00:00
1
6
2000-01-11 05:00:00
2
6
2000-01-15 05:00:00
2
6
2000-01-15 05:30:00
2
Desired result:
id
start_date
end_date
Sum
5
2000-01-01 01:00:00
2000-01-05 02:45:00
3
5
2000-01-10 01:00:00
2000-01-15 02:45:00
null
6
2000-01-01 01:00:00
2000-01-05 02:45:00
1
6
2000-01-11 01:00:00
2000-01-12 02:45:00
2
6
2000-01-15 01:00:00
2000-01-20 02:45:00
4
Try this :
SELECT a.id, a.start_date, a.end_date, sum(b.value) AS sum
FROM table1 AS a
LEFT JOIN table2 AS b
ON b.id = a.id
AND b.timestamp >= a.start_date
AND b.timestamp < a.end_date
GROUP BY a.id, a.start_date, a.end_date

SQL Query to display Calculated fields on a year, monthly basis

I need help writing this SQL query (PostgresSQL) to display results in the form below:
--------------------------------------------------------------------------------
State | Jan '17 | Feb '17 | Mar '17 | Apr '17 | May '17 ... Dec '18
--------------------------------------------------------------------------------
Principal Outs. |700,839 |923,000 |953,000 |6532,293 | 789,000 ... 913,212
Disbursal Amount |23,000 |25,000 |23,992 | 23,627 | 25,374 ... 23,209
Interest |113,000 |235,000 |293,992 |322,627 |323,374 ... 267,209
There are multiple tables but I would be okay joining them.

How to iterate through rows after group by in spark scala dataframe?

I have a dataframe with the below columns , df1
Following the example there:
Project_end_date I_date Project_start_date id
Jan 30 2017 Jan 10 2017 Jan 1 2017 1
Jan 30 2017 Jan 15 2017 Jan 1 2017 1
Jan 30 2017 Jan 20 2017 Jan 1 2017 1
Here you would fist find the differences between i and start date, which would be 10, 15, and 20 days. Then you would express those as a percentage of the project's duration, so 100*10/30=33%, 100*15/30=50%, 100*20/20=67%. Then you would obtain the mean (33%), min(33%), max(67%), etc of these.
how to achieve this after doing group by on id
df.groupby("id"). ?
Easiest way would be to add the value you care about just before the groupBy:
import org.apache.spark.sql.{functions => F}
import spark.implicits._
df.withColumn("ival", (
$"I_date" - $"Project_start_date") /
($"Project_end_date" - $"Project_start_date"))
.groupBy('id').agg(
F.min($"ival").as("min"),
F.max($"ival").as("max"),
F.avg($"ival").as("avg")
)
If you want to avoid the withColumn you can just get the expression for ival inside F.min, F.max and F.avg, but that's more verbose.

Select from table removing similar rows - PostgreSQL

There is a table with document revisions and authors. Looks like this:
doc_id rev_id rev_date editor title,content so on....
123 1 2016-01-01 03:20 Bill ......
123 2 2016-01-01 03:40 Bill
123 3 2016-01-01 03:50 Bill
123 4 2016-01-01 04:10 Bill
123 5 2016-01-01 08:40 Alice
123 6 2016-01-01 08:41 Alice
123 7 2016-01-01 09:00 Bill
123 8 2016-01-01 10:40 Cate
942 9 2016-01-01 11:10 Alice
942 10 2016-01-01 11:15 Bill
942 15 2016-01-01 11:17 Bill
I need to find out moments when document was transferred to another editor - only first rows of every edition series.
Like so:
doc_id rev_id rev_date editor title,content so on....
123 1 2016-01-01 03:20 Bill ......
123 5 2016-01-01 08:40 Alice
123 7 2016-01-01 09:00 Bill
123 8 2016-01-01 10:40 Cate
942 9 2016-01-01 11:10 Alice
942 10 2016-01-01 11:15 Bill
If I use DISTINCT ON (doc_id, editor) it resorts a table and I see only one per doc and editor, that is incorrect.
Of course I can dump all and filter with shell tools like awk | sort | uniq. But it is not good for big tables.
Window functions like FIRST_ROW do not give much, because I cannot partition by doc_id, editor not to mess all them.
How to do better?
Thank you.
You can use lag() to get the previous value, and then a simple comparison:
select t.*
from (select t.*,
lag(editor) over (partition by doc_id order by rev_date) as prev_editor
from t
) t
where prev_editor is null or prev_editor <> editor;

function to calculate aggregate sum count in postgresql

Is there a function that calculates the total count of the complete month like below? I am not sure if postgres. I am looking for the grand total value.
2012-08=# select date_trunc('day', time), count(distinct column) from table_name group by 1 order by 1;
date_trunc | count
---------------------+-------
2012-08-01 00:00:00 | 22
2012-08-02 00:00:00 | 34
2012-08-03 00:00:00 | 25
2012-08-04 00:00:00 | 30
2012-08-05 00:00:00 | 27
2012-08-06 00:00:00 | 31
2012-08-07 00:00:00 | 23
2012-08-08 00:00:00 | 28
2012-08-09 00:00:00 | 28
2012-08-10 00:00:00 | 28
2012-08-11 00:00:00 | 24
2012-08-12 00:00:00 | 36
2012-08-13 00:00:00 | 28
2012-08-14 00:00:00 | 23
2012-08-15 00:00:00 | 23
2012-08-16 00:00:00 | 30
2012-08-17 00:00:00 | 20
2012-08-18 00:00:00 | 30
2012-08-19 00:00:00 | 20
2012-08-20 00:00:00 | 24
2012-08-21 00:00:00 | 20
2012-08-22 00:00:00 | 17
2012-08-23 00:00:00 | 23
2012-08-24 00:00:00 | 25
2012-08-25 00:00:00 | 35
2012-08-26 00:00:00 | 18
2012-08-27 00:00:00 | 16
2012-08-28 00:00:00 | 11
2012-08-29 00:00:00 | 22
2012-08-30 00:00:00 | 26
2012-08-31 00:00:00 | 17
(31 rows)
--------------------------------
Total | 12345
As best I can guess from your question and comments you want sub-totals of the distinct counts by month. You can't do this with group by date_trunc('month',time) because that'll do a count(distinct column) that's distinct across all days.
For this you need a subquery or CTE:
WITH day_counts(day,day_col_count) AS (
select date_trunc('day', time), count(distinct column)
from table_name group by 1
)
SELECT 'Day', day, day_col_count
FROM day_counts
UNION ALL
SELECT 'Month', date_trunc('month', day), sum(day_col_count)
FROM day_counts
GROUP BY 2
ORDER BY 2;
My earlier guess before comments was: Group by month?
select date_trunc('month', time), count(distinct column)
from table_name
group by date_trunc('month', time)
order by time
Or are you trying to include running totals or subtotal lines? For running totals you need to use sum as a window function. Subtotals are just a pain, as SQL doesn't really lend its self to them; you need to UNION two queries then wrap them in an outer ORDER BY.
select
date_trunc('day', time)::text as "date",
count(distinct column) as count
from table_name
group by 1
union
select
'Total',
count(distinct column)
from table_name
group by 1, date_trunc('month', time)
order by "date" = 'Total', 1