kdb/q: use function in a select from partitioned table - kdb

I'm trying to get max drawdown from a partitioned table across multiple dates. The query works fine when run with a date constrained to a specific day. E.g.
select {max neg x-maxs x} pnl from trades where date=last date
It's getting map-reduced over multiple dates so the above query no longer works. I can make the query run over multiple dates by adding another aggregation:
select max {max neg x-maxs x} pnl from trades
but it's not getting the max drawdown from continuous sequence of trades but a maximum of daily drawdowns.
I wonder if there's a way to make it work with a single select without chaining selects like
select {max neg x-maxs x} pnl from select pnl from trades
I've got a rather big query to pull a lot of various metrics on the trades where max drawdown is just one of them. Using chained select means that I need to break the big query into two queries, map-reduced and non-map-reduced, and then join them back which would make the query look ugly.
Thanks!

Select query runs on each date in partition db and apply function to each date values and finally aggregates them depending upon the call (user defined function behaves differently than plain 'q' functions).
So I don't think you can combine that into one query. But there are ways you can look for to make your query more generalized and reusable for different scenarios.
For ex. convert your query to functional form and use variables in that query for column name and user function. Put this in one function which will accept column name and user function. Now you can call this function with different set of (column ;function). Something like :
runF:{[col;usrfunc] funtional_query_uses_col_userfunc }
All this depends on your use cases. Also check for memory usage as you'll be taking lot of data into memory.

Related

Getting the total number of records in a single knexjs query when using the limit() method

I use knexjs and postgresql. Is it possible in knexjs to get the total of records from the same query in which the limit is used?
For example:
knex.select().from('project').limit(50)
Is it possible to somehow get the total number of records in the same query if there are more than 50?
The question arose due to the fact that my query is much more complex, which uses a lot of subqueries and conditions, and I would not like to make this query twice to get the data in one query and the total number of records (I use the .count() method) from another.
I do not know your obscurification manager (knexjs?) but I would think you should be able to add the window version of the count() function to your select list. In plain SQL something like: Where ... represents your current select list. (see demo)
select ..., count(*) over() total_rows
from project
limit 5;
This works because the window count function counts all rows selected, after all rows selected, but before the LIMIT clause is applied. Note: This adds a column to the result set with the same value in every row.

From group by to window function postgres

my goal is to compare Event Engagement Rate = ( Event Subscribed / Event Participated ) for the newsletter source vs Email Source by using window functions
i Managed to do it by using group by
select source ,sum(event_subscribed/event_participated) as "engagement rate" from events
group by source
i keep failing with the windows function by having too many rows and wrong engagement rates
Thanks for your help.
First you need to understand the difference between the Aggregate functions and the Window variant of the same function. The aggregate functions builds groups as specified in the 'Group By' clause and reduces the result to a single row per group. The Window version also builds groups as specified in the 'Partition By' clause. However, it does not reduce the number of rows, instead it generated the same result in each of the rows within the group. See example. To reduce the Window version to a single row per group use Distinct on expression.
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of
each set of rows where the given expressions evaluate to equal. The
DISTINCT ON expressions are interpreted using the same rules as for
ORDER BY (see above). Note that the “first row” of each set is
unpredictable unless ORDER BY is used to ensure that the desired row
appears first.
Your query becomes
select distinct on (source)
source
, sum(event_subscribed/event_participated) as "engagement rate"
from events
order by source.

DAX: Distinct and then aggregate twice

I'm trying to create a Measure in Power BI using DAX that achieves the below.
The data set has four columns, Name, Month, Country and Value. I have duplicates so first I need to dedupe across all four columns, then group by Month and sum up the value. And then, I need to average across the Month to arrive at a single value. How would I achieve this in DAX?
I figured it out. Reply by #OscarLar was very close but nested SUMMARIZE causes problems because it cannot aggregate values calculated dynamically within the query itself (https://www.sqlbi.com/articles/nested-grouping-using-groupby-vs-summarize/).
I kept the inner SUMMARIZE from #OscarLar's answer changed the outer SUMMARIZE with a GROUPBY. Here's the code that worked.
AVERAGEX(GROUPBY(SUMMARIZE(Data, Data[Name], Data[Month], Data[Country], Data[Value]), Data[Month], "Month_Value", sumx(CURRENTGROUP(), Data[Value])), [Month_Value])
Not sure I completeley understood the question since you didn't provide example data or some DAX code you've already tried. Please do so next time.
I'm assuming parts of this can not (for reasons) be done using power query so that you have to use DAX. Then I think this will do what you described.
Create a temporary data table called Data_reduced in which duplicate rows have been removed.
Data_reduced =
SUMMARIZE(
'Data';
[Name];
[Month];
[Country];
[Value]
)
Then create the averaging measure like this
AveragePerMonth =
AVERAGEX(
SUMMARIZE(
'Data_reduced';
'Data_reduced'[Month];
"Sum_month"; SUM('Data_reduced'[Value])
);
[Sum_month]
)
Where Data is the name of the table.

Query distinct values from historical database

If I run this query on large Historical database without specifying a date, will KDB be smart enough to retrive status values from index and not bring database down?
select distinct status from trades
The only way kdb can possibly tell all the distinct status is by reading from every partition. Yes this will take a lot of memory but unless you yourself want to maintain a cache of all distinct status, there is nothing else you can do. As previous mentioned an attribute will speed the query up but the query time will still only scale with the number of partitions.
To retrieve using index, kdb provides 'g#' attribute. Distinct alone can take more time which depends on size of your table(it will be linear search without `g# attribute).
Check this-> http://code.kx.com/q4m3/8_Tables/#88-attributes
Let's look at simple example:
q) a: 10000000#1 2 3 5
q) b:`g#a
q) \ts distinct a
68 134217888
q) \ts distinct b
0 288
Difference shows that g# attribute makes a lot of difference in time and space taken during searching. It is becauseg# attribute creates and maintains index on vector.

Annualize data - Tableau

I'm trying to annualise my data in tableau, but get an error in the Calculated Field.
"Cannot mix aggregate and non-aggregate arguments to function"
my formula is
sum(profit)/month(selected date) *12
How do I get an integer for the current month? That seems to be the problem, it tries to aggregate the month as well.
Thanks.
Short answer: wrap the call to month in a call to min() -- which works well if you have MONTH([selected date]) on the visualization as a dimension.
There are three types of calculated fields in Tableau:
row level calculations which act on a single data row. They can read from values of other fields in the same row and return a single value per row.
aggregate calculations which act on a partition or block of data rows. They can reference the result of aggregating the values for a field across the entire partition, using a an aggregate function like SUM() or MIN().
table calculations which act on an entire table of aggregated results.
You can't mix and match. Everything in a calculated field must be all at one level or another -- either all referenced fields must use aggregation functions (for aggregate calculated fields) or no referenced fields must use aggregation functions (for data row level calculated fields).
Hence the error message you saw.
Sometimes you know that all values for a field will be the same in a partition based on your visualization, so the aggregation function seems unnecessary. But Tableau still requires you to be explicit about how to turn a block of values into a single value, because the calculation must be defined even when the visualization is partitioned differently. In these cases, you can use min(), max(), avg(), or perhaps attr() because they all return the same value for a list of identical values.
The first two types are typically executed on the server (i.e. they are implemented by Tableau emitting SQL to send to the database server). Table calculations are executed by Tableau on the client site to post-process the results from the database server.
Table calcs are the most complicated type, but can be very useful. Explaining them is a post for another day.