Optimize the query using group by on certain condition in q kdb

Optimize the query using group by on certain condition in q kdb - kdb

We have a table t as below
q)t:([] sym:10?`GOOG`AMZN`IBM; px:10?100.; size:10?1000; mkt:10?`ab`cd`ef)
Our requirement is to 'group by' the table 't' by column 'sym' if column 'mkt' value is 'ef', for rest of the markets('ab`cd') we need all the values(not group by).
For this use case I have written below query which works as expected,
q)(select px, size, sym, mkt from select by sym from t where mkt=`ef), select px, size, sym, mkt from t where mkt in `ab`cd
please help me optimize the above query in a way i.e
sudo code -
if mkt=`ef:
then use group by on table
else if mkt in `ab`cd
don't use group by on table

I have found two different ways to make your query that are different from the one you have provided.
You can use the following query to accomplish what you want in one select statement:
select from t where (mkt<>`ef)|(mkt=`ef)&i=(last;i)fby ([]sym;mkt)
However if you compare its speed:
q)\t:1000 select from t where (mkt<>`ef)|(mkt=`ef)&i=(last;i)fby ([]sym;mkt)
68
to your original query:
q)\t:1000 (select px, size, sym, mkt from select by sym from t where mkt=`ef), select px, size, sym, mkt from t where mkt in `ab`cd
40
You can see that your query is faster.
Additionally you can try this which does not require explicitly stating every mkt in t you wish to not group by sym
(0!select by sym from t where mkt=`ef),select from t where mkt<>`ef
But again this ends up being around the same speed as your original solution:
q)\t:1000 (0!select by sym from t where mkt=`ef),select from t where mkt<>`ef
42
So in terms of optimization it seems your query works well for what you want it to accomplish.

This isn't any quicker either (as Rob says, your query is already good in terms of speed), but is shorter at least
delete x from select by sym,(1+i)*`ef<>mkt from t
...provided you don't mind the order changing a little.
In fby form
select from t where i=(last;i)fby([]sym;(1+i)*`ef<>mkt)

Related

How can I drop the first 252 rows by sym?

I have a table with sym-date indexing.
I'm trying to get the same table back, but skipping the first 252 rows for each symbol.
I expected it would be:
ungroup 252_select by sym from t
but this doesn't work. What am I doing wrong?

You are looking for something like this
select from t where 252<=(rank;i) fby sym
where rank returns the position in the sorted list and fby is used to apply this function to each subset of i when split on sym
Reasons why your attempt wasn't working
select by sym from t returns only the last row for each sym
therefore when you drop rows using 252_ you are dropping 252 last rows
ungroup is then likely failing because you have a two or more columns with different length vector elements
If you wanted to do this via ungroup you could do the following using xgroup as to keep all the rows in the grouping
ungroup 252_/:/:`sym xgroup t

select from t where 1=({x>252};i) fby sym

I came up with an admittedly more convoluted solution:
t:([] date:.z.D+til 1008;sym:(504#`A),(504#`B);px:1008?1.0); / test table
s:252; / # of elements to skip
ungroup (key tt)!{flip (x)_flip y}/: [s;tt[key tt:?[t;();((,)`sym)!(,)`sym;`date`px!`date`px]]]
The logic involves:
grouping by sym
assigning the result on the fly to table tt
processing the grouped dictionary one by one
reconstructing the table
Now, I initially tried to benchmark vs the fby solution proposed above testing against a very small table, and the solution using fby is 50% faster:
t:([] date:.z.D+til 10;sym:(5#`A),(5#`B);px:10?1.0);
s:2;
\t:100000 ungroup (key tt)!{flip (x)_flip y}/: [s;tt[key tt:?[t;();((,)`sym)!(,)`sym;`sym`px!`sym`px]]]
796
\t:100000 select from t where s<=(rank;i) fby sym
396
However, when the larger table proposed at the beginning (1008 rows in total, first 252 skipped per ticker) is used, the performance ranking changes:
\t:100000 select from t where s<=(rank;i) fby sym
2384
\t:100000 ungroup (key tt)!{flip (x)_flip y}/: [s;tt[key tt:?[t;();((,)`sym)!(,)`sym;`sym`px!`sym`px]]]
1679

Order by custom named rows

I’d like to sort my postgres results by some fancy ranking function, but for sake of simplicity, let’s say that I’d like to add two custom rows and sort by them.
SELECT my_table.*,
extract(epoch from (age(current_date, '2012-09-12 10:43:40'::date)))/3600 AS age_in_hours
Fancy_function_counting_distance() AS distance
FROM my_table
ORDER BY distance + age_in_hours;
However, it doesn’t work, since I’m getting error: ERROR: column "distance" does not exist.
Is it possible to order my results by that custom named rows?
I’m running postgres 9.1.x

As per the SQL standard, aliases in the SELECT list are not visible in ORDER BY.
You can use column-position specification (eg ORDER BY 1,2), but that doesn't accept an expression; you cannot ORDER BY 1+2, for example. So you need to use a subquery to generate the result set then sort it in an outer query:
SELECT *
FROM (
SELECT my_table.*,
extract(epoch from (age(current_date, '2012-09-12 10:43:40'::date)))/3600 AS age_in_hours
Fancy_function_counting_distance() AS distance
FROM my_table
) x
ORDER BY distance + age_in_hours;

Calling a function on distinct values from table

I've got a SQL Server 2005 database. I need to get distinct values in addition to calling a function on those distinct values. I'm not sure how the distinct works when there is a function call involved. For example, I have this query:
SELECT DISTINCT a, b, c, fcn_DoSomething(a, b, c) AS z FROM users
I'm guessing that the function (fcn_DoSomething) is being called for all of the values in the table, not the distinct values. Am I correct? If so, how can I write the query to call the function only on distinct values of a,b,c? I know one option is to use a temporary table, but if anyone has better ideas that would be great.
Thanks

This got me curious, so I did a bit of basic testing. I created a small table with some distinct and some repeating values, a function that just does string concatenation, and then looked at the execution plans for:
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select distinct cola, colb, dbo.sillyfunc(cola, colb)
from distincttest
--Clear the cache
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select cola, colb, dbo.sillyfunc(cola, colb)
from (select distinct cola, colb from distincttest) as t
In this case, the execution plans showed clearly that the first one ran the concatenation function for every single row, but the second did the sort for distinct values first, then ran the function. But for a small number of rows, they had the same execution time, and when run together they showed each one using 50% of the total query resources.
So, I added a few hundred thousand repeating rows. and tried again. This changed the query plan so it was doing a hash match to get distinctness rather than the former sort, and now the second version which forced it to select for distinctness first executed more than ten times faster.
Finally, I thought there was a chance that this might just be because SQL Server had my sillyfunc marked as nondeterministic (select OBJECTPROPERTYEX(object_id('dbo.sillyfunc'), 'isdeterministic') returned 0), so I switched to patindex which was a builtin function and considered deterministic. This gave me the same results with the function being called for every row in the first version and just for the few distinct ones in the second version.
So, its possible that further testing would find situations that would coax the optimizer to do something more sophisticated, but it appears that if you want to apply the distinct before the function is called then you need to use something like a subquery, CTE, or temp table to limit what the function has access to.

This would ensure that the function only got called on distinct values.
select *, fcn_DoSomething(a, b, c)
from
(select distinct a,b,c FROM users) v
However, I believe that the function call will be optimised, so it may not make a difference. Give it a try.

PostgreSQL array_agg order

Table 'animals':
animal_name animal_type
Tom Cat
Jerry Mouse
Kermit Frog
Query:
SELECT
array_to_string(array_agg(animal_name),';') animal_names,
array_to_string(array_agg(animal_type),';') animal_types
FROM animals;
Expected result:
Tom;Jerry;Kerimt, Cat;Mouse;Frog
OR
Tom;Kerimt;Jerry, Cat;Frog;Mouse
Can I be sure that order in first aggregate function will always be the same as in second.
I mean I would't like to get:
Tom;Jerry;Kermit, Frog;Mouse,Cat

Use an ORDER BY, like this example from the manual:
SELECT array_agg(a ORDER BY b DESC) FROM table;

If you are on a PostgreSQL version < 9.0 then:
From: http://www.postgresql.org/docs/8.4/static/functions-aggregate.html
In the current implementation, the order of the input is in principle unspecified. Supplying the input values from a sorted subquery will usually work, however. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
So in your case you would write:
SELECT
array_to_string(array_agg(animal_name),';') animal_names,
array_to_string(array_agg(animal_type),';') animal_types
FROM (SELECT animal_name, animal_type FROM animals) AS x;
The input to the array_agg would then be unordered but it would be the same in both columns. And if you like you could add an ORDER BY clause to the subquery.

According to Tom Lane:
... If I read it right, the OP wants to be sure that the two aggregate functions will see the data in the *same* unspecified order. I think that's a pretty safe assumption. The server would have to go way out of its way to do differently, and it doesn't.
... So it is documented behavior that an aggregate without its own ORDER BY will see the rows in whatever order the FROM clause supplies them.
So I think it's fine to assume that all the aggregates, none of which uses ORDER BY, in your query will see input data in the same order. The order itself is unspecified though (which depends on the order the FROM clause supplies rows).
Source: PostgreSQL mailing list

Do this:
SELECT
array_to_string(array_agg(animal_name order by animal_name),';') animal_names,
array_to_string(array_agg(animal_type order by animal_type),';') animal_types
FROM
animals;

Equivalent of LIMIT for DB2

How do you do LIMIT in DB2 for iSeries?
I have a table with more than 50,000 records and I want to return records 0 to 10,000, and records 10,000 to 20,000.
I know in SQL you write LIMIT 0,10000 at the end of the query for 0 to 10,000 and LIMIT 10000,10000 at the end of the query for 10000 to 20,000
So, how is this done in DB2? Whats the code and syntax?
(full query example is appreciated)

Using FETCH FIRST [n] ROWS ONLY:
http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.db29.doc.perf/db2z_fetchfirstnrows.htm
SELECT LASTNAME, FIRSTNAME, EMPNO, SALARY
FROM EMP
ORDER BY SALARY DESC
FETCH FIRST 20 ROWS ONLY;
To get ranges, you'd have to use ROW_NUMBER() (since v5r4) and use that within the WHERE clause: (stolen from here: http://www.justskins.com/forums/db2-select-how-to-123209.html)
SELECT code, name, address
FROM (
SELECT row_number() OVER ( ORDER BY code ) AS rid, code, name, address
FROM contacts
WHERE name LIKE '%Bob%'
) AS t
WHERE t.rid BETWEEN 20 AND 25;

Developed this method:
You NEED a table that has an unique value that can be ordered.
If you want rows 10,000 to 25,000 and your Table has 40,000 rows, first you need to get the starting point and total rows:
int start = 40000 - 10000;
int total = 25000 - 10000;
And then pass these by code to the query:
SELECT * FROM
(SELECT * FROM schema.mytable
ORDER BY userId DESC fetch first {start} rows only ) AS mini
ORDER BY mini.userId ASC fetch first {total} rows only

Support for OFFSET and LIMIT was recently added to DB2 for i 7.1 and 7.2. You need the following DB PTF group levels to get this support:
SF99702 level 9 for IBM i 7.2
SF99701 level 38 for IBM i 7.1
See here for more information: OFFSET and LIMIT documentation, DB2 for i Enhancement Wiki

Here's the solution I came up with:
select FIELD from TABLE where FIELD > LASTVAL order by FIELD fetch first N rows only;
By initializing LASTVAL to 0 (or '' for a text field), then setting it to the last value in the most recent set of records, this will step through the table in chunks of N records.

#elcool's solution is a smart idea, but you need to know total number of rows (which can even change while you are executing the query!). So I propose a modified version, which unfortunately needs 3 subqueries instead of 2:
select * from (
select * from (
select * from MYLIB.MYTABLE
order by MYID asc
fetch first {last} rows only
) I
order by MYID desc
fetch first {length} rows only
) II
order by MYID asc
where {last} should be replaced with row number of the last record I need and {length} should be replaced with the number of rows I need, calculated as last row - first row + 1.
E.g. if I want rows from 10 to 25 (totally 16 rows), {last} will be 25 and {length} will be 25-10+1=16.

Try this
SELECT * FROM
(
SELECT T.*, ROW_NUMBER() OVER() R FROM TABLE T
)
WHERE R BETWEEN 10000 AND 20000

The LIMIT clause allows you to limit the number of rows returned by the query. The LIMIT clause is an extension of the SELECT statement that has the following syntax:
SELECT select_list
FROM table_name
ORDER BY sort_expression
LIMIT n [OFFSET m];
In this syntax:
n is the number of rows to be returned.
m is the number of rows to skip before returning the n rows.
Another shorter version of LIMIT clause is as follows:
LIMIT m, n;
This syntax means skipping m rows and returning the next n rows from the result set.
A table may store rows in an unspecified order. If you don’t use the ORDER BY clause with the LIMIT clause, the returned rows are also unspecified. Therefore, it is a good practice to always use the ORDER BY clause with the LIMIT clause.
See Db2 LIMIT for more details.

You should also consider the OPTIMIZE FOR n ROWS clause. More details on all of this in the DB2 LUW documentation in the Guidelines for restricting SELECT statements topic:
The OPTIMIZE FOR clause declares the intent to retrieve only a subset of the result or to give priority to retrieving only the first few rows. The optimizer can then choose access plans that minimize the response time for retrieving the first few rows.

There are 2 solutions to paginate efficiently on a DB2 table :
1 - the technique using the function row_number() and the clause OVER which has been presented on another post ("SELECT row_number() OVER ( ORDER BY ... )"). On some big tables, I noticed sometimes a degradation of performances.
2 - the technique using a scrollable cursor. The implementation depends of the language used. That technique seems more robust on big tables.
I presented the 2 techniques implemented in PHP during a seminar next year. The slide is available on this link :
http://gregphplab.com/serendipity/uploads/slides/DB2_PHP_Best_practices.pdf
Sorry but this document is only in french.

Theres these available options:-
DB2 has several strategies to cope with this problem.
You can use the "scrollable cursor" in feature.
In this case you can open a cursor and, instead of re-issuing a query you can FETCH forward and backward.
This works great if your application can hold state since it doesn't require DB2 to rerun the query every time.
You can use the ROW_NUMBER() OLAP function to number rows and then return the subset you want.
This is ANSI SQL
You can use the ROWNUM pseudo columns which does the same as ROW_NUMBER() but is suitable if you have Oracle skills.
You can use LIMIT and OFFSET if you are more leaning to a mySQL or PostgreSQL dialect.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Optimize the query using group by on certain condition in q kdb - kdb

This isn't any quicker either (as Rob says, your query is already good in terms of speed), but is shorter at least delete x from select by sym,(1+i)`ef<>mkt from t ...provided you don't mind the order changing a little. In fby form select from t where i=(last;i)fby([]sym;(1+i)`ef<>mkt)

Related

How can I drop the first 252 rows by sym?

Order by custom named rows

Calling a function on distinct values from table

PostgreSQL array_agg order

Equivalent of LIMIT for DB2

Categories

Resources

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Optimize the query using group by on certain condition in q kdb - kdb

This isn't any quicker either (as Rob says, your query is already good in terms of speed), but is shorter at least delete x from select by sym,(1+i)*`ef<>mkt from t ...provided you don't mind the order changing a little. In fby form select from t where i=(last;i)fby([]sym;(1+i)*`ef<>mkt)

Related

How can I drop the first 252 rows by sym?

Order by custom named rows

Calling a function on distinct values from table

PostgreSQL array_agg order

Equivalent of LIMIT for DB2

Categories

Resources

This isn't any quicker either (as Rob says, your query is already good in terms of speed), but is shorter at least delete x from select by sym,(1+i)`ef<>mkt from t ...provided you don't mind the order changing a little. In fby form select from t where i=(last;i)fby([]sym;(1+i)`ef<>mkt)