Count from KDB table with where clause - kdb

I know count table tells you how many rows are in table but how do you count from a table with a where clause as a filter? I tried count table where PERIOD=x but I am getting the error: 'PERIOD even though PERIOD is a field in the table

Use qsql to filter and then count the result:
count select from table where PERIOD=x

If you only need the count, do
exec sum PERIOD=x from table
If the table has many columns, this can be much faster than
count select from table where PERIOD=x
Please note that this computes a sum of booleans as a 32bit int, so if your table has more than a billion rows, you may want to add a cast:
exec sum "j"$PERIOD=x from table

The following will be the most efficient.
select count i from table where PERIOD=x
#jomahony solution will require all columns to be read from disk (if the table is on disk) before doing the count so can be inefficient

Related

Postgres: counting records in two groups (existing foreign key or null)

I have a table items and a table batches. A batch can have n items associated by items.batch_id.
I'd like to write a query item counts in two groups batched and unbatched:
items WHERE batch_id IS NOT NULL (batched)
items WHERE batch_id IS NULL (unbatched)
The result should look like this
batched
unbatched
1200000
100
Any help appreciated, thank you!
EDIT:
I got stuck with using GROUP BY which turned out to be the wrong tool for the job.
You can use COUNT with `FILTER( WHERE)
it is called conditional count
CREATE TABLE items(item_id int, batch_id int)
CREATE TABLE
INSERT INTO items VALUEs(1,NULL),(2,NULL),(3,1)
INSERT 0 3
CREATE tABLe batch (batch_id int)
CREATE TABLE
select
count(*) filter (WHERE batch_id IS NOT NULL ) as "matched"
,
count(*) filter (WHERE batch_id IS NULL ) as "unmatched"
from items
matched
unmatched
1
2
SELECT 1
fiddle
The count() function seems to be the most likely basic tool here. Given an expression, it returns a count of the number of rows where that expression evaluates to non-null. Given the argument *, it counts all rows in the group.
To the extent that there is a trick, it is getting the batched an unbatched counts in the same result row. There are at least three ways to do that:
Using subqueries:
select
(select count(batch_id) from items) as batched,
(select count(*) from items where batch_id is null) as unbatched
-- no FROM
That's pretty straightforward. Each subquery is executed and produces one column of the result. Because no FROM clause is given in the outer query, there will be exactly one result row.
Using window functions:
select
count(batch_id) over () as batched,
(count(*) over () - count(batch_id) over ()) as unbatched
from items
limit 1
That will compute the batched and unbatched results for the whole table on every result row, one per row of the items table, but then only one result row is actually returned. It is reasonable to hope (though you would definitely want to test) that postgres doesn't actually compute those counts for all the rows that are culled by the limit clause. You might, for example, compare the performance of this option with that of the previous option.
Using count() with a filter clause, as described in detail in another answer.

Redshift large 'in' clause best practices

We have a query in which a list of parameter values is provided in "IN" clause of the query. Some time back this query failed to execute as the size of data in "IN" clause got quite large and hence the resulting query exceeded the 16 MB limit of the query in REDSHIFT. As a result of which we then tried processing the data in batches so as to limit the data and not breach the 16 MB limit.
My question is what are the factors/pitfalls to keep in mind while supplying such large data for the "IN" clause of a query or is there any alternative way in which I can deal with such large data for the "IN" clause?
If you have control over how you are generating your code, you could split it up as follows
first code to be submitted, drop and recreate filter table:
drop table if exists myfilter;
create table myfilter (filter_text varchar(max));
Second step is to populate the filter table in parts of a suitable size, e.g. 1000 values at a time
insert into myfilter
values({{myvalue1}},{{myvalue2}},{{myvalue3}} etc etc up to 1000 values );
repeat the above step multiple times until you have all of your values inserted
Then, use that filter table as follows
select * from master_table
where some_value in (select filter_text from myfilter);
drop table myfilter;
Large IN is not the best practice itself, it's better to use joins for large lists:
construct a virtual table a subquery
join your target table to the virtual table
like this
with
your_list as (
select 'first_value' as search_value
union select 'second_value'
...
)
select ...
from target_table t1
join your_list t2
on t1.col=t2.search_value

Join two tables where both joined columns have a large set of different values

I am currently trying to join two tables, where both of the tables have very many different in the columns I am joining.
Here's the tsql
from AVG(Position) as Position from MonitoringGsc_Keywords as sk
Join GSC_RankingData on sk.Id = GSC_RankingData.KeywordId
groupy by sk.Id
The execution plan shows me, that it takes very much time to perform the join. I think it is because a huge group from the first table has to be compared with a huge group of values in the second table.
MonitoringGsc_Keywords.Id has about 60.000 different values
GSC_RankingData hat about 100.000.000 Values
MonitoringGsc_Keywords.Id is Primary-Key of MonitoringGsc_Keywords GSC_RankingData.KeywordId is indexed.
So, what can i do to increase performance?
Is Position column from GSC_RankingData table? If yes then JOIN is redundant and query should looks like this:
SELECT AVG(rd.Position) as Position
FROM GSC_RankingData rd
GROUP BY rd.KeywordId;
If Position column is in GSC_RankingData table then index on GSC_RankingData should include this column and looks like this:
CREATE INDEX IX_GSC_RankingData_KeywordId_Position ON GSC_RankingData(KeywordId) INCLUDE(Position);
You should check indexes fragmentation for this tables, to do this you could use this query:
SELECT * FROM sys.dm_db_index_physical_stats(db_id(), object_id('MonitoringGsc_Keywords'), null, null, 'DETAILED')
if avg_fragmentation_in_percent > 5% and < 30% then
ALTER INDEX [index name] on [table name] REORGANIZE;
if avg_fragmentation_in_percent >= 30% then
ALTER INDEX [index name] on [table name] REBUILD;
It could be problem with statistics, you could check it with query:
SELECT
sp.stats_id, name, filter_definition, last_updated, rows, rows_sampled,
steps, unfiltered_rows, modification_counter
FROM sys.stats AS stat
CROSS APPLY sys.dm_db_stats_properties(stat.object_id, stat.stats_id) AS sp
WHERE stat.object_id = object_id('GSC_RankingData');
check last update date, rows count, if it not be current then update statistics. Also it could be possible that statistics not exist, then you must create it.

how to efficiiently select first or last rows from a KDB table stored on disk

For an in-memory table, I can use sublist or take syntax to retrieve first x, last x elements.
How to do this efficiently for an on-disk table which may be very large? The constraint is that I don't want to cache all the data from table to memory to run the query.
.Q.ind - it takes a table and (long!) indices into the table - and returns the appropriate rows
http://code.kx.com/q/ref/dotq/#qind-partitioned-index
I suppose you can use the i column which is the row number (per partition!) on a historical.
So the first row would be select from trade where date=first date, i = 0
The last row would I guess be select from trade where date=last date, i=max i
This assumes normal partitioned by date stuff. If you have just a non-partitioned tables, probably select from trade where i=0 would be fine

Number of rows returned in a sqlite statement

Is there any easy way to get the number of rows returned by a sqlite statement? I don't want to have to go through the process of doing a COUNT() first. Thanks.
On each call to sqlite_step, increment a variable by 1.
If you want the row count in advance, then there's no easy way.
To count all entries in a table, you can use the following SQL statement:
SELECT COUNT(*) FROM "mytable" where something=42;
Or just the following to get all entries:
SELECT COUNT(*) FROM "mytable";
In case you have already done the query, and just want the number of entries returned you can use sqlite3_data_count() and sqlite3_column_count() depending on what you want to count.