Querying timestamp column In q - kdb

I want to count the number of records inserted in a kdb+ database using a q query.
Currently, using below query:
count select from executionTable where ingestTimeStamp within 2019.09.07D00:00:00.000000000 2019.09.08D00:00:00.000000000
It works but not highly performant. Any recommendations to make it efficient is highly appreciated.
Thank you for your help.

If you only want count then use 'count i' inside select like below:
q) select count i from executionTable where ingestTimeStamp within 2019.09.07D00:00:00.000000000 2019.09.08D00:00:00.000000000
This will only get the count instead of fetching full data which is what your query is doing and that's one of the reasons for taking more time.
And if it is a partitioned database, then add 'date' in the filter as #Callum Biggs mentioned.

Given the information you have provided I'm assuming you're querying on-disk data, likely saved in a standard date partitioned structure. In this case, you should be specifying a date clause before you specify a time clause, this will prevent searching all the date directories.
select from executionTable where date=2019.09.07, ingestTimeStamp within 2019.09.07D00:00:00.000000000 2019.09.08D00:00:00.000000000
I'd suggest reading through the whitepaper on query optimization, it will give some guidance in good query structure, and how to take advantage of map reduction in kdb.

Related

DB2 Optimize for n rows

I'm learning DB2, and I came across this clause: OPTIMIZE FOR 1 ROW right after FETCH FIRST 100 ROWS ONLY.
I understand that FETCH FIRST 100 ROWS ONLY would give me the first 100 rows that qualified. But I don't understand what the OPTIMIZE FOR 1 ROW really doing here. I read this DB2 documentation, it says
Use OPTIMIZE FOR 1 ROW clause to influence the access path. OPTIMIZE FOR 1 ROW tells Db2 to select an access path that returns the first qualifying row quickly.
and this DB2 documentation, it says
In general, if you are retrieving only a few rows, specify OPTIMIZE FOR 1 ROW to influence the access path that Db2 selects.
But I'm still confused. Is using OPTIMIZE FOR n ROWS would make a query more efficient?
I also found this post on SO and it seems like OPTIMIZE FOR n ROWS is equivalent to FETCH FIRST n ROWS ONLY per the accepted answer.
But when I experimented it myself using OPTIMIZE FOR n ROWS instead of FETCH FIRST n ROWS ONLY, the result set was not the same. With OPTIMIZE FOR n ROWS, the query returns all qualifying rows.
Could someone please explain it to me what OPTIMIZE FOR n ROWS really does? Thanks!
Is using OPTIMIZE FOR n ROWS would make a query more efficient?
Not necessarily. However, it might cause your application to start receiving rows earlier than it otherwise would, if there is an access plan alternative that can find the first row matching the query criteria faster although the entire query will as a result run longer.
There's this bit in the Db2 for LUW docs that gives some examples specific to that platform:
Try specifying OPTIMIZE FOR n ROWS along with FETCH FIRST n ROWS ONLY, to encourage query access plans that return rows directly from the referenced tables, without first performing a buffering operation such as inserting into a temporary table, sorting, or inserting into a hash join hash table.
Applications that specify OPTIMIZE FOR n ROWS to encourage query access plans that avoid buffering operations, yet retrieve the entire result set, might experience poor performance. This is because the query access plan that returns the first n rows fastest might not be the best query access plan if the entire result set is being retrieved.

kdb: getting one row from HDB

For a normal table, we can select one row using select[1] from t. How can I do this for HDB?
I tried select[1] from t where date=2021.02.25 but it gives error
Not yet implemented: it probably makes sense, but it’s not defined nor implemented, and needs more thinking about as the language evolves
select[n] syntax works only if table is already loaded in memory.
The easiest way to get 1st row of HDB table is:
1#select from t where date=2021.02.25
select[n] will work if applied on already loaded data, e.g.
select[1] from select from t where date=2021.02.25
I've done this before for ad-hoc queries by using the virtual index i, which should avoid the cost of pulling all data into memory just to select a couple of rows. If your query needs to map constraints in first before pulling a subset, this is a reasonable solution.
It will however pull N rows for each date partition selected due to the way that q queries work under the covers. So YMMV and this might not be the best solution if it was behind an API for example.
/ 5 rows (i[5] is the 6th row)
select from t where date=2021.02.25, sum=`abcd, price=1234.5, i<i[5]
If your table is date partitioned, you can simply run
select col1,col2 from t where date=2021.02.25,i=0
That will get the first record from 2021.02.25's partition, and avoid loading every record into memory.
Per your first request (which is different to above) select[1] from t, you can achieve that with
.Q.ind[t;enlist 0]

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).

Tableau: Create a table calculation that sums distinct string values (names) when condition is met

I am getting my data from denormalized table, where I keep names and actions (apart from other things). I want to create a calculated field that will return sum of workgroup names but only when there are more than five actions present in DB for given workgroup.
Here's how I have done it when I wanted to check if certain action has been registered for workgroup:
WINDOW_SUM(COUNTD(IF [action] = "ADD" THEN [workgroup_name] END))
When I try to do similar thing with count, I am getting "Cannot mix aggregate and non-aggregate arguments":
WINDOW_SUM(COUNTD(IF COUNT([Number of Records]) > 5 THEN [workgroup_name] END))
I know that there's problem with the IF clause, but don't know how to fix it.
How to change the IF to be valid? Maybe there's an easier way to do it, that I am missing?
EDIT:
(after Inox's response)
I know that my problem is mixing aggregate with non-aggregate fields. I can't use filter to do it, because I want to use it later as a part of more complicated view - filtering would destroy the whole idea.
No, the problem is to mix aggregated arguments (e.g., sum, count) with non aggregate ones (e.g., any field directly). And that's what you're doing mixing COUNT([Number of Records]) with [workgroup_name]
If your goal is to know how many workgroup_name (unique) has more than 5 records (seems like that by the idea of your code), I think it's easier to filter then count.
So first you drag workgroup_name to Filter, go to tab conditions, select By field, Number of Records, Count, >, 5
This way you'll filter only the workgroup_name that has more than 5 records.
Now you can go with a simple COUNTD(workgroup_name)
EDIT: After clarification
Okay, than you need to add a marker that is fixed in your database. So table calculations won't help you.
By definition table calculation depends on the fields that are on the worksheet (and how you decide to use those fields to partition or address), and it's only calculated AFTER being called in a sheet. That way, each time you call the function it will recalculate, and for some analysis you may want to do, the fields you need to make the table calculation correct won't be there.
Same thing applies to aggregations (counts, sums,...), the aggregation depends, well, on the level of aggregation you have.
In this case it's better that you manipulate your data prior to connecting it to Tableau. I don't see a direct way (a single calculated field that would solve your problem). What can be done is to generate a db from Tableau (with the aggregation of number of records for each workgroup_name) then export it to csv or mdb and then reconnect it to Tableau. But if you can manipulate your database outside Tableau, it's usually a better solution

Sorting data in a oracle table

Is it possible to sort the data inside of an oracle table? Like ascending/descending via a certain column, alphabetically. Oracle 10g express.
You could try
Select *
from some_table
order by some_column asc
This will sort the results by some_column and place them in ascending order. Use desc instead of asc if you want descending order. Or did you mean to have the ordering in the physical storage itself?
I believe it's possible to specify the ordering/sorting of an indexed column in storage. It's probably closest to what you want. I don't usually use this index sort feature, but for more info see: http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10759/statements_5010.htm#i2062718
Perhaps you could use an index organized table - IOT to ensure that the data is stored ordered by index.
Have a look at the physical properties clause of the CREATE TABLE statement:
http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_7002.htm#i2128663
What is the problem that you are trying to solve though? An IOT may or may not be what you should be using.
As defined by the relational model, the rows and columns in a table are unordered. That's the theory at least.
In practice, if you want the data to come out of a query in a particular order then you should always use the ORDER BY clause. The order of the output is not guarantee unless you provide that.
It would be possible to use an ORDER BY when inserting into a table but that doesn't guarantee the order that data will be output. A query might come out in the same order every time.... but that doesn't mean it will come out in the same order next time.
There were issues when Oracle 10g came out where aggregate queries (with GROUP BY) were not coming out sorted because users had come to rely on the data being sorted as a side-effect of the grouping. With the introduction of the HASH GROUP BY in addition to the SORT GROUP BY people were caught out. This was a useful reminder that ORDER BY should always be used.
What do you really mean ?
Are you just asking for the order by clause ?
http://www.1keydata.com/sql/sqlorderby.html
Oracle 10g Express supports ANSI SQL like most RDBM's so you can sort in the standard manner:
SELECT * FROM Persons ORDER BY LastName
A good tutorial on SQL can be found here: w3schools SQL
Oracle Express does have some limitations compared to the Enterprise Edition but not in the basic SQL dialect it supports.