kdb: Is there a way to efficiently merge two sorted tables? - kdb

Say we have two tables both sorted on the time column:
t1:`time xasc ([]time:5?100;v:5?1000)
t2:`time xasc ([]time:5?100;v:5?1000)
Is there an efficient way to get the same result as `time xasc t1,t2 , using the fact that the two tables are already sorted? I looked at aj but I wasn't able to find the "combine two tables" functionality I need here.

There is no native merge-sort/binary-sort in kdb so the optimal available approach is asc x,y. If you go down the path of replicating a merge/binary sort in kdb then you're unlikely to get it faster than the native asc x,y. You could alternatively try to write a merge/binary sort in C and import a shared library to use in kdb

Related

How to use OPTIMIZE ZORDER BY in Databricks

I have two dataframes(from a delta lake table) that do a left join via an id column.
sd1, sd2
%sql
select
a.columnA,
b.columnB,
from sd1 a
left outer join sd2 b
on a.id = b.id
The problem is that my query takes a long time, looking for ways to improve the results I have found OPTIMIZE ZORDER BY Youtube video
according to the video seems to be useful when ordering columns if they are going to be part of the where condition`.
But since the two dataframes use the id in the join condition, could it be interesting to order that column?
spark.sql(f'OPTIMIZE delta.`{sd1_delta_table_path}` ZORDER BY (id)')
the logic that follows in my head is that if we first order that column then it will take less time to look for them to make the match. Is this correct ?
Thanks ind advance
OPTIMIZE ZORDER may help a bit by placing related data together, but it's usefulness may depend on the data type used for ID column. OPTIMIZE ZORDER relies on the data skipping functionality that just gives you min & max statistics, but may not be useful when you have big ranges in your joins.
You can also tune a file sizes, to avoid scanning of too many smaller files.
But from my personal experience, for joins, bloom filters give better performance because they allow to skip files more efficiently than data skipping. Just build bloom filter on the ID column...

How to sort columns in a HDB to apply the p attribute

I have a HDB that is date partitioned. I want to apply the p attribute historically to a specific column. As far as I am aware, to do this, I need to first ensure this column is sorted in a way that all common occurrences are adjacent. Currently, this is not the case. How can I sort this HDB so that this column in each partition has common values adjacent to each other.
Thank you!
You can use xasc on disk.
https://code.kx.com/q/ref/asc/#xasc
You'd want to sort each partition and apply the parted attribute. Could build up the paths with .Q.PD & .Q.PV as I don't think this is something that exists in dbmaint.q.
This is just a general idea, it is untested so use on some test data and modify to meet your hdb structure if needed.
You may need to modify the xasc part if you want additional sorting within each part.
{[tbl;sortPartCol]
{[sortPartCol;path] sortPartCol xasc path;#[path;sortPartCol;`p#]
}[sortPartCol] each distinct ` sv/: (.Q.PD cross `$string .Q.PV) cross tbl
}
https://code.kx.com/q/ref/dotq/#qpv-partition-values
https://code.kx.com/q/ref/dotq/#qpd-partition-locations

Querying timestamp column In q

I want to count the number of records inserted in a kdb+ database using a q query.
Currently, using below query:
count select from executionTable where ingestTimeStamp within 2019.09.07D00:00:00.000000000 2019.09.08D00:00:00.000000000
It works but not highly performant. Any recommendations to make it efficient is highly appreciated.
Thank you for your help.
If you only want count then use 'count i' inside select like below:
q) select count i from executionTable where ingestTimeStamp within 2019.09.07D00:00:00.000000000 2019.09.08D00:00:00.000000000
This will only get the count instead of fetching full data which is what your query is doing and that's one of the reasons for taking more time.
And if it is a partitioned database, then add 'date' in the filter as #Callum Biggs mentioned.
Given the information you have provided I'm assuming you're querying on-disk data, likely saved in a standard date partitioned structure. In this case, you should be specifying a date clause before you specify a time clause, this will prevent searching all the date directories.
select from executionTable where date=2019.09.07, ingestTimeStamp within 2019.09.07D00:00:00.000000000 2019.09.08D00:00:00.000000000
I'd suggest reading through the whitepaper on query optimization, it will give some guidance in good query structure, and how to take advantage of map reduction in kdb.

Where clause versus join clause in Spark SQL

I am writing a query to get records from Table A which satisfies a condition from records in Table B. For example:
Table A is:
Name Profession City
John Engineer Palo Alto
Jack Doctor SF
Table B is:
Profession City NewJobOffer
Engineer SF Yes
and I'm interested to get Table c:
Name Profession City NewJobOffer
Jack Engineer SF Yes
I can do this in two ways using where clause or join query which one is faster and why in spark sql?
Where clause to compare the columns add select those records or join on the column itself, which is better?
It's better to provide filter in WHERE clause. These two expressions are not equivalent.
When you provide filtering in JOIN clause, you will have two data sources retrieved and then joined on specified condition. Since join is done through shuffling (redistributing between executors) data first, you are going to shuffle a lot of data.
When you provide filter in WHERE clause, Spark can recognize it and you will have two data sources filtered and then joined. This way you will shuffle less amount of data. What might be even more important is that this way Spark may also be able to do a filter-pushdown, filtering data at datasource level, which means even less network pressure.

Sorting data in a oracle table

Is it possible to sort the data inside of an oracle table? Like ascending/descending via a certain column, alphabetically. Oracle 10g express.
You could try
Select *
from some_table
order by some_column asc
This will sort the results by some_column and place them in ascending order. Use desc instead of asc if you want descending order. Or did you mean to have the ordering in the physical storage itself?
I believe it's possible to specify the ordering/sorting of an indexed column in storage. It's probably closest to what you want. I don't usually use this index sort feature, but for more info see: http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10759/statements_5010.htm#i2062718
Perhaps you could use an index organized table - IOT to ensure that the data is stored ordered by index.
Have a look at the physical properties clause of the CREATE TABLE statement:
http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_7002.htm#i2128663
What is the problem that you are trying to solve though? An IOT may or may not be what you should be using.
As defined by the relational model, the rows and columns in a table are unordered. That's the theory at least.
In practice, if you want the data to come out of a query in a particular order then you should always use the ORDER BY clause. The order of the output is not guarantee unless you provide that.
It would be possible to use an ORDER BY when inserting into a table but that doesn't guarantee the order that data will be output. A query might come out in the same order every time.... but that doesn't mean it will come out in the same order next time.
There were issues when Oracle 10g came out where aggregate queries (with GROUP BY) were not coming out sorted because users had come to rely on the data being sorted as a side-effect of the grouping. With the introduction of the HASH GROUP BY in addition to the SORT GROUP BY people were caught out. This was a useful reminder that ORDER BY should always be used.
What do you really mean ?
Are you just asking for the order by clause ?
http://www.1keydata.com/sql/sqlorderby.html
Oracle 10g Express supports ANSI SQL like most RDBM's so you can sort in the standard manner:
SELECT * FROM Persons ORDER BY LastName
A good tutorial on SQL can be found here: w3schools SQL
Oracle Express does have some limitations compared to the Enterprise Edition but not in the basic SQL dialect it supports.