Orientdb query and scheme patterns to speed up the reading phase - orientdb

I've some performance issue on a quite big data store.
For optimizing the insert phase, we created a document store and not a graph, infact the edge creation performance was too slow.
Essentially now we have a class A (with about 30M documents) with a link (say field fieldL) to a class B (about 500 documents).
The query structure is like:
select from A where field1='field1value' and field2='field2value' and field3>0 ... and fieldL in (select from B where ...)
The first issue i've found is this:
I've created n indexes on the n properties engaged in the where condition, but the explain command showed me orient uses only one... https://github.com/orientechnologies/orientdb/issues/3626
So I've created a composite index and if I perform a query involving only the index, say
select from A where field1='field1value' and field2='field2value' and field3>0
the result is really fast
The issue is about the second part of the query, involving the fieldL and the links.
I've tried with the [#rid,...] syntax but it seems not perform well.
I've also tried to change the schema using a different approach: class B with multiple links to class A, using a different query pattern (say the field containing the links fieldL1):
select * from (select expand(fieldL1) from B where ...) where field1='field1value' and field2='field2value' and field3>0
In this case the subquery executes a sort of partition of the data, but unfortunatelly we lose the indexes on the result set, so we have really slow performances on the second where clause (field1='field1value' and field2='field2value' and field3>0).
My question is: Does it exist a better query pattern to execute these kind of query faster?
Thank you very much.
By the way during the performance tuning it seems really awkward to perform a count of the documents involved in a query. (https://github.com/orientechnologies/orientdb/issues/3462)

If you use the following query
select * from (select expand(fieldL1) from B where ...) where field1='field1value' and field2='field2value' and field3>0
it doesn't use the index because seems that there are problems when using the subqueries and the indexes
For more information, you can look at this link
https://groups.google.com/forum/#!topic/orient-database/7jWEGpkIzXQ

Related

Get all ltree nodes at depth

I have an ltree column containing a tree with a depth of 3. I'm trying to write a query that can select all children at a specific depth (level 1 = get all parents, 2 = get all children, 3 = get all grandchildren). I know this is pretty straightforward with n_level:
SELECT path FROM hierarchies
WHERE
nlevel(path) = 1
LIMIT 1000;
I have 200,000 dummy records and it's pretty fast (~170 ms). However, this query uses a sequential scan. I think it'd be better to write it in a way that takes advantage of the ltree operators supported by the GiST index. Frustratingly, I can't seem to wrap my brain around them, and I haven't found a similar question on SO or DBA (besides this one on finding leaves)
Any advice is appreciated!
The only index that could support your query is a simple b-tree index on an expression.
create index on hierarchies((nlevel(path)))
Note however that it is quite possible for the planner to choose a sequential scan anyway, exemplary in the case the number of rows with level 1 is much more than other levels.

DB2 Optimize for n rows

I'm learning DB2, and I came across this clause: OPTIMIZE FOR 1 ROW right after FETCH FIRST 100 ROWS ONLY.
I understand that FETCH FIRST 100 ROWS ONLY would give me the first 100 rows that qualified. But I don't understand what the OPTIMIZE FOR 1 ROW really doing here. I read this DB2 documentation, it says
Use OPTIMIZE FOR 1 ROW clause to influence the access path. OPTIMIZE FOR 1 ROW tells Db2 to select an access path that returns the first qualifying row quickly.
and this DB2 documentation, it says
In general, if you are retrieving only a few rows, specify OPTIMIZE FOR 1 ROW to influence the access path that Db2 selects.
But I'm still confused. Is using OPTIMIZE FOR n ROWS would make a query more efficient?
I also found this post on SO and it seems like OPTIMIZE FOR n ROWS is equivalent to FETCH FIRST n ROWS ONLY per the accepted answer.
But when I experimented it myself using OPTIMIZE FOR n ROWS instead of FETCH FIRST n ROWS ONLY, the result set was not the same. With OPTIMIZE FOR n ROWS, the query returns all qualifying rows.
Could someone please explain it to me what OPTIMIZE FOR n ROWS really does? Thanks!
Is using OPTIMIZE FOR n ROWS would make a query more efficient?
Not necessarily. However, it might cause your application to start receiving rows earlier than it otherwise would, if there is an access plan alternative that can find the first row matching the query criteria faster although the entire query will as a result run longer.
There's this bit in the Db2 for LUW docs that gives some examples specific to that platform:
Try specifying OPTIMIZE FOR n ROWS along with FETCH FIRST n ROWS ONLY, to encourage query access plans that return rows directly from the referenced tables, without first performing a buffering operation such as inserting into a temporary table, sorting, or inserting into a hash join hash table.
Applications that specify OPTIMIZE FOR n ROWS to encourage query access plans that avoid buffering operations, yet retrieve the entire result set, might experience poor performance. This is because the query access plan that returns the first n rows fastest might not be the best query access plan if the entire result set is being retrieved.

kdb: getting one row from HDB

For a normal table, we can select one row using select[1] from t. How can I do this for HDB?
I tried select[1] from t where date=2021.02.25 but it gives error
Not yet implemented: it probably makes sense, but it’s not defined nor implemented, and needs more thinking about as the language evolves
select[n] syntax works only if table is already loaded in memory.
The easiest way to get 1st row of HDB table is:
1#select from t where date=2021.02.25
select[n] will work if applied on already loaded data, e.g.
select[1] from select from t where date=2021.02.25
I've done this before for ad-hoc queries by using the virtual index i, which should avoid the cost of pulling all data into memory just to select a couple of rows. If your query needs to map constraints in first before pulling a subset, this is a reasonable solution.
It will however pull N rows for each date partition selected due to the way that q queries work under the covers. So YMMV and this might not be the best solution if it was behind an API for example.
/ 5 rows (i[5] is the 6th row)
select from t where date=2021.02.25, sum=`abcd, price=1234.5, i<i[5]
If your table is date partitioned, you can simply run
select col1,col2 from t where date=2021.02.25,i=0
That will get the first record from 2021.02.25's partition, and avoid loading every record into memory.
Per your first request (which is different to above) select[1] from t, you can achieve that with
.Q.ind[t;enlist 0]

what's the utility of array type?

I'm totally newbie with postgresql but I have a good experience with mysql. I was reading the documentation and I've discovered that postgresql has an array type. I'm quite confused since I can't understand in which context this type can be useful within a rdbms. Why would I have to choose this type instead of using a classical one to many relationship?
Thanks in advance.
I've used them to make working with trees (such as comment threads) easier. You can store the path from the tree's root to a single node in an array, each number in the array is the branch number for that node. Then, you can do things like this:
SELECT id, content
FROM nodes
WHERE tree = X
ORDER BY path -- The array is here.
PostgreSQL will compare arrays element by element in the natural fashion so ORDER BY path will dump the tree in a sensible linear display order; then, you check the length of path to figure out a node's depth and that gives you the indentation to get the rendering right.
The above approach gets you from the database to the rendered page with one pass through the data.
PostgreSQL also has geometric types, simple key/value types, and supports the construction of various other composite types.
Usually it is better to use traditional association tables but there's nothing wrong with having more tools in your toolbox.
One SO user is using it for what appears to be machine-aided translation. The comments to a follow-up question might be helpful in understanding his approach.
I've been using them successfully to aggregate recursive tree references using triggers.
For instance, suppose you've a tree of categories, and you want to find products in any of categories (1,2,3) or any of their subcategories.
One way to do it is to use an ugly with recursive statement. Doing so will output a plan stuffed with merge/hash joins on entire tables and an occasional materialize.
with recursive categories as (
select id
from categories
where id in (1,2,3)
union all
...
)
select products.*
from products
join product2category on...
join categories on ...
group by products.id, ...
order by ... limit 10;
Another is to pre-aggregate the needed data:
categories (
id int,
parents int[] -- (array_agg(parent_id) from parents) || id
)
products (
id int,
categories int[] -- array_agg(category_id) from product2category
)
index on categories using gin (parents)
index on products using gin (categories)
select products.*
from products
where categories && array(
select id from categories where parents && array[1,2,3]
)
order by ... limit 10;
One issue with the above approach is that row estimates for the && operator are junk. (The selectivity is a stub function that has yet to be written, and results in something like 1/200 rows irrespective of the values in your aggregates.) Put another way, you may very well end up with an index scan where a seq scan would be correct.
To work around it, I increased the statistics on the gin-indexed column and I periodically look into pg_stats to extract more appropriate stats. When a cursory look at those stats reveal that using && for the specified values will return an incorrect plan, I rewrite applicable occurrences of && with arrayoverlap() (the latter has a stub selectivity of 1/3), e.g.:
select products.*
from products
where arrayoverlap(cat_id, array(
select id from categories where arrayoverlap(parents, array[1,2,3])
))
order by ... limit 10;
(The same goes for the <# operator...)

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.