I have some serious performance problems using orientdb.
I have got a plocal graph Database with a sheme like the following, the data is imported from JSON:
PersonA --hasInterest-> InterestA
PersonA --hasInterest-> InterestB
PersonB --hasInterest-> InterestA
PersonB --hasInterest-> InterestB
My goal is to find Interests that occur in combination with a given Interest. So my query looks like:
SELECT * FROM ( TRAVERSE out_hasInterest FROM ( SELECT FROM ( TRAVERSE in_hasInterest FROM #12:33 ) WHERE $depth > 0 )) WHERE $depth > 0
Where #12:33 is an Interest.
My real data is a bit bigger than this small snippet so for a concrete Interest there are ~500,000 Persons associated which have an average of ~3 Interests. So I would Traverse 500,000 + 500,000 * 3 = 2,000,000 Vertices. That seems not to be that much.
The query needs ~100 seconds. This is far to much for my application.
I think I am doing something terribly wrong, I can't believe the performance is that bad.
Any help is greatly appreciated!
Best regards
Ludwig
Version: 1.7-rc1
Why are you using traverse? If I understand correctly your goal you could do:
SELECT expand( in('hasInterest').out('hasInterest') ) FROM #12:33
Related
I am writing dynamic sql code and it would be easier to use a generic where column in (<comma-seperated values>) clause, even when the clause might have 1 term (it will never have 0).
So, does this query:
select * from table where column in (value1)
have any different performance than
select * from table where column=value1
?
All my test result in the same execution plans, but if there is some knowledge/documentation that sets it to stone, it would be helpful.
This might not hold true for each and any RDBMS as well as for each an any query with its specific circumstances.
The engine will translate WHERE id IN(1,2,3) to WHERE id=1 OR id=2 OR id=3.
So your two ways to articulate the predicate will (probably) lead to exactly the same interpretation.
As always: We should not really bother about the way the engine "thinks". This was done pretty well by the developers :-) We tell - through a statement - what we want to get and not how we want to get this.
Some more details here, especially the first part.
I Think this will depend on platform you are using (optimizer of the given SQL engine).
I did a little test using MySQL Server and:
When I query select * from table where id = 1; i get 1 total, Query took 0.0043 seconds
When I query select * from table where id IN (1); i get 1 total, Query took 0.0039 seconds
I know this depends on Server and PC and what.. But The results are very close.
But you have to remember that IN is non-sargable (non search argument able), it will not use the index to resolve the query, = is sargable and support the index..
If you want the best one to use, You should test them in your environment because they both work so good!!
Hello and thank you for reading my question!
Currently, we use PostgreSQL v.10 on 3 nodes through stolon (https://github.com/sorintlab/stolon)
We have 3 tables (I want to make my question simple):
Invoice (150 000 000 records)
User (35 000 000 records)
User_Address (20 000 000 records)
The main query looks like this (The original query is a large, using a temporary table and have a lot of where conditions, but the sample shows my problem.)
select
i.*
from invoice as i
inner join get_similar_name('Jon') as s on i.name ilike s.name
left join user_address as a on i.user_id = a.user_id
where
a.state = 'OH'
and
i.last_name = 'Smith'
and
i.date between '2016-01-01'::date and '2018-12-31'::date;
The function get_similar_name returns similar names (example: get_similar_name('Jon') will return John, Jonny, Jonathan ... etc) average 200 - 1000 names. I must use the function :\
The query was executed a long time, around 30 - 120 seconds,
but if I exclude the function get_similar_name from the query, then execution time will be not more then 1 second.
I already configured PostgreSQL and the server working pretty good. I also created indexes and my query don't use seq scan and etc.
We don't have the possibility to make partitioned tables because we have a lot of columns for this. We can't divide a table only by one row.
I think about migrating my warehouse to MongoDB
My questions are:
Am I right about moving to MongoDB?
Increase performance if I move warehouse from PostgreSQL to 20-40 nodes under MongoDB control?
Is it possible to have the function get_similar_name on MongoDB or similar solution? If yes, how?
Do you have good experience to use fulltext search in MongoDB?
Is it right way to use MongoDB on production?
Can you please advise a "google-vector" to right solution on your opinion?
I don't know if moving to MongoDB will solve a text search problem, but Postgres has excellent features like Vector and trigram. Have you tired any of this?
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
https://www.postgresql.org/docs/9.6/pgtrgm.html
On my previous project, we used pg_trgm and were pretty happy with its performance.
User.find(:all, :order => "RANDOM()", :limit => 10) was the way I did it in Rails 3.
User.all(:order => "RANDOM()", :limit => 10) is how I thought Rails 4 would do it, but this is still giving me a Deprecation warning:
DEPRECATION WARNING: Relation#all is deprecated. If you want to eager-load a relation, you can call #load (e.g. `Post.where(published: true).load`). If you want to get an array of records from a relation, you can call #to_a (e.g. `Post.where(published: true).to_a`).
You'll want to use the order and limit methods instead. You can get rid of the all.
For PostgreSQL and SQLite:
User.order("RANDOM()").limit(10)
Or for MySQL:
User.order("RAND()").limit(10)
As the random function could change for different databases, I would recommend to use the following code:
User.offset(rand(User.count)).first
Of course, this is useful only if you're looking for only one record.
If you wanna get more that one, you could do something like:
User.offset(rand(User.count) - 10).limit(10)
The - 10 is to assure you get 10 records in case rand returns a number greater than count - 10.
Keep in mind you'll always get 10 consecutive records.
I think the best solution is really ordering randomly in database.
But if you need to avoid specific random function from database, you can use pluck and shuffle approach.
For one record:
User.find(User.pluck(:id).shuffle.first)
For more than one record:
User.where(id: User.pluck(:id).sample(10))
I would suggest making this a scope as you can then chain it:
class User < ActiveRecord::Base
scope :random, -> { order(Arel::Nodes::NamedFunction.new('RANDOM', [])) }
end
User.random.limit(10)
User.active.random.limit(10)
While not the fastest solution, I like the brevity of:
User.ids.sample(10)
The .ids method yields an array of User IDs and .sample(10) picks 10 random values from this array.
Strongly Recommend this gem for random records, which is specially designed for table with lots of data rows:
https://github.com/haopingfan/quick_random_records
All other answers perform badly with large database, except this gem:
quick_random_records only cost 4.6ms totally.
the accepted answer User.order('RAND()').limit(10) cost 733.0ms.
the offset approach cost 245.4ms totally.
the User.all.sample(10) approach cost 573.4ms.
Note: My table only has 120,000 users. The more records you have, the more enormous the difference of performance will be.
UPDATE:
Perform on table with 550,000 rows
Model.where(id: Model.pluck(:id).sample(10)) cost 1384.0ms
gem: quick_random_records only cost 6.4ms totally
For MYSQL this worked for me:
User.order("RAND()").limit(10)
You could call .sample on the records, like: User.all.sample(10)
The answer of #maurimiranda User.offset(rand(User.count)).first is not good in case we need get 10 random records because User.offset(rand(User.count) - 10).limit(10) will return a sequence of 10 records from the random position, they are not "total randomly", right? So we need to call that function 10 times to get 10 "total randomly".
Beside that, offset is also not good if the random function return a high value. If your query looks like offset: 10000 and limit: 20 , it is generating 10,020 rows and throwing away the first 10,000 of them,
which is very expensive. So call 10 times offset.limit is not efficient.
So i thought that in case we just want to get one random user then User.offset(rand(User.count)).first maybe better (at least we can improve by caching User.count).
But if we want 10 random users or more then User.order("RAND()").limit(10) should be better.
Here's a quick solution.. currently using it with over 1.5 million records and getting decent performance. The best solution would be to cache one or more random record sets, and then refresh them with a background worker at a desired interval.
Created random_records_helper.rb file:
module RandomRecordsHelper
def random_user_ids(n)
user_ids = []
user_count = User.count
n.times{user_ids << rand(1..user_count)}
return user_ids
end
in the controller:
#users = User.where(id: random_user_ids(10))
This is much quicker than the .order("RANDOM()").limit(10) method - I went from a 13 sec load time down to 500ms.
I've some performance issue on a quite big data store.
For optimizing the insert phase, we created a document store and not a graph, infact the edge creation performance was too slow.
Essentially now we have a class A (with about 30M documents) with a link (say field fieldL) to a class B (about 500 documents).
The query structure is like:
select from A where field1='field1value' and field2='field2value' and field3>0 ... and fieldL in (select from B where ...)
The first issue i've found is this:
I've created n indexes on the n properties engaged in the where condition, but the explain command showed me orient uses only one... https://github.com/orientechnologies/orientdb/issues/3626
So I've created a composite index and if I perform a query involving only the index, say
select from A where field1='field1value' and field2='field2value' and field3>0
the result is really fast
The issue is about the second part of the query, involving the fieldL and the links.
I've tried with the [#rid,...] syntax but it seems not perform well.
I've also tried to change the schema using a different approach: class B with multiple links to class A, using a different query pattern (say the field containing the links fieldL1):
select * from (select expand(fieldL1) from B where ...) where field1='field1value' and field2='field2value' and field3>0
In this case the subquery executes a sort of partition of the data, but unfortunatelly we lose the indexes on the result set, so we have really slow performances on the second where clause (field1='field1value' and field2='field2value' and field3>0).
My question is: Does it exist a better query pattern to execute these kind of query faster?
Thank you very much.
By the way during the performance tuning it seems really awkward to perform a count of the documents involved in a query. (https://github.com/orientechnologies/orientdb/issues/3462)
If you use the following query
select * from (select expand(fieldL1) from B where ...) where field1='field1value' and field2='field2value' and field3>0
it doesn't use the index because seems that there are problems when using the subqueries and the indexes
For more information, you can look at this link
https://groups.google.com/forum/#!topic/orient-database/7jWEGpkIzXQ
For example, I want to retrieve all data from citizen table which contains about 18K rows.
String sqlResult = "SELECT * FROM CITIZEN";
Query query = getEntityManager().createNativeQuery(sqlResult);
query.setFirstResult(searchFrom);
query.setMaxResults(searchCount); // searchCount is 20
List<Object[]> listStayCit = query.getResultList();
Everything was fine until "searchFrom" offset was large ( 17K or something ). For example, it took 3-4 mins to get 20 rows ( 17,000 to 17,020 ). So is there any better way to get it faster but not via tunning the DB ?
P/s: Sorry for my bad English
You Could use batch queries.
A good article explaining solution to ur problem is available here:
http://java-persistence-performance.blogspot.in/2010/08/batch-fetching-optimizing-object-graph.html