MongoDB indexed find performance - mongodb

I have a simple question.
If we do a db.collection.find({_id:ObjectId("an id")}) on a 1 million rows, does it take the same time as 1 billion rows?
if possible to explain why it does or does not take the same time, knowing that _id is an indexed field.

MongoDB uses B-trees for indexes which have a time complexity of O(log n) for search.
log 1M = 6
log 1B = 9
So a search over 1 billion docs will take roughly 50% longer than a search over 1 million docs.

Related

Performing a text search for each row in 2.5 million rows of a Postgresql Database

I have a list of search terms I'm reading into Pandas:
terms_list = np.genfromtxt('terms_list.csv', delimiter=',',dtype=str)
I need to pull what will be about 2.5 million rows from three joined tables in my Postgresql database, and loop through in Pandas to check the text in several columns from each row for a match with my terms_list. In the past I have used a chunking function to query and text search 500 rows at a time; however for 12,000 rows this took six minutes, so this strategy isn't effective to apply to 2.5 million rows.
I'm not very familiar with big data, so wondering if there are any good strategies for iterating through large volumes on a low-performant postgres database?

What kind of index should I use in postgresql for a column with 3 values

I have a table with 100Mil+ records and 237 fields.
One of the fields is a varchar 1 field with three possible values (Y,N,I)
I need to find all of the records with N.
Right now I have a b-tree index built and the query below takes about 20 min to run.
Is there another index I can use to get better performance?
SELECT * FROM tableone WHERE export_value='N';
Assuming your values are roughly equally distributed (say at least 15% of each value) and roughly equally distributed throughout the table (some physically at the beginning, some in the middle, some at the end) then no.
If you think about it you'll see why. You'll have to look up tens of millions of disk blocks in the index and then fetch them from the disk one by one. By the time you have done that, it would have been quicker to just scan the whole table and pick out the values as they match. The planner knows this and would probably not use the index at all.
However - if you only have 17 rows with "N" or they are all very recently added to the table and so physically happen to be close to each other then yes, and index can help.
If you only had a few rows with "N" you would have mentioned it, so we can ignore that one.
If however you mostly insert to this table you might find a BRIN index helpful. That can let the planner see that e.g. the first 80% of your table doesn't have any "N" blocks and so it just needs to look at the last bit.

Sphinx composite (distributed) big indexes

Experience problem with indexing lot's of content data. Searching for the suitable solution.
The logic if following:
Robot is uploading content every day to the database.
Sphinx index must reindex only new (daily) data. I.e. the previous content is never being changed.
Sphinx delta indexing is an exact solution for this, but with too much content the error is rising: too many string attributes (current index format allows up to 4 GB).
Distributed indexing seems to be usable, but how to dynamically (without dirty hacks) add & split indexing data?
I.e.: day 1 there are total 10000 rows, day 2 - 20000 rows and etc. The index throws >4GB error on about 60000 rows.
The expected index flow: 1-5 day there is 1 index (no matter distributed or not), 6-10 day - 1 distributed (composite) index (50000 + 50000 rows) and so on.
The question is how to fill distributed index dynamically?
Daily iteration sample:
main index
chunk1 - 50000 rows
chunk2 - 50000 rows
chunk3 - 35000 rows
delta index
10000 new rows
rotate "delta"
merge "delta" into "main"
Please, advice.
Thanks to #barryhunter
RT indexes is a solution here.
Good manual is here: https://www.sphinxconnector.net/Tutorial/IntroductionToRealTimeIndexes
I've tested match queries on 3 000 000 000 letters. The speed is close to be the same as for "plain" index type. Total index size on HDD is about 2 GB.
Populating sphinx rt index:
CPU usage: ~ 50% of 1 core / 8 cores,
RAM usage: ~ 0.5% / 32 GB, Speed: quick as usual select - insert (mostly depends on using batch insert or row-by-row)
NOTE:
"SELECT MAX(id) FROM sphinx_index_name" will produce error "fullscan requires extern docinfo". Setting docinfo = extern will not solve this. So keep counter simply in mysql table (like for sphinx delta index: http://sphinxsearch.com/docs/current.html#delta-updates).

MongoDB performance issue: Single Huge collection vs Multiple Small Collections

I tested two scenarios Single Huge collection vs Multiple Small Collections and found huge difference in performance while querying. Here is what I did.
Case 1: I created a product collection containing 10 million records for 10 different types of product, and in this exactly 1 million records for each product type, and I created index on ProductType. When I ran a sample query with condition ProductType=1 and ProductPrice>100 and limit(10) to return 10 records of ProductType=1 and whose price is greater than 100, it took about 35 milliseconds when the collection has lot of products whose price is more than 100, and the same query took about 8000 millisecond (8 second) when we have very less number of products in ProductType=1 whose price is greater than 100.
Case 2: I created 10 different Product table for each ProductType each containing 1 million records. In collection 1 which contains records for productType 1, when I ran the same sample query with condition ProductPrice>100 and limit(10) to return 10 records of products whose price is greater than 100, it took about 2.5 milliseconds when the collection has lot of products whose price is more than 100, and the same query took about 1500 millisecond (1.5 second) when we have very less number of products whose price is greater than 100.
So why there is so much difference? The only difference between the case one and case two is one huge collection vs multiple smaller collection, but I have created index of ProductType in the first case one single huge collection. I guess the performance difference is caused by the Index in the first case, and I need that index in the first case otherwise it will be more worst in performance. I expected some performance slow in the first case due to the Index but I didn't expect the huge difference about 10 times slow in the first case.
So 8000 milliseconds vs 1500 milliseconds on one huge collection vs multiple small collection. Why?
Separating the collections gives you a free index without any real overhead. There is overhead for an index scan, especially if the index is not really helping you cut down on the number of results it has to scan (if you have a million results in the index, but you have to scan them all and inspect them, it's not going to help you much).
In short, separating them out is a valid optimization, but you should make your indexes better for your queries before you actually decide to take that route, which I consider a drastic measure (an index on product price might help you more in this case).
Using explain() can help you understand how queries work. Some basics are: You want a low nscanned to n ratio, ideally. You don't want scanAndOrder = true, and you don't want BasicCursor, usually (this means you're not using an index at all).

SQLite vs Memory

I have a situation with my app.
Suppose I have 6 users, each user can have up to 9 score entries (i.e score 1000 points at 8:00pm with gold collected 3, silver 4 etc etc), say score per stage and 9 stages.
All these scores are being taken from an API call, so it can update with an interval of 3+minutes.
Operations I need to do on this data is
find the nearest min, max record from stage 4.
and some more operations like add or subtract two scores etc
All these 6 users, and their score records are already in database, being updated in needed after the API call.
Now my questions is :
Is this a better way for such kind of data (data of scores here) to keep all the data for all the 6 users in memory in NSArray or NSDictionary, and find min and max in that array by a min-max algorithm.
OR
It should be taken from Database by a query like " WHERE score<=200 " AND " WHERE score >=200", in short, 2 database queries which return nearest min and max record each, and not keeping all the data in memory.
What we are focusing on is speed, and memory usage both. The point is, Would a DB call be fast and efficient to find min and max OR a search for min,max in an Array of all the records from DB.
All records can be 6users * 9scores for each = 54.
Update time for records can be 3+ minutes.
Frequency of finding min max for certain values are high.
Please ask, if any more details are required.
Thanks in advance.
You're working with such a small amount of data that I wouldn't imagine it would be worth worrying about. Do whichever method makes your development process easiest!
Edit:
If I had a lot of data (hundreds of competitors) I'd use SQLite. You can do queries like the following:
SELECT MIN(`score`) FROM `T_SCORE` WHERE `stage` = '4';
That way you can let the database handle doing the calculation for you, so you never have to fetch all the results.
My SQL-fu isn't the most awesome, but I think you can also do this:
SELECT `stage`, MIN(`score`) AS min, MAX(`score`) AS max FROM `T_SCORE` GROUP BY `stage`
That would do all the calculations in one single query.