MongoDB performance issue: Single Huge collection vs Multiple Small Collections - mongodb

I tested two scenarios Single Huge collection vs Multiple Small Collections and found huge difference in performance while querying. Here is what I did.
Case 1: I created a product collection containing 10 million records for 10 different types of product, and in this exactly 1 million records for each product type, and I created index on ProductType. When I ran a sample query with condition ProductType=1 and ProductPrice>100 and limit(10) to return 10 records of ProductType=1 and whose price is greater than 100, it took about 35 milliseconds when the collection has lot of products whose price is more than 100, and the same query took about 8000 millisecond (8 second) when we have very less number of products in ProductType=1 whose price is greater than 100.
Case 2: I created 10 different Product table for each ProductType each containing 1 million records. In collection 1 which contains records for productType 1, when I ran the same sample query with condition ProductPrice>100 and limit(10) to return 10 records of products whose price is greater than 100, it took about 2.5 milliseconds when the collection has lot of products whose price is more than 100, and the same query took about 1500 millisecond (1.5 second) when we have very less number of products whose price is greater than 100.
So why there is so much difference? The only difference between the case one and case two is one huge collection vs multiple smaller collection, but I have created index of ProductType in the first case one single huge collection. I guess the performance difference is caused by the Index in the first case, and I need that index in the first case otherwise it will be more worst in performance. I expected some performance slow in the first case due to the Index but I didn't expect the huge difference about 10 times slow in the first case.
So 8000 milliseconds vs 1500 milliseconds on one huge collection vs multiple small collection. Why?

Separating the collections gives you a free index without any real overhead. There is overhead for an index scan, especially if the index is not really helping you cut down on the number of results it has to scan (if you have a million results in the index, but you have to scan them all and inspect them, it's not going to help you much).
In short, separating them out is a valid optimization, but you should make your indexes better for your queries before you actually decide to take that route, which I consider a drastic measure (an index on product price might help you more in this case).
Using explain() can help you understand how queries work. Some basics are: You want a low nscanned to n ratio, ideally. You don't want scanAndOrder = true, and you don't want BasicCursor, usually (this means you're not using an index at all).

Related

What kind of index should I use in postgresql for a column with 3 values

I have a table with 100Mil+ records and 237 fields.
One of the fields is a varchar 1 field with three possible values (Y,N,I)
I need to find all of the records with N.
Right now I have a b-tree index built and the query below takes about 20 min to run.
Is there another index I can use to get better performance?
SELECT * FROM tableone WHERE export_value='N';
Assuming your values are roughly equally distributed (say at least 15% of each value) and roughly equally distributed throughout the table (some physically at the beginning, some in the middle, some at the end) then no.
If you think about it you'll see why. You'll have to look up tens of millions of disk blocks in the index and then fetch them from the disk one by one. By the time you have done that, it would have been quicker to just scan the whole table and pick out the values as they match. The planner knows this and would probably not use the index at all.
However - if you only have 17 rows with "N" or they are all very recently added to the table and so physically happen to be close to each other then yes, and index can help.
If you only had a few rows with "N" you would have mentioned it, so we can ignore that one.
If however you mostly insert to this table you might find a BRIN index helpful. That can let the planner see that e.g. the first 80% of your table doesn't have any "N" blocks and so it just needs to look at the last bit.

long running queries and new data

I'm looking at a postgres system with tables containing 10 or 100's of millions of rows, and being fed at a rate of a few rows per second.
I need to do some processing on the rows of these tables, so I plan to run some simple select queries: select * with a where clause based on a range (each row contains a timestamp, that's what I'll work with for ranges). It may be a "closed range", with a start and an end I know are contained in the table, and I know no new data will fall into the range, or an open range : ie one of the range boundary might not be "in the table yet" and rows being fed in the table might thus fall in that range.
Since the response will itself contains millions of rows, and the processing per row can take some time (10s of ms) I'm fully aware I'll use a cursor and fetch, say, a few 1000 rows at a time. My question is:
If I run an "open range" query: will I only get the result as it was when I started the query, or will new rows being inserted in the table that fall in the range while I run my fetch show up ?
(I tend to think that no I won't see new rows, but I'd like a confirmation...)
updated
It should not happen under any isolation level:
https://www.postgresql.org/docs/current/static/transaction-iso.html
but Postgres insures it only in Serializable isolation
Well, I think when you make a query, that means you create a new transaction and it will not receive/update data from any other transaction until it commit.
So, basically "you only get the result as it was when you started the query"

executing query having over billion rows

I have a table say 'T' in kdb which has rows over 6 billion. When I tried to execute query like this
select from T where i < 10
it throws wsfull expection. Is there any way I can execute queries like this in table having large amount of data.
10#T
The expression as you wrote it first makes a bitmap containing all of the elements where i (rownumber) < 10, which is as tall as one of your columns. It then does where (which just contains til 10) and then gets them from each row. You can save the last step with:
T[til 10]
but 10#T is shorter.
Assuming you have a partitioned table here, it is normally beneficial to have the partitioning column (date, int etc.) as the first item in the where clause of your query - otherwise as mentioned previously you are reading a six billion item list into memory, which will result in a 'wsfull signal for any machine with less than the requisite amount of RAM.
Bear in mind that row index starts at 0 for each partition, and is not reflective of position in the overall table. The query that you gave as an example in your question would return the first ten rows of each partition of table T in your database.
In order to do this without reaching your memory limit, you can try running the following (if your database is date-partitioned):
raze{10#select from T where date=x}each date

mongodb embeded document: large document with more embeded or small document with less embeded

Here is an example to my case:
There are 3 factory A, B and C.
Each of them produces same 30 products: P1, P2,...P3
there are two possible structure of the document:
{
id: 1,
base: 'A',
date: '01/01/2014',
products:
[
{name:'P1', quantity:20, weight:4},
{name:'P2', quantity:19, weight:5},
...
{name:'P30', quantity:25, weight:6}
]
}
or
{
id: 1,
product: 'P1',
date: '01/01/2014',
bases:
[
{name:'A', quantity:20, weight:4},
{name:'B', quantity:19, weight:5},
{name:'C', quantity:25, weight:6}
]
}
The difference is the number of total documents and the number of embedded documents.
Actually, for the convenience of the use of the document, it doesn't make too much difference for me.
I want to ask what are the pros and cons between these two structures in MongoDB and will it make a significant difference in performance.
For each day, there will be:
three documents, each with 30 embedded documents as the first condition OR
thirty documents, each with 3 embedded documents as the second condition
Usually, the task will be just:
calc the sum of quantity for each product for a time period
calc the sum of produced quantity for each base for a time period
check the production detail of a base for the given day (the output is similar
to the first condition structure)
check the production detail of a
product for the given day (the output is similar to the second
condition structure)
Thanks in advance.
three documents, each with 30 embedded documents as the first condition OR
The only difference I can see here is that as embedded documents (the first schema) you might suffer fragmentation on the default setting of MongoDBs document padding, with the powerof2sizes enabled http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes , of course if you are using v2.6+ then this setting is already on and you will see no difference.
calc the sum of quantity for each product for a time period
This is favoured by the second schema, however:
calc the sum of produced quantity for each base for a time period
check the production detail of a base for the given day (the output is similar to the first condition structure)
check the production detail of a product for the given day (the output is similar to the second condition structure
Is actually favoured by the first, querying by base.
So currently I would say go with first since it will be easier to fill your queries.

SQLite vs Memory

I have a situation with my app.
Suppose I have 6 users, each user can have up to 9 score entries (i.e score 1000 points at 8:00pm with gold collected 3, silver 4 etc etc), say score per stage and 9 stages.
All these scores are being taken from an API call, so it can update with an interval of 3+minutes.
Operations I need to do on this data is
find the nearest min, max record from stage 4.
and some more operations like add or subtract two scores etc
All these 6 users, and their score records are already in database, being updated in needed after the API call.
Now my questions is :
Is this a better way for such kind of data (data of scores here) to keep all the data for all the 6 users in memory in NSArray or NSDictionary, and find min and max in that array by a min-max algorithm.
OR
It should be taken from Database by a query like " WHERE score<=200 " AND " WHERE score >=200", in short, 2 database queries which return nearest min and max record each, and not keeping all the data in memory.
What we are focusing on is speed, and memory usage both. The point is, Would a DB call be fast and efficient to find min and max OR a search for min,max in an Array of all the records from DB.
All records can be 6users * 9scores for each = 54.
Update time for records can be 3+ minutes.
Frequency of finding min max for certain values are high.
Please ask, if any more details are required.
Thanks in advance.
You're working with such a small amount of data that I wouldn't imagine it would be worth worrying about. Do whichever method makes your development process easiest!
Edit:
If I had a lot of data (hundreds of competitors) I'd use SQLite. You can do queries like the following:
SELECT MIN(`score`) FROM `T_SCORE` WHERE `stage` = '4';
That way you can let the database handle doing the calculation for you, so you never have to fetch all the results.
My SQL-fu isn't the most awesome, but I think you can also do this:
SELECT `stage`, MIN(`score`) AS min, MAX(`score`) AS max FROM `T_SCORE` GROUP BY `stage`
That would do all the calculations in one single query.