Couchbase N1QL Query getting distinct on the basis of particular fields - nosql

I have a document structure which looks something like this:
{
...
"groupedFieldKey": "groupedFieldVal",
"otherFieldKey": "otherFieldVal",
"filterFieldKey": "filterFieldVal"
...
}
I am trying to fetch all documents which are unique with respect to groupedFieldKey. I also want to fetch otherField from ANY of these documents. This otherFieldKey has minor changes from one document to another, but I am comfortable with getting ANY of these values.
SELECT DISTINCT groupedFieldKey, otherField
FROM bucket
WHERE filterFieldKey = "filterFieldVal";
This query fetches all the documents because of the minor variations.
SELECT groupedFieldKey, maxOtherFieldKey
FROM bucket
WHERE filterFieldKey = "filterFieldVal"
GROUP BY groupFieldKey
LETTING maxOtherFieldKey= MAX(otherFieldKey);
This query works as expected, but is taking a long time due to the GROUP BY step. As this query is used to show products in UI, this is not a desired behaviour. I have tried applying indexes, but it has not given fast results.
Actual details of the records:
Number of records = 100,000
Size per record = Approx 10 KB
Time taken to load the first 10 records: 3s
Is there a better way to do this? A way of getting DISTINCT only on particular fields will be good.
EDIT 1:
You can follow this discussion thread in Couchbase forum: https://forums.couchbase.com/t/getting-distinct-on-the-basis-of-a-field-with-other-fields/26458

GROUP must materialize all the documents. You can try covering index
CREATE INDEX ix1 ON bucket(filterFieldKey, groupFieldKey, otherFieldKey);

Related

Spring boot mongo template remove with limit query

I am trying to delete a limited set of mongo documents from a collection which have id less than 10 but want to remove them in sets of 3, so tried using limit, but it still deletes all the documents and ignores limit.
Query query = new Query();
query.addCriteria(Criteria.where("_id").lt(id)).limit(3);
mongoTemplate.remove(query,TestCollection.class);
When I perform mongoTemplate.find(query,TestCollection.class); limit works fine and returns 3 element at a time but in remove it doesn't works.
Is there any other way to delete in single query only.
To achieve this do it in two passes
Find 3 ids to delete as you are doing currently
do a collection.remove with Criteria.where("_id").in[id1,id2,id3]
I would also add a sort criteria before doing a limit. Otherwise the results of deletion might be dependent on the index used

Firestore 1 global index vs 1 index per query what is better?

I'm working on my app and I just ran into a dilemma regarding what's the best way to handle indexes for firestore.
I have a query that search for publication in a specify community that contains at least one of the tag and in a geohash range. The index for that query looks like this:
community Ascending tag Ascending location.geohash Ascending
Now if my user doesnt need to filter by tag, I run the query without the arrayContains(tag) which prompt me to create another index:
community Ascending location.geohash Ascending
My question is, is it better to create that second index or, to just use the first one and specifying all possible tags in arrayContains in the query if the user want no filters on tag ?
Neither is pertinently better, but it's a typical space vs time tradeoff.
Adding the extra tags in the query adds some overhead there, but it saves you the (storage) cost for the additional index. So you're trading some small amount of runtime performance for a small amount of space/cost savings.
One thing to check is whether the query with tags can actually run on just the second index, as Firestore may be able to do a zigzag merge join. In that case you could only keep the second, smaller index and save the runtime performance of adding additional clauses, but then get a (similarly small) performance difference on the query where you do specify one or more tags.

Why MongoDB find has same performance as count

I am running tests against my MongoDB and for some reason find has the same performance as count.
Stats:
orders collection size: ~20M,
orders with product_id 6: ~5K
product_id is indexed for improved performance.
Query: db.orders.find({product_id: 6}) vs db.orders.find({product_id: 6}).count()
result the orders for the product vs 5K after 0.08ms
Why count isn't dramatically faster? it can find the first and last elements position with the product_id index
As Mongo documentation for count states, calling count is same as calling find, but instead of returning the docs, it just counts them. In order to perform this count, it iterates over the cursor. It can't just read the index and determine the number of documents based on first and last value of some ID, especially since you can have index on some other field that's not ID (and Mongo IDs are not auto-incrementing). So basically find and count is the same operation, but instead of getting the documents, it just goes over them and sums their number and return it to you.
Also, if you want a faster result, you could use estimatedDocumentsCount (docs) which would go straight to collection's metadata. This results in loss of the ability to ask "What number of documents can I expect if I trigger this query?". If you need to find a count of docs for a query in a faster way, then you could use countDocuments (docs) which is a wrapper around an aggregate query. From my knowledge of Mongo, the provided query looks like a fastest way to count query results without calling count. I guess that this should be preferred way regarding performances for counting the docs from now on (since it's introduced in version 4.0.3).

Are Postgres WHERE clauses run sequentially?

I'm looking at using Postgres as a database to let our clients segment their customers.
The idea is that they can select a bunch of conditions in our front-end admin, and these conditions will get mapped to a SQL query. Right now, I'm thinking the best structure could be something like this:
SELECT DISTINCT id FROM users
WHERE id IN (
-- condition 1
)
AND id IN (
-- condition 2
)
AND id IN (
-- etc
)
Efficiency and query speed is super important to us, and I'm wondering if this is the best way of structuring things. When going through each of the WHERE clauses, will Postgres pass the id values from one to the next?
The ideal scenario would be, for a group of 1m users:
Query 1 filters down to 100k
Query 2 filters down from 100k to 10k
Query 3 filters down to 10k to 5k
As opposed to:
Query 1 filters from 1m to 100k
Query 2 filters down from 1m to 50k
Query 3 filters down from 1m to 80k
The intersection of all queries are mashed together, to 5k
Maybe I'm misunderstanding something here, I'd love to get your thoughts!
Thanks!
Postgres uses a query planner to figure out how to most efficiently apply your query. It may reorder things or change how certain query operations (such as joins) are implemented, based on statistical information periodically collected in the background.
To determine how the query planner will structure a given query, stick EXPLAIN in front of it:
EXPLAIN SELECT DISTINCT id FROM users ...;
This will output the query plan for that query. Note that an empty table may get a totally different query plan from a table with (say) 10,000 rows, so be sure to test on real(istic) data.
Database engines are much more sophisticated than that.
The specific order of the conditions should not matter. They will take your query as a whole and try to figure out the best way to get the data according to all the conditions you specified, the indexes that each table has, the amount of records each condition will filter out, etc.
If you want to get an idea of how your query will actually be solved you can ask the engine to "explain" it for you: http://www.postgresql.org/docs/current/static/sql-explain.html
However, please note that there is a lot of technical background on how DB engines actually work in order to understand what that explanation means.

what is the Recommended max emits in map function?

I am new to mongoDb and planning to use map reduce for computing large amount of data.
As you know we have map function to match the criteria and then emit the required data for a given filed. In my map function I have multiple emits. As of now I have 50 Fields emitted from a single document. That means from a single document in a collection explodes to 40 document in temp table. So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Thought of splitting the map function into two….but then one more problem …while performing map function if by chance I ran into an exception thought of skipping entire document data (I.e not to emit any data from that document) but if I split I won't be able to….
In mongoDB.org i found a comment which said..."When I run MR job, with sort - it takes 1.5 days to reach 23% at first stage of MR. When I run MR job, without sort, it takes about 24-36 hours for all job.Also when turn off jsMode is speed up my MR twice ( before i turn off sorting )"
Will enabling sort help? or Will turning OFF jsmode help? i am using mongo 2.0.5
Any suggestion?
Thanks in advance .G
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Don't know what you mean, MR's don't have sort params, only the incoming query has a sort param. The sort param of the incoming query only sorts the data going in. Unless you are looking for some specific behaviour that will avoid sorting the final output using an incoming sort you don't normally need to sort.
How are you looking to use this MR. Obviusly it won't be in realtime else you would just kill your servers so Ima guess it is a background process that runs and formats data to the way you want. I would suggest looking into incremental MRs so that you do delta updates throughout the day to limit the amount of resources used at any given time.
So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
Are you emiting multiple times? If not then the temp table should have only one key per row with documents of the format:
{
_id: emitted_id
[{ //each of your docs that you emit }]
}
This is shown: http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/
or Will turning OFF jsmode help? i am using mongo 2.0.5
Turning off jsmode is unlikely to do anything significant and results from it have varied.