storing millions of values again a key database modelling - mongodb

I want to store all tags against the document in which they appeared and make it searchable by some other service/client. Scale:
10 Billion search query per day
10 Million New tags CRUD per day (deleted from doc or appended to doc)
So suppose
"hello" appeared in 10 million documents.
So when a user does the query for "hello", I want to return the list of document_ids in which it occurred.
What should I do for the data modelling for the same?
option 1:
use key: value NoSQL like dynamodb
key: "hello"
value: [doc_id1, doc_id2, .......]
Issues: whenever there is a change in any document related to this tag, we have to read the real value and make the changes.
option 2:
storing in individual rows and using something like MongoDB
"hello": doc_id1
"hello": doc_id2
Issue: suppose when doc_id122 removes the "hello" tag then we will have to fetch all entries to delete this one as our database will be shared on tag_name
option3 : column based (e.g Cassandra)
option 4: elastic search
An extensive requirement for the same is: that
we want to support the autosuggest on the tag in our tag service.
return according to some ranking (we can't return 1 million in the first go) so return the first 50 most popular documents (can be most viewed, most clapped). I think elastic search internally gives the option to rank documents higher based on Tg-IDF algorithm

Related

There is a doc in firestore that has a ID_number field when created. If I want to create another doc, how do I make its ID equal to the previous ID+1?

I'm using Flutter and Firestore, and I want to be able to create documents with an assigned ID field inside them, but I don't know how to make that the new document IDs Field is equal to the last document ID field number + 1.
For Example, if I have this field inside a document, I want to make the next document's correlativeNumber equal to this one + 1 = 87
The best option that you have is to store the last correlativeNumber into a document. Each time a new document is added, increment that number by 1. In this way, you can always know which number was used previously.
But there is something that should take into consideration. When it comes to document IDs, according to the official documentation:
Do not use monotonically increasing document IDs such as:
Customer1, Customer2, Customer3, ...
Product 1, Product 2, Product 3, ...
Such sequential IDs can lead to hotspots that impact latency.
So it's best to use the Firestore's built-in identifiers, which are by definition completely unique.

Postgres full text search against arbitrary data - possible or not?

I was hoping to get some advice or guidance around a problem I'm having.
We currently store event data in Postgres 12 (AWS RDS) - this data could contain anything. To reduce the amount of data (alot of keys for example are common across all events) we flatten this data and store it across 3 tables -
event_keys - the key names from events
event_values - the values from events
event_key_values - a lookup table, containing the event_id, and key_id and value_id.
First inserting the key and value (or returning the existing id), and finally storing the ids in the event_key_values table. So 2 simple events such as
[
{
"event_id": 1,
"name": "updates",
"meta": {
"id": 1,
"value: "some random value"
}
},
{
"event_id": 2,
"meta": {
"id": 2,
"value: "some random value"
}
}
]
would become
event_keys
id key
1 name
2 meta.id
3 meta.value
event_values
id value
1 updates
2 1
3 some random value
4 2
event_key_values
event_id key_id value_id
1 1 1
1 2 2
1 3 3
2 2 4
2 3 3
All values are converted to text before storing, and a GIN index has been added to the event_key and event_values tables.
When attempting to search this data, we are able to retrieve results, however once we start hitting 1 million or more rows (we are expecting billions!) this can take anywhere from 10 seconds too minutes to find data. The key-values could have multiple search operations applied to them - equality, contains (case-sensitive and case-insensitive) and regex. To complicate things a bit more, the user can also search against all events, or a filtered selection (so only search against the last 10 days, events belonging to a certain application etc).
Some things I have noticed from testing
searching with multiple WHERE conditions on the same key e.g meta.id, the GIN index is used. However, a WHERE condition with multiple keys does not hit the index.
searching with multiple WHERE conditions on both the event_keys and event_values table does not hit the GIN index.
using 'raw' SQL - we use Jooq in this project and this was to rule out any issues caused by it's SQL generation.
I have tried a few things
denormalising the data and storing everything in one table - however this resulted in the database (200 GB disk) becoming filled within a few hours, with the index taking up more space than the data.
storing the key-values as a JSONB value against an event_id, the JSON blob containing the flattened key-value pairs as a map - this had the same issues as above, with the index taking up 1.5 times the space as the data.
building a document from the available key-values using concatenation using both a sub-query and CTE - from testing with a few million rows this takes forever, even when attempting to tune some parameters such as work_mem!
From reading solutions and examples here, it seems full text search provides the most benefits and performance when applied against known columns e.g. a table with first_name, last_name and a GIN index against these two columns, but I hope I am wrong. I don't believe the JOINs across tables is an issue, or event_values needing to be stored in the TOAST storage due to the size to be an issue (I have tried with truncated test values, all of the same length, 128 chars and the results still take 60+ seconds).
From running EXPLAIN ANALYSE it appears no matter how I tweak the queries or tables, most of the time is spent searching the tables sequentially.
Am I simply spending time trying to make Postgres and full text search suit a problem it may never work (or at least have acceptable performance) for? Or should I look at other solutions e.g. One possible advantage of the data is it is 'immutable' and never updated once persisted, so something syncing the data to something like Elasticsearch and running search queries against it first might be a solution.
I would really like to use Postgres for this as I've seen it is possible, and read several articles where fantastic performance has been achieved - but maybe this data just isn't suited?
Edit;
Due to the size of the values (some of these could be large JSON blobs, several 100Kbs), the GIN index on event_values is based on the MD5 hash - for equality checks the index is used but never for searching as expected. For event_keys the GIN index is against the key_name column. Users can search against key names, values or both, for example "List all event keys beginning with 'meta.hashes'"

Couchbase N1QL Query getting distinct on the basis of particular fields

I have a document structure which looks something like this:
{
...
"groupedFieldKey": "groupedFieldVal",
"otherFieldKey": "otherFieldVal",
"filterFieldKey": "filterFieldVal"
...
}
I am trying to fetch all documents which are unique with respect to groupedFieldKey. I also want to fetch otherField from ANY of these documents. This otherFieldKey has minor changes from one document to another, but I am comfortable with getting ANY of these values.
SELECT DISTINCT groupedFieldKey, otherField
FROM bucket
WHERE filterFieldKey = "filterFieldVal";
This query fetches all the documents because of the minor variations.
SELECT groupedFieldKey, maxOtherFieldKey
FROM bucket
WHERE filterFieldKey = "filterFieldVal"
GROUP BY groupFieldKey
LETTING maxOtherFieldKey= MAX(otherFieldKey);
This query works as expected, but is taking a long time due to the GROUP BY step. As this query is used to show products in UI, this is not a desired behaviour. I have tried applying indexes, but it has not given fast results.
Actual details of the records:
Number of records = 100,000
Size per record = Approx 10 KB
Time taken to load the first 10 records: 3s
Is there a better way to do this? A way of getting DISTINCT only on particular fields will be good.
EDIT 1:
You can follow this discussion thread in Couchbase forum: https://forums.couchbase.com/t/getting-distinct-on-the-basis-of-a-field-with-other-fields/26458
GROUP must materialize all the documents. You can try covering index
CREATE INDEX ix1 ON bucket(filterFieldKey, groupFieldKey, otherFieldKey);

Mongodb compare two big data collection

I want to compare two very big collection, the main of the operation is two know what element is change or deleted
My collection 1 and 2 have a same structure and have more 3 million records
example :
record 1 {id:'7865456465465',name:'tototo', info:'tototo'}
So i want to know : what element is change, and what element is not present in collection 2.
What is the best solution to do this?
1) Define what equality of 2 documents means. For me it would be: both documents should contain all fields with exact same values given their ids are unique. Note that mongo does not guarantee field order, and if you update a field it might move to the end of the document which is fine.
2) I would use some framework that can connect to mongo and fetch data at the same time converting it to a map-like data structure or even JSON. For instance I would go with Scala + Lift record (db.coll.findAll()) + Lift JSON. Lift JSON library has Diff function that will give you a diff of 2 JSON docs.
3) Finally I would sort both collections by ids, open db cursors, iterate and compare.
if the schema is flat in your case it is, you can use a free tool to compare the data(dataq.io) in two tables.
Disclaimer : I am the founder of this product.

MongoDB: Speed of field ("inside record") search in comporation with speed of search in "global scope"

My question may be not very good formulated because I haven't worked with MongoDB yet, so I'd want to know one thing.
I have an object (record/document/anything else) in my database - in global scope.
And have a really huge array of other objects in this object.
So, what about speed of search in global scope vs search "inside" object? Is it possible to index all "inner" records?
Thanks beforehand.
So, like this
users: {
..
user_maria:
{
age: "18",
best_comments :
{
goodnight:"23rr",
sleeptired:"dsf3"
..
}
}
user_ben:
{
age: "18",
best_comments :
{
one:"23rr",
two:"dsf3"
..
}
}
So, how can I make it fast to find user_maria->best_comments->goodnight (index context of collections "best_comment") ?
First of all, your example schema is very questionable. If you want to embed comments (which is a big if), you'd want to store them in an array for appropriate indexing. Also, post your schema in JSON format so we don't have to parse the whole name/value thing :
db.users {
name:"maria",
age: 18,
best_comments: [
{
title: "goodnight",
comment: "23rr"
},
{
title: "sleeptired",
comment: "dsf3"
}
]
}
With that schema in mind you can put an index on name and best_comments.title for example like so :
db.users.ensureIndex({name:1, 'best_comments.title:1})
Then, when you want the query you mentioned, simply do
db.users.find({name:"maria", 'best_comments.title':"first"})
And the database will hit the index and will return this document very fast.
Now, all that said. Your schema is very questionable. You mention you want to query specific comments but that requires either comments being in a seperate collection or you filtering the comments array app-side. Additionally having huge, ever growing embedded arrays in documents can become a problem. Documents have a 16mb limit and if document increase in size all the time mongo will have to continuously move them on disk.
My advice :
Put comments in a seperate collection
Either do document per comment or make comment bucket documents (say,
100 comments per document)
Read up on Mongo/NoSQL schema design. You always query for root documents so if you end up needing a small part of a large embedded structure you need to reexamine your schema or you'll be pumping huge documents over the connection and require app-side filtering.
I'm not sure I understand your question but it sounds like you have one record with many attributes.
record = {'attr1':1, 'attr2':2, etc.}
You can create an index on any single attribute or any combination of attributes. Also, you can create any number of indices on a single collection (MongoDB collection == MySQL table), whether or not each record in the collection has the attributes being indexed on.
edit: I don't know what you mean by 'global scope' within MongoDB. To insert any data, you must define a database and collection to insert that data into.
Database 'Example':
Collection 'table1':
records: {a:1,b:1,c:1}
{a:1,b:2,d:1}
{a:1,c:1,d:1}
indices:
ensureIndex({a:ascending, d:ascending}) <- this will index on a, then by d; the fact that record 1 doesn't have an attribute 'd' doesn't matter, and this will increase query performance
edit 2:
Well first of all, in your table here, you are assigning multiple values to the attribute "name" and "value". MongoDB will ignore/overwrite the original instantiations of them, so only the final ones will be included in the collection.
I think you need to reconsider your schema here. You're trying to use it as a series of key value pairs, and it is not specifically suited for this (if you really want key value pairs, check out Redis).
Check out: http://www.jonathanhui.com/mongodb-query