Couchbase document list model - database-schema

How to model that kind of structure in to couchbase considering performance?
Structure of list is similar to for example list of products on let's say ebay. But in this case besides sorting, choosing categories we have couple of write heavy ints and we need to show data almost real time (those INTs can be a little out of date).
We have a list that we present to our users. Structure of every document in the list is something like this:
{
String1 : "some text",
String2: "some text",
String3: "some more text",
String4Categories(can be represented as ints too):: enumerable(around 10 choices),
Timestamp: someTimestamp,
int1: someInt,
int2: someInt2,
int3: someInt3,
}
All the strings, one int and timestamp do not change at all, but two of INTs change (Incrementing/decrementing) A LOT.
Users are getting 100 results from this list sorted by timestamp, but they can make a new request and sort it ascending/descending by any of the atributes of this documents (string1/timestamp/int2 etc.). Or they can group results by categories so they will only see all the results from category1 or from category1 combined with category2 etc.
Collection of this documents for now is around ~100k, it's not growing very fast, less than 200 a day.
What is the best way to implement this in couchbase if performance (thgrouthput and latency) is a concern ?

Related

Best way to structure MongoDB with the following use cases?

sorry to have to ask this but I am new to MongoDB (only have experience with relational databases) and was just curious as to how you would structure your MongoDB.
The documents will be in the format of JSONs with some of the following fields:
{
"url": "http://....",
"text": "entire ad content including HTML (very long)",
"body": "text (50-200 characters)",
"date": "01/01/1990",
"phone": "8001112222",
"posting_title": "buy now"
}
Some of the values will be very long strings.
Each document is essentially an ad from a certain city. We are storing all ads for a lot of big cities in the US (about 422). We are storing more ads every day, and the amount of ads per city varies from as little as 0 to as big as 2000. The average is probably around 700-900.
We need to do the following types of queries, in almost instant time (if possible):
Get all ads for any specific city, for any specific date range.
Get all ads that were posted by a specific phone number, for any city, for any date range.
What would you recommend? I'm thinking I should have 422 collections - one for each city. I'm just worried about the query time when we query for phone numbers because it needs to go through each collection. I have an iterable list of all collection names.
Or would it be faster to just have one collection so that I don't have to switch through 422 collections?
Thank you so much, everyone. I'm here to answer any questions!
EDIT:
Here is my "iterating through all collections" snippet:
for name in glob.glob("Data\Nov. 12 - 5pm\*"):
val = name.split("5pm")[1].split(".json")[0][1:]
coll = db[val]
# Add into collection here...
MongoDB does not offer any operations which get results from more than one collection, so putting your data in multiple collections is not advisable in this case.
You can considerably speed up all the use-cases you mentioned by creating indexes for them. When you have a very large dataset and always query for exact equality, then hashed indexes are the fastest.
When you query a range of dates (between day x and day y), you should use the Date type and not strings, because this not just allows you to use lots of handy date operators in aggregation but also allows you to speed up ranged queries and sorts with ascending or descending indexes.
Maybe I'm missing something, but wouldn't making "city" a field in your JSON solve your problem? That way you only need to do something like this db.posts.find({ city: {$in: ['Boston', 'Michigan']}})

MongoDB schema design for unbounded growing table

I'm practicing on MongoDB through a small personal project,
in which, may encounter a need to store some intermediate data abstracted as a unbounded growing table. Both rows and columns would grow boundlessly.
The usage of this abstract table is that I want to be able to
know the corresponding column for each entry in a row
know the corresponding row for each entry in a column
Or, in other word, know the index of each table entry
Hence there comes up two choices to model the table:
Make two collections:
one holds each row as a document which embeds a growing structure as row entries to have reference to the corresponding columns;
and similarly, another collection holds each column as a document embedding a growing structure to reference to the corresponding rows.
Make a single separate collection that holds each table entry as a document. Hence each document size is fixed.
The first model has problem with document growth (In fact, in my application, the table grows a bit askew, and only one collection would encounter document growth issue). The second model seems fine to me. Is there some pitfall or some other issue that should be aware of? And what is the common practice to deal with such problem?
UPDATE: explain things in more detail
I am trying to do automatic summarization of an ongoing conversation. The input is a corpus of sentences, and terms are extracted from each sentences. For example, English terms are stemmed, and sentences in CJK languages are segmented. Hence obtained a term-sentence matrix. Then one of the method needs to compute (sparse) SVD of such term-sentence matrix.
The sentences and extracted terms would be stored into the database. But the term-sentence matrix would grow unbounded.
(Or one can think of the problem of storing a mapping between tweets and hashtags)
There were two choices of draft schema that comes up to my mind:
choice one (hold two-way linkages between sentences and terms)
{ // sentence collection doc
"_id" : // generated by timestamp
, "text" : //
, "contained_terms" : [
// an array of "_id"s in term collection
]
}
{ // term collection doc
"_id" : // use term name
, "in_sentences" : [
// an array of "_id"s in sentence collection
]
}
choice two (make linkages into a separate collection)
{ // linkage collection doc (as matrix entries)
"_id" : // generated by timestamp
, "term" : // an "_id" in term collection
, "in_sentence" : // an "_id" in sentence collection
}
{ // sentence collection doc
"_id" : // generated by time stamp
, "text" : //
}
{ // term collection doc
"_id" : // use term name
}
The choice one encounters document growth problem because "in sentences" array of a term collection doc is very likely to grow beyond limit when sentences come in nonstop.
The choice two extract the linkage between terms and sentences into a separate collection, hence avoids the document growth. Although querying "which sentences contain the term" costs more, but in the end, it seems I don't actually need such operation much.
Currently, I'm thinking that the choice two better suit my needs. The linkage collection seems conform to the input of sparse SVD. To speed up computation, very high frequency terms can be filtered out if the term frequency field is added to each term collection docs (or in a separate collection when there are more than one conversations). This filtering seems fine in the case of automatic summarization.
But still wonder
Is there some issues or pitfalls that should be aware of?
What is the common practice for similar situation?
My understanding of mongodb is that you need to design your schema around your queries. So how you save your data is highly dependent on what data will you be querying. So even for the same set of data, your schema can vary depending on the actual use case. Additionally, data redundancy is quite common in NoSql database design. In case you are going to need some data again and again, there is no point in saving it in a separate collection. You can duplicate it in 2 collections, and that's a fair enough cost for faster querying. Memory is cheap, processing isn't! Additionally, pre-aggregation helps in case of mongo for huge data sets. Your queries will work fine for decent number of documents, but once you go into the realm of millions of records, you may face problems with a certain class of queries like counts, aggregation, etc. Pre-aggregation helps in keeping things real time, though it may have a higher write/insertion overhead. Always avoid a full table scan, whenever you can.
Above are some broad level concepts that I find relevant to your question. I'll try and explain it in your context with some examples (as I am not sure what data you are eventually going to need, or the queries you will do).
Let's say you are going to need terms per sentence frequently, to highlight them. In that case the recommended schema will be:
{ "_id" : // sentence id - you will query on this
, "text" : // sentence text
, "terms" : ["term1", "term2", "term3"]
}
So for each new sentence, you extract all the terms and save it (not the id) along with the sentence. The advantage here being that you will not need to query for the term separately. You can get all the terms for a given sentence in a single query. Additionally, the document size doesn't grow, and hence no document relocation.
Let's say you also want to have a unique list of terms and some per term meta data. You can have a separate terms collection which has a list of all the unique terms:
{ "_id": ,
, "term": //term
, "meaning":
, "metadata""
, "count": 1
}
You can have a unique index on term. Each time you extract terms from a sentence, you look up for it in this collection, and in case you don't find it, you insert it. Now let's say you also want to maintain a count of term appearance. So each time you find a term in a sentence and do a lookup in terms collection, you can increment (atomic) the count as well - pre-aggregation. If you add an index on count, you can get the top 100 terms, etc. easily on the fly.
Now let's say you want to query/count all the sentences with a given term. You can add an index on terms array and directly look up for all the sentences with a given term:
Sentence.where(:term => "term1").count //mongoid query
Again, you are achieving this with a single query, as opposed to getting a term id first in your case, and then the sentences.
Other than this it's always advisable to ensure that your working set and indexes fit into RAM for best performance.
So again, there are no right and wrong answers for schema design and it definitely depends on the queries you will be doing. I would also advise you to unlearn some of your relational DB concepts when trying to design for NoSQL databases. I learned it the hard way =) Hope some of this helps you in coming up with an efficient schema for your use case.
If you are trying to model a matrix with the whole collection representing the matrix, I think the go-to model should be to have each entry (row i, column j) as a document. If you put in a field like "index" : { "row" : i, "column" : j} and appropriate indices then it's easy and fast to do fun things like
get the entry at (i, j)
get row i
get column j
The matrix is represented sparsely so if row i only has 10 columns with values, row i is just 10 documents. If the rows/columns really do grow unboundedly to very large sizes then modeling a document as a row or column or something of "1 dimension" could hit the hard 16MB BSON document size limit.
I'm thinking the biggest drawback could be large index sizes given that every entry is its own document.

Referencing Other Documents by String rather than ObjectId

Let's say I have two collections:
Products and Categories.
The latter collection's documents have 2 fields:
_id (BSON ObjectId)
Name (String)
The latter collection's documents have 3 fields:
_id (BSON ObjectId)
Name (String)
Products (Array of Strings)
Assume I have the following Product document:
{ "_id" : ObjectId("AAA"), "name" : "Shovel" }
Let's say I have the following Category document:
{ "_id" : ObjectId("BBB"), "Name" : "Gardening", "Products" : ["AAA"] }
For purposes of this example, assume that AAA and BBB are legitimate ObjectId's - example: ObjectId("523c7df5c30cc960b235ddee") where they would equal the inner ObjectId's string.
Should the Products field be stored as ObjectId(...)'s rather than as Strings?
I don't think it really matters that much.
I'm pretty sure that the ObjectId format encodes a hex number, so it is probably slightly more efficient with memory and bandwidth. I have done it both ways. As long as you decide, for each field, how you are going to encode it, either will work just fine.
As long as you consistently use the same type (so that comparisons happen correctly), the difference is:
An ObjectId cannot be compared to a String representation of the same ObjectId value. Thus, ObjectId("523c7df5c30cc960b235ddee") is not equal to "523c7df5c30cc960b235ddee".
ObjectIds, when stored natively, will be stored as 12 bytes, plus field name
An ObjectId, when stored as a string, will be commonly stored in 24 bytes (as it will be converted to a hexadecimal number), plus field name
Comparisons can be made more SLIGHTLY more efficiently with the 12 byte number, as it's comparing fewer bytes. It won't matter in most types of usage though, so it's a micro-optimization (but something you should know)
Bonus -- if you don't use short abbreviated field names, the size benefit of using an ObjectId natively as 12 bytes really won't matter, as the field names will far outweigh the size of bytes when stored as a string.
I'd recommend storing them as native ObjectIds. Some drivers can optionally and transparently translate to an ObjectId to a String and back so that the client code can more easily manipulate it. The C# driver for example can do this, and I've used it so that when serializing to JSON, the ObjectId is in a simple format that is easily consumed in JavaScript.
This will matter most when you try to find the details of a product starting from the Categories collection.
Since there are no server side JOIN in Mongo, your code will have to match documents together. ObjectIDs are encoded as 12 bytes, which you can easilly compare in any language. Using either strings or object ids does not really matter.
The real issue you are facing is one of data normalization (or lack thereof). If you store the Name field in your Categories documents, instead of the ObjectID, you will be able to return the products names in a single call (instead of multiple calls, 1 for each products of the category).
It feels wrong the first time you do it. After all, you will have to update many documents if you ever change the name of a product, which might or might not be frequent. You have to model your data by thinking of the way your application will use it.
Finally, index the Name attribute in the Prodcuts collection. Getting the details of a product, starting with the string you found in a Categories document will be fast.
Another way to do it is to not to have a Categories collection at all, but to add a Category attribute to your Products document. You can find documents that have the {'Category':'Gardening'}. Indexing the Category field will probably be a good idea.
Again, ObjectID or String does not matter much. It is about modeling your data thinking of how your application will use it.

Efficient document format to store "Votes" in Mongo DB?

I'm trying to store "Votes" in MongoDB and I am stuck on how to proceed in an efficient way.
Basically , I have a question with several options like A B C D ...(6 total).
I am giving voters the option to choose an option and want to save the "Vote" with fields like:
MongoDate, option, voter name, and maybe couple more fields.
I am planning to have unlimited "Votes" in the thousands and even in millions on a given question.
In terms of retrieving the data : I would like to be able to query it mainly by Date and present in charts, like a stock price with hourly, daily, monthly... intervals
In other words it is like time series.
I am not sure on the "format" of the document in MongoDB;
One reasonable way to do it would be to have a votes collection, where each document looks like:
{
v: 'a', //voted for the first option
d: Date(), //the date
n: 'Bob',
...
}
Then, index on the date field. Be careful not to shard on the date field alone, though, if you have to end up sharding this. I listed the field names as single characters because the name of every field is stored in mongoDB, so for better space efficiency, you should use shorter names. If you aren't concerned about space, a longer, more informative name is probably fine.

How to deal with Many-to-Many relations in MongoDB when Embedding is not the answer?

Here's the deal. Let's suppose we have the following data schema in MongoDB:
items: a collection with large documents that hold some data (it's absolutely irrelevant what it actually is).
item_groups: a collection with documents that contain a list of items._id called item_groups.items plus some extra data.
So, these two are tied together with a Many-to-Many relationship. But there's one tricky thing: for a certain reason I cannot store items within item groups, so -- just as the title says -- embedding is not the answer.
The query I'm really worried about is intended to find some particular groups that contain some particular items (i.e. I've got a set of criteria for each collection). In fact it also has to say how much items within each found group fitted the criteria (no items means group is not found).
The only viable solution I came up with this far is to use a Map/Reduce approach with a dummy reduce function:
function map () {
// imagine that item_criteria came from the scope.
// it's a mongodb query object.
item_criteria._id = {$in: this.items};
var group_size = db.items.count(item_criteria);
// this group holds no relevant items, skip it
if (group_size == 0) return;
var key = this._id.str;
var value = {size: group_size, ...};
emit(key, value);
}
function reduce (key, values) {
// since the map function emits each group just once,
// values will always be a list with length=1
return values[0];
}
db.runCommand({
mapreduce: item_groups,
map: map,
reduce: reduce,
query: item_groups_criteria,
scope: {item_criteria: item_criteria},
});
The problem line is:
item_criteria._id = {$in: this.items};
What if this.items.length == 5000 or even more? My RDBMS background cries out loud:
SELECT ... FROM ... WHERE whatever_id IN (over 9000 comma-separated IDs)
is definitely not a good way to go.
Thank you sooo much for your time, guys!
I hope the best answer will be something like "you're stupid, stop thinking in RDBMS style, use $its_a_kind_of_magicSphere from the latest release of MongoDB" :)
I think you are struggling with the separation of domain/object modeling from database schema modeling. I too struggled with this when trying out MongoDb.
For the sake of semantics and clarity, I'm going to substitute Groups with the word Categories
Essentially your theoretical model is a "many to many" relationship in that each Item can belong Categories, and each Category can then possess many Items.
This is best handled in your domain object modeling, not in DB schema, especially when implementing a document database (NoSQL). In your MongoDb schema you "fake" a "many to many" relationship, by using a combination of top-level document models, and embedding.
Embedding is hard to swallow for folks coming from SQL persistence back-ends, but it is an essential part of the answer. The trick is deciding whether or not it is shallow or deep, one-way or two-way, etc.
Top Level Document Models
Because your Category documents contain some data of their own and are heavily referenced by a vast number of Items, I agree with you that fully embedding them inside each Item is unwise.
Instead, treat both Item and Category objects as top-level documents. Ensure that your MongoDb schema allots a table for each one so that each document has its own ObjectId.
The next step is to decide where and how much to embed... there is no right answer as it all depends on how you use it and what your scaling ambitions are...
Embedding Decisions
1. Items
At minimum, your Item objects should have a collection property for its categories. At the very least this collection should contain the ObjectId for each Category.
My suggestion would be to add to this collection, the data you use when interacting with the Item most often...
For example, if I want to list a bunch of items on my web page in a grid, and show the names of the categories they are part of. It is obvious that I don't need to know everything about the Category, but if I only have the ObjectId embedded, a second query would be necessary to get any detail about it at all.
Instead what would make most sense is to embed the Category's Name property in the collection along with the ObjectId, so that pulling back an Item can now display its category names without another query.
The biggest thing to remember is that the key/value objects embedded in your Item that "represent" a Category do not have to match the real Category document model... It is not OOP or relational database modeling.
2. Categories
In reverse you might choose to leave embedding one-way, and not have any Item info in your Category documents... or you might choose to add a collection for Item data much like above (ObjectId, or ObjectId + Name)...
In this direction, I would personally lean toward having nothing embedded... more than likely if I want Item information for my category, i want lots of it, more than just a name... and deep-embedding a top-level document (Item) makes no sense. I would simply resign myself to querying the database for an Items collection where each one possesed the ObjectId of my Category in its collection of Categories.
Phew... confusing for sure. The point is, you will have some data duplication and you will have to tweak your models to your usage for best performance. The good news is that that is what MongoDb and other document databases are good at...
Why don't use the opposite design ?
You are storing items and item_groups. If your first idea to store items in item_group entries then maybe the opposite is not a bad idea :-)
Let me explain:
in each item you store the groups it belongs to. (You are in NOSql, data duplication is ok!)
for example, let's say you store in item entries a list called groups and your items look like :
{ _id : ....
, name : ....
, groups : [ ObjectId(...), ObjectId(...),ObjectId(...)]
}
Then the idea of map reduce takes a lot of power :
map = function() {
this.groups.forEach( function(groupKey) {
emit(groupKey, new Array(this))
}
}
reduce = function(key,values) {
return Array.concat(values);
}
db.runCommand({
mapreduce : items,
map : map,
reduce : reduce,
query : {_id : {$in : [...,....,.....] }}//put here you item ids
})
You can add some parameters (finalize for instance to modify the output of the map reduce) but this might help you.
Of course you need to have another collection where you store the details of item_groups if you need to have it but in some case (if this informations about item_groups doe not exist, or don't change, or you don't care that you don't have the most updated version of it) you don't need them at all !
Does that give you a hint about a solution to your problem ?