How do I manage a sublist in Mongodb? - mongodb

I have different types of data that would be difficult to model and scale with a relational database (e.g., a product type)
I'm interested in using Mongodb to solve this problem.
I am referencing the documentation at mongodb's website:
http://docs.mongodb.org/manual/tutorial/model-referenced-one-to-many-relationships-between-documents/
For the data type that I am storing, I need to also maintain a relational list of id's where this particular product is available (e.g., store location id's).
In their example regarding "one-to-many relationships with embedded documents", they have the following:
{
name: "O'Reilly Media",
founded: 1980,
location: "CA",
books: [12346789, 234567890, ...]
}
I am currently importing the data with a spreadsheet, and want to use a batchInsert.
To avoid duplicates, I assume that:
1) I need to do an ensure index on the ID, and ignore errors on the insert?
2) Do I then need to loop through all the ID's to insert a new related ID to the books?

Your question could possibly be defined a little better, but let's consider the case that you have rows in a spreadsheet or other source that are all de-normalized in some way. So in a JSON representation the rows would be something like this:
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 12346789
},
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 234567890
}
So in order to get those sort of row results into the structure you wanted, one way to do this would be using the "upsert" functionality of the .update() method:
So assuming you have some way of looping the input values and they are identified with some structure then an analog to this would be something like:
books.forEach(function(book) {
db.publishers.update(
{
"name": book.publisher
},
{
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
{ "upsert": true }
);
})
This essentially simplified the code so that MongoDB is doing all of the data collection work for you. So where the "name" of the publisher is considered to be unique, what the statement does is first search for a document in the collection that matches the query condition given, as the "name".
In the case where that document is not found, then a new document is inserted. So either the database or driver will take care of creating the new _id value for this document and your "condition" is also automatically inserted to the new document since it was an implied value that should exist.
The usage of the $setOnInsert operator is to say that those fields will only be set when a new document is created. The final part uses $addToSet in order to "push" the book values that have not already been found into the "books" array (or set).
The reason for the separation is for when a document is actually found to exist with the specified "publisher" name. In this case, all of the fields under the $setOnInsert will be ignored as they should already be in the document. So only the $addToSet operation is processed and sent to the server in order to add the new entry to the "books" array (set) and where it does not already exist.
So that would be simplified logic compared to aggregating the new records in code before sending a new insert operation. However it is not very "batch" like as you are still performing some operation to the server for each row.
This is fixed in MongoDB version 2.6 and above as there is now the ability to do "batch" updates. So with a similar analog:
var batch = [];
books.forEach(function(book) {
batch.push({
"q": { "name": book.publisher },
"u": {
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
"upsert": true
});
if ( ( batch.length % 500 ) == 0 ) {
db.runCommand( "update", "updates": batch );
batch = [];
}
});
db.runCommand( "update", "updates": batch );
So what is doing in setting up all of the constructed update statements into a single call to the server with a sensible size of operations sent in the batch, in this case once every 500 items processed. The actual limit is the BSON document maximum of 16MB so this can be altered appropriate to your data.
If your MongoDB version is lower than 2.6 then you either use the first form or do something similar to the second form using the existing batch insert functionality. But if you choose to insert then you need to do all the pre-aggregation work within your code.
All of the methods are of course supported with the PHP driver, so it is just a matter of adapting this to your actual code and which course you want to take.

Related

Mongodb query getting slow even after indexing

I have a collection let's say Fruits in db. which has following fields
{
"_id": ObjectId(...),
"item": "Banana",
"category": ["food", "produce", "grocery"],
"location": "4th Street Store",
"stock": 4,
"type": "cases"
}
There is an index by default on _id, and i added another index which is,
{
"item": "1",
"category": "1",
"stock": "1",
"type": "1"
}
this collection has data of thousands , and my query response is slow. My query is.
After the index which I mentioned above, Do I need to add all these
checks in my query or I can use any on the keys added in the index ?
Like, currently my queries are like
fruits.find({item: 'new'});
fruits.find({item: 'new', category: 'history'});
fruits.find({stock: '5', category: 'drama'});
fruits.find({type: 'new'});
Is my index which has all these keys is enough for this or I need to
created different indexes for all these combination of keys which I
mentioned above?
Sometimes I am using query and sometimes I am using aggregation on some other collections and lookup for this fruits collections and then doing search etc..
{
"item": "1",
"category": "1",
"stock": "1",
"type": "1"
}
This index will partially work for the following.
fruits.find({item: 'new'}); **Will work (Partially)**
fruits.find({item: 'new', category: 'history'}); **Will work (Partially)**
fruits.find({stock: '5', category: 'drama'}); **Won't work**
fruits.find({type: 'new'}); **Won't work**
Partially => The index is basically an addition in a B-Tree data structure in MongoDB which maps a document in the system. The index prefix on item allows the index to work for first and second query you mentioned but it would be a collection scan for third and last one.
Read about prefixes here.
You need to properly understand indexes in the long run, for queries specifically you can seek help but the knowledge gap will become a problem. This brief read will be really useful.
Edit
Aggregation => Depends on part of the query, mostly only for match you can use index thereafter everything else happens in memory(Check this for more details). For lookup you fetch the data using index on other collection if you have the index on it (again the match part) but after fetching that data whatever you do extra on it would be done in memory. Logically, mostly the fetching of data will be where indexes will be used anyway, for sorting part read the document linked above.

index Elasticsearch document with existing "id" field

I have documents that I want to index into Elasticsearch with an existing unique "id" field.
I get an array of documents from a REST api endpoint ( eg.: http://some.url/api/products) in no particular order and if a document with the _id already exists in Elasticsearch it should update and reindex the document.
I want to create a new document if no document with the _id in Elasticsearch exists and then update a document, if it matches with an existing document in Elasticsearch.
This could be done with:
PUT products/product/un1qu3-1d-b718-105973677e95
{
"id": "un1qu3-1d-b718-105973677e95",
"state": "packaged"
}
The basic idea is to use the provided "id" field to create or update a document. Extraction of _id from document fields seems deprecated (link). But the indexing/ reindexing of documents with the "id" field can be done manually very easy with the kibana dev tools, with postman or a cURL request.
I want to achieve this (re-)indexing of documents that I receive over this api endpoint programmatically.
Is it possible to achieve this with logstash or a simple cronjob? Does Elasticsearch provide any functionality for this? Or do I need to write some custom backend to achieve this?
I thought of either:
1) index the document into Elasticsearch with the "id" field of my document or
2) find an Elasticsearch query that first searches for the document with the specific "id" field and then updates the document.
I was unable to find a solution for either way and have no clue how a good approach would look like.
Can anyone point me into the right direction on how to achieve this, suggest a better approach or provide a solution?
Any help much appreciated!
Update
I solved the problem with the help of the accepted answer. I used Logstash, the Http_poller input plugin, this article: https://www.elastic.co/blog/new-way-to-ingest-part-1 and this elastic.co question: https://discuss.elastic.co/t/upsert-with-logstash/59116
My output of logstash looks like this at the moment:
output {
elasticsearch {
index => "products"
document_type => "product"
pipeline => "rename_id"
document_id => "%{id}"
doc_as_upsert => true
action => "update"
}
Update 2
just for the sake of completeness I added the "rename_id" pipeline
{
"rename_id": {
"description": "_description",
"processors": [
{
"set": {
"field": "_id",
"value": "{{id}}"
}
}
]
}
}
It works this way!
Thanks alot!
Peter,
If I understand correctly, you want to ingest your documents into elastic search and will have some updates in future for these documents ?
If that's the case,
- Use your documents primary key as id for elastic documents.
- You can ingest entire document with updated values, elastic will replace the previous document with new one. given the primary key is same. Old document with same id will be deleted.
We use this approach for our search data.
you can use ingest pipelines to extract the id from the body and the _create endpoint to only create a document if it does not exist. Minor note: If you could specify the id on the client side indexing would be faster, as adding a pipeline adds a certain overhead.
PUT _ingest/pipeline/my_pipeline
{
"description": "_description",
"processors": [
{
"set": {
"field": "_id",
"value": "{{id}}"
}
}
]
}
PUT twitter/tweet/1?op_type=create&pipeline=my_pipeline
{
"foo" : "bar",
"id" : "123"
}
GET twitter/tweet/123
# this call will fail
PUT twitter/tweet/1?op_type=create&pipeline=my_pipeline
{
"foo" : "bar",
"id" : "123"
}
You can use script to UPSERT (update or insert) your document
PUT /products/product/un1qu3-1d-b718-105973677e95/_update
{
"script": {
"inline": "ctx._source.state = \"packaged\"",
"lang": "painless"
},
"upsert": {
"id": "un1qu3-1d-b718-105973677e95",
"state": "packaged"
}
}
Above query find the document with _id = "un1qu3-1d-b718-105973677e95"
if it is able to find any document then it will update state to "packaged" otherwise create a new document with field "id" and "state" (you can insert as many fields as you want).

Push Item to Array and Delete in the Same Request

I have a document that stores sensor data where the sensor readings are objects stored in an array. Example:
{
"readings": [
{
"timestamp": 1499475320,
"temperature": 121
},
{
"timestamp": 1499475326,
"temperature": 93
},
{
"timestamp": 1499475340,
"temperature": 142
}
]
}
I know how to push/add an item to the "readings" array. But what I need is when I add an item to the array, I also want to "clean" the array by removing items that have "timestamp" value older than a cutoff time.
Is this possible in mongodb?
The way I see this you basically have two options here that have varying approaches.
Restrict Arrays to Capped Size
The first option here is "not exactly" what you are asking for, but it is the option with the least implementation and execution overhead. The variance from your question is that instead of "removing past a certain age", we instead simply place a "limit/cap" on the total number of entries in the array.
This is actually done using the $slice modifier to $push:
Model.update(
{ "_id": docId },
{ "$push": {
"readings": {
"$each": [{ "timestamp": 1499478496679, "temperature": 100 }],
"$slice": -10
}
}
)
In this case the -10 argument restricts the array to only have the "last ten" entries from the end of the array since we are "appending" with $push. If you wanted instead the "latest" as the first entry then you would modify with $position and instead provide the "positive" value to $slice, which means "first ten" in contrast.
So it's not the same thing you asked for, but it is practical since the arrays do not have "unlimited growth" and you can simply "cap" them as each update is made and the "oldest" item will be removed once at the maximum length. This means the overall document never actually grows beyond a set size, and this is a very good thing for MongoDB.
Issue with Bulk Operations
The next case which actually does exactly what you ask uses "Bulk Operations" to issue "two" update operations in a "single" request to the server. The reason why it is "two" is because there is a rule that you cannot have different update operators "assigned to the same path" in a singe update operation.
Therefore what you want actually involves a $push AND a $pull operation, and on the "same array path" we need to issue those as "separate" operations. This is where the Bulk API can help:
Model.collection.bulkWrite([
{ "updateOne": {
"filter": { "_id": docId },
"update": {
"$pull": {
"readings": { "timestamp": { "$lt": cutOff } }
}
}
}},
{ "updateOne": {
"filter": { "_id": docId },
"update": {
"$push": { "timestamp": 1499478496679, "temperature": 100 }
}
}}
])
This uses the .bulkWrite() method from the underlying driver which you access from the model via .collection as shown. This will actually return a BulkWriteOpResult within the callback or Promise which contains information about the actual operations performed within the "batch". In this case it will be the "matched" and "modified" numbers which will be appropriate to the operations that were actually performed.
Hence if the $pull did not actually "remove" anything since the timestamp values were actually newer than the given constraint, then the modified count would only reflect the $push operation. But most of the time this need not concern you, where instead you would just accept that the operations completed without error and did something according to what you actually asked.
Conclude
So the general case of "both" is that it's really all done in one request and one response. The differences come in that "under the hood" the second approach which matches your request actually does do "two" operations per request and therefore takes microseconds longer.
There is actually no reason why you could not "combine" the logic of "both", and remove past your "cutoFF" as well as keeping a "cap" on the overall array size. But the general idea here is that the first implementation, though not exactly the same thing as asked will actually do a "good enough" job of "housekeeping" with little to no additional overhead on the request, or indeed the implementation of the actual code.
Also, whilst you can always "read the data" -> "modify" -> "save". That is not a really great pattern. And for best performance as well as "consistency" without conflict, you should be using the atomic operations to modify in just the same way as is outlined here.

Why should I have multiple collections in a NoSql database (MongoDB)

Example:
If I have a database, let's call it world, then I can have a collection of countries.
The collection can have documents, one example on a document:
"_id" : "hiqgfuywefowehfnqfdqfe",
"name" : "Italy",
cities : {{"_id": "ihevfbwyfv", name : "Napoli"}, {"_id: "hjbyiu", name: "Milano"}}
Why should I or when should I create a new collection of documents, I could expand my documents inside my world collection instead of creating a new collection.
Is creating a new collection not like creating a new domain?
Separate collections offer the greatest querying flexibility. Separate collections are good if you need to select individual documents, need more control over querying, or have huge documents and they require more work. Selecting embedded documents is more limited. A document, including all its embedded documents and arrays, cannot exceed 16 MB. Embedded documents are easy and fast, good when you want the entire document, the document with a $slice of comments, or with no comments at all.
A world collection should look like this:
You should create a document / country.
The above is a single document:
{
"_id": "objectKey", //generated by mongo
"countryName": "Italy",
"cities":[
{
"id": 1,
"cityName": "Napoli"
},
{
"id": 2,
"cityName": "Milano"
}
],
}
Try to expend this collection with other countrys but there are some other situations where you should consider creating a new collection and try to create some sort of relation:
If you use mmapv1 storage engine and you wish to add cities over the time you should consider relational way cuz mmapv1 does not tolerate rapidly growing arrays. If you create a document with mmapv1 the storage engine gives a paddign to the document for growing arrays. If this paddign is filled then the document will be relocated in the hard drive with will result disc fregmentation and performace loss. Btw the default storage is Wiredtiger now which does not have this problem.
Do not nest documents too deeply. It will make your documents hard to run complex queries on. For an example do not create documents like the following:
{
"continentName": "Europe",
"continentCountries: [
{
"countryName": "France",
"cities": [
{
"cityId": 1,
"cityName": "Paris"
},
{
another country
}
]
},
{
another country
}
],
"TimeZone": "GMT+1"
],
"continentId": 1
}
In this situtation you should create a continent property for your country object with the continent name or id in it.
With mongodb you should avoid relations most of the time but its not forbidden to create them. Its way better to create relation then nest document too deeply but the best solution is to flip hierarchical level of the document documents for an example:
use 'county->cities, countinentId, etc...' instead of 'continent->country->city'
useful reading: https://docs.mongodb.org/manual/core/data-modeling-introduction/

mongo operation speed : array $addToSet/$pull vs object $set/$unset

I have a index collection containing lots of terms, and a field items containing identifier from an other collection. Currently that field store an array of document, and docs are added by $addToSet, but I have some performance issues. It seems an $unset operation is executed faster, so I plan to change the array of document to a document of embed documents.
Am I right to think the $set/$unset fields are fatest than push/pull embed document into arrays ?
EDIT:
After small tests, we see the set/unset 4 times faster. On the other
hand, if I use object instead of array, it's a little harder to count
the number of properties (vs the length of the array), and we were
counting that a lot. But we can consider using $set everytime and
adding a field with the number of items.
This is a document of the current index :
{
"_id": ObjectId("5594dea2b693fffd8e8b48d3"),
"term": "clock",
"nbItems": NumberLong("1"),
"items": [
{
"_id": ObjectId("55857b10b693ff18948ca216"),
"id": NumberLong("123")
}
{
"_id": ObjectId("55857b10b693ff18948ca217"),
"id": NumberLong("456")
}
]
}
Frequent update operations are :
* remove item : {$pull:{"items":{"id":123}}}
* add item : {$addToSet:{"items":{"_id":ObjectId("55857b10b693ff18948ca216"),"id":123,}}}
* I can change $addToSet to $push and check duplicates before if performances are better
And this is what I plan to do:
{
"_id": ObjectId("5594dea2b693fffd8e8b48d3"),
"term": "clock",
"nbItems": NumberLong("1"),
"items": {
"123":{
"_id": ObjectId("55857b10b693ff18948ca216")
}
"456":{
"_id": ObjectId("55857b10b693ff18948ca217")
}
}
}
* remove item : {$unset:{"items.123":true}
* add item : {$set:{"items.123":{"_id":ObjectId("55857b10b693ff18948ca216"),"id":123,}}}
For information, theses operations are made with pymongo (or can be done with php if there is a good reason to), but I don't think this is relevant
As with any performance question, there are a number of factors which can come into play with an issue like this, such as indexes, need to hit disk, etc.
That being said, I suspect you are likely correct that adding a new field or removing an old field from a MongoDB document will be slightly faster than appending/removing from an array as the array types will be less easy to traverse when searching for duplicates.