How to append to existing documents using mongoimport and csv files - mongodb

I am trying to use mongoimport to translate a one-to-many relational structure into mongoDB using csv files. My approach is to import the "one" file then use the upsert option to append the "many" records as a nested array but it looks like it only replaces the original document instead of appending.
Is this a limitation of mongoimport or could I be doing something wrong?

You can do upserts when using mongoimport, but you cannot use complex operators to perform modifications to the data as you would with a normal update operation. This is a limitation of mongoimport - essentially each piece of data you import must be insert ready even though you are using the upsert functionality, which is basically working as a de-duplication mechanism for your input data.
If you wish to merge in a more sophisticated manner then it would be best to use one of the drivers and merge the data using your language of choice. This also has the advantage of avoiding potential issues with type fidelity and allowing you to code around exceptions etc.

Related

Is there a way to back up only the structure without data when backing up mongodb?

When backing up mongodb, I want to back up only the collection information about the database. Is there a way?
I am asking if there is a way to back up only the colllection without data using mongodump.
It has been noted that MongoDB is and remains schema-less, so unless you use Mongoose or another ORM, is it difficult to get the schema directly from the data.
That being said, there is a tool called mongodb-schema that can read your documents and INFER a schema from them. It creates a probability metric for each of the potential fields and also the assigned type. It may be useful if you want to retrospectively analyse the collections without resorting to a dump and manual inspection.
You can also use MongoDB Compass to analyse your schemas. This is, again, based on the sampling of your data.
I believe this is the desired output, but for more clarity in your use case, please update the question accordingly.

Faster way to remove 2TB of data from single collection (without sharding)

We collect a lot of data and currently decided to migrate our data from mongodb into data lake. We are going to leave in mongo only some portion of our data and use it as our operational database that keeps only newest most relevant data. We have replica set, but we don't use sharding. I suspect, if we had sharded cluster, we could achieve necessary results much simpler, but it's one-time operation, so setting up cluster just for one operation looks like very complex solution (plus I also suspect, that it will be very long running operation to convert such collection into sharded collection, but I can be completely wrong here)
One of our collections has size of 2TB right now. We want to remove old data from original database as fast as possible, but looks like standard "remove" operation is very slow, even if we use unorderedBulkOperation.
I found a few suggestions to copy data into another collection and then just drop original collection instead of trying remove data (so migrate data that we want to keep instead of removing data that we don't want to keep). There are few different ways, that I found, to copy portion of data from original collection to another collection:
Extract data and insert it into other collection one by one. Or extract portion of data and insert it in bulk using insertMany(). It looks faster than just remove data, but still not enough fast.
Use $out operator with aggregation framework to extract portions of data. It's very fast! But it extracts every portion of data into separate collections and doesn't have ability to append data in current mongodb version. So we will need to combine all exported portions of data into one final collection, what is slow again. I see that $out will be able to append data in next release of mongo (https://jira.mongodb.org/browse/SERVER-12280). But we need some solution now, and unfortunately, we won't be able to do quick update of mongo version anyway.
mongoexport / mongoimport - it exports portion of data into json file and append to another collection using import. It's quite fast too, so looks like good option.
Currently it looks like the best choice to improve performance of migration is combination of $out + mongoexport/mongoimport approaches. Plus multithreading to perform multiple described operations at once.
But is there any even faster option that I might missed?

Update a collection of documents with mongodb

I have the following two documents in a mongo collection:
{
_id: "123",
name: "n1"
}
{
_id: "234",
name: "n2"
}
Let's suppose I read those two documents, and make changes, for example, add "!" to the end of the name.
I now want to save the two documents back.
For a single document, there's save, for new documents, I can use insert to save an array of documents.
What is the solution for saving updates to those two documents? The update command asks for a query, but I don't need a query, I already have the documents, I just want to save them back...
I can update one by one, but if that was 2 million documents instead of just two this would not work so well.
One thing to add: we are currently using Mongo v2.4, we can move to 2.6 if Bulk operations are the only solution for this (as that was added in 2.6)
For this you have 2 options (present in 2.6),
Bulk operations like Mongoimport, mongorestore.
Upsert command for each document.
First option goes better with huge no. of documents (which is your case). In Mongoimport you can use --upsert flag to overwrite the existing documents. You can use --upsert --drop flags to drop existing data and set new document.
This options scales well with lot amount of data in terms of IO and system util.
Upsert command works on in-place update principle. You can use it with a filter but drawback is it works in serial fashion and shouldn't be used for huge data size. Performant only with small data.
When you switch off write concerns, a save doesn't block until the database wrote and returns almost immediately. So with WriteConcern.Unacknowledged, storing 2 million documents with save is a lot quicker than you would think. But no write concerns have the drawback that you won't get any errors from the database.
When you don't want to save them one-by-one, bulk operations are the way to go.

Index Markdown Files with MongoDB

I am looking for a Document-Oriented-Database solution - MongoDB preferred - to index a continuously growing and frequently changing number of (pandoc) markdown files.
I read that MongoDB has a clean text indexer but I have not worked with MongoDB before and the only thing related which I found was an indexing process of preprocessed HTML. The scenario I am thinking about is: An automatic indexing of the markdown files where the markdown syntax is used to create keys (for example ## FOOO -> header2: FOO) and where the hierarchical structure of the key/value pairs is preserved as they appear in the document.
Is this possible with MongoDB only or do I always need a preprocessing in which I transform the markdown into something like a BSON file and than ingest it into MongoDB?
Why do you want to use MongoDB for it? I think ElasticSearch is much better fitting for this purpose, it's basically built for indexing texts. However - the same as with MongoDB - you won't get anything automatic and will need to process the document before saving it, if you to improve the precision of finding the documents. The whole document needs to be sent to ElasticSearch as a JSON object, but you can store inside a property also the whole unprocessed markdown text.
I'm not sure about MongoDB full text indices, but ElasticSearch also combines all indexed properties of a document for the full text search. Additionally you can also define the importance of different properties in your index. For instance the title might be more important than the rest of the text, ...

Create schema.xml automatically for Solr from mongodb

Is there an option to generate automatically a schema.xml for solr from mongodb? e.g each field of a document and subdocuments from a collection should by indexed and get searchable by default.
As written as in this SO answer Solr's Schemaless Mode could help you
Solr supports a Schemaless Mode. When starting Solr this way, you are initially not bound to a schema. When you give Solr a first document it will guess the appropriate field types and generate a schema that includes those field types for you. These fields are then fixed. You may still add new fields on the fly that way.
What you still need to do is to create an Import Route of some kind from your mongodb into Solr.
After googling a bit, you may stumble over the SO question - solr Data Import Handlers for MongoDB - which may help you on that part too.
Probably simpler would be to create a mongo query whose result contains all relevant information you require, save the result to json and send that to Solr's direct update handler, which can parse json.
So in short
Create a new, empty core in Schemaless Mode
Create an import of some kind that covers all entities and attributes you want
Run the import
Check if the result is as you want it to be
As long as (4) is not satisfied you may delete the core and repeat these steps.
No, MongoDB does not provide this option. You will have to create a script that maps documents to XML.