Index Markdown Files with MongoDB - mongodb

I am looking for a Document-Oriented-Database solution - MongoDB preferred - to index a continuously growing and frequently changing number of (pandoc) markdown files.
I read that MongoDB has a clean text indexer but I have not worked with MongoDB before and the only thing related which I found was an indexing process of preprocessed HTML. The scenario I am thinking about is: An automatic indexing of the markdown files where the markdown syntax is used to create keys (for example ## FOOO -> header2: FOO) and where the hierarchical structure of the key/value pairs is preserved as they appear in the document.
Is this possible with MongoDB only or do I always need a preprocessing in which I transform the markdown into something like a BSON file and than ingest it into MongoDB?

Why do you want to use MongoDB for it? I think ElasticSearch is much better fitting for this purpose, it's basically built for indexing texts. However - the same as with MongoDB - you won't get anything automatic and will need to process the document before saving it, if you to improve the precision of finding the documents. The whole document needs to be sent to ElasticSearch as a JSON object, but you can store inside a property also the whole unprocessed markdown text.
I'm not sure about MongoDB full text indices, but ElasticSearch also combines all indexed properties of a document for the full text search. Additionally you can also define the importance of different properties in your index. For instance the title might be more important than the rest of the text, ...

Related

Explain Like I'm Five: Form w/ Text and Image Field > Routes > Controller > Write to MongoDB Document - GridFS goes where?

I have been trying to read the documentation for GridFS and MongoDB for awhile. They just keep repeating the same thing and I can't make sense of it.
Desired Output: The user submits a form that form contains many fields, but one is an image. The request needs to store the data in a collection and make a new document, which can be retrieved later. My main question is how do I use GridFS in this situation to store an image in that document.
It says GridFS makes two collections in my database files and chunks. So how do those collections relate to my other collection which has the other form data?
I assume a reference to these files and chucks collection, however, I can't make any sense of this. It's been a few days and I feel like it's time to reach out to my StackOverflow community.
Can someone please explain to me the program flow and key points for how I can achieve my goal of storing an image in a document using gridfs?
GridFS is mentioned everywhere and seems to be popular but I can't make sense of it. These moments of utter confusion usually result in a breakthrough, so I'm eager to learn from veterans and experts.
GridFS collections are internal implementation detail of GridFS. There is no linking between them and your other data.
To write to and read from GridFS, you would use GridFS APIs provided by your driver. Generally this means that, if you are saving for example some fields and a binary blob like an image, you would perform the save in two steps (one insert/update operation for the fields and a separate GridFS operation for the binary blob).
Can someone please explain to me the program flow and key points for how I can achieve my goal of storing an image in a document using gridfs?
You wouldn't store the image in your document. You would store the image in GridFS, and in your document you could include a reference to the GridFS file (those have their own ids).

How to store lookup values in MongoDB?

I have a collection in db which represents mediafiles.
And among other info I shoud store format name. I wonder if there best practices to store info like that. Is it better to create new collection for file formats and use link to that collection or to store format name right in file documents as a plain text? What about perfomance and compression? It supposed to be more than a billion documents in db. What would mongo expers suggest in this situation?
Embedded documents are the preferred approach.
In your case, it means it is better to store file format in the same collection.
Putting the file format into the separate collection means creating a new file on the disk.
It is a slower option and should be used if your document ( any of them ) exceeds 16 MB in size.
See these links for more information
6 Rules of Thumb for MongoDB Schema Design
and
How to Program with MongoDB Using the .NET Driver
I've done some benchmarks and figured out that in my case storing "lookup values" as plaintext is more efficient in terms of disk space than embedded document and than reference to outstanding collection. Sorry for poor terminology.

Mongo search structure

I have several collections that should be searched when searching for words.
Those collections are them merged on a single search collection.
I Have some fields that are fullText and some fields that are a array of text tags.Those tags come from several contexts
Considering that mongo just use ONE index per search, and that a compound index can only contain one array field . I am inclined to put all tags on a set on some field of this search structure.
My Question Is: What about the fullText fields? Should i just concat them and add them as some huge text field?
Should i just skip mongo alltogether and input my fields on some other Database?
Having using my solution on production for some time i have the following conclusions:
Keeping all the arrays on a single field for indexing even for large datasets is good enough;
A specialized text engine , altough increasing the technology stack, is more powerfull and scalable and i would recommend ElasticSearch

Create schema.xml automatically for Solr from mongodb

Is there an option to generate automatically a schema.xml for solr from mongodb? e.g each field of a document and subdocuments from a collection should by indexed and get searchable by default.
As written as in this SO answer Solr's Schemaless Mode could help you
Solr supports a Schemaless Mode. When starting Solr this way, you are initially not bound to a schema. When you give Solr a first document it will guess the appropriate field types and generate a schema that includes those field types for you. These fields are then fixed. You may still add new fields on the fly that way.
What you still need to do is to create an Import Route of some kind from your mongodb into Solr.
After googling a bit, you may stumble over the SO question - solr Data Import Handlers for MongoDB - which may help you on that part too.
Probably simpler would be to create a mongo query whose result contains all relevant information you require, save the result to json and send that to Solr's direct update handler, which can parse json.
So in short
Create a new, empty core in Schemaless Mode
Create an import of some kind that covers all entities and attributes you want
Run the import
Check if the result is as you want it to be
As long as (4) is not satisfied you may delete the core and repeat these steps.
No, MongoDB does not provide this option. You will have to create a script that maps documents to XML.

How do I sanitize sensitive fields in MongoDB?

Some of the fields in my MongoDB documents contain sensitive data, and when I use this data for testing I need to sanitise them.
The data was previously stored in MySQL and I did this with something like REPEAT('x', LENGTH(fieldName)).
I would like to keep the length of the sanitized fields the same as they were and ideally preserve whitespace.
Can anyone suggest a good way to do this in MongoDB?
Update
The sensitive data is stuff like performance review feedback that has been provided for employees so when testers are using the app they must not see this data. I want to preserve the length of the strings and whitespace so that the the layout of the text is similar to what it is in production.
I was wondering if it would be possible to do this using some simple MongoDB operators, but haven't been able to find what I'm looking for.
The application is written in Java and I am using Spring Data. In the case of MySQL replacing characters with 'x' in the strings in Java and then updating the rows was slow which is why I resorted to using repeat even though I lost the whitespace in the strings.
You can do this from a MongoDB shell:
db.myColl.find().forEach(function(doc){
doc.myField = Array(doc.myField.length+1).join("X");
db.myColl.save(doc);
});