Link pdf with collection/database in Mongodb - mongodb

I am new to MongoDB. I have following data. Empid, Name, Salary, Resume (Resume is in PDF Format).
Now I am able to insert id, name and salary using mongo shell as following.
db.test.insert({empid:100,Name:'Gaurav',Salary:1000});
I am using mongofiles command to upload resume in database.
mongofiles -d test put "C:\resume.pdf"
So I am able to insert data as well as pdf in database.
My question is how to relate/map empid 100 with resume.

As you're using the mongofiles utility to insert in FS grid, files will be put in the collection fs.files (chunks will be stored in fs.chunks). The files have to live in a different collection because grid FS uses a different engine.
Mongofiles works only with file names, so you can either store the file name and query for it like below OR you can parse the utility response after you call it.
After executing your mongofiles command you'll have:
db.fs.files.find();
{
_id: ObjectId('530191f8fc518a0ecdfd45a6'),
filename: "file.pdf",
chunkSize: 262144,
uploadDate: new Date(1392611832917),
md5: "b1ee25bcd665e2d6d7c4f4d6f08f44a3",
length: 40098
}
To link with your employee entry:
> db.test.insert({empid:100,Name:'Gaurav',Salary:1000, file: ObjectId("530191f8fc518a0ecdfd45a6")});
> db.test.find();
{ "_id" : ObjectId("530195bd58f2d10f8b6703a4"), "empid" : 100, "Name" : "Gaurav", "Salary" : 1000, "file" : ObjectId("530191f8fc518a0ecdfd45a6") }
In case you need to also specify the db and collection, use a DB Ref (http://docs.mongodb.org/manual/reference/database-references/)

You have couple options:
Add a field to your employee document that references files collection by _id:
db.test.insert({empid: 100, Name: 'Gaurav', Salary: 1000, fileId: ObjectId("53019397a8f26bc570896972")});
I prefer this option because it lets you use files for different purposes and not pollute it with fields created for specific needs like employee info. When you use mongofiles to put the file it returns back an ID of a newly created document. Use it as a value for fileId. The same will work if you use mongo driver and get back an id.
Add empid field to files collection. GridFS stores files in 2 collections: chunks and files (list of fields is in this doc). Not perfect for the same reason as option 3.
Move all fields from your employee doc to files collection - not the best practice if you plan to use files for anything other than resume storage.

Related

How to search values in real time on a badly designed database?

I have a collection named Company which has the following structure:
{
"_id" : ObjectId("57336ea1a7454c0100d889e4"),
"currentMonth" : 62,
"variables1": { ... },
...
"variables61": { ... },
"variables62" : {
"name" : "Test",
"email": "email#test.com",
...
},
"country" : "US",
}
My need is to be able to search for companies by name with up-to-date data. I don't have permission to change this data structure because many applications still use it. For the moment I haven't found a way to index these variables with this data structure, which makes the search slow.
Today each of these documents can be several megabytes in size and there are over 20,000 of them in this collection.
The system I want to implement uses a search engine to index the names of companies, but for that it needs to be able to detect changes in the collection.
MongoDB's change stream seems like a viable option but I'm not sure how to make it scalable and efficient.
Do you have any suggestions that would help me solve this problem? Any suggestion on the steps needed to set up the above system?
Usually with MongoDB you can add new fields to documents and existing applications would simply ignore the extra fields (though they naturally would not be populated by old code). Therefore:
Create a task that is regularly executed which goes through all documents in your collection, figures out the name for each document from its fields, then writes the name into a top-level field.
Add an index on that field.
In your search code, look up by the values of that field.
Compare the calculated name to the source-of-truth name. If different, discard the document.
If names don't change once set, step 1 only needs to go through documents that are missing the top-level name and step 4 is not needed.
Using the change detection pattern with monstache, I was able to synchronise in real time MongoDB with ElasticSearch, performing a Filter based on the current month and then Map the result of the variables to be indexed 🎊

Representing a file in MongoDB

I would like to process a CSV or Excel file, convert it into JSON and store it in MongoDB for a particular user. I would then like to do queries that filter depending on the user id, file name, or by attributes in the cells.
The method suggested to me is that each document would represent a row from the CSV/Excel file. I would add the filename and username to every single row.
Here is an example of one document (row)
{ user_id: 1, file_name: "fileName.csv", name: "Michael", surname: "Smith"},
The problem I have with this is that every time a query is executed it will have to go through the whole database and filter out any rows not associated with that user id or filename. If the database contained tens of millions of rows then surely this would be very slow?
The structure I would think is better is this but this I've been told it wouldn't be fast to query. I would have thought it would be quicker as now you just need to find one entry by user id, then the files you want to query, then the rows.
{
"user_id":1,
"files":[
{
"file_name":"fileName.csv",
"rows":[
{
"name":"Michael",
"surname":"Smith"
}
]
}
]
}
I'm still rather new to MongoDB so I'm sure it's just a lack of understanding on my part.
What is the best representation of the data?

Insert ObjectID in to mongo array elements in Talend tool

We are migrating the data from Oracle to Mongo DB using Talend tool and we would need to add the object Id to each object inside an array. We have tried to use attribute #type with fixed value as ObjectId but it didn't worked.
We need the output as below:
{
"_id":"12243",
"name": "ABCD",
"city":"XYZ",
"requests":[
{
"_id" : ObjectId("5efdcf15ea9355c419fc9699"), // How to generate this ObjectId using talend tool in Mongo
"type":"department",
"value":"Science"
},
{
"_id" : ObjectId("K279kkqasj8ac023878hjc"), // How to generate this ObjectId using talend tool in Mongo
"type":"department",
"value":"Commerce"
}
]
}
Based on your needs, I assume that manually generating the ObjectId should be enough. I will propose to use:
either the standard MongoDB BSON java library (recommended)
or to generate this element by yourself as soon as you follow the official MongoDB conventions: https://docs.mongodb.com/manual/reference/bson-types/#objectid
The recommended way means that you have to add the MongoDB BSON Java library to your Talend project, certainly by including it as a JAR (see links below). I will not explain how (-> out-of-scope). Then simply do the following to add a correct _id to your embedded elements:
ObjectId id = new ObjectId();
// or
ObjectId id = ObjectId.get();
Related materials:
https://docs.mongodb.com/manual/reference/method/ObjectId/
How to generate unique object id in mongodb
https://mongodb.github.io/mongo-java-driver/4.0/bson/installation-guide/
https://mvnrepository.com/artifact/org.mongodb/bson

mongoimport .csv into existing collection and database

I have a database that contains a collection that has documents in it already. Now I'm trying to insert another .csv into the same database and collection. for example one document in the db.collection has:
Name: Bob
Age: 25
and an entry from the csv im tying to upload is like this:
Name: Bob
Age:27
How can I import the new csv without replacing any documents, just adding to the database so that both entries will be in the database.collection?
Assuming you're using mongoimport, take a look at the --mode option.
With --mode merge, mongoimport enables you to merge fields from a new
record with an existing document in the database. Documents that do
not match an existing document in the database are inserted as usual.

Updating mongoData with MongoSpark

From the following tutorial provided by Mongo:
MongoSpark.save(centenarians.write.option("collection", "hundredClub").mode("overwrite"))
am I correct in understanding that What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?
My question is then is it possible to use the MongoSpark connector to actually update records in Mongo,
lets say I've got data that looks like
{"_id" : ObjectId(12345), "name" : "John" , "Occupation" : "Baker"}
What I would then like to do is to merge the record of the person from another file that has more details, I.E. that file looks like
{"name" : "John", "address" : "1800 some street"}
the goal is to update the record in Mongo so now the JSON looks like
{"_id" : ObjectId(12345) "name" : "John" , "address" : 1800 some street", "Occupation" : "Baker"}
Now here's the thing, lets assume that we just want to update John, and that there are millions of other records that we would like to leave as is.
There are a few questions here, I'll try to break them down.
What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?
Correct, as of mongo-spark v2.x, if you specify mode overwrite, MongoDB Connector for Spark will first drop the collection the save new result into the collection. See source snippet for more information.
My question is then is it possible to use the MongoSpark connector to actually update records in Mongo,
The patch described on SPARK-66 (mongo-spark v1.1+) is , if a dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted. 
What I would then like to do is to merge the record of the person from another file that has more details
As mentioned above, you need to know the _id value from your collection. Example steps:
Create a dataframe (A) by reading from your Person collection to retrieve John's _id value. i.e. ObjectId(12345).
Merge _id value of ObjectId(12345) into your dataframe (B - from the other file with more information). Utilise unique field value to join the two dataframes (A and B).
Save the merged dataframe (C). Without specifying overwrite mode.
we just want to update John, and that there are millions of other records that we would like to leave as is.
In that case, before you merge the two dataframes, filter out any unwanted records from dataframe B (the one from the other file with more details). In addition, when you call save(), specify mode append.