MongoDB in Luigi Python - mongodb

I would like to know if there is a way to output to a MongoDB in Luigi. I see in the documentation they support files (local FS, HDFS), S3, PostgreSQL but not MongoDB. If not, could someone explain me why not? Maybe it is a bad idea to have it? I would like to store the data in a database because then I can explore it by querying it. However I am using mongodb and I would not like to install another database. I do not need a relational database as I am using the database only to store and query ( NoSql ) without relationships, so the best option is mongodb.
Basically I need a task to read the data and save it in the database. Then the next task take this output and process the data.
Any recommendation, suggestion or clarification is more than welcome. Thanks!

You can try using mortar-luigi.
Check out this link for MongoDB tasks and this example.

Related

Automated functions in Mongo DB

I was wondering if there is something like a script that I can write on the MongoDB side that would do something like delete an item in a list if it is older than a week old etc.
I want to have the DB do this check every day. Are there some kind of automated functions that I can setup on the DB to do this?
I could just write a few small methods to do it on the userside myself, but I remember my old SQL DB having this feature. Any help would be appreciated. Thanks,
If you are using Self-Managed MongoDB, The answer is NO.
Like SQL, MongoDB doesn't support scheduled transactions. However you can run the scheduled jobs in your programming language and perform the operations. For example, https://thecodebarbarian.com/node.js-task-scheduling-with-agenda-and-mongodb
If you are using MongoDB Atlas, Then you need to check this https://docs.mongodb.com/realm/triggers/scheduled-triggers/
Then you might want to check out the MongoDB TTL feature :
https://docs.mongodb.com/manual/tutorial/expire-data/
It won't run every day like a script, but it will automatically remove data after some time.
Hope it can help !

Use two mongoDB databases in one SpringBoot application

My SpringBoot API is supposed to read data from a collection of one database and before returning response back, it is supposed to insert a document in a collection of another database.
I am looking for a quick and efficient way to do this. I searched and found that I can make two entries in my application.properties and create two different Mongo template connection using those. But I am looking for a more clean and compact way to do this (if any).
Refer
https://github.com/Mohit-Hurkat/spring-boot-multi-mongo
it's by using two templates (but a clean way and simple to do this)
https://github.com/Mohit-Hurkat/multi-tenant-spring-mongodb
You can use change stream concept in mongodb..
If you have any change in database it automatically drop the changes in another database

Auto complete on Large data set

I'm writing a project where I need to do an autocomplete on a data set that has 5 milion objects (schema is different for objects).
My first thought was to do SQL, but since Schema is changing it will not be fast
So I thought about MongoDB.
Two questions:
1 - do you have sample code that's working that I can use?
2- is Mongo the best solution in place? will it be fast? is there another NoSQL database that I can use instead?
If the time is critical and you wish to have the fastest database than Redis may be the database you are looking for. Here is a link to the Auto complete blog post using Redis.
MongoDB is a great database and includes many great feature so it may be a good choice either.

Mongodb to redshift

We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis).
How can we do it? Should we export the mongo to csv?
I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.
To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.
You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.
Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).
So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.
In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:
mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
head data.json > sample.json
ddlgenerator postgresql sample.json
Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.
Next you upload the data into Redshift.
If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.
For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.
Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.
For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.
Here are some related resources although in the end I did not use them:
Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
Related Redit question
Blog post about dealing with JSON arrays in Redshift
Honestly, I'd recommend using a third party here. I've used Panoply (panoply.io) and would recommend it. It'll take your mongo collections and flatten them into their own tables in redshift.
AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.
MongoDB versions 2.6.x and 3.x as a database source
Document Mode and Table Mode supported
Supports change data capture(CDC)
Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
A few questions that would be helpful to know would be:
Is this an add-only always increasing incremental sync i.e. data is only being added and not being updated / removed or rather your redshift instance is interested only in additions?
Is the data inconsistency due to delete / updates happening at source and not being fed to redshift instance ok?
Does it need to be daily-incremental batch or can it be realtime as it is happening as well?
Depending on your situation may be mongoexport works for you, but you have to understand the shortcoming of it, which can be found at http://docs.mongodb.org/manual/reference/program/mongoexport/ .
I had to tackle the same issue (not on a daily basis though).
as ask mentioned, You can use mongoexport in order to export the data, but keep in mind that redshift doesn't support arrays, so in case your collections data contains arrays you'll find it a bit problematic.
My solution to this was to pipe the mongoexport into a small utility program I wrote that transforms the mongoexport json rows into my desired csv output.
piping the output also allows you to make the process parallel.
Mongoexport allows you to add a mongodb query to the command, so if your collection data supports it you can spawn N different mongoexport processes, pipe it's results into the other program and decrease the total runtime of the migration process.
Later on, I uploaded the files to S3, and performed a COPY into the relevant table.
This should be a pretty easy solution.
Stitch Data is the best tool ever I've ever seen to replicate incrementally from MongoDB to Redshift within a few clicks and minutes.
Automatically and dynamically Detect DML, DDL for tables for replication.

how to extract data from mongo collection for data warehouse use

My company starts to use mongo and we are starting to think about what is the best way to extract data from mongodb and send it to our data warehouse.
My question focus around the extract part of the process. As i see it, the best way is to expose API on the service that is built on top of mongo, that the ETL process (that is invoked by a job from the data warehouse) will execute with some specific query that will probably will query for set of times (i.e. - startdate and enddate for every record).
is that sound right or i am missing something or maybe there is better way than that?
initially i was thinking about doing mongoexport every X duration but according to the documentation it seems not so good performance wise.
Thanks in advance!
give a try to pentaho kettle.
https://anonymousbi.wordpress.com/2012/07/25/creating-pentaho-reports-from-mongodb/
I am using Alteryx Designer to extract from MongoDB with the dedicated connector and prep my data to load into Tableau, with optional data prep in between.
Works pretty well!
ALteryx can write to most DBs though...