how to extract data from mongo collection for data warehouse use

how to extract data from mongo collection for data warehouse use - mongodb

My company starts to use mongo and we are starting to think about what is the best way to extract data from mongodb and send it to our data warehouse.
My question focus around the extract part of the process. As i see it, the best way is to expose API on the service that is built on top of mongo, that the ETL process (that is invoked by a job from the data warehouse) will execute with some specific query that will probably will query for set of times (i.e. - startdate and enddate for every record).
is that sound right or i am missing something or maybe there is better way than that?
initially i was thinking about doing mongoexport every X duration but according to the documentation it seems not so good performance wise.
Thanks in advance!

give a try to pentaho kettle.
https://anonymousbi.wordpress.com/2012/07/25/creating-pentaho-reports-from-mongodb/

I am using Alteryx Designer to extract from MongoDB with the dedicated connector and prep my data to load into Tableau, with optional data prep in between.
Works pretty well!
ALteryx can write to most DBs though...

Related

Azure Data Factory - querying Cosmos DB using dynamic timestamps

I want to create and maintain a snapshot of a collection in Cosmos DB.
Periodically, I want to retrieve only the delta (new or modified documents) from Cosmos and write them to the snapshot, which will be stored in a Azure Data Explorer cluster.
I wish to get the delta using the _ts member of the documents. In other words, I will fetch only records for which the _ts is between some range.
The range will be the range of a time window, which I get using a tumbling window trigger in the data factory.
The issue is that if I print the dynamic timestamps which I create in the query, and hard code them into the query, it works. But if I let the query generate them, I don't get any results.
For example:
I'm using those value to simulate the window range of the trigger.
I use this query to create timestamps in unix time.
and I see that the timestamps created are correct.
And if I run my query using those hardcoded timestamps, I get results
But, if I run a query using the code that just create those timestamps, I get no results from the query
This is the code to create the timestamps:
select
DateTimeToTimestamp('#{formatDateTime('2020-05-20T12:00:00.0000000Z','yyyy-MM
ddTHH:mm:ss.fffffffZ')}')/1000,
DateTimeToTimestamp('#{formatDateTime('2020-08-20T12:00:00.0000000Z','yyyy-MM
ddTHH:mm:ss.fffffffZ')}')/1000
Does anyone have a clue as to what might be the issue?
Any other way to achieve this is also welcome.
Thanks
EDIT: I managed to work around this by taking the other, simpler option:
where TimestampToDateTime(c._ts*1000)> "#{formatDateTime(pipeline().parameters.windowStart,'yyyy-MM-ddTHH:mm:ss.fffffffZ')}"

We are glad that you resolved this problem:
You managed to work around this by taking the other, simpler option:
where TimestampToDateTime(c._ts*1000)> "#{formatDateTime(pipeline().parameters.windowStart,'yyyy-MM-ddTHH:mm:ss.fffffffZ')}"
I think the error in first option is most caused by the different data type between c.ts and DateTimeToTimestamp('#{formatDateTime('2020-05-20T12:00:00.0000000Z','yyyy-MM ddTHH:mm:ss.fffffffZ')}')/1000.

Automated functions in Mongo DB

I was wondering if there is something like a script that I can write on the MongoDB side that would do something like delete an item in a list if it is older than a week old etc.
I want to have the DB do this check every day. Are there some kind of automated functions that I can setup on the DB to do this?
I could just write a few small methods to do it on the userside myself, but I remember my old SQL DB having this feature. Any help would be appreciated. Thanks,

If you are using Self-Managed MongoDB, The answer is NO.
Like SQL, MongoDB doesn't support scheduled transactions. However you can run the scheduled jobs in your programming language and perform the operations. For example, https://thecodebarbarian.com/node.js-task-scheduling-with-agenda-and-mongodb
If you are using MongoDB Atlas, Then you need to check this https://docs.mongodb.com/realm/triggers/scheduled-triggers/

Then you might want to check out the MongoDB TTL feature :
https://docs.mongodb.com/manual/tutorial/expire-data/
It won't run every day like a script, but it will automatically remove data after some time.
Hope it can help !

MongoDB in Luigi Python

I would like to know if there is a way to output to a MongoDB in Luigi. I see in the documentation they support files (local FS, HDFS), S3, PostgreSQL but not MongoDB. If not, could someone explain me why not? Maybe it is a bad idea to have it? I would like to store the data in a database because then I can explore it by querying it. However I am using mongodb and I would not like to install another database. I do not need a relational database as I am using the database only to store and query ( NoSql ) without relationships, so the best option is mongodb.
Basically I need a task to read the data and save it in the database. Then the next task take this output and process the data.
Any recommendation, suggestion or clarification is more than welcome. Thanks!

You can try using mortar-luigi.
Check out this link for MongoDB tasks and this example.

MongoDB to DynamoDB

I have a database currently in Mongo running on an EC2 instance and would like to migrate the data to DynamoDB. Is this possible and what is the most cost effective way to achieve this?

When you ask for a "cost effective way" to migrate data, I assume you are looking for existing technologies that can ease your life. If so, you could do the following:
Export your MongoDB data to a text file, say in tsv format, using mongoexport.
Upload that file somewhere in S3.
Import this data, in S3, to DynamoDB using AWS Data Pipeline.
Of course, you should design & finalize your DynamoDB table schema before doing all this.

Whenever you are changing databases, you have to be very careful about the way you migrate data. Certain data formats maintain type consistency, while others do not.
Then there are just data formats that cannot handle your schema. For example, CSV is great at handling data when it is one row per entry, but how do you render an embedded array in CSV? It really isn't possible, JSON is good at this, but JSON has its own problems.
The easiest example of this is JSON and DateTime. JSON does not have a specification for storing DateTime values, they can end up as ISO8601 dates, or perhaps UNIX Epoch Timestamps, or really anything a developer can dream up. What about Longs, Doubles, Ints? JSON doesn't discriminate, it makes them all strings, which can cause loss of precision if not deserialized correctly.
This makes it very important that you choose the appropriate translation medium. The generally means you have to roll your own solution. This means loading up the drivers for both databases, reading an entry from one, translating, and writing to this other. This is the best way to be absolutely sure errors are handled properly for your environment, that types are kept consistently, and that the code properly translates schema from source to destination (if necessary).
What does this all mean for you? It means a lot of leg work for you. It is possible somebody has already rolled something that is broad enough for your case, but I have found in the past that it is best for you to do it yourself.

I know this post is old, Amazon made it possible with AWS DMS, check this document :
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
Some relevant parts:
Using an Amazon DynamoDB Database as a Target for AWS Database
Migration Service
You can use AWS DMS to migrate data to an Amazon DynamoDB table.
Amazon DynamoDB is a fully managed NoSQL database service that
provides fast and predictable performance with seamless scalability.
AWS DMS supports using a relational database or MongoDB as a source.

Mongodb to redshift

We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis).
How can we do it? Should we export the mongo to csv?

I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.
To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.
You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.
Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).
So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.
In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:
mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
head data.json > sample.json
ddlgenerator postgresql sample.json
Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.
Next you upload the data into Redshift.
If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.
For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.
Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.
For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.
Here are some related resources although in the end I did not use them:
Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
Related Redit question
Blog post about dealing with JSON arrays in Redshift

Honestly, I'd recommend using a third party here. I've used Panoply (panoply.io) and would recommend it. It'll take your mongo collections and flatten them into their own tables in redshift.

AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.
MongoDB versions 2.6.x and 3.x as a database source
Document Mode and Table Mode supported
Supports change data capture(CDC)
Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html

A few questions that would be helpful to know would be:
Is this an add-only always increasing incremental sync i.e. data is only being added and not being updated / removed or rather your redshift instance is interested only in additions?
Is the data inconsistency due to delete / updates happening at source and not being fed to redshift instance ok?
Does it need to be daily-incremental batch or can it be realtime as it is happening as well?
Depending on your situation may be mongoexport works for you, but you have to understand the shortcoming of it, which can be found at http://docs.mongodb.org/manual/reference/program/mongoexport/ .

I had to tackle the same issue (not on a daily basis though).
as ask mentioned, You can use mongoexport in order to export the data, but keep in mind that redshift doesn't support arrays, so in case your collections data contains arrays you'll find it a bit problematic.
My solution to this was to pipe the mongoexport into a small utility program I wrote that transforms the mongoexport json rows into my desired csv output.
piping the output also allows you to make the process parallel.
Mongoexport allows you to add a mongodb query to the command, so if your collection data supports it you can spawn N different mongoexport processes, pipe it's results into the other program and decrease the total runtime of the migration process.
Later on, I uploaded the files to S3, and performed a COPY into the relevant table.
This should be a pretty easy solution.

Stitch Data is the best tool ever I've ever seen to replicate incrementally from MongoDB to Redshift within a few clicks and minutes.
Automatically and dynamically Detect DML, DDL for tables for replication.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse