MongoDB into AWS Redshift - mongodb

We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data.
I've looked around for options to make the data available and easier to consume, and have settled on two promising options:
AWS Redshift
Hadoop + Hive
We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight).
As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster.
I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising.
Has anyone done something like this?

I ended up coding up our own migrator using NodeJS.
I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end.
Timestamped data
Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp.
Plugins returning cursors
We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin.
How the cursors are used
The migrator engine then uses this cursor, and loops through each record.
It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes.
Delimited exports from S3 into a table on redshift
The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table.
So for example, if I had a plugin configured with a table name of employees, it would create a temp table with the name of temp_employees.
Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an is_deleted flag in redshift.
Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine.
So in summary, each plugin/migration provides the following to the engine:
A cursor, which optionally uses the last migrated date passed to it
from the engine, in order to ensure that only deltas are moved
across.
A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file
A schema file, this is a SQL file containing the schema for the table at redshift

Redshift is a data ware housing product and Mongo DB is a NoSQL DB. Clearly, they are not a replacement of each other and can co-exist and serve different purpose. Now how to save and update records at both places.
You can move all Mongo DB data to Redshift as a one time activity.
Redshift is not a good fit for real time write. For Near Real Time Sync to Redshift, you should Modify program that writes into Mongo DB.
Let that program also writes into S3 locations. S3 location to redshift movement can be done on regular interval.

Mongo DB being a document storage engine, Apache Solr, Elastic Search can be considered as possible replacements. But they do not support SQL type querying capabilities.They basically use a different filtering mechanism. For eg, for Solr, you might need to use the Dismax Filter.
On Cloud, Amazon's Cloud Search/Azure Search would be compelling options to try as well.

You can use AWS DMS to migrate data to redshift now easily , you can also realtime ongoing changes with it.

Related

streaming PostgreSQL tables into Google BigQuery

I would like to automatically stream data from an external PostgreSQL database into a Google Cloud Platform BigQuery database in my GCP account. So far, I have seen that one can query external databases (MySQL or PostgreSQL) with the EXTERNAL_QUERY() function, e.g.:
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
But for that to work, the database has to be in GCP Cloud SQL. I tried to see what options are there for streaming from the external PostgreSQL into a Cloud SQL PostgreSQL database, but I could only find information about replicating it in a one time copy, not streaming:
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external
The reason why I want this streaming into BigQuery is that I am using Google Data Studio to create reports from the external PostgreSQL, which works great, but GDS can only accept SQL query parameters if it comes from a Google BigQuery database. E.g. if we have a table with 1M entries, and we want a Google Data Studio parameter to be added by the user, this will turn into a:
SELECT * from table WHERE id=#parameter;
which means that the query will be faster, and won't hit the 100K records limit in Google Data Studio.
What's the best way of creating a connection between an external PostgreSQL (read-only access) and Google BigQuery so that when querying via BigQuery, one gets the same live results as querying the external PostgreSQL?
Perhaps you missed the options stated on the google cloud user guide?
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external#setup-replication
Notice in this section, it says:
"When you set up your replication settings, you can also decide whether the Cloud SQL replica should stay in-sync with the source database server after the initial import is complete. A replica that should stay in-sync is online. A replica that is only updated once, is offline."
I suspect online mode is what you are looking for.
What you are looking for will require some architecture design based on your needs and some coding. There isn't a feature to automatically sync your PostgreSQL database with BigQuery (apart from the EXTERNAL_QUERY() functionality that has some limitations - 1 connection per db - performance - total of connections - etc).
In case you are not looking for the data in real time, what you can do is with Airflow for instance, have a DAG to connect to all your DBs once per day (using KubernetesPodOperator for instance), extract the data (from past day) and loading it into BQ. A typical ETL process, but in this case more EL(T). You can run this process more often if you cannot wait one day for the previous day of data.
On the other hand, if streaming is what you are looking for, then I can think on a Dataflow Job. I guess you can connect using a JDBC connector.
In addition, depending on how you have your pipeline structure, it might be easier to implement (but harder to maintain) if at the same moment you write to your PostgreSQL DB, you also stream your data into BigQuery.
Not sure if you have tried this already, but instead of adding a parameter, if you add a dropdown filter based on a dimension, Data Studio will push that down to the underlying Postgres db in this form:
SELECT * from table WHERE id=$filter_value;
This should achieve the same results you want without going through BigQuery.

IBM DB2 and IBM IMS Change Data Capture Capabilities

I'd like to understand if the CDC enabled IBM IMS segments and IBM DB2 table sources would be able to provide both the before and after snapshot change values (like the Oracle .OLD and .NEW values in trigger) so that both could be used for further processing.
Note:
We are supposed to retrieve these values through Informatica PowerExchange and process and push to targets.
As of now, we need to know would we be able to retrieve both before snapshot and after snapshot values from IBM DB2 and IBM IMS (.OLD and .NEW as in Oracle triggers - not an exact similar example, but mentioned just as an example to understand)
Any help is much appreciated, Thanks.
I don't believe CDC captures before data in its change messages that it compiles from the DBMS log data. It's main purpose is to issue the minimum number of commands needed to replicate the data from one database to another. You'll want to take a snapshot of your replica database prior to processing the change messages if you want to preserve the state of data such that you can query it.
Alternatively for Db2, it's probably easier to work with the temporal tables feature added in Db2 10 as that allows you to define what changes should drive a snapshot. You can then access the temporal data using a temporal SQL query.
SELECT … FROM…period specification
Example trigger with old and new referencing...
CREATE TRIGGER danny117
NO CASCADE BEFORE Update ON mylib.myfile
REFERENCING NEW AS N old as O
FOR EACH ROW
-- don't let the claim change and force upper case
--just do something automatically on update blah...
BEGIN ATOMIC
SET N.claim = ucase(O.claim);
END
w.r.t PowerExchange 9.1.0 & 9.6:
Before snapshot data can't be processed via the powerexchange for DB2 database. Recently I worked on a migration project and I thought like the Oracle CDC which uses SCN numbers there should be something for db2 to start the logger from any desired point. But to my surprise Inforamtica global support confirmed that before snapshot data can't be captured by PowerExchange.
They talk about materialize and de-materialize targets which was out of my knowledge at that time, later I found out they meant to export and import of history data.
Even if you have table with CDC enanbled, you can't capture the data before snapshot from PWX.
DB2 reads capture data from the DB2-logs which has a marking for the operation like U/I/D that's enough for PowerExchange to progress.

Sync postgreSql data with ElasticSearch

Ultimately I want to have a scalable search solution for the data in PostgreSql. My finding points me towards using Logstash to ship write events from Postgres to ElasticSearch, however I have not found a usable solution. The soluions I have found involve using jdbc-input to query all data from Postgres on an interval, and the delete events are not captured.
I think this is a common use case so I hope you guys could share with me your experience, or give me some pointers to proceed.
If you need to also be notified on DELETEs and delete the respective record in Elasticsearch, it is true that the Logstash jdbc input will not help. You'd have to use a solution working around the binlog as suggested here
However, if you still want to use the Logstash jdbc input, what you could do is simply soft-delete records in PostgreSQL, i.e. create a new BOOLEAN column in order to mark your records as deleted. The same flag would then exist in Elasticsearch and you can exclude them from your searches with a simple term query on the deleted field.
Whenever you need to perform some cleanup, you can delete all records flagged deleted in both PostgreSQL and Elasticsearch.
You can also take a look at PGSync.
It's similar to Debezium but a lot easier to get up and running.
PGSync is a Change data capture tool for moving data from Postgres to Elasticsearch.
It allows you to keep Postgres as your source-of-truth and expose structured denormalized
documents in Elasticsearch.
You simply define a JSON schema describing the structure of the data in
Elasticsearch.
Here is an example schema: (you can also have nested objects)
e.g
{
"nodes": {
"table": "book",
"columns": [
"isbn",
"title",
"description"
]
}
}
PGsync generates queries for your document on the fly.
No need to write queries like Logstash. It also supports and tracks deletion operations.
It operates both a polling and an event-driven model to capture changes made to date
and notification for changes that occur at a point in time.
The initial sync polls the database for changes since the last time the daemon
was run and thereafter event notification (based on triggers and handled by the pg-notify)
for changes to the database.
It has very little development overhead.
Create a schema as described above
Point pgsync at your Postgres database and Elasticsearch cluster
Start the daemon.
You can easily create a document that includes multiple relations as nested objects. PGSync tracks any changes for you.
Have a look at the github repo for more details.
You can install the package from PyPI
Please take a look at Debezium. It's a change data capture (CDC) platform, which allow you to steam your data
I created a simple github repository, which shows how it works

Mongodb to redshift

We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis).
How can we do it? Should we export the mongo to csv?
I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.
To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.
You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.
Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).
So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.
In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:
mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
head data.json > sample.json
ddlgenerator postgresql sample.json
Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.
Next you upload the data into Redshift.
If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.
For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.
Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.
For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.
Here are some related resources although in the end I did not use them:
Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
Related Redit question
Blog post about dealing with JSON arrays in Redshift
Honestly, I'd recommend using a third party here. I've used Panoply (panoply.io) and would recommend it. It'll take your mongo collections and flatten them into their own tables in redshift.
AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.
MongoDB versions 2.6.x and 3.x as a database source
Document Mode and Table Mode supported
Supports change data capture(CDC)
Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
A few questions that would be helpful to know would be:
Is this an add-only always increasing incremental sync i.e. data is only being added and not being updated / removed or rather your redshift instance is interested only in additions?
Is the data inconsistency due to delete / updates happening at source and not being fed to redshift instance ok?
Does it need to be daily-incremental batch or can it be realtime as it is happening as well?
Depending on your situation may be mongoexport works for you, but you have to understand the shortcoming of it, which can be found at http://docs.mongodb.org/manual/reference/program/mongoexport/ .
I had to tackle the same issue (not on a daily basis though).
as ask mentioned, You can use mongoexport in order to export the data, but keep in mind that redshift doesn't support arrays, so in case your collections data contains arrays you'll find it a bit problematic.
My solution to this was to pipe the mongoexport into a small utility program I wrote that transforms the mongoexport json rows into my desired csv output.
piping the output also allows you to make the process parallel.
Mongoexport allows you to add a mongodb query to the command, so if your collection data supports it you can spawn N different mongoexport processes, pipe it's results into the other program and decrease the total runtime of the migration process.
Later on, I uploaded the files to S3, and performed a COPY into the relevant table.
This should be a pretty easy solution.
Stitch Data is the best tool ever I've ever seen to replicate incrementally from MongoDB to Redshift within a few clicks and minutes.
Automatically and dynamically Detect DML, DDL for tables for replication.

Loading DB2 table rows as Marklogic documents

Is there any tool to quickly convert a DB2 table rows into collection of XML documents that we can load to Marklogic?
DB2 supports the SQL/XML publishing extensions that were introduced in SQL:2003. These functions include XMLSERIALIZE, XMLELEMENT, XMLATTRIBUTE, and XMLFOREST, and are easily added to a SQL SELECT statement to produce a simple, well-formed XML document for each row in the result set. By writing queries that retrieve the table names and column layouts from DB2's catalog views, it is possible to automate the creation of the XML-publishing SELECT statements for a large number of tables.
One way of doing this would be to use the MLSQL toolkit ( http://developer.marklogic.com/code/mlsql ). It allows accessing relational databases from within your XQuery code in MarkLogic. Not sure how the returned data actually looks like, but it should be easy to process it within XQuery, and insert your data as XML into MarkLogic.
Just make sure not to try to load a million records in one statement, but instead try to spawn batches of lets say 1000 records at a time. Spawning will also allow for handling it with multiple threads, so should be faster for that reason too..
HTH!
Do you need to stream from DB2 to MarkLogic? Or can you temporarily dump all the documents to an intermediary filesystem and then read them in? If you can dump, then simply use some DB2 tooling (like #Fred's answer above) to export the rows to a bunch of XML documenets in a filesystem and use one of many methods for reading in a directory full of XML files into MarkLogic (like Information Studio (UI or apis), RecordLoader, and so on).
If you have don't want to store them in the filesystem as an intermediary, then you could write an InformationStudio plugin for MarkLogic that will pull out each row and insert a document into MarkLogic. You'd like need some web-service or rest endpoint that the plugin could call to extract the document data from DB2.
Alternatively, I suspect you could use the DB2 tooling (described by #Fred) that will let you execute some code per row of your table. If you can do that in Java (or .Net), then pull in the MarkLogic XCC APIs which will give you the ability to write documents into MarkLogic.