Redis Pub/Sub - ETL -> Postgres - postgresql

I have a simple task:
Subscribe to messages on Redis channel
Transform message, e.g.
HASH: '<user_id>|<user_type>|<event_type>|...'
with items:
{ 'param_1': 'param_1_value', 'param_2': 'param_2_value', ... } into tabular form
user_id
event_type
param_1
param_2
...
<user_id>
<event_type>
cleaned(param_1_value)
cleaned(param_2_value)
...
Append to an existing table in Postgres
Additional context:
The scale of events is rather small
Refreshments must be done at most every ~15 minutes
Solution must be deployable on premises
Using something else as a queue than Redis is not an option
The best solution I came up with is to use Kafka, with Kafka Redis Source Connector (https://github.com/jaredpetersen/kafka-connect-redis) and then Kafka Postgres Sink Connector (https://github.com/ibm-messaging/kafka-connect-jdbc-sink). It seems reasonable, but the task seems like generic Redis to Postgres ETL and I'm wondering if there is really no easier out of the box solution out there.

You could just write a script and execute it via cron. But take a look at the Benthos project as you can easily run it on prem and what you describe can be done entirely via configuration for Redis -> Postgres.

Related

Can I configure AWS RDS to only stream INSERT operations to AWS DMS?

My requirement is to stream only INSERTs on a specific table in my db to a Kinesis data stream.
I have configured this pipeline in my AWS environment:
RDS Postgres 13 -> DMS (Database Migration Service) -> KDS (Kinesis Data Stream)
This setup works correctly but it processes all changes, even UPDATEs and DELETEs, on my source table.
What I've tried:
Looking for config options in the Postgres logical decoding plugin. DMS uses the test_decoding PG plugin which does not accept options to include/exclude data changes by operation type.
Looking at the DMS selection and filtering rules. Still didn't see anything that might help.
Of course I could simply ignore records originated from non-INSERT operations in my Kinesis consumer, but this doesn't look like a cost-efficient implementation.
Is there any way to meet my requirements using these AWS services (RDS -> DMS -> Kinesis)?
Well DMS does not have this capability .
If you want only INSERT to be send to Kinesis in that case you can have a lambda function on every INSERT of RDS .
Lambda function can be configured as trigger for INSERT .
You can invoke lambda only for INSERT and write to Kinesis directly .
Cost wise also this will be less .
In DMS you are paying for Replication instance even when not in use .
For detailed reference Stream changes from Amazon RDS for PostgreSQL using Amazon Kinesis Data Streams and AWS Lambda

Flink - multiple instances of flink application deployment on kubernetes

I need help on Flink application deployment on K8
we have 3 source that will send trigger condition as in form of SQL queries. Total queries ~3-6k and effectively a heavy load on flink instance. I try to execute but it was very slow and takes lot of time to start.
Because of high volume of queries, we decide to create multiple flink app instance per source. so effectively one flink instance will execute ~1-2K queries only.
example: sql query sources are A, B, C
Flink instance:
App A --> will be responsible to handle source A queries only
App B --> will be responsible to handle source B queries only
App C --> will be responsible to handle source C queries only
I want to deploy these instances on Kubernetes
Question:
a) is it possible to deploy standalone flink jar with mini cluster (inbuilt)? like just start main method: Java -cp mainMethod (sourceName is command line argument A/B/C).
b) if k8's one pod or flink instance is down then how we can manage it in another pod or another flink instance? is it possible to give the work to other pod or other flink instance?
sorry If I mixed up two or more things together :(
Appreciate your help. thanks
Leaving aside issues of exactly-once semantics, one way to handle this would be to have a parallel source function that emits the SQL queries (one per sub-task), and a downstream FlatMapFunction that executes the query (one per sub-task). Your source could then send out updates to the query without forcing you to restart the workflow.

ETL Spring batch, Spring cloud data flow (SCDF)

We have a use case where data can be sourced from different sources (DB, FILE etc) and transformed and stored to various sinks (Cassandra, DB or File).We would want the ability to split the jobs and do parallel loads - looks like Spring Batch RemoteChunking provides that ability.
I am new to SCDF and Spring batch and wondering what is the best way to use it.
Is there a way to provide configuration for these jobs (source connection details, table and query) and can this be done through an UI (SCDF Server UI ?). Is it possible to compose the flow?
This will run on Kubernetes and our applications are deployed through Jenkins pipeline.
We would want the ability to split the jobs and do parallel loads - looks like Spring Batch RemoteChunking provides that ability.
I don't think you need remote chunking, you can rather run parallel jobs, where each job handles an ETL process (for a particular file, db table).
Is there a way to provide configuration for these jobs (source connection details, table and query)
Yes, those can be configured like any regular Spring Batch job is configured.
and can this be done through an UI (SCDF Server UI ?
If you make them configurable through properties of your job, you can specify them through the UI when you run the task.
Is it possible to compose the flow?
Yes, this is possible with Composed Task.

How to setup cross region replica of AWS RDS for PostgreSQL

I have a RDS for PostgreSQL setup in ASIA and would like to have a read copy in US.
But unfortunately just found from the official site that only RDS for MySQL has cross-region replica but not for PostgreSQL.
And I saw this page introduced other ways to migrate data in to and out of RDS for PostgreSQL.
If not buy an EC2 to install a PostgreSQL by myself in US, is there any way the synchronize data from ASIA RDS to US RDS?
It all depends on the purpose of your replication. Is it to provide a local data source and avoid network latencies ?
Assuming that your goal is to have cross-region replication, you have a couple of options.
Custom EC2 Instances
You can create your own EC2 instances and install PostgreSQL so you can customize replication behavior.
I've documented configuring master-slave replication with PostgreSQL on my blog: http://thedulinreport.com/2015/01/31/configuring-master-slave-replication-with-postgresql/
Of course, you lose some of the benefits of AWS RDS, namely automated multi-AZ redundancy, etc., and now all of a sudden you have to be responsible for maintaining your configuration. This is far from perfect.
Two-Phase Commit
Alternate option is to build replication into your application. One approach is to use a database driver that can do this, or to do your own two-phase commit. If you are using Java, some ideas are described here: JDBC - Connect Multiple Databases
Use SQS to uncouple database writes
Ok, so this one is the one I would personally prefer. For all of your database writes you should use SQS and have background writer processes that take messages off the queue.
You will need to have a writer in Asia and a writer in the US regions. To publish on SQS across regions you can utilize SNS configuration that publishes messages onto multiple queues: http://docs.aws.amazon.com/sns/latest/dg/SendMessageToSQS.html
Of course, unlike a two phase commit, this approach is subject to bugs and it is possible for your US database to get out of sync. You will need to implement a reconciliation process -- a simple one can be a pg_dump from Asian and pg_restore into US on a weekly basis to re-sync it, for instance. Another approach can do something like a Cassandra read-repair: every 10 reads out of your US database, spin up a background process to run the same query against Asian database and if they return different results you can kick off a process to replay some messages.
This approach is common, actually, and I've seen it used on Wall St.
So, pick your battle: either you create your own EC2 instances and take ownership of configuration and devops (yuck), implement a two-phase commit that guarantees consistency, or relax consistency requirements and use SQS and asynchronous writers.
This is now directly supported by RDS.
Example of creating a cross region replica using the CLI:
aws rds create-db-instance-read-replica \
--db-instance-identifier DBInstanceIdentifier \
--region us-west-2 \
--source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:my-postgres-instance

How to continuously write mongodb data into a running hdinsight cluster

I want to keep a windows azure hdinsight cluster always running so that I can periodically write updates from my master data store (which is mongodb) and have it process map-reduce jobs on demand.
How can periodically sync data from mongodb with the hdinsight service? I'm trying to not have to upload all data whenever a new query is submitted which anytime, but instead have it somehow pre-warmed.
Is that possible on hdinsight? Is it even possible with hadoop?
Thanks,
It is certainly possible to have that data pushed from Mongo into Hadoop.
Unfortunately HDInsight does not support HBase (yet) otherwise you could use something like ZeroWing which is a solution from Stripe that reads the MongoDB Op log used by Mongo for replication and then writes that our to HBase.
Another solution might be to write out documents from your Mongo to Azure Blob storage, this means you wouldn't have to have the cluster up all the time, but would be able to use it to do periodic map reduce analytics against the files in the storage vault.
Your best method is undoubtedly to use the Mongo Hadoop connector. This can be installed in HDInsight, but it's a bit fiddly. I've blogged a method here.