How to copy Druid data source data from prod server to QA server (like hive distcp action) - druid

I wanted to check if there is a way to copy Druid datasource data (segments) from one server to another. Our ask is to load new data to prod Druid (using SQL queries), and copy the same data to qa Druid server. We are using hive druid storage handler to load the data, and HDFS as deep storage.
I read Druid documentations, but did not find any useful information.

There is currently no way to do this cleanly in druid.
If you really need this feature, please request this by creating a github ticket on : https://github.com/apache/druid/issues .
The workaround way is documented here : https://docs.imply.io/latest/migrate/#the-new-cluster-has-no-data-and-can-access-the-old-clusters-deep-storage
Full disclosure: I work for imply.

Related

Can I configure AWS RDS to only stream INSERT operations to AWS DMS?

My requirement is to stream only INSERTs on a specific table in my db to a Kinesis data stream.
I have configured this pipeline in my AWS environment:
RDS Postgres 13 -> DMS (Database Migration Service) -> KDS (Kinesis Data Stream)
This setup works correctly but it processes all changes, even UPDATEs and DELETEs, on my source table.
What I've tried:
Looking for config options in the Postgres logical decoding plugin. DMS uses the test_decoding PG plugin which does not accept options to include/exclude data changes by operation type.
Looking at the DMS selection and filtering rules. Still didn't see anything that might help.
Of course I could simply ignore records originated from non-INSERT operations in my Kinesis consumer, but this doesn't look like a cost-efficient implementation.
Is there any way to meet my requirements using these AWS services (RDS -> DMS -> Kinesis)?
Well DMS does not have this capability .
If you want only INSERT to be send to Kinesis in that case you can have a lambda function on every INSERT of RDS .
Lambda function can be configured as trigger for INSERT .
You can invoke lambda only for INSERT and write to Kinesis directly .
Cost wise also this will be less .
In DMS you are paying for Replication instance even when not in use .
For detailed reference Stream changes from Amazon RDS for PostgreSQL using Amazon Kinesis Data Streams and AWS Lambda

In Snowplow, is it a compulsory to use DynamoDB in stream enrich process?

I trying to develop a working example of Snowplow click tracking. I have to setup enrichment process to enrich raw data on Kinesis stream. But, when I am running JAR file, I am getting this error:
ERROR com.amazonaws.services.kinesis.leases.impl.LeaseManager - Failed to get table status for SnowplowEnrich-${enrich.streams.in.raw}
Is DynamoDB a necessity for enrichment process?
It depends, in batch mode DynamoDB is not necessary for enrichment process, DynamoDB is used in the RDB Shredder.
Which release are (were) you trying to install. For a PoC you can use Snowplow Mini
Snowplow community is active in discourse.snowplowanalytics.com

Loading data from S3 to PostgreSQL RDS

We are planning to go for PostgreSQL RDS in AWS environment. There are some files in S3 which we will need to load every week. I don't see any option in AWS documentation where we can load data from S3 to PostgreSQL RDS. I see it is possible for Aurora but cannot find anything for PostgreSQL.
Any help will be appreciated.
One option is to use AWS Data Pipeline. It's essentially a JSON script that allows you to orchestrate the flow of data between sources on AWS.
There's a template offered by AWS that's setup to move data between S3 and MySQL. You can find it here. You can easily follow this and swap out the MySQL parameters with those associated with your Postgres instance. Data Pipeline simply looks for RDS as the type and does not distinguish between MySQL and Postgres instances.
Scheduling is also supported by Data Pipeline, so you can automate your weekly file transfers.
To start this:
Go to the Data Pipeline service in your AWS console
Select "Build from template" under source
Select the "Load S3 to MySQL table" template
Fill in the rest of the fields and create the pipeline
From there, you can monitor the progress of the pipeline in the console!

How to replicate MySQL database to Cloud SQL Database

I have read that you can replicate a Cloud SQL database to MySQL. Instead, I want to replicate from a MySQL database (that the business uses to keep inventory) to Cloud SQL so it can have up-to-date inventory levels for use on a web site.
Is it possible to replicate MySQL to Cloud SQL. If so, how do I configure that?
This is something that is not yet possible in CloudSQL.
I'm using DBSync to do it, and working fine.
http://dbconvert.com/mysql.php
The Sync version do the service that you want.
It work well with App Engine and Cloud SQL. You must authorize external conections first.
This is a rather old question, but it might be worth noting that this seems now possible by Configuring External Masters.
The high level steps are:
Create a dump of the data from the master and upload the file to a storage bucket
Create a master instance in CloudSQL
Setup a replica of that instance, using the external master IP, username and password. Also provide the dump file location
Setup additional replicas if needed
VoilĂ !

How to continuously write mongodb data into a running hdinsight cluster

I want to keep a windows azure hdinsight cluster always running so that I can periodically write updates from my master data store (which is mongodb) and have it process map-reduce jobs on demand.
How can periodically sync data from mongodb with the hdinsight service? I'm trying to not have to upload all data whenever a new query is submitted which anytime, but instead have it somehow pre-warmed.
Is that possible on hdinsight? Is it even possible with hadoop?
Thanks,
It is certainly possible to have that data pushed from Mongo into Hadoop.
Unfortunately HDInsight does not support HBase (yet) otherwise you could use something like ZeroWing which is a solution from Stripe that reads the MongoDB Op log used by Mongo for replication and then writes that our to HBase.
Another solution might be to write out documents from your Mongo to Azure Blob storage, this means you wouldn't have to have the cluster up all the time, but would be able to use it to do periodic map reduce analytics against the files in the storage vault.
Your best method is undoubtedly to use the Mongo Hadoop connector. This can be installed in HDInsight, but it's a bit fiddly. I've blogged a method here.