How to dump file data into AWS ElasticSearch - aws-elasticsearch

I have a large json text file (line terminated json documents generated by querying DynamoDB on based on a filter criteria). What is the best way to index these documents into AWS managed ElasticSearch service.

If your data is in from of a local file or in EC2, you can consider using Logstash.
The open source version of Logstash (Logstash OSS) provides a convenient way to use the bulk API to upload data into your Amazon Elasticsearch Service (Amazon ES) domain.
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains-logstash.html

Related

How to copy Druid data source data from prod server to QA server (like hive distcp action)

I wanted to check if there is a way to copy Druid datasource data (segments) from one server to another. Our ask is to load new data to prod Druid (using SQL queries), and copy the same data to qa Druid server. We are using hive druid storage handler to load the data, and HDFS as deep storage.
I read Druid documentations, but did not find any useful information.
There is currently no way to do this cleanly in druid.
If you really need this feature, please request this by creating a github ticket on : https://github.com/apache/druid/issues .
The workaround way is documented here : https://docs.imply.io/latest/migrate/#the-new-cluster-has-no-data-and-can-access-the-old-clusters-deep-storage
Full disclosure: I work for imply.

Loading data from S3 to PostgreSQL RDS

We are planning to go for PostgreSQL RDS in AWS environment. There are some files in S3 which we will need to load every week. I don't see any option in AWS documentation where we can load data from S3 to PostgreSQL RDS. I see it is possible for Aurora but cannot find anything for PostgreSQL.
Any help will be appreciated.
One option is to use AWS Data Pipeline. It's essentially a JSON script that allows you to orchestrate the flow of data between sources on AWS.
There's a template offered by AWS that's setup to move data between S3 and MySQL. You can find it here. You can easily follow this and swap out the MySQL parameters with those associated with your Postgres instance. Data Pipeline simply looks for RDS as the type and does not distinguish between MySQL and Postgres instances.
Scheduling is also supported by Data Pipeline, so you can automate your weekly file transfers.
To start this:
Go to the Data Pipeline service in your AWS console
Select "Build from template" under source
Select the "Load S3 to MySQL table" template
Fill in the rest of the fields and create the pipeline
From there, you can monitor the progress of the pipeline in the console!

How can I test the MariaDB CONNECT Storage Engine with MongoDB - Does a connector exist?

According to a few articles I've read (e.g. here and here), MariaDB supports connection, command and querying of external data sources using the CONNECT Storage Engine.
Specifically, I'd like to test with MongoDB. Is there a connector that I can download and documention specific to MongoDB? My Google searches so far have come up short.

How to continuously write mongodb data into a running hdinsight cluster

I want to keep a windows azure hdinsight cluster always running so that I can periodically write updates from my master data store (which is mongodb) and have it process map-reduce jobs on demand.
How can periodically sync data from mongodb with the hdinsight service? I'm trying to not have to upload all data whenever a new query is submitted which anytime, but instead have it somehow pre-warmed.
Is that possible on hdinsight? Is it even possible with hadoop?
Thanks,
It is certainly possible to have that data pushed from Mongo into Hadoop.
Unfortunately HDInsight does not support HBase (yet) otherwise you could use something like ZeroWing which is a solution from Stripe that reads the MongoDB Op log used by Mongo for replication and then writes that our to HBase.
Another solution might be to write out documents from your Mongo to Azure Blob storage, this means you wouldn't have to have the cluster up all the time, but would be able to use it to do periodic map reduce analytics against the files in the storage vault.
Your best method is undoubtedly to use the Mongo Hadoop connector. This can be installed in HDInsight, but it's a bit fiddly. I've blogged a method here.

Using Amazon S3 as a File System for MongoDB

I am deciding to use MongoDB as a Document management DB in my application. Initially I was thinking to use S3 as a data store but it seems mongoDB uses local file system to store data. Can I use S3 as data store in MongoDB.
thanx
Provisioned IOPS in AWS is ideal for MongoDB.
This link has notes about running MongoDB on AWS and is rather useful.