Druid ingestion across multiple Druid environments/clusters - druid

Is there a simplified way to ingest raw data into one Druid environment then use the result from Druid stored in Druid Deep Storage to re-ingest the result into a diff Druid environment(different Druid cluster) or simply ingest from one druid cluster to another Druid cluster?
FROM: Raw Data --> data pipeline/Airflow --> Druid (environment 1)
TO: Raw Data --> Airflow --> Druid (environment 1) --> Druid (environment 2)
Looking to achieve this due to the time it takes to ingest raw data into druid. Instead of Ingesting raw data for each environment, I would like to ingest raw data once and copy result into another Druid environment.
Deep storage is using S3, so I can copy data from S3 (environment 1) to S3(environment 2). However, metadata needs to be updated as well, but this looks a hacky way to achieve it.
Looking also for best practices for this scenario if I want to avoid duplicating data pipelines for each Druid environment.

Yes, this is possible. If you have the metadata stored in for example mysql, you can just copy this data and insert these records in your second environment.
All segment data is stored in MySQL as data-store. This sounds complicated, but it is not. Just take a look in the
druid_segments table and filter on your dataSource.
Just copy over the records which you want to "move". Just make sure that the location (path) of the deep storage file is accessible from your second environment. Possible, you can alter these paths in the "payload" field if needed.
See also this page with some useful tips:
https://support.imply.io/hc/en-us/articles/115004960053-Migrate-existing-Druid-Cluster-to-a-new-Imply-cluster

Related

Inject big local json file into Druid

It's my first Druid experience.
I have got a local setup of Druid in local machine.
Now I'd like to make some query performance test. My test data is a huge local json file 1.2G.
The idea was to load it into druid and run required SQL query. The file is getting parsed and successfully processed (I'm using Druid web-based UI to submit an injection task).
The problem I run into is the datasource size. It doesn't makes sense that 1.2G of raw json data results in 35M of datasource. Is there any limitation the locally running Druid setup has. I think the test data is processed partially. Unfortunately didn't find any relevant config to change it. Will appreciate if some one is able to shed light on this.
Thanks in advance
With druid 80-90 percent compression is expected. I have seen 2GB CSV file reduced to 200MB druid datasoruce.
Can you query the count to make sure all data is ingested? All please disable approximate algorithm hyper-log-log to get exact count.Druid SQL will switch to exact distinct counts if you set "useApproximateCountDistinct" to "false", either through query context or through broker configuration.( refer http://druid.io/docs/latest/querying/sql.html )
Also can check logs for exception and error messages. If it faces problem to ingest particular JSON record it skips that record.

Transfer data from Cassandra to Postgressql

I would like to know it will have any solution
Problem :
I have Cassandra database for saving large scale data from other sources continuously. Application-based data are saving in postgressql. For functionality, I want to query all data from postgresql. so I would like to save Cassandra data consistently to postgressql database based on data coming in Cassandra.
Is it possible?
Please suggest
I would like to save Cassandra data consistently to PostgreSQL database based on data coming in Cassandra.
There are no special utilities for this. You need to create your own service to gather data from Cassandra, process it and put results to PostgreSQL.
Sure. You can use Change Data Capture (CDC) to copy the data from Cassandra to PostgreSQL as and when there is a change in Cassandra data. One option is to use Kafka Connect with appropriate coonectors.

how to make superset display druid data?

I have been trying to have superset display data from druid, but was unable to succeed.
In my druid console I could clearly see a "wiki-edits" data source, but, when I have specified druid cluster and druid data source in superset, it did not pick up any of that data.
Have anyone been able to make this work?
Use the Refresh Druid Metadata option available in the source menu of Superset.
If even after that you are not able to see the data source then make sure you have given the correct coordinator host,port and broker host,port in the Druid Cluster source of Superset.
Have you tried scan for new datasources in the Sources Menu?

What is the common practice to store users data and analysis it with Spark/hadoop?

I'm new to spark. I'm used to a Web developer, not familiar to big data.
That's say I have a portal website. user's behavior and action will store in 5 sharded mongoDB clusters.
How to I analyze it with spark ?
Or Spark can get the data from any databases directly (postgres/mongoDB/mysql/....)
Because most website may use Relational DB as back-end database.
Should I export whole data in the website's databases into HBase ?
I stored all the users log in postgreSQL, is it practical to export data into HBase or other Spark preffered databases ?
It seems it will make lots of duplicated data if I copy the data to a new database.
Does my big data model need other framework excepts Spark ?
For analyze the data in the website's databases,
I don't see the reasons that I need HDFS, Mesos, ...
How to make Spark workers can access the data in PostgreSQL databases ?
I only know how to read data from text file,
and saw some codes about how to load data from HDFS://
But I don't have HDFS system now , should I create one HDFS for my purpose ?
Spark is a distributed compute engine; so it expects to have files accessible from all nodes. Here are some choices you might consider
There seems to be Spark - MongoDB connector. This post explains how to get it working
Export the data out of MongoDB into Hadoop. And then use Spark to process the files. For this , you need to have a Hadoop cluster running
If you are on Amazon, you can put the files in S3 store and access from Spark

mahout datamodel for amazon redshift Recommendation Engine

how would i build Recommendation Engine with amazon Redshift as a data source.is there any mahout data model for amazon redshift or S3
Mahout uses Hadoop to read data, except for a few supported NoSQL dbs and JDBC dbs. Hadoop in turn can use S3. You'd have to configure Hadoop to use the S3 filesystem and then Mahout should work fine reading and writing to S3.
Redshift is a data warehousing solution based on Postgres and supporting JDBC/ODBC. Mahout 0.9 supports data models stored in JDBC compliant stores so, though I haven't done it, it should be supported
The Mahout v1 recommenders runs on Spark and input and output is text by default. All I/O goes through Hadoop. So S3 data is fine for input but the models created are also text and need to be indexed and queried with a search engine like Solr or Elasticsearch. You can pretty easily write a reader to get data from any other store (Redshift) but you might not want to save the models in a data warehouse since they need to be indexed by solr and should have super fast search engine style retrieval.