I have been trying to have superset display data from druid, but was unable to succeed.
In my druid console I could clearly see a "wiki-edits" data source, but, when I have specified druid cluster and druid data source in superset, it did not pick up any of that data.
Have anyone been able to make this work?
Use the Refresh Druid Metadata option available in the source menu of Superset.
If even after that you are not able to see the data source then make sure you have given the correct coordinator host,port and broker host,port in the Druid Cluster source of Superset.
Have you tried scan for new datasources in the Sources Menu?
Related
Is there a simplified way to ingest raw data into one Druid environment then use the result from Druid stored in Druid Deep Storage to re-ingest the result into a diff Druid environment(different Druid cluster) or simply ingest from one druid cluster to another Druid cluster?
FROM: Raw Data --> data pipeline/Airflow --> Druid (environment 1)
TO: Raw Data --> Airflow --> Druid (environment 1) --> Druid (environment 2)
Looking to achieve this due to the time it takes to ingest raw data into druid. Instead of Ingesting raw data for each environment, I would like to ingest raw data once and copy result into another Druid environment.
Deep storage is using S3, so I can copy data from S3 (environment 1) to S3(environment 2). However, metadata needs to be updated as well, but this looks a hacky way to achieve it.
Looking also for best practices for this scenario if I want to avoid duplicating data pipelines for each Druid environment.
Yes, this is possible. If you have the metadata stored in for example mysql, you can just copy this data and insert these records in your second environment.
All segment data is stored in MySQL as data-store. This sounds complicated, but it is not. Just take a look in the
druid_segments table and filter on your dataSource.
Just copy over the records which you want to "move". Just make sure that the location (path) of the deep storage file is accessible from your second environment. Possible, you can alter these paths in the "payload" field if needed.
See also this page with some useful tips:
https://support.imply.io/hc/en-us/articles/115004960053-Migrate-existing-Druid-Cluster-to-a-new-Imply-cluster
I am trying to create Grafana dashboard for a large system. There are thousands of metadata variables which I need to store and access. E.g. SLA's for hundreds of applications. What is the best way to achieve this? My data source for logs and metrics is elastic search.
Should I store the static data as Elastic search index and query along with main data or is it possible to store it in some other DB and access it with main elastic search data.
tl;dr Best is to handle all metadata before and only feed Grafana with indexes ready for display.
The only source of data in Grafana is the 'data source'. There is no way to get any sort of metadata in Grafana. Especially with ElasticSearch(ES) as a data source which is fairly new to Grafana.
The best way to configure any metadata is in an ES index or to model your data along with the metadata using a transformation or ingestion in ES. As suggested in tl;dr it is best to handle all the correlation and transformation beforehand and let Grafana just query indices to render graphs.
However, if you need any aggregations to be performed on the data Grafana does support it. You can check it in the official documentation
It's my first Druid experience.
I have got a local setup of Druid in local machine.
Now I'd like to make some query performance test. My test data is a huge local json file 1.2G.
The idea was to load it into druid and run required SQL query. The file is getting parsed and successfully processed (I'm using Druid web-based UI to submit an injection task).
The problem I run into is the datasource size. It doesn't makes sense that 1.2G of raw json data results in 35M of datasource. Is there any limitation the locally running Druid setup has. I think the test data is processed partially. Unfortunately didn't find any relevant config to change it. Will appreciate if some one is able to shed light on this.
Thanks in advance
With druid 80-90 percent compression is expected. I have seen 2GB CSV file reduced to 200MB druid datasoruce.
Can you query the count to make sure all data is ingested? All please disable approximate algorithm hyper-log-log to get exact count.Druid SQL will switch to exact distinct counts if you set "useApproximateCountDistinct" to "false", either through query context or through broker configuration.( refer http://druid.io/docs/latest/querying/sql.html )
Also can check logs for exception and error messages. If it faces problem to ingest particular JSON record it skips that record.
I'm trying to let Debezium start reading the binlog from the bottom of the file directly.
Could someone help in this matter please ?
Based on the docs it looks like you can use snapshot.mode=schema_only:
If you don’t need the topics to contain a consistent snapshot of the data but only need them to have the changes since the connector was started, you can use the schema_only option, where the connector only snapshots the schemas (not the data).
We want to use Grafana to show measuring data. Now, our measuring setup creates a huge amount of data that is saved in files. We keep the files as-is and do post-processing on them directly with Spark ("Data Lake" approach).
We now want to create some visualization and I thought of setting up Cassandra on the cluster running Spark and HDFS (where the files are stored). There will be a service (or Spark-Streaming job) that dumps selected channels from the measuring data files to a Kafka topic and another job that puts them into Cassandra. I use this approach because we have other stream processing jobs that do on the fly calculations as well.
I now thought of writing a small REST service that makes Grafana's Simple JSON datasource usable to pull the data in and visualize it. So far so good, but as the amount of data we are collecting is huge (sometimes about 300MiB per minute) the Cassandra database should only hold the most recent few hours of data.
My question now is: If someone looks at the data, finds something interesting and creates a snapshot of a dashboard or panel (or a certain event occurrs and a snapshot is taken automatically), and the original data is deleted from Cassandra, can the snapshot still be viewed? Is the data saved with it? Or does the snapshot only save metadata and the data source is queried anew?
According to Grafana docs:
Dashboard snapshot
A dashboard snapshot is an instant way to share an interactive dashboard publicly. When created, we strip sensitive data like queries (metric, template and annotation) and panel links, leaving only the visible metric data and series names embedded into your dashboard. Dashboard snapshots can be accessed by anyone who has the link and can reach the URL.
So, data is saved inside snapshot and no longer depends on original data.
As far as I understand Local Snapshot is stored in grafana db. At your data scale using external storage (webdav, etc) for snapshots can be more a better option.