Using druid graphite emitter extension - apache-kafka

I'm trying out the graphite emitter plugin in druid to collect certain druid metrics in graphite during druid performance tests.
The intent is to then query these metrics using the REST API provided by graphite in order to characterize the performance of the deployment.
However, the numbers returned by graphite don't make sense. So, I wanted to check if I'm interpreting the results in the right manner.
Setup
The kafka indexing service is used to ingest data from kafka into druid.
I've enabled the graphite emitter and provided a whitelist of metrics to collect.
Then I pushed 5000 events to the kafka topic being indexed. Using kafka-related tools, I confirmed that the messages are indeed stored in the kafka logs.
Next, I retrieved the ingest.rows.output metric from graphite using the following call:
curl "http://Graphite_IP:Graphite_Port>/render/?target=druid.test.ingest.rows.output&format=csv"
Following are the results I got:
druid.test.ingest.rows.output,2017-02-22 01:11:00,0.0
druid.test.ingest.rows.output,2017-02-22 01:12:00,152.4
druid.test.ingest.rows.output,2017-02-22 01:13:00,97.0
druid.test.ingest.rows.output,2017-02-22 01:14:00,0.0
I don't know how these numbers need to be interpreted:
Questions
What do the numbers 152.4 and 97.0 in the output indicate?
How can the 'number of rows' be a floating point value like 152.4?
How do these numbers relate to the '5000' messages I pushed to
Kafka?
Thanks in advance,
Jithin

As per druid metrics page it indicates the number of events after rollup.
The observed float point value is due to computing the average over a window of time period that the graphite server uses to summarize data.
So if those metrics are complete it means that your initial 5000 rows were compressed to about 250 ish rows.

I figured the issue after some experimentation. Since my kafka topic has multiple partitions, druid runs multiple tasks to index the kafka data (one task per partition). Each of these tasks reports various metrics at regular intervals. For each metric, the number obtained from graphite for each time interval is the average of the values reported by all the tasks for the metric in that interval. In my case above, had the aggregation function been sum (instead of average), the value obtained from graphite would have been 5000.
However, I wasn't able to figure out whether the averaging is done by the graphite-emitter druid plugin or by graphite.

Related

Prediction/Estimation of missing intervals inside Apache Kafka process

Goal is to process raw readings (15min and 1h interval) from external remote meters (assets) in real time.
Process is defined using simple Apache Kafka producer/consumer and multiple Spring Boot microservices to deduplicate messages, transform (map) readings to our system (instead external codes insert internal IDS and similar stuff) and insert in TimescaleDB (extension of PostgreSql).
Everything seems fine, but there is requirement to perform real time prediction/estimation of missing intervals.
Simple example for one meter and 15 minute readings:
On day 1 we got all readings. We process them and have them ingested in our DB.
On day 2 we are missing all readings - so process is not even
started for this meter.
On day 3 we again got all readings - but only for day 3. Now we need
to predict that whole day 2 is missing and create empty readings and
then estimate them by some algorithm (that is not that important
now).
My question here, is there any way or idea how to do this without querying existing database in one of the microservices and checking if something is missing?
Is it possible to check previous messages in Kafka topics and based on that do the prediction/estimation (kafka streams? - I don't get them at all) and is that even smart to do, or there is any other way/idea to do it?
Personal opinion disclaimer
It is not reasonably possible to check previous messages in Kafka Streams. If you are hellbent on doing it, you could probably try to seek messages and re-consume them but Kafka will fight you every step on the way. The mental model is, that you are transforming or aggregating data that comes in in real time. If you need to query something about previous data, you ought to have collected that information when that data was coming through.
What could work (rather well even) is to separate the prediction of missing data from the transformation.
Create two consumers for the stream.
Have one topology (or whatever it is that does your transformations already) transform the data and load it back into Kafka and from there to timescaledb.
Have one topology (or another microservice) that does what is needed to predict missing data. Your usecase of backfilling a missing day could be handled by something like a count based on daily windows
Make that trigger your backfilling either as part of that topology or as a subsequent microservice and load that data to timescaledb as well.
Are you already using Kafka Streams for the transformations? This would be a classical usecase.
The recognition of missing data not so much
As far as I understand it does not require high throughput. More the opposite. You want to know if there is no data.
As far as I understand it latency is not a (main) concern.
Kafka Streams could be useful if you need to take automated action within seconds after data stops coming in. But even then, you could just write throughput metrics and trigger alerts in this case.
Pther than that, it is a very stateful problem and stream processing is at its best if you can treat every message separately reduce them in a "standard" manner like sums or counts.
I got the impression, that a delay of a few hours / a day is not that tragic and currently the backfilling might be done manually. In this case the cot of Kafka Streams would outweigh the benefits.

How to connect apache storm with grafana?

I am collecting data in JSON format that I process in real time with apache storm. I would now like to use grafana to be able to perform real time visualizations on this processed data. Is there a way to connect storm to grafana?
I haven't found much information on the topic, any help would be appreciated
Grafana is a visualization tool that can be set on top of a datastore such as Prometheus / ADX etc. You have first enable collection of these metrics into such a datastore.
Here is one such example https://github.com/wizenoze/storm-metrics-reporter-prometheus
Once this is done , any metrics that are reported from the code (Counter , Gauges , JMX metrics) are then saved on the data store that can then be visualized on Grafana

can graphite or grafana used to monitor pyspark metrics?

In a pyspark project we have pyspark dataframe.foreachPartition(func) and in that func we have some aiohttp call to transfer data. What type of monitor tools can be used to monitor the metrics like data rate, throughput, time elapsed...? Can we use statsd and graphite or grafana in this case(they're prefered if possible)? Thanks.
Here is my solution. I used PySpark's accumulators to collect the metrics(number of http calls, payload sent per call, etc.) at each partitions, at the driver node, assign these accumulators' value to statsD gauge variable, and send these metrics to Graphite server and eventually visualized them in Grafana dashboard. It works so far so good.

Apache Nifi : Oracle To Mongodb data transfer

I want to transfer data from oracle to MongoDB using apache nifi. Oracle has a total of 9 million records.
I have created nifi flow using QueryDatabaseTable and PutMongoRecord processors. This flow is working fine but has some performance issues.
After starting the nifi flow, records in the queue for SplitJson -> PutMongoRecord are increasing.
Is there any way to slow down records putting into the queue by SplitJson processor?
OR
Increase the rate of insertion in PutMongoRecord?
Right now, in 30 minutes 100k records are inserted, how to speed up this process?
#Vishal. The solution you are looking for is to increase the concurrency of PutMongoRecord:
You can also experiment with the the BATCH size in the configuration tab:
You can also reduce the execution time splitJson. However you should remember this process is going to take 1 flowfile and make ALOT of flowfiles regardless of the timing.
How much you can increase concurrency is going to depend on how many nifi nodes you have, and how many CPU Cores each node has. Be experimental and methodical here. Move up in single increments (1-2-3-etc) and test your file in each increment. If you only have 1 node, you may not be able to tune the flow to your performance expectations. Tune the flow instead for stability and as fast as you can get it. Then consider scaling.
How much you can increase concurrency and batch is also going to depend on the MongoDB Data Source and the total number of connections you can get fro NiFi to Mongo.
In addition to Steven's answer, there are two properties on QueryDatabaseTable that you should experiment with:
Max Results Per Flowfile
Use Avro logical types
With the latter, you might be able to do a direct shift from Oracle to MongoDB because it'll convert Oracle date types into Avro ones and those should in turn by converted directly into proper Mongo date types. Max results per flowfile should also allow you to specify appropriate batching without having to use the extra processors.

Stackdriver throughput metric of Apache Beam streaming job

I have a streaming job implemented on top of Apache Beam, which reads messages from Apache Kafka, processes them and outputs them into BigTable.
I would like to get throughput metrics of ingress/egress inside this job i.e. how many msg/sec the job is reading and how many msg/sec it's writing.
Looking at graph visualization I see that there is throughput metric
e.g. take a look at below exemplary picture for demonstration
However looking at documentation it's not available on Stackdriver.
Is there any existing solution to get this metrics ?
We are looking into publishing a throughput metric to Stackdriver, but one does not currently exist. The ElementCount (element_count in Stackdriver) metric is the only metric available to that UI or through Stackdriver that could be used to measure throughput. If that's displaying on the graph, it must be some computation over that metric. Unfortunately, the metric is exported as a Gauge metric to Stackdriver, so it can't be directly interpreted as a rate in Stackdriver.
A small secondary point, Dataflow doesn't actually export a metric measuring flow into and out external sources. The ElementCount metric measures flow into inter-transform collections only. But as long as your read / write transforms are basically pass throughs, the flow into / out of the adjacent collection should be sufficient.