Stackdriver throughput metric of Apache Beam streaming job - apache-kafka

I have a streaming job implemented on top of Apache Beam, which reads messages from Apache Kafka, processes them and outputs them into BigTable.
I would like to get throughput metrics of ingress/egress inside this job i.e. how many msg/sec the job is reading and how many msg/sec it's writing.
Looking at graph visualization I see that there is throughput metric
e.g. take a look at below exemplary picture for demonstration
However looking at documentation it's not available on Stackdriver.
Is there any existing solution to get this metrics ?

We are looking into publishing a throughput metric to Stackdriver, but one does not currently exist. The ElementCount (element_count in Stackdriver) metric is the only metric available to that UI or through Stackdriver that could be used to measure throughput. If that's displaying on the graph, it must be some computation over that metric. Unfortunately, the metric is exported as a Gauge metric to Stackdriver, so it can't be directly interpreted as a rate in Stackdriver.
A small secondary point, Dataflow doesn't actually export a metric measuring flow into and out external sources. The ElementCount metric measures flow into inter-transform collections only. But as long as your read / write transforms are basically pass throughs, the flow into / out of the adjacent collection should be sufficient.

Related

Is there an HPA configuration that could autoscale based on previous CPU usage?

We currently have a GKE environemt with several HPAs for different deployments. All of them work just fine out-of-the-box, but sometimes our users still experience some delay during peak hours.
Usually this delay is the time it takes the new instances to start and become ready.
What I'd like is a way to have an HPA that could predict usage and scale eagerly before it is needed.
The simplest implementation I could think of is just an HPA that could take the average usage of previous days and in advance (say 10 minutos earliers) scale up or down based on the historic usage for the current time-frame.
Is there anything like that in vanilla k8s or GKE? I was unable to find anything like that in GCP's docs.
If you want to scale your applications based on events/custom metrics, you can use KEDA (Kubernetes-based Event Driven Autoscaler) which support scaling based on GCP Stackdriver, Datadog or Promtheus metrics (and many other scalers).
What you need to do is creating some queries to get the CPU usage at the moment: CURRENT_TIMESTAMP - 23H50M (or the aggregated value for the last week), then defining some thresholds to scale up/down your application.
If you have trouble doing this with your monitoring tool, you can create a custom metrics API that queries the monitoring API and aggregate the values (with the time shift) before sending it to the metrics-api scaler.

Are time-related OpenTelemetry metrics an anti-pattern?

When setting up metrics and telemetry for my API, is it an anti-pattern to track something like "request-latency" as a metric (possibly in addition to) tracking it as a span?
For example, say my API makes a request to another API in order to generate a response. If I want to track latency information such as:
My API's response latency
The latency for the request from my API to the upstream API
DB request latency
Etc.
That seems like a good candidate for using a span but I think it would also be helpful to have it as a metric.
Is it a bad practice to duplicate the OTEL data capture (as both a metric and a span)?
I can likely extract that information and avoid duplication, but it might be simpler to log it as a metric as well.
Thanks in advance for your help.
I would say traces and also metrics have own use cases. Traces have usually low retention period (AWS X-Ray: 30 days) + you can generate metrics based on traces for short time period (AWS X-Ray: 24 hours). If you will need longer time period then those queries will be expensive (and slow). So I would say metrics stored in time series DB will be perfect use case for longer time period stats.
BTW: there is also experimental Span Metrics Processor, which you can use to generate Prometheus metrics from the spans directly with OTEL collector - no additional app instrumentation/code.

can graphite or grafana used to monitor pyspark metrics?

In a pyspark project we have pyspark dataframe.foreachPartition(func) and in that func we have some aiohttp call to transfer data. What type of monitor tools can be used to monitor the metrics like data rate, throughput, time elapsed...? Can we use statsd and graphite or grafana in this case(they're prefered if possible)? Thanks.
Here is my solution. I used PySpark's accumulators to collect the metrics(number of http calls, payload sent per call, etc.) at each partitions, at the driver node, assign these accumulators' value to statsD gauge variable, and send these metrics to Graphite server and eventually visualized them in Grafana dashboard. It works so far so good.

Kafka which volume to use it?

I work on a log centralization project.
I'm working with ELK to Collect/Aggregate/Store/Visualize my data. I see that Kafka can be useful for large volume of data but
I can not find information from what volume of data it could become interesting to use it.
10 Giga of log per day ? Less, more ?
Thanks for your help.
Let's approach this in two ways.
What volumes of data is Kafka suitable for? Kafka is used at large scale (Netflix, Uber, Paypal, Twitter, etc) and small.
You can start with a cluster of three brokers handling a few MB if you want, and scale out from there as required. 10 Gb of data a day would be perfectly reasonable in Kafka—but so would ten times less or ten times more.
What is Kafka suitable for? In the context of your question, Kafka serves as an event-driven integration point between systems. It can be a "dumb" pipeline, but since it persists data that enables its reconsumption elsewhere. It also offers native stream processing capabilities and integration with other systems.
If all you are doing is getting logs into Elasticsearch then Kafka may be overkill. But if you wanted to use that log data in another place (e.g. HDFS, S3, etc), or process it for patterns, or filter it for conditions to route elsewhere—then Kafka would be a sensible option to route it through. This talk explores some of these concepts.
In terms of ELK and Kafka specifically, Logstash and Beats can write to Kafka as an output, and there's a Kafka Connect connector for Elasticsearch
Disclaimer: I work for Confluent.

Using druid graphite emitter extension

I'm trying out the graphite emitter plugin in druid to collect certain druid metrics in graphite during druid performance tests.
The intent is to then query these metrics using the REST API provided by graphite in order to characterize the performance of the deployment.
However, the numbers returned by graphite don't make sense. So, I wanted to check if I'm interpreting the results in the right manner.
Setup
The kafka indexing service is used to ingest data from kafka into druid.
I've enabled the graphite emitter and provided a whitelist of metrics to collect.
Then I pushed 5000 events to the kafka topic being indexed. Using kafka-related tools, I confirmed that the messages are indeed stored in the kafka logs.
Next, I retrieved the ingest.rows.output metric from graphite using the following call:
curl "http://Graphite_IP:Graphite_Port>/render/?target=druid.test.ingest.rows.output&format=csv"
Following are the results I got:
druid.test.ingest.rows.output,2017-02-22 01:11:00,0.0
druid.test.ingest.rows.output,2017-02-22 01:12:00,152.4
druid.test.ingest.rows.output,2017-02-22 01:13:00,97.0
druid.test.ingest.rows.output,2017-02-22 01:14:00,0.0
I don't know how these numbers need to be interpreted:
Questions
What do the numbers 152.4 and 97.0 in the output indicate?
How can the 'number of rows' be a floating point value like 152.4?
How do these numbers relate to the '5000' messages I pushed to
Kafka?
Thanks in advance,
Jithin
As per druid metrics page it indicates the number of events after rollup.
The observed float point value is due to computing the average over a window of time period that the graphite server uses to summarize data.
So if those metrics are complete it means that your initial 5000 rows were compressed to about 250 ish rows.
I figured the issue after some experimentation. Since my kafka topic has multiple partitions, druid runs multiple tasks to index the kafka data (one task per partition). Each of these tasks reports various metrics at regular intervals. For each metric, the number obtained from graphite for each time interval is the average of the values reported by all the tasks for the metric in that interval. In my case above, had the aggregation function been sum (instead of average), the value obtained from graphite would have been 5000.
However, I wasn't able to figure out whether the averaging is done by the graphite-emitter druid plugin or by graphite.