Apache beam on google dataflow: Collecting metrics from within the main method - apache-beam

I have a batch pipeline which pulls data from a cassandra table and writes into kafka. I would like to get various statistics based on cassandra data . For ex, total no.of records in the cassandra table, no.of records having null value for a column etc. I tried to leverage beam metrics. Though it is showing correct count in the google cloud console after the pipeline has completed execution, I am unable to get it in the main program after pipeline.run() method. It throws unsupported exception. I am using google data flow and bundles the pipeline as flex template. Is there anyway to get this work.

If you can get the job id, dataflow offers a public API that can be used to query metrics which is used internally . Easier might be to get these from Stackdriver, see, e.g. Collecting Application Metrics From Google cloud Dataflow

Related

How to keep track of Cassandra write successes while using Kafka in cluster

When working in my cluster I have the constraint that my frontend cannot display a finished job until all the jobs different results have been added into Cassandra. These result are computed in their individual microservices and sent via Kafka to a cassandra writer.
My question is if there are any best practices for letting the frontend know when these writes have completed? Should I make another database entry for results or is there some other smart way that would scale well?
Each job has about 100 different results written in to it, and I have like 1000jobs/day
I used Cassandra for a UI backend in the past with Kafka , and we would store a status field in each DB record, which would very periodically get updated through a slew of Kafka Streams processors (there were easily more than 1000 DB writes per day).
The UI itself was running some setInterval(refresh) JS function that would query the latest database state, then update the DOM, accordingly.
Your other option is to push some websocket/SSE data into the UI from some other service that indicates "data is finished"

Reading from a MongoDB changeStream with unbounded PCollections in Apache Beam

I'm designing a new way for my company to stream data from multiple MongoDB databases, perform some arbitrary initial transformations, and sink them into BigQuery.
There are various requirements but the key ones are speed and ability to omit or redact certain fields before they reach the data warehouse.
We're using Dataflow to basically do this:
MongoDB -> Dataflow (Apache Beam, Python) -> BigQuery
We basically need to just wait on the collection.watch() call as the input, but from the docs and existing research it may not be possible,
At the moment, the MongoDB connector is bounded and there seems to be no readily-available solution to read from a changeStream, or a collection in an unbounded way.
Is it possible to read from a changeStream and have the pipeline wait until the task is killed rather than being out of records?
In this instance I decided to go via Google Pub/Sub which serves as the unbounded data source.

How to expose a REST service from HDFS?

My project requires to expose a REST service from HDFS, currently we are processing huge amount of data on HDFS, we are using MR jobs to store all the data from HDFS to Apache-Impala database for our reporting needs.
At present we have a REST endpoint hitting the Impala database but the problem is the Impala database is not fully updated with the latest data from HDFS.
We run MR jobs periodically to update the Impala database, but as we know the MR will consume lot-of time due to this we are not able to perform real-time queries on HDFS.
Use case/Scenario : Okay let me explain in detail; We have one application called "duct" built on top of hadoop, this application process huge amount of data and creates individual archives (serialized avro files) on HDFS for every run.We have another application (lets say the name is Avro-To-Impala) which takes these AVRO archives as input, process them using MR jobs and populates a new schema on Impala for every "duct" run.This tool reads the AVRO files and creates and populates the tables on Impala schema. Inorder to expose the data outside (REST endpoint) we are relaying on the Impala database.In this case whenever we have output from "duct" eventually to update the database we explicitly run "Avro-To-Impala" tool.This processing is taking long time because of this the REST endpoint returning obsolete or old data to the consumers of the web service.
Can anyone suggest solution for this kind of problem ?
Many Thanks

Using druid graphite emitter extension

I'm trying out the graphite emitter plugin in druid to collect certain druid metrics in graphite during druid performance tests.
The intent is to then query these metrics using the REST API provided by graphite in order to characterize the performance of the deployment.
However, the numbers returned by graphite don't make sense. So, I wanted to check if I'm interpreting the results in the right manner.
Setup
The kafka indexing service is used to ingest data from kafka into druid.
I've enabled the graphite emitter and provided a whitelist of metrics to collect.
Then I pushed 5000 events to the kafka topic being indexed. Using kafka-related tools, I confirmed that the messages are indeed stored in the kafka logs.
Next, I retrieved the ingest.rows.output metric from graphite using the following call:
curl "http://Graphite_IP:Graphite_Port>/render/?target=druid.test.ingest.rows.output&format=csv"
Following are the results I got:
druid.test.ingest.rows.output,2017-02-22 01:11:00,0.0
druid.test.ingest.rows.output,2017-02-22 01:12:00,152.4
druid.test.ingest.rows.output,2017-02-22 01:13:00,97.0
druid.test.ingest.rows.output,2017-02-22 01:14:00,0.0
I don't know how these numbers need to be interpreted:
Questions
What do the numbers 152.4 and 97.0 in the output indicate?
How can the 'number of rows' be a floating point value like 152.4?
How do these numbers relate to the '5000' messages I pushed to
Kafka?
Thanks in advance,
Jithin
As per druid metrics page it indicates the number of events after rollup.
The observed float point value is due to computing the average over a window of time period that the graphite server uses to summarize data.
So if those metrics are complete it means that your initial 5000 rows were compressed to about 250 ish rows.
I figured the issue after some experimentation. Since my kafka topic has multiple partitions, druid runs multiple tasks to index the kafka data (one task per partition). Each of these tasks reports various metrics at regular intervals. For each metric, the number obtained from graphite for each time interval is the average of the values reported by all the tasks for the metric in that interval. In my case above, had the aggregation function been sum (instead of average), the value obtained from graphite would have been 5000.
However, I wasn't able to figure out whether the averaging is done by the graphite-emitter druid plugin or by graphite.

Moving all documents in a collection from mongo to azure blob storage

I am trying to move all documents in a mongo collection to a azure blob storage within a scheduled azure webjob using c# and mongo 1.9.1 drivers.
I do not want to hold all the 100000 documents in memory in the webjob. Is there a better way may be like a batched retrieve of documents from mongo? Or is there a completely different approach that I can look into?
You could have one web job process responsible for queuing up each document individually. This web job would only need the unique identifier for each document so that it could push that as a message to an Azure Storage Queue. You can have this web job configured to be scheduled or manual depending on the need.
Then have another web job that migrates a single file. You can have this web job setup to be continuous so that as long as there are messages on the queue to be read, it will start processing them. By default a web job will parallelize 16 items off of a queue. This is configurable.