possible to load the latest available datapoint and discard the rest in Druid?

possible to load the latest available datapoint and discard the rest in Druid? - druid

Consider raw events (alpha set in Druid parlance) of the form timestamp | compoundId | dimension 1 | dimension 2 | metric 1 | metric 2
Normally in Druid data can be loaded in Realtime nodes and historic nodes based on some rules. These rules seem to be related to time-ranges. E.g.:
load the last day of data on boxes A
load the last week (except last day) on boxes B
keep the rest in deep storage but don't load segments.
In contrast I want to support the use-case of:
load the last event for each given compoundId on boxes A. Regardless if that last event happened to be loaded today or yesterday.
Is this possible?
Alternatively, if the above is not possible, I figured it would perhaps be possible as a workaround to create a betaset (finest granulation level as follows):
Given an alphaset with schema as defined above, create a betaset so that:
all events for a given compoundId are rolled-up.
metric1 and metric2 are set to the metrics from the last occurring (largest timestamp) event.
Any advice much appreciated.

I believe the first and last aggregators is what you are looking for.

Related

Graphite: keepLastValue for a given time period instead of number of points

I'm using Graphite + Grafana to monitor (by sampling) queue lengths in a test system at work. The data that gets uploaded to Graphite is grouped into different series/metrics by properties of the payloads in the queue. These properties can be somewhat arbitrary, at least to the point where they are not all known at the time when the data collection script is run.
For example, a property could be the project that the payload belongs to and this could be uploaded as a separate series/metric so that we can monitor the queues broken down by the different projects.
This has the consequence that Graphite sends a lot of null values for certain metrics if the queues in the test system did not contain any payloads with properties that would group it into that specific series/metric.
For example, if a certain project did not have any payloads in queue at the time when the data collection was ran.
In Grafana this is not so nice as the line graphs don't show up as connected lines and gauges will show either null or the last non-null value.
For line graphs I can just chose to connect null values in Grafana but for gauges thats not possible.
I know about the keepLastValue function in Graphite but that includes a limit for how long to keep the value which I like very much as I would like to keep the last value until the next time data collection is ran. Data collection is run periodically at known intervals.
The problem with keepLastValue is it expects a number of points as this limit. I would rather give it a time period instead. In Grafana the relationship between time and data points is very dynamic so its not easy to hard-code a good limit for keepLastValue.
Thus, my question is: Is there a way to tell Graphite to keep the last value for a given time instead of a given number of points?

Google Cloud: Metrics Explorer: "Aggregator" vs "Aligner" - Whats the difference?

Trying to understand the difference between the two: Aggregator vs Aligner.
Docs was not helpful for me.
What I'm trying to achieve is to get the bytes of logs generated within a week per each namespace and container combination. For example, I want to see that container C in namespace N generated 10Gb of logs during the last 7 days.
This is how far I got:
Resource type = Kubernetes Container
Metric = Log bytes
Group by = namespace_name and container_name
Aggregator = sum(?) mean(?)
Minimum alignment period = 1(?) 7(?) days
Aligner = sum(?) mean(?)

I was confused with this until I realized that a single metric, such as kubernetes.io/container/cpu/core_usage_time is available in multiple different resources in my cluster.
So when you search for that metric, you'll get a whole lot of different resources that emit that metric. Aggregation is adding up all the data from those different resources WITH THAT SAME METRIC.
This all combines into one "time series" for that metric, an aggregation of all the individual time series from each of those different resources.
Now, alignment is the process of using that time series and putting all the data points through a function (over a period of time, known as the alignment period) which results in one single data point (per alignment period).
So aggregation combines the same metric across multiple resources, while alignment combines multiple data points in the same time series into one data point (per alignment period, which is why that field is required when using alignment).

Creating Datadog alerts for when the percentage difference between two custom metrics goes over a specified percentage threshold

My current situation is that I have two different data feeds (Feed A & Feed B) and I have created custom metrics for both feeds:
Metric of Order counts from Feed A
Metric Order counts from Feed B
Next steps is to create alert monitoring for the agreed upon threshold of difference between the two metrics. Say we have agreed that it is acceptable for Order Counts from Feed A to be within ~5% of Order Counts from Feed B. How can I go about creating that threshold and comparison between the two metrics that I have already developed in Datadog?
I would like to send alerts to myself when the % difference between the two data feeds is > 5 % for a daily validation.

You might be able to get this if you...
Start creating a metric type monitor
To the far right of the metric definition, select "advanced"
Select "Add Query"
Input your metrics
In the field called "Express these queries as:", input (a-b)/b or some such
Trigger when the metric is above or equal to the threshold in total during the last 24 hours
Set Alert threshold >= 0.05
If you start having trouble as you start setting it up, you may want to reach out to support#datadoghq.com to get their assistance.

Streaming data Complex Event Processing over files and a rather long period

my challenge:
we receive files every day with about 200.000 records. We keep the files for approx 1 year, to support re-processing, etc..
For the sake of the discussion assume it is some sort of long lasting fulfilment process, with a provisioning-ID that correlates records.
we need to identify flexible patterns in these files, and trigger events
typical questions are:
if record A is followed by record B which is followed by record C, and all records occured within 60 days, then trigger an event
if record D or record E was found, but record F did NOT follow within 30 days, then trigger an event
if both records D and record E were found (irrespective of the order), followed by ... within 24 hours, then trigger an event
some pattern require lookups in a DB/NoSql or joins for additional information either to select the record, or to put into the event.
"Selecting a record" can be simple "field-A equals", but can also be "field-A in []" or "filed-A match " or "func identify(field-A, field-B)"
"days" might also be "hours" or "in previous month". Hence more flexible then just "days". Usually we have some date/timestamp in the record. The maximum is currently "within 6 months" (cancel within setup phase)
The created events (preferably JSON) needs to contain data from all records which were part of the selection process.
We need an approach that allows to flexibly change (add, modify, delete) the pattern, optionally re-processing the input files.
Any thoughts on how to tackle the problem elegantly? May be some python or java framework, or does any of the public cloud solutions (AWS, GCP, Azure) address the problem space especially well?
thanks a lot for your help

After some discussions and readings, we'll try first Apache Flink with the FlinkCEP library. From the docs and blog entries it seems to be able to do the job. It also seems AWS's choice, running on their EMR cluster. We didn't find any managed service on GCP nor Azure providing the functionalities. Of course we can always deploy and manage it ourselves. Unfortunately we didn't find a Python framework

Spark Structured streaming - dropDuplicates with watermark alternate solution

I am trying to deduplicate on streaming data using the dropDuplicate function with watermark. The problem I am facing currently is that I have to two timestamps for a given record
One is the eventtimestamp - timestamp of the record creation from the source.
Another is an transfer timestamp - timestamp from an intermediate process that is responsible to stream the data.
The duplicates are introduced during the intermediate stage so for a given a record duplicate, the eventtimestamp is same but transfer timestamp is different.
For the watermark, I like to use the transfertimestamp because I know the duplicates cant occur more than 3 minutes apart in transfer. But I cant use it within dropDuplicate because it wont capture the duplicates as the duplicates have different transfer timestamp.
Here is an example,
Event 1:{ "EventString":"example1", "Eventtimestamp": "2018-11-29T10:00:00.00", "TransferTimestamp": "2018-11-29T10:05:00.00" }
Event 2 (duplicate): {"EventString":"example1", "Eventtimestamp": "2018-11-29T10:00:00.00", "TransferTimestamp": "2018-11-29T10:08:00.00"}
In this case, the duplicate was created during transfer after 3 mins from the original event
My code is like below,
streamDataset.
.withWatermark("transferTimestamp", "4 minutes")
.dropDuplicates("eventstring","transferTimestamp");
The above code won't drop the duplicates as transferTimestamp is unique for the event and its duplicate. But currently, this is the only way as Spark forces me to include the watermark column in the dropDuplicates function when watermark is set.
I would really like to see an dropDuplicate implementation like below which would be a valid case for any at-least once semantics streams where I dont have to use the watermark field in dropDuplicates and still the watermark based state eviction is honored. But that is not the case currently
streamDataset.
.withWatermark("transferTimestamp", "4 minutes")
.dropDuplicates("eventstring");
I cant use the eventtimestamp as it is not ordered and time range varies drastically (delayed events and junk events).
If anyone has an alternate solution or ideas for deduping in such scenario, please let me know.

For your use case you cant use the dropDuplicates API directly . You have to use a arbitrary stateful operation for the same using some spark API like flatmapgroupwithstate