Comparing records from dstream - scala

I have a case with Apache Spark where I like to analyse sensorstreams. The stream exists of sensordata from a variety of sensors but all push the same kind of data.
From this stream I like to know for each sensor how long a specific value is below a certain threshold. A sensor submits records every x seconds containing: timestamp and value. I like to extract the intervals at which a sensor is below the value to get the duration, starttime of interval, endtime of interval and average value.
I'm not sure about the proper ('Sparkish') way to extract the duration, start- and endtime of every interval from all connected sensors.
The approach I currently use is a foreach-loop with some state variables to uniquely mark each record if it is part of an interval from a specific sensor. When the records have been marked a map-reduce approach is used to extract the needed info. But I don't feel comfortable with the foreach-loop because it does not fit in the map-reduce approach and therefore does not scale well when work is distributed among workers.
In more general terms I'm facing the challenge comparing individual records in a rdd and records from different dstreams.
Anyone recognises such a (trivial) case and knows a better/more elegant approach to tackle this.

I found out that the best way to do this is by using mapWithState(). This function offers an elegant and flexible way to maintain state among values from successive dstreams.

Related

How to construct a two level partition in kdb?

Is it possible to have a two level partition scheme in kdb+? For example, one level is based on trading dates and the other level is on stock symbols. Thanks!
If you use the standard partitioned-by-date-and-`p#attribute-on-sym approach then you do effectively have a two level partition. It is very efficient to query/filter by date
select from table where date=x
and also generally still very efficient to query/filter by sym
select from table where date within(x;y),sym=z
Particularly if you run secondary threads and have good disk performance.
So you should get good performance when querying a sym across all history, and also when querying all syms for a given date. This is also the simplest way to write daily data.
I don't believe there's any real way to have a true two-level partition without writing the data twice in two different ways.
If your queries span a large range of symbols or involve extensive queries on each one that would benefit from parallel computation, you can segment by first letter ranges: A-N, N-Z, or finer grained than that.
As long as you maintain`p# within and across segments (so that e.g. IBM only appears in one place), you can throw threads and disks at the problem.

When to use windowing against grouping by event time?

Kind of related, as allowing infinitely late data would also solve my problem.
I have a pipeline linked to a pub/sub topic, from which data can come very late (days) as well as nearly real time: I have to make simple parsing and aggregation tasks on it, based on event time.
As each element holds a timestamp of the event time, my first idea when looking at the programming guide was to use windowing by setting the timestamp on event time.
However, as the processing can happen a long time after the event, I would have to allow very late data, which based on the answer of the first link I posted seems to not be such a good idea.
An alternative solution I have come up with would be to simply group the data based on event time, without marking it as a timestamp, and then calculate my aggregates based on such a key. As the data for a given even time tend to arrive at the same time in the pipeline, I could then just provide windows of arbitrary length based on processing time so that my data is ingested as soon as it is available.
This got me wondering: does that mean that windowing is useless as soon as your data comes a bit later than real time ? Would it be possible to start a window when the first element is received and close it a bit afterwards ?

prometheus aggregate table data from offset; ie pull historical data from 2 weeks ago to present

so i am constructing a table within grafana with prometheus as a data source. right now, my queries are set to instant, and thus it's showing scrape data from the instant that the query is made (in my case, shows data from the past 2 days)
however, i want to see data from the past 14 days. i know that you can adjust time shift in grafana as well as use the offset <timerange> command to shift the moment when the query is run, however these only adjust query execution points.
using a range vector such as go_info[10h] does indeed go back that range, however the scrapes are done in 15s intervals and as such produce duplicate data in addition to producing query results for a query done in that instant
(and not an offset timepoint), which I don't want
I am wondering if there's a way to gather data from two weeks ago until today, essentially aggregating data from multiple offset time points.
i've tried writing multiple queries on my table to perform this,
e.g:
go_info offset 2d
go_info offset 3d
and so on..
however this doesn't seem very efficient and the values from each query end up in different columns (a problem i could probably alleviate with an alteration to the query, however that doesn't solve the issue of complexity in queries)
is there a more efficient, simpler way to do this? i understand that the latest version of Prometheus offers subquerys as a feature, but i am currently not able to upgrade Prometheus (at least in a simple manner with the way it's currently set up) and am also not sure it would solve my problem. if it is indeed the answer to my question, it'll be worth the upgrade. i just haven't had the environment to test it out
thanks for any help anyone can provide :)
figured it out;
it's not pretty but i had to use offset <#>d for each query in a single metric.
e.g.:
something_metric offset 1d
something_metric offset 2d

How to ensure spark not read same data twice from cassandra

I'm learning spark and cassandra. My problem is as follows.
I have cassandra table which records row data from a sensor
CREATE TABLE statistics.sensor_row (
name text,
date timestamp,
value int,
PRIMARY KEY (name, date)
)
Now I want to aggregate these rows through a spark batch job (ie. Daily)
So I could write
val rdd = sc.cassandraTable("statistics","sensor_row")
//and do map and reduce to get what i want and perhaps write back to aggregated table.
But my problem is I will be running this code periodically. I need to make sure I dont read same data twice.
One thing I can do is delete rows which I read, which looks pretty ugly, or use filter
sensorRowRDD.where("date >'2016-02-05 07:32:23+0000'")
Second one looks much nicer, but then I need to record when was the job run last and continue from there. However according to DataStax driver data locality, each worker will load data only in its local cassandra node. Which mean instead of tracking a global date, i need to track date of each cassandra/spark node. Still does not look very elegant.
Is there any better ways of doing this ?
The DataFrame filters will be pushed down to Cassandra, so this is an efficient solution to the problem. But you are right to worry about the consistency issue.
One solution is to set not just a start date, but an end date also. When your job starts, it looks at the clock. It is 2016-02-05 12:00. Perhaps you have a few minutes delay in collecting late-arriving data, and the clocks are not absolutely precise either. You decide to use 10 minutes of delay and set your end time to 2016-02-05 11:50. You record this in a file/database. The end time of the previous run was 2016-02-04 11:48. So your filter is date > '2016-02-04 11:48' and date < '2016-02-05 11:50'.
Because the date ranges cover all time, you will only miss events that have been saved into a past range after the range has been processed. You can increase the delay from 10 minutes if this happens too often.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.