Complex Event Processing-Esper - complex-event-processing

I wonder if there is any information (e.g. a diagramm) on how the components of Esper source code collaborate in order to produce the query results.For example,when a select query is applied,the data is stored in an array and where does that happen in Esper's source code?

The data structures depend on the query. Lets say you have a "select * from MyEvent" that means there is no data structure that anything gets stored in. But instead if you have "select * from MyEvent.win:time(1 min)" there is a 1-minute window of events that one can iterate over using iterator API and that the engine does keep 1 minute of events in a data structure. For the time window the data structure is probably closer to a list. There any many different queries possible with all sorts of data windows and patterns and subqueries and more. All these are not one data structure but different ones.

Related

Redshift : loading & storing JSON (IoT) data

My source is JSON with nested arrays & structures. (examples at end of post)
Large volume of new data streaming real-time (20m/day)
I have to decide how to store this data, considering.....
-- End users want to use 'traditional' SQL
-- Performance (ingestion & query)
-- Load on Cluster
As far as I can see, my options, are to make use of the SUPER data type, or just convert everything to traditional relational tables & types.
(Even if I store the full JSON as a super, I still have to serialize critical attributes into regular columns for the purposes of Distribution/Sort.)
Regardless, been trying to weigh up the pros & cons of super vs. 'traditional'.
(1) Store full JSON as SUPER type
-- Very easy to ingest data with low load on cluster
-- Maybe an additional load on cluster & performance impact to execute end user queries?
-- End users would have to learn PartiQL and deal with unnesting & serialization etc
(2) 'Traditional' Relational Tables & Types
(a) Load as super, but then use PartiQL to unnest, serialize and store in relational tables
-- Additional continuous load on cluster
-- Easy to implement (insert into)
-- Would result in some massive tables for the 'tag' nodes
(b) Use lambda to pre-unnest & serialize json, insert/copy directly into relational tables
-- Lambda would be invoked continuously
(3) Redshift Spectrum
If I am converting to relational structure (2b), could simply store in S3 and utilze Redshift Spectrum to query
-- No load on cluster to ingest/process incoming data
-- Cost/maintenance of lambda/other process to transform JSON
Questions:
Are the above understandings correct ?
Any other considerations not listed ?
Is there a 'standard' for this scenario ?
Any other guidance/wisdeom welcome !
Background info
Schema:
events:array[ struct{
channels:array[
struct{
tags:array[struct{}]
}
tags:array[struct{}]
]
}
]
Schema Exploded view:
You will not want to leave things as a monolithic json. Any data that will be repeatedly queries in analytics will want to be its own column. The database work to expand the json at ingestion will be dwarfed by the work to repeatedly expand it for every query.
Any data that will be commonly used in a where clause, group by, partition, join condition etc will likely need to be its own column. I'd expect any data that is common for 90% of the json elements you will want to be in unique columns. Json element that are rare, unique, or of little analytic interest can be kept in super columns that have just these subset parts of the json.
The data size increase will be less than you think, Redshift is good a compressing columns. The ingestion load is unlikely to be a major concern but if it is then the Lambda approach is a reasonable way to extend the compute resources to address. I really don't expect this but if needed can be folded in easily to the existing ETL processes.
A hazard you will face is that users will only reference the json and not the extracted columns. Re-extracting the same data repeatedly costs. I'd consider NOT keeping the entire json in the main fact tables, only json pieces that represent the data not otherwise in columns. Keeping the original jsons in a separate table keyed with an identity column will allow joining if some need arises but the goal will be to not need to do this.
Spectrum does not look like a good fit for this use case. Spectrum does well when the compute elements in S3 can apply the first level where clauses and simple aggregations. I just don't see Spectrum working on this data so it will just send the entire json data to Redshift repeatedly. This will make things slow and tie up a ton of network bandwidth. Now storing the full original json with identity column in S3 and having all the expanded columns plus left-over json elements in native Redshift table does make sense. This way if some need to reference the full json arises it is just an external table reference away.

Pagination Options in KDB

I am looking to support a use case that returns kdb datasets back to users. The users connects to kdb using the Java API, runs the query synchronously and retrieves results.
However, issues are coming up when returning larger datasets and therefore I would like to return the data from kdb to the java process in pages/slices. Unfortunately users need to be able to run queries that return millions of rows and it would be easier to handle if they were passed back in slices of say 100,000 rows (Cassandra and other DBs do this sort of thing).
The potential approaches I have come up with are as follows:
Run the "where" part of the query on the database and return only the indices/date partitions (if applicable) of the data required. The java process would then use these indices to select the data required slice by slice . This approach would control memory usage on the kdb side as it would not have to load all HDB data required at once. However, overall this would increase the run time of the query as data would have to be searched/queried multiple times. This could work well for simple selects but complicated queries may need to go through an "onboarding" process which I want to avoid.
Store results of the query in a global variable in kdb which the java process can then query slice by slice. This simpler method could support any query but could potentially hit limits on the kdb side (memory/timeout) if too large a dataset is queried.
Other points to consider:
It should support users running queries on any type of process - gateway, hdb, rdb etc
It should support more than just simple selects e.g.
((1!select sym, price from trade where sym=`AAA) uj
1!select sym,price from order where sym=`AAA)
lj select avgBid:avg bid by sym from quote where sym=`AAA
The paging functionality should be removed from the end user
Does anyone have any views on if there are there any options available other than the ones listed above? Essentially I am looking for a select[m n] type approach that supports any query.

OLAP Approach for Backend redshift connection

We have a system where we do some aggregations in Redshift based on some conditions. We aggregate this data with complex joins which usually takes about 10-15 minutes to complete. We then show this aggregated data on Tableau to generate our reports.
Lately, we are getting many changes regarding adding a new dimension ( which usually requires join with a new table) or get data on some more specific filter. To entertain these requests we have to change our queries everytime for each of our subprocesses.
I went through OLAP a little bit. I just want to know if it would be better in our use case or is there any better way to design our system to entertain such adhoc requests which does not require developer to change things everytime.
Thanks for the suggestions in advance.
It would work, rather it should work. Efficiency is the key here. There are few things which you need to strictly monitor to make sure your system (Redshift + Tableau) remains up and running.
Prefer Extract over Live Connection (in Tableau)
Live connection would query the system everytime someone changes the filter or refreshes the report. Since you said the dataset is large and queries are complex, prefer creating an extract. This'll make sure data is available upfront whenever someone access your dashboard .Do not forget to schedule the extract refresh, other wise the data will be stale forever.
Write efficient queries
OLAP systems are expected to query a large dataset. Make sure you write efficient queries. It's always better to first get a small dataset and join them rather than bringing everything in the memory and then joining / using where clause to filter the result.
A query like (select foo from table1 where ... )a left join (select bar from table2 where) might be the key at times where you only take out small and relevant data and then join.
Do not query infinite data.
Since this is analytical and not transactional data, have an upper bound on the data that Tableau will refresh. Historical data has an importance, but not from the time of inception of your product. Analysing the data for the past 3, 6 or 9 months can be the key rather than querying the universal dataset.
Create aggregates and let Tableau query that table, not the raw tables
Suppose you're analysing user traits. Rather than querying a raw table that captures 100 records per user per day, design a table which has just one (or two) entries per user per day and introduce a column - count which'll tell you the number of times the event has been triggered. By doing this, you'll be querying sufficiently smaller dataset but will be logically equivalent to what you were doing earlier.
As mentioned by Mr Prashant Momaya,
"While dealing with extracts,your storage requires (size)^2 of space if your dashboard refers to a data of size - **size**"
Be very cautious with whatever design you implement and do not forget to consider the most important factor - scalability
This is a typical problem and we tackled it by writing SQL generators in Python. If the definition of the metric is the same (like count(*)) but you have varying dimensions and filters you can declare it as JSON and write a generator that will produce the SQL. Example with pageviews:
{
metric: "unique pageviews"
,definition: "count(distinct cookie_id)"
,source: "public.pageviews"
,tscol: "timestamp"
,dimensions: [
['day']
,['day','country']
}
can be relatively easy translated to 2 scripts - this:
drop table metrics_daily.pageviews;
create table metrics_daily.pageviews as
select
date_trunc('day',"timestamp") as date
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1;
and this:
drop table metrics_daily.pageviews_by_country;
create table metrics_daily.pageviews_by_country as
select
date_trunc('day',"timestamp") as date
,country
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1,2;
the amount of complexity of a generator required to produce such sql from such config is quite low but in increases exponentially as you need to add new joins etc. It's much better to keep your dimensions in the encoded form and just use a single wide table as aggregation source, or produce views for every join you might need and use them as sources.

One big and wide table or many not so big for statistics data

I'm writing simplest analytics system for my company. I have about 100 different event types that should be collected per tens of projects. We are not interested in cross-project analytic requests but events have similar types through all projects. I use PostgreSQL as primary storage for this system. Now I should decide which architecture is more preferable.
First architecture is one very big table (in terms of rows count) per project that contains data for all types of events. It will be about 20 or more columns many of them will be nullable. May be it will be used partitioning to split this table by event type but table still be so wide.
Second one architecture is a lot of tables (fairly big in terms of rows count but not so wide) with one table per event type.
I going to retrieve analytic data from this tables using different join queries (self join in case of first architecture). Which one is more preferable and where are pitfalls of them?
UPD. All events have about 10 common attributes. And remain attributes are varied from one event type to another.
In the past, I've had similar situations. With postgres you have a bunch of options.
Depending on how your data is input into the system (all at once/ a little at a time) and the volume of your data per project (hundreds of data points vs millions of data points) and the querying pattern (IE, querying after the data is all in, querying nightly, or reports running constantly throughout), there are many options. One other factor will be IF new project types (with new data point types) are likely to crop up.
First, in your "first architecture" the first question that comes up for me is: Are all the "data points" the same data type (or at least very similar). Are some text and others numeric? Are some numeric and others floats? If so, you're likely to run into issues with rolling up your data without either building a column or a table for every data type.
If all your data is the same datatype, then the first architecture you mentioned might work really well.
The second architecture you mentioned is OK especially if you don't predict having a bunch of new project types coming down the pike anytime soon, otherwise, you'll be constantly modifying the DB, which I prefer to avoid when unnecessary.
A third architecture that you didn't mention is to have a combination of 1 and 2. Basically have 1 table to hold the 10 common attributes and use either 1 or 2 to hold the additional attributes. This would have an advantage, especially if the additional data wasn't that frequently used, or was non-numeric.
Lastly, you could use one of PostgreSQLs "document store" type datatypes. You could store this information in arrays, hstores, or json. Now, this will be fairly inefficient if you're doing a ton of aggregate functions as you might be left calculating the aggregates outside of Pgsql, or at a minimum, running an inefficient query. You could store the 10 common fields in normal fields, and the additional ones as hstore or json.
I didn't ask you, but it'd be nice to know that if each event within a project had more than 1 data point (IE are you logging changes, or just updating data).If your overall table has less than 100,000 rows, it's likely just going to be best to focus on what's easier to maintain and program rather than performance, as small amounts of data are pretty quick regardless of how they're stored.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.