Strategies for near real-time analytics - real-time

I have few different "dimensions", mainly hierarchical I'd like to run some metrics saved at individual levels in hierarchy. As you're navigating through this hierarchical structure the metrics update depending on your current selection.
In traditional OLAP systems you'd have some dimension tables with a fact table and you do an ETL to get data into your data warehouse and run the queries against that. I'd like to do this near real-time. Doing this real-time means ETLs have to be run in near real-time basis (probably data points cached in memory).
If the hierarchical structure is (X -> Y -> Z), then if I have 5 Xs, 2 Ys and 5 Zs as dimension then I need to run (5 * 2 * 5) => 50 queries to get fact table populated? If this hierarchy grows bigger then I can easily get into running millions of queries. I'm not sure if I'm thinking about this problem correctly, it would greatly help if someone with real time data analytics experience can share their experience.

Do not understand why you'd need millions of queries to update your facts data. A simple 'incremental-id' per table could say to the OLAP system which rows needs to be loaded each time the data needs to be refreshed.
Perhaps you can have a look to icCube that allows for near real-time analytics; as the cubes do not cache any aggregation, each time new data has been loaded, it is ready for query.

Related

Redshift Performance of Flat Tables Vs Dimension and Facts

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.
You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).
When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

Apache Drill for a Homogeneous Data Store

Just beginning to explore apache drill for as a data engine for a reporting app.
We're a PostGres shop as our transactional data is all in RDBMS.
Moving to any NoSQL (MongoDB) is a distant dream for us and there's no pressing need for us to spend money on that as of today.
Our data size is big (but still all in PostGres). We have a few tables spanning upto a few lower hundreds of millions (say 150M).
Performance is a key for us. We want our reports to be generated as fast as possible to the end user real time.
I have a basic question here for my use case:
If the time-cost of a native (direct) postgres query is say: P
By going through drill, I would imagine the cost is going to be: P + D, where D is the extra cost of Drill?
At the end of the day, if Postgres proves to be a bottleneck (say missing indices etc), then Drill can't help in making the situation better right, no matter how many ever Drill bits I horizontally add?
So, in what way using Drill for my use case help than optimise PostGres and querying it directly?
Apache Drill is usually being used to consolidate access and being able to join over different database systems, e.g. a PostgreSQL and a MongoDB.
Here my first question would be why change a working and proven database system which is in the newer versions is fully capable of handling JSON data? What is the main success factor which is being seen which opens the wish to move to MongoDB?
If you have only one database system, I'd concentrate in getting the most performance out of that. If using Apache Drill to consolidate different systems, you'd have to remember a few facts designing the drill layer:
You need Zookeeper nodes for Drill if you setup several drillbits
You need a few drillbit servers which do have compute power and big memory
You need to make sure to understand how Drill uses the underlying databases when queries are being sent: Drill tries to use the most power of the database systems to minimize any processing it needs to do (e.g. joins, like statemens happen in the database system). Because of that the underlying database infrastructure has to be powerful

olap and oltp queries in gremlin

In gremlin,
s = graph.traversal()
g = graph.traversal(computer())
i know the first one is for OLTP and second for OLAP. I know the difference between OLAP and OLTP at definition level.I have the following queries on this:
How does
the above queries differ in working?
Can I use second one ,using'g'
in queries in my application to get results(I know this 'g' one
gives gives results faster than first one )?
Difference between OLAP and OLTP with example ?
Thanks in advance.
From the user's perspective, in terms of results, there's no real difference between OLAP and OLTP. The Gremlin statements are the same save for configuration of the TraversalSource as you have shown with your use of withComputer() and other settings.
The difference is more in how the traversal is executed behind the scenes. OLAP-based traversals are meant to process the "entire graph" (i.e. all vertices/edges and perhaps more than once). Where OLTP based traversals are meant to process smaller bodies of data, typically starting with one or a handful of vertices and traversing from there. When you consider graphs in the scale of "billions of edges", it's easy to understand why an efficient mechanism like OLAP is needed to process such graphs.
You really shouldn't think of OLTP vs OLAP as "faster" vs "slower". It's probably better to think of it as it is described in the documentation:
OLTP: real-time, limited data accessed, random data access,
sequential processing, querying
OLAP: long running, entire data set
accessed, sequential data access, parallel processing, batch
processing
There's no reason why you can't use an OLAP traversal in your applications so long as your application is aware of the requirements of that traversal. If you have some SLA that says that REST requests must complete in under 0.5 seconds and you decide to use an OLAP traversal to get the answer, you will undoubtedly break your SLA. Assuming you execute the OLAP traversal job over Spark, it will take Spark 10-15 seconds just to get organized to run your job.
I'm not sure how to provide an example of OLAP and OLTP, except to talk about the use cases a little bit more, so it should be clear as to when to use one as opposed to the other. In any case, let's assume you have a graph with 10 billion edges. You would want your OLTP traversals to always start with some form of index lookup - like a traversal that shows the average age of the friends of the user "stephenm":
g.V().has('username','stephenm').out('knows').values('age').mean()
but what if I want to know the average age of every user in my database? In this case I don't have any index I can use to lookup a "small set of starting vertices" - I have to process all the many millions/billions of vertices in my graph. This is a perfect use case for OLAP:
g.V().hasLabel('user').values('age').mean()
OLAP is also great for understanding growth of your graph and for maintaining your graph. With billions of edges and a high data ingestion rate, not knowing that your graph is growing improperly is a death sentence. It's good to use OLAP to grab global statistics over all the data in the graph:
g.E().label().groupCount()
g.V().label().groupCount()
In the above examples, you get an edge/vertex label distribution. If you have an idea as to how your graph is growing, this can be a good indicator of whether or not your data ingestion process is working properly. On a billion edge graph, trying to execute even one of the traversals would take "forever" if it ever finished at all without error.

Performance improvement for fetching records from a Table of 10 million records in Postgres DB

I have a analytic table that contains 10 million records and for producing charts i have to fetch records from analytic table. several other tables are also joined to this table and data is fetched currently But it takes around 10 minutes even though i have indexed the joined column and i have used Materialized views in Postgres.But still performance is very low it takes 5 mins for executing the select query from Materialized view.
Please suggest me some technique to get the result within 5sec. I dont want to change the DB storage structure as so much of code changes has to be done to support it. I would like to know if there is some in built methods for query speed improvement.
Thanks in Advance
In general you can take care of this issue by creating a better data structure(Most engines do this to an extent for you with keys).
But if you were to create a sorting column of sorts. and create a tree like structure then you'd be left to a search rate of (N(log[N]) rather then what you may be facing right now. This will ensure you always have a huge speed up in your searches.
This is in regards to binary tree's, Red-Black trees and so on.
Another implementation for a speedup may be to make use of something allong the lines of REDIS, ie - a nice database caching layer.
For analytical reasons in the past I have also chosen to make use of technologies related to hadoop. Though this may be a larger migration in your case at this point.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.