Schema for analytics table in Postgres - postgresql

We use Postgres for analytics (star schema).
Every few seconds we get reports on ~500 metrics types.
The simplest schema would be:
timestamp metric_type value
78930890 FOO 80.9
78930890 ZOO 20
Our DBA has came up with a suggestion to flatten all reports of the same 5 seconds to:
timestamp metric1 metric2 ... metric500
78930890 90.9 20 ...
Some developers push back on this saying this adds a huge complexity on development (batching data so it is written in one shot) and to maintainability (just looking at the table or adding fields is more complex).
Is the DBA model the standard practice in such systems or only a last resort once the original model is clearly not scalable enough?
EDIT: the end goal is to draw a line chart for the users. So queries will mostly be selecting a few metrics, folding them by hours, and selecting min/max/avg per hour (or any otehr time period).
EDIT: the DBA arguments are:
This is relevant from day 1 (see below) but even if was not this is something the system eventually will need to do and migrating from another schema will be a pain
Reducing the number of rows x500 times will allow more efficient indexes and memory (the table will contain hundreds of millions of rows before this optimization)
When selecting multiple metrics the suggested schema will allow one pass over the data instead of separate query for each metric (or some complex combinations of OR and GroupBY)
EDIT: 500 metrics is an "upper bound" but in practice most of the time only ~40 metrics are reported per 5 seconds (not the same 40 though)

The DBA's suggestion isn't totally unreasonable if the metrics are fairly fixed, and make sense to group together. A couple of problems you'll likely face, though:
Postgres has a limit of between 250 and 1,600 columns (depending on data type)
The table will be hard for developers to work with, especially if you often want to query for only a subset of the attributes
Adding new columns will be slow
Instead, you might want to consider using an HSTORE column:
CREATE TABLE metrics (
timestamp INTEGER,
values HSTORE
)
This will give you some flexibility in storing attributes, and allows for indices. For example, to index just one of the metrics:
CREATE INDEX metrics_metric3 ON metrics ((values->'metric3'))
One drawback of this is that values can only be text stringsā€¦ so if you need to do numeric comparisons, a JSON column might also be worth considering:
CREATE TABLE metrics (
timestamp INTEGER,
values JSON
)
CREATE INDEX metrics_metric3 ON metrics ((values->'metric3'))
The drawback here is that you'll need to use Postgres 9.3, which is still reasonably new.

Related

OLAP Approach for Backend redshift connection

We have a system where we do some aggregations in Redshift based on some conditions. We aggregate this data with complex joins which usually takes about 10-15 minutes to complete. We then show this aggregated data on Tableau to generate our reports.
Lately, we are getting many changes regarding adding a new dimension ( which usually requires join with a new table) or get data on some more specific filter. To entertain these requests we have to change our queries everytime for each of our subprocesses.
I went through OLAP a little bit. I just want to know if it would be better in our use case or is there any better way to design our system to entertain such adhoc requests which does not require developer to change things everytime.
Thanks for the suggestions in advance.
It would work, rather it should work. Efficiency is the key here. There are few things which you need to strictly monitor to make sure your system (Redshift + Tableau) remains up and running.
Prefer Extract over Live Connection (in Tableau)
Live connection would query the system everytime someone changes the filter or refreshes the report. Since you said the dataset is large and queries are complex, prefer creating an extract. This'll make sure data is available upfront whenever someone access your dashboard .Do not forget to schedule the extract refresh, other wise the data will be stale forever.
Write efficient queries
OLAP systems are expected to query a large dataset. Make sure you write efficient queries. It's always better to first get a small dataset and join them rather than bringing everything in the memory and then joining / using where clause to filter the result.
A query like (select foo from table1 where ... )a left join (select bar from table2 where) might be the key at times where you only take out small and relevant data and then join.
Do not query infinite data.
Since this is analytical and not transactional data, have an upper bound on the data that Tableau will refresh. Historical data has an importance, but not from the time of inception of your product. Analysing the data for the past 3, 6 or 9 months can be the key rather than querying the universal dataset.
Create aggregates and let Tableau query that table, not the raw tables
Suppose you're analysing user traits. Rather than querying a raw table that captures 100 records per user per day, design a table which has just one (or two) entries per user per day and introduce a column - count which'll tell you the number of times the event has been triggered. By doing this, you'll be querying sufficiently smaller dataset but will be logically equivalent to what you were doing earlier.
As mentioned by Mr Prashant Momaya,
"While dealing with extracts,your storage requires (size)^2 of space if your dashboard refers to a data of size - **size**"
Be very cautious with whatever design you implement and do not forget to consider the most important factor - scalability
This is a typical problem and we tackled it by writing SQL generators in Python. If the definition of the metric is the same (like count(*)) but you have varying dimensions and filters you can declare it as JSON and write a generator that will produce the SQL. Example with pageviews:
{
metric: "unique pageviews"
,definition: "count(distinct cookie_id)"
,source: "public.pageviews"
,tscol: "timestamp"
,dimensions: [
['day']
,['day','country']
}
can be relatively easy translated to 2 scripts - this:
drop table metrics_daily.pageviews;
create table metrics_daily.pageviews as
select
date_trunc('day',"timestamp") as date
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1;
and this:
drop table metrics_daily.pageviews_by_country;
create table metrics_daily.pageviews_by_country as
select
date_trunc('day',"timestamp") as date
,country
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1,2;
the amount of complexity of a generator required to produce such sql from such config is quite low but in increases exponentially as you need to add new joins etc. It's much better to keep your dimensions in the encoded form and just use a single wide table as aggregation source, or produce views for every join you might need and use them as sources.

Slow select from one billion rows GreenPlum DB

I've created the following table on GreenPlum:
CREATE TABLE data."CDR"
(
mcc text,
mnc text,
lac text,
cell text,
from_number text,
to_number text,
cdr_time timestamp without time zone
)
WITH (
OIDS = FALSE,appendonly=true, orientation=column,compresstype=quicklz, compresslevel=1
)
DISTRIBUTED BY (from_number);
I've loaded one billion rows to this table but every query works very slow.
I need to do queries on all fields (not only one),
What can I do to speed up my queries?
Using PARTITION? using indexes?
maybe using a different DB like Cassandra or Hadoop?
This highly depends on the actual queries you are doing and what your hardware setup looks like.
Since you are querying all the fields the selectivity gained by going columnar orientation is probably hurting you more than helping, as you needs to scan all the data anyway. I would remove columnar orientation.
Generally speaking indexes don't help in a Greenplum system. Usually the amount of hardware that is involved tends to make scanning the data directory faster than doing index lookups.
Partitioning could be a great help but there would need to be a better understanding of the data. You are probably accessing specific time intervals so creating a partitioning scheme around cdr_time could eliminate the scan of data not needed for the result. The last thing I would worry about is indexes.
Your distribution by from_number could have an impact on query speed. The system will hash the data based on from_number so if you are querying selectively on the from_number the data will only be returned by the node that has it and you won't be leveraging the parallel nature of the system and spreading the request across all of the nodes. Unless you are joining to other tables on from_number, which allows the joins to be collocated and performed within the node, I would change that to be distributed RANDOMLY.
On top of all of that there is the question of what the hardware is and if you have a proper amount of segments setup and resources to feed them. Essentially every segment is a database. Good hardware can handle multiple segments per node, but if you are doing this on a light hardware you need to find the sweet spot where number of segments matches what the underlying system can provide.
#Dor,
I have same type of data where CDR info is stored for a telecom company, and daily 10-12 millions rows inserted and also heavy queries running on those CDRs related tables, I was also facing the same issue last year, and i have created partitions on those tables on the CDR timings column.
As per My understanding GP creates physical tables for each partition whereas logical tables created in other RDBMS. After this I got better performance with all SELECTs on these tables. Also I think you should convert text datatype to Character Varying for all columns (if text is really not required) I felt DB operations on Text field is very slow(specially order by, group by)
index will help you depends on your queries in my case i have huge inserts so i didnt try yet
If you are selecting all the columns in select so no need of Column Oriented table
Regards

Postgres partitioning?

My software runs a cronjob every 30 minutes, which pulls data from Google Analytics / Social networks and inserts the results into a Postgres DB.
The data looks like this:
url text NOT NULL,
rangeStart timestamp NOT NULL,
rangeEnd timestamp NOT NULL,
createdAt timestamp DEFAULT now() NOT NULL,
...
(various integer columns)
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table. At this rate, the cronjob will generate about 480 000 records a day and about 14.5 million a month.
I think the solution would be using several tables, for example I could use a specific table to store data generated in a given month: stats_2015_09, stats_2015_10, stats_2015_11 etc.
I know Postgres supports table partitioning. However, I'm new to this concept, so I'm not sure what's the best way to do this. Do I need partitioning in this case, or should I just create these tables manually? Or maybe there is a better solution?
The data will be queried later in various ways, and those queries are expected to run fast.
EDIT:
If I end up with 12-14 tables, each storing 10-20 millions rows, Postgres should be still able to run select statements quickly, right? Inserts don't have to be super fast.
Partitioning is a good idea under various circumstances. Two that come to mind are:
Your queries have a WHERE clause that can be readily mapped onto one or a handful of partitions.
You want a speedy way to delete historical data (dropping a partition is faster than deleting records).
Without knowledge of the types of queries that you want to run, it is difficult to say if partitioning is a good idea.
I think I can say that splitting the data into different tables is a bad idea because it is a maintenance nightmare:
You can't have foreign key references into the table.
Queries spanning multiple tables are cumbersome, so simple questions are hard to answer.
Maintaining tables becomes a nightmare (adding/removing a column).
Permissions have to be carefully maintained, if you have users with different roles.
In any case, the place to start is with Postgres's documentation on partitioning, which is here. I should note that Postgres's implementation is a bit more awkward than in other databases, so you might want to review the documentation for MySQL or SQL Server to get an idea of what it is doing.
Firstly, I would like to challenge the premise of your question:
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table.
As far as I know, there is no fundamental reason why the database would not cope fine with a single table of many millions of rows. At the extreme, if you created a table with no indexes, and simply appended rows to it, Postgres could simply carry on writing these rows to disk until you ran out of storage space. (There may be other limits internally, I'm not sure; but if so, they're big.)
The problems only come when you try to do something with that data, and the exact problems - and therefore exact solutions - depend on what you do.
If you want to regularly delete all rows which were inserted more than a fixed timescale ago, you could partition the data on the createdAt column. The DELETE would then become a very efficient DROP TABLE, and all INSERTs would be routed through a trigger to the "current" partition (or could even by-pass it if your import script was aware of the partition naming scheme). SELECTs, however, would probably not be able to specify a range of createAt values in their WHERE clause, and would thus need to query all partitions and combine the results. The more partitions you keep around at a time, the less efficient this would be.
Alternatively, you might examine the workload on the table and see that all queries either already do, or easily can, explicitly state a rangeStart value. In that case, you could partition on rangeStart, and the query planner would be able to eliminate all but one or a few partitions when planning each SELECT query. INSERTs would need to be routed through a trigger to the appropriate table, and maintenance operations (such as deleting old data that is no longer needed) would be much less efficient.
Or perhaps you know that once rangeEnd becomes "too old" you will no longer need the data, and can get both benefits: partition by rangeEnd, ensure all your SELECT queries explicitly mention rangeEnd, and drop partitions containing data you are no longer interested in.
To borrow Linus Torvald's terminology from git, the "plumbing" for partitioning is built into Postgres in the form of table inheritance, as documented here, but there is little in the way of "porcelain" other than examples in the manual. However, there is a very good extension called pg_partman which provides functions for managing partition sets based on either IDs or date ranges; it's well worth reading through the documentation to understand the different modes of operation. In my case, none quite matched, but forking that extension was significantly easier than writing everything from scratch.
Remember that partitioning does not come free, and if there is no obvious candidate for a column to partition by based on the kind of considerations above, you may actually be better off leaving the data in one table, and considering other optimisation strategies. For instance, partial indexes (CREATE INDEX ... WHERE) might be able to handle the most commonly queried subset of rows; perhaps combined with "covering indexes", where Postgres can return the query results directly from the index without reference to the main table structure ("index-only scans").

Search Engine Database (Cassandra) & Best Practise

I'm currently storing rankings in MongoDB (+ nodejs as API) . It's now at 10 million records, so it's okay for now but the dataset will be growing drastically in the near future.
At this point I see two options:
MongoDB Sharding
Change Database
The queries performed on the database will not be text searches, but for example:
domain, keyword, language, start date, end date
keyword, language, start date, end date
A rank contains a:
1. domain
2. url
3. keyword
4. keyword language
5. position
6. date (unix)
Requirement is to be able to query and analyze the data without caching. For example get all data for domain x, between dates y, z and analyze the data.
I'm noticing a perfomance decrease lately and I'm looking into other databases. The one that seems to fit the job best is Cassandra, I did some testing and it looked promising, performance is good. Using Amazon EC2 + Cassandra seems a good solution, since it's easilly scalable.
Since I'm no expert on Cassandra I would like to know if Cassandra is the way to go. Secondly, what would be the best practice / database model.
Make a collection for (simplified):
domains (domain_id, name)
keywords (keyword_id, name, language)
rank (domain_id, keyword_id, position, url, unix)
Or put all in one row:
domain, keyword, language, position, url, unix
Any tips, insights would be greatly appreciated.
Cassandra relies heavily on query driven modelling. It's very restrictive in how you can query, but it is possible to fit an awful lot of requirements within those capabilities. For any large scale database, knowing your queries is important, but in terms of cassandra, it's almost vital.
Cassandra has the notion of primary keys. Each primary key consists of one or more keys (read columns). The first column (which may be a composite) is referred to as the partition key. Cassandra keeps all "rows" for a partition in the same place (on disk, in mem, etc.), and a partition is the unit of replication, etc.
Additional keys in the primary key are called clustering keys. Data within a partition are ordered according to successive clustering keys. For instance, if your primary key is (a, b, c, d) then data will be partitioned by hashing a, and within a partition, data will be ordered by b, c and d.
For efficient querying, you must hit one (or very few) partitions. So your query must have a partition key. This MUST be exact equality (no starts with, contains, etc.). Then you need to filter down to your targets. This can get interesting too:
Your query can specify exact equality conditions for successive clustering keys, and a range (or equality) for the last key in your query. So, in the previous example, this is allowed:
select * from tbl where a=a1 and b=b1 and c > c1;
This is not:
select * from tbl where a=a1 and b>20 and c=c1;
[You can use allow filtering for this]
or
select * from tbl where a=a1 and c > 20;
Once you understand the data storage model, this makes sense. One of the reason cassandra is so fast for queries is that it pin points data in a range and splats it out. If it needed to do pick and choose, it'd be slower. You can always grab data and filter client side.
You can also have secondary indexes on columns. These would allow you to filter on exact equality on non-key columns. Be warned, never use a query with a secondary index without specifying a partition key. You'll be doing a cluster query which will time out in real usage. (The exception is if you're using Spark and locality is being honoured, but that's a different thing altogether).
In general, it's good to limit partition sizes to less than a 100mb or at most a few hundred meg. Any larger, you'll have problems. Usually, a need for larger partitions suggests a bad data model.
Quite often, you'll need to denormalise data into multiple tables to satisfy all your queries in a fast manner. If your model allows you to query for all your needs with the fewest possible tables, that's a really good model. Often that might not be possible though, and denormalisation will be necessary. For your question, the answer to whether or not all of it goes in one row depends on whether you can still query it and keep partition sizes less than 100 meg or not if everything is in one row.
For OLTP, cassandra will be awesome IF you can build the data model that works the way Cassandra does. Quite often OLAP requirements won't be satisfied by this. The current tool of choice for OLAP with Cassandra data is the DataStax Spark connector + Apache Spark. It's quite simple to use, and is really powerful.
That's quite a brain dump. But it should give you some idea of the things you might need to learn if you intend to use Cassandra for a real world project. I'm not trying to put you off Cassandra or anything. It's an awesome data store. But you have to learn what it's doing to harness its power. It works very different to Mongo, and you should expect a mindshift when switching. It's most definitely NOT like switching from mysql to sql server.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.