One big and wide table or many not so big for statistics data - postgresql

I'm writing simplest analytics system for my company. I have about 100 different event types that should be collected per tens of projects. We are not interested in cross-project analytic requests but events have similar types through all projects. I use PostgreSQL as primary storage for this system. Now I should decide which architecture is more preferable.
First architecture is one very big table (in terms of rows count) per project that contains data for all types of events. It will be about 20 or more columns many of them will be nullable. May be it will be used partitioning to split this table by event type but table still be so wide.
Second one architecture is a lot of tables (fairly big in terms of rows count but not so wide) with one table per event type.
I going to retrieve analytic data from this tables using different join queries (self join in case of first architecture). Which one is more preferable and where are pitfalls of them?
UPD. All events have about 10 common attributes. And remain attributes are varied from one event type to another.

In the past, I've had similar situations. With postgres you have a bunch of options.
Depending on how your data is input into the system (all at once/ a little at a time) and the volume of your data per project (hundreds of data points vs millions of data points) and the querying pattern (IE, querying after the data is all in, querying nightly, or reports running constantly throughout), there are many options. One other factor will be IF new project types (with new data point types) are likely to crop up.
First, in your "first architecture" the first question that comes up for me is: Are all the "data points" the same data type (or at least very similar). Are some text and others numeric? Are some numeric and others floats? If so, you're likely to run into issues with rolling up your data without either building a column or a table for every data type.
If all your data is the same datatype, then the first architecture you mentioned might work really well.
The second architecture you mentioned is OK especially if you don't predict having a bunch of new project types coming down the pike anytime soon, otherwise, you'll be constantly modifying the DB, which I prefer to avoid when unnecessary.
A third architecture that you didn't mention is to have a combination of 1 and 2. Basically have 1 table to hold the 10 common attributes and use either 1 or 2 to hold the additional attributes. This would have an advantage, especially if the additional data wasn't that frequently used, or was non-numeric.
Lastly, you could use one of PostgreSQLs "document store" type datatypes. You could store this information in arrays, hstores, or json. Now, this will be fairly inefficient if you're doing a ton of aggregate functions as you might be left calculating the aggregates outside of Pgsql, or at a minimum, running an inefficient query. You could store the 10 common fields in normal fields, and the additional ones as hstore or json.
I didn't ask you, but it'd be nice to know that if each event within a project had more than 1 data point (IE are you logging changes, or just updating data).If your overall table has less than 100,000 rows, it's likely just going to be best to focus on what's easier to maintain and program rather than performance, as small amounts of data are pretty quick regardless of how they're stored.

Related

Redshift Performance of Flat Tables Vs Dimension and Facts

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.
You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).
When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

Is it good practice to have 2 or more tables with the same columns?

I'm creating a web-app that lets users search for restaurants and cafes. Since I currently have no data other than their type to differentiate the two, I have two options on storing the list of eateries.
Use a single table for both restaurants and cafes, and have an enum (text) column stating if an entry is a restaurant or cafe.
Create two separate tables, one for restaurants, and one for cafes.
I will never need to execute a query that collects data from both, so the only thing that matters to me I guess is performance. What would you suggest as the better option for PostgreSQL?
Typical database modeling would lend itself to a single table. The main reason is maintainability. If you have two tables with the same columns and your client decides they want to add a column, say hours of operation. You now have to write two sets of code for creating the column, reading the new column, updating the new column, etc. Also, what if your client wants you to start tracking bars, now you need a third table with a third set of code. It gets messy quick. It would be better to have two tables, a data table (say Establishment) with most of the columns (name, location, etc.) and then a second table that's a "type" table (say EstablishmentType) with a row for Restaurant, Cafe, Bar, etc. And of course a foreign key linking the two. This way you can have "X" types and only need to maintain a single set of code.
There are of course exceptions to this rule where you may want separate tables:
Performance due to a HUGE data set. (It depends on your server, but were talking at least hundreds of thousands of rows before it should matter in Postgres). If this is the reason I would suggest table inheritance to keep much of the proper maintainability while speeding up performance.
Cafes and Restaurants have two completely different sets of functionality in your website. If the entirety of your code is saying if Cafe, do this, if Restaurant, do that, then you already have two sets of code to maintain, with the added hassle of if logic in your code. If that's the case, two separate tables is a much cleaner and logical option.
In the end I chose to use 2 separate tables, as I really will never need to search for both at the same time, and this way I can expand a single table in the future if I need to add another data field specific to cafes, for example.

When to use composite types and arrays and when to normalize a database?

Is there any guideline on when to normalize a database or just use composite types and arrays?
When using arrays and composite types, I can use just a single table. I can also normalize the database and use a couple of tables and joins.
How do you decide which option is best?
Most of the time, stick to normalization. Among other things, keeping your database fairly well normalized helps with lock granularity. For example, if you have a "parent" object with two arrays in it, you cannot have transactions that are simultaneously adding/updating/modifying members of the arrays. If they're regular side tables, you can. (You can still SELECT ... FOR UPDATE the parent row before updating child objects if you want the serialized behaviour, though).
Updating an array to add/replace/delete a value is expensive, as PostgreSQL must rewrite the whole tuple the array is in as an MVCC update. (It has a few TOAST tricks up its sleeve that can help, but not tons). Ditto composite types embedded in rows.
Big wide rows full of arrays and composites mean slower table scans, meaning slower fetches for commonly used values.
IIRC you can't define a foreign key into a field of a composite type, so you'll find yourself working around that or giving up on referential integrity where it'd be good to have. Ditto arrays (there was work to get foreign keys to arrays to work but I don't think it ever got comitted).
Many client drivers (PgJDBC, psqlODBC, psycopg2, etc etc etc) have incomplete to nonexistent support for arrays and composites, so you'll often land up expanding them into tuples for client driver interaction anyway. Some things, like arrays of composite types, are really quite painful to work with.
Most ORMs, including common ones like Hibernate, totally suck at using anything beyond the most utterly simplistic lowest-common-denominator SQL features. Sooner or later, someone's going to want to point one of those at your data model, at which point much wailing and gnashing of teeth will ensue. OTOH, don't accomodate garbage ORMs to the point where you avoid using features that'll greatly improve the data model and solve real world problems - for example, if you have the choice of storing native hstore fields, or using an EAV schema, consider just using jstore (or better, in 9.4, json with hstore features).
(Perversely, this means that people who have the most "object oriented" programs often have the most purely relational databases because their tools suck).
Things like report generation tools will similarly struggle with composites and arrays, so you'll often land up creating views to present a normalized appearance for the DB anyway. Then ON INSERT OR UPDATE OR DELETE ... DO INSTEAD triggers on the views to enable writes. At which point it gets ugly.
Personally I recommend keeping composites for times when it's logical to model something as a "type". Consider, say, if your data model required you to track timestamps in their original time zone. There's no built-in type for this (no, that's not what "timestamp with time zone" does, despite the name, thanks SQL committee), so you might create a composite type that stored (timestamp without time zone, tzname) and use that consistently in your data model.
Similarly, I tend to use arrays in queries a lot, but not in the data model much. They're useful when you want to intentionally denormalize something for performance, but that's often done in a materialized view or similar. Even if it's a change to the main data model, it's the sort of thing you should be doing based on proper performance review, not just "optimizing" stuff you don't know is slow yet.

What NoSQL DB to use for sparse Time Series like data?

I'm planning a side project where I will be dealing with Time Series like data and would like to give one of those shiny new NoSQL DBs a try and am looking for a recommendation.
For a (growing) set of symbols I will have a list of (time,value) tuples (increasing over time).
Not all symbols will be updated; some symbols may be updated while others may not, and completely new symbols may be added.
The database should therefore allow:
Add Symbols with initial one-element (tuple) list. E.g. A: [(2012-04-14 10:23, 50)]
Update Symbols with a new tuple. (Append that tuple to the list of that symbol).
Read the data for a given symbol. (Ideally even let me specify the time frame for which the data should be returned)
The create and update operations should possibly be atomic. If reading multiple symbols at once is possible, that would be interesting.
Performance is not critical. Updates/Creates will happen roughly once every few hours.
I believe literally all the major NoSQL databases will support that requirement, especially if you don't actually have a large volume of data (which begs the question, why NoSQL?).
That said, I've had to recently design and work with a NoSQL database for time series data so can give some input on that design, which can then be extrapolated for all others.
Our chosen database was Cassandra, and our design was as follows:
A single keyspace for all 'symbols'
Each symbol was a new row
Each time entry was a new column for that relevant row
Each value (can be more than a single value) was the value part of the time entry
This lets you achieve everything you asked for, most notably to read the data for a single symbol, and using a range if necessary (column range calls). Although you said performance wasn't critical, it was for us and this was quite performant also - all data for any single symbol is by definition sorted (column name sort) and always stored on the same node (no cross node communication for simple queries). Finally, this design translates well to other NoSQL databases that have have dynamic columns.
Further to this, here's some information on using MongoDB (and capped collections if necessary) for a time series store: MongoDB as a Time Series Database
Finally, here's a discussion of SQL vs NoSQL for time series: https://dba.stackexchange.com/questions/7634/timeseries-sql-or-nosql
I can add to that discussion the following:
Learning curve for NoSQL will be higher, you don't get the added flexibility and functionality for free in terms of 'soft costs'. Who will be supporting this database operationally?
If you expect this functionality to grow in future (either as more fields to be added to each time entry, or much larger capacity in terms of number of symbols or size of symbol's time series), then definitely go with NoSQL. The flexibility benefit is huge, and the scalability you get (with the above design) on both the 'per symbol' and 'number of symbols' basis is almost unbounded (I say almost unbounded - maximum columns per row is in the billions, maximum rows per key space is unbounded I believe).
Have a look at opentsdb.org an opensource time series database which use hbase. They have been smart on how they store the TS. It is well documented here: http://opentsdb.net/misc/opentsdb-hbasecon.pdf