Performance improvement for fetching records from a Table of 10 million records in Postgres DB - postgresql

I have a analytic table that contains 10 million records and for producing charts i have to fetch records from analytic table. several other tables are also joined to this table and data is fetched currently But it takes around 10 minutes even though i have indexed the joined column and i have used Materialized views in Postgres.But still performance is very low it takes 5 mins for executing the select query from Materialized view.
Please suggest me some technique to get the result within 5sec. I dont want to change the DB storage structure as so much of code changes has to be done to support it. I would like to know if there is some in built methods for query speed improvement.
Thanks in Advance

In general you can take care of this issue by creating a better data structure(Most engines do this to an extent for you with keys).
But if you were to create a sorting column of sorts. and create a tree like structure then you'd be left to a search rate of (N(log[N]) rather then what you may be facing right now. This will ensure you always have a huge speed up in your searches.
This is in regards to binary tree's, Red-Black trees and so on.
Another implementation for a speedup may be to make use of something allong the lines of REDIS, ie - a nice database caching layer.
For analytical reasons in the past I have also chosen to make use of technologies related to hadoop. Though this may be a larger migration in your case at this point.

Related

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

Getting Over to Someone Why Views Can't Be Used as a Tables

For some reason I'm having a hard time getting over to some people that using a view in Postgres as you would use a table, is a bad idea.
As some background, there are a number of tables containing completely static data that is updated every few months via a batch import into different tables by date - table_201603 or table_201607. A view has then been created called 'table' which clients then use which is just a 'SELECT * FROM' of the table. When an updated batch of data is put into a new table the view is then updated to point at the new table. This means an in-place rename of the table does not need to take place that might mean downtime. This is in a version of Postgres before 9.3 where materialized views came in, just to clarify. These tables generally have about 100 million rows in them.
This is understandably leading to some confusing results when people are querying these views with very inconsistent query times. Sometimes queries are taking seconds, other times 20 or 30 milliseconds.
Additional: This is geospatial data, so they're doing geospatial queries on a view.
I know what many of the pitfalls here are - views are created on-the-fly like a sub-query, you're very much at the whim of the query planner as to what predicates get brought down and how long results are cached as results aren't physically stored as tables - but can anyone see anything else and suggest a better way of doing this? I can imagine this would be a reasonably common scenario so it might help others.
Thanks,
In general, this reminds me a use case for synonym. However, there are no synonyms in Postgres and they recommend using Views and or separation by schema
https://www.postgresql.org/message-id/kon2r2$mo6$1#ger.gmane.org

Cron job + new table vs materialized views for webapp analytics?

Hi
I am in the process of adding analytics to my SaaS app, and I'd love to hear other people's experiences doing this.
Current I see two different approches:
Do most of the data handling at the DB level, building and aggregating data into materialized views for performance boost. This way the data will stay normalized.
Have different cronjobs/processes that will run at different intervals (10 min, 1 hour etc.) that will query the database and insert aggregate results into a new table. In this case, the metrics/analytics are denormalized.
Which approach makes the most sense, maybe something completely different?
On really big data, the cronjob or ETL is the only option. You read the data once, aggregate it and never go back. Querying aggregated data is then relatively cheap.
Views will go through tables. If you use "explain" for a view-based query, you might see the data is still being read from tables, possibly using indexes (if corresponding indexes exist). Querying terabytes of data this way is not viable.
The only problem with the cronjob/ETL approach is that it's PITA to maintain. If you find a bug on production environment - you are screwed. You might spend days and weeks fixing and recalculating aggregations. Simply said: you have to get it right the first time :)

Moving the data from transaction table to history table to increase insert performance, postgres

I have 3 database tables, each contain 6 million rows and adding 3 million rows every year.
Following are the table information:
Table 1: 20 fields with 50 characters average in each filed. Has 2 indexes both are on timestamp fields.
Table 2: 5 fields, 2 byte array field and 1 xml field
Table 3: 4 fields, 1 byte array field
Following is the usage:
Insert 15 to 20 records per second in each table.
A view is created by joining first 2 tables and the select is mostly based on the date field in the first table.
Right Now, insert one record each in all three table together takes about 100 milliseconds.
I'm planning to migrate from postgres 8.4 to 9.2. I would like to do some optimization for insert performance also.Also, I'm planning to create history tables and keep the old record into those tables. I have the following questions in this regard
Will create history tables and move older data to those tables help in increasing the insert performance?
If it helps, how often I need to move the old records into the history tables, daily? or weekly/monthly/yearly?
If i keep only one month (220,000) data instead of one year data (3 million) will it help in improving insert performance?
Thanks in advance,
Sudheer
I'm sure someone better informed than I will show up and provide a better answer, but my impression is that:
Insert performance is mostly a function of your indexing strategy and your hardware
Performance, in general, is better under 9.0+ than 8.4, and this may rub off on insert performance, but I'm not certain of that.
None of your ideas are going to directly affect insert performance
Now, that said, the cost of maintaining a small index is lower than a large one, so it may be that creating history tables and moving old data there will improve performance simply by reducing index pressure. But I would expect dropping one of your indexes to have a direct and greater effect. Perhaps you could have a history table with both indexes and just maintain one of them on the "today" table?
If I were in your shoes, I'd get a copy of production going on my machine running 8.4 with a similar configuration. Then upgrade to 9.2 and see if the insert performance changes. Then try out these ideas and benchmark them, see which ones improve the situation. It's absolutely essential that things be kept as similar to production as possible for this to yield useful information, but it will certainly be better information than any hypothetical answer you might get.
Now, 100ms seems pretty slow for inserting one row IMO. Better hardware would certainly improve this situation. The usual suggestion would be a big striped RAID array with a battery-backed cache. PostgreSQL 9.0 High Performance has more information on all of this.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.