How much data can practically be stored in a Postgres table field? - postgresql

I have an application that stores data in a Postgres database that I am considering extending. This would entail storing larger amounts of data in fields of a table. How large can the data in a field reasonably be before one starts running into performance issues? 100kb? 1mb? 100mb? 500mb? Does it matter what type the data is stored as (other than the fact the binary data tends to be more compact)?

Up to 1 GB per field, 32 TB per relation.
Upper limits are as defined in the "about" page of Postgres.
... have since been moved to the manual page "PostgreSQL Limits".
But storing massive amounts of data in table columns is typically a bad idea. If you want to change anything in a 1 GB field, Postgres has to write a new row version, which is extremely inefficient. That's not the use case relational databases are optimized for.
Consider storing large objects in files or at least use binary large objects for this. See:
Storing long binary (raw data) strings
I would think twice before even storing megabytes of data into a single table field. You can do it. Doesn't mean you should.

Related

Redshift Performance of Flat Tables Vs Dimension and Facts

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.
You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).
When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

"Big" data in a JSONB column

I have a table with a metadata column (JSONB). Sometimes I run queries on this column. Example:
select * from "a" where metadata->'b'->'c' is not null
This column has always just small JSON objects <1KB. But for some records (less than 0.5%), it can be >500KB, because some sub-sub-properties have many data.
Today, I only have ~1000 records, everything works fine. But I think I will have more records soon, I don't know if having some big data (I don't speak about "Big Data" of course!) will have a global impact on performance. Is 500KB "big" for postgres and is it "hard" to parse? Maybe my question is too vague, I can edit if required. In other words:
Is having some (<0.5%) bigger entries in a JSONB column affect noticeably global performance of JSON queries?
Side note: assuming the "big" data is in metadata->'c'->'d' I don't run any queries to this particular property. Queries are always done on "small/common" properties. But the "big" properties still exists.
It is a theoretical question, so I hope a generic answer will satisfy.
If you need numbers, I recommend that you run some performance tests. It shouldn't be too hard to generate some large jsonb objects for testing.
As long as the data are jsonb and not json, operations like metadata->'b'->'c' will be fast. Where you could lose time is when a large jsonb value is loaded and uncompressed from the TOAST table (“detoasted”).
You can avoid that problem by creating an index on the expression. Then an index scan does not have to detoast the jsonb and hence will be fast, no matter how big the jsonb is.
So I think that you will not encounter problems with performance as long as you can index for the queries.
You should pay attention to the total size of your database. If the expected size is very large, backups will become a challenge. It helsp to design and plan how to discard old and unused data. That is an aspect commonly ignored during application design that tends to cause headaches years later.

Redshift and ultra wide tables

In attempt to handle custom fields for specific objects in multi-tenant dimensional DW I created ultra wide denormalized dimension table (hundreds of columns, hard coded limit of column) that Redshift is not liking too much ;).
user1|attr1|attr2...attr500
Even innocent update query on single column on handful of records takes approximately 20 seconds. (Which is kind of surprising as I would guess it shouldn't be such a problem on columnar database.)
Any pointer how to modify design for better reporting from normalized source table (one user has multiple different attributes, one attribute is one line) to denormalized (one row per user with generic columns, different for each of the tenants)?
Or anyone tried to perform transposing (pivoting) of normalized records into denormalized view (table) in Redshift? I am worried about performance.
Probably important to think about how Redshift stores data and then implements updates on that data.
Each column is stored in it's own sequence of 1MB blocks and the content of those blocks is determined by the SORTKEY. So, how ever many rows of the sort key's values can fit in 1MB is how many (and which) values are in corresponding 1MB for all other columns.
When you ask Redshift to UPDATE a row it actually writes a new version of the entire block for all columns that correspond to that row - not just the block(s) which change. If you have 1,600 columns that means updating a single row requires Redshift to write a minimum of 1,600MB of new data to disk.
This issue can be amplified if your update touches many rows that are not located together. I'd strongly suggest choosing a SORTKEY that corresponds closely to the range of data being updated to minimise the volume of writes.

Why does MongoDB takes up so much space?

I am trying to store records with a set of doubles and ints (around 15-20) in mongoDB. The records mostly (99.99%) have the same structure.
When I store the data in a root which is a very structured data storing format, the file is around 2.5GB for 22.5 Million records. For Mongo, however, the database size (from command show dbs) is around 21GB, whereas the data size (from db.collection.stats()) is around 13GB.
This is a huge overhead (Clarify: 13GB vs 2.5GB, I'm not even talking about the 21GB), and I guess it is because it stores both keys and values. So the question is, why and how Mongo doesn't do a better job in making it smaller?
But the main question is, what is the performance impact in this? I have 4 indexes and they come out to be 3GB, so running the server on a single 8GB machine can become a problem if I double the amount of data and try to keep a large working set in memory.
Any guesses into if I should be using SQL or some other DB? or maybe just keep working with ROOT files if anyone has tried them?
Basically, this is mongo preparing for the insertion of data. Mongo performs prealocation of storage for data to prevent (or minimize) fragmentation on the disk. This prealocation is observed in the form of a file that the mongod instance creates.
First it creates a 64MB file, next 128MB, next 512MB, and on and on until it reaches files of 2GB (the maximum size of prealocated data files).
There are some more things that mongo does that might be suspect to using more disk space, things like journaling...
For much, much more info on how mongoDB uses storage space, you can take a look at this page and in specific the section titled Why are the files in my data directory larger than the data in my database?
There are some things that you can do to minimize the space that is used, but these tequniques (such as using the --smallfiles option) are usually only recommended for development and testing use - never for production.
Question: Should you use SQL or MongoDB?
Answer: It depends.
Better way to ask the question: Should you use use a relational database or a document database?
Answer:
If your data is highly structured (every row has the same fields), or you rely heavily on foreign keys and you need strong transactional integrity on operations that use those related records... use a relational database.
If your records are heterogeneous (different fields per document) or have variable length fields (arrays) or have embedded documents (hierarchical)... use a document database.
My current software project uses both. Use the right tool for the job!

SQLite optimization -- two databases vs. one

I'm developing an iPhone app with two SQLite databases. I'm wondering if I should be using only one.
The first database has a dozen small tables (each having four or five INTEGER or TEXT columns). Most of the tables have only a few dozen rows, but one table will have several thousand rows of "Items." The tables are all related so they have many joins between them.
The second database contains only one table. It has a BLOB column with one row related to some of the "Items" from the first database (photos of the items). These BLOBs are only used in a small part of the program, so they are rarely attached or joined with the first database (rows can be fetched easily without joins).
The first database file size will be about a megabyte; the db with the BLOBs will be much larger -- about 50 megabytes. I separated the BLOBs because I wanted to be able to backup the smaller database without having to carry the BLOBs, too. I also thought separating the BLOBs might improve performance for the smaller tables, but I'm not sure.
What are the pros and cons of having two databases (SQLIte databases in particular ) to separate the BLOBs, vs. having just one database with a table full of very narrow tables mixed in with the table of BLOBs?
Even at 50 MB, that is a very small database size, and you're not going to see a performance difference due to the size of the database itself.
However, if you think performance is the issue, check your DB queries, and make sure you're not 'SELECT'ing the BLOB rows in queries where you don't need that information. Returning more rows from the database than you need (think of it as reading more data from a hard drive than you need to), is where you're much more likely to see performance issues.
I don't see any pros to separating them, unless you are under a very tight restriction on the total size of the backup. It's just going to add complexity in weird places in your app.