"Big" data in a JSONB column - postgresql

I have a table with a metadata column (JSONB). Sometimes I run queries on this column. Example:
select * from "a" where metadata->'b'->'c' is not null
This column has always just small JSON objects <1KB. But for some records (less than 0.5%), it can be >500KB, because some sub-sub-properties have many data.
Today, I only have ~1000 records, everything works fine. But I think I will have more records soon, I don't know if having some big data (I don't speak about "Big Data" of course!) will have a global impact on performance. Is 500KB "big" for postgres and is it "hard" to parse? Maybe my question is too vague, I can edit if required. In other words:
Is having some (<0.5%) bigger entries in a JSONB column affect noticeably global performance of JSON queries?
Side note: assuming the "big" data is in metadata->'c'->'d' I don't run any queries to this particular property. Queries are always done on "small/common" properties. But the "big" properties still exists.

It is a theoretical question, so I hope a generic answer will satisfy.
If you need numbers, I recommend that you run some performance tests. It shouldn't be too hard to generate some large jsonb objects for testing.
As long as the data are jsonb and not json, operations like metadata->'b'->'c' will be fast. Where you could lose time is when a large jsonb value is loaded and uncompressed from the TOAST table (“detoasted”).
You can avoid that problem by creating an index on the expression. Then an index scan does not have to detoast the jsonb and hence will be fast, no matter how big the jsonb is.
So I think that you will not encounter problems with performance as long as you can index for the queries.
You should pay attention to the total size of your database. If the expected size is very large, backups will become a challenge. It helsp to design and plan how to discard old and unused data. That is an aspect commonly ignored during application design that tends to cause headaches years later.

Related

PostgreSQL varchar length performance impact

I have a table in PostgreSQL which is under heavy load (reads). It is practically core table of an application. One column is used as a discriminator - column used by application, that determines type of entity (class) that represents given row. It has to be exactly one varchar column. Currently I do store full class name in it, like: "bank_statement_transaction".
When application selects all bank statement transactions, query is built like ... WHERE Discriminator = 'bank_statement_transaction' . This brings more readability and clarity to data, structure and code.
Table contains currently 3M rows and counting, approximately 100k new rows monthly. Discriminator was indexed during some performance tunings. I don't have any performance issues right now.
I am working on a new feature that requires some little refactoring and yeah I had an idea to change full class name (bank_statement_transaction) to short unique codes (BST)
I replicated dbo and changed full class name to code. With 3M rows, performance gain is barely measurable, same or 1-2 milliseconds faster.
Can anyone share experience with VARCHAR length impact on INDEX size and performance? On bigger data set? Is this change worth of it?
If you index strings, the index will become larger if the strings are long. The fan-out will be less, so the index will become deeper.
With an index scan that searches for a few rows, this won't be noticable: reading a few blocks more and running comparisons on longer strings may be lost in the noise for any but the simplest queries. Still, you'll be faster with smaller strings.
Maybe the most noticeable effect will be that a smaller index needs less RAM for caching, so the number of disk reads should go down.

Optimizing aggregation function and ordering in PostgreSQL

I have the following table 'medicion' with the followings fields:
id_variable[int](PK),
id_departamento[int](PK),
fecha [date](PK),
valor [number]`.
So, I want to get the minimum, maximum and the average of valor grouping all that data by id_variable. So my query is:
SELECT AVG(valor), MIN(valor), MAX(valor)
FROM medicion
GROUP BY id_variable;
Knowing that by default PostgreSQL builds an index for the primary key
(id_departamento, id_variable, fecha)
how can I optimize this query?, should I create a new index only by id_variable or the default index is valid in this query?
Thanks!
Since there is an avg(), and one needs all the values to compute an average, it's going to read the whole table. Unless you use a WHERE, but there is no WHERE, so I presume you want global statistics.
The only things an extra covering index brings are:
Not reading the entire table.
This could be beneficial if there was, say, 50 columns, or TEXTs which make the table file huge. In this case reading the whole table just to average a few int's would need to grind in tons of useless stuff from disk.
I mean, covering indexes are awesome when you want to snipe one column or two out of a huge table, and keep the small column set in cache. But this is not the case here, you only got small columns, so this reason is out.
...and of course slightly slower UPDATEs since the index needs to be updated. Also, the index needs to be cached, its gonna use some RAM, etc.
Getting the rows pre-sorted for convenient aggregation.
This can matter here, mostly if it avoids a huge sort. However, if it avoids a hash-aggregate, which super fast anyway, not so useful.
Now, if you have relatively few distinct values of id_variable... say, enough to fit into a hash-aggregate, which can be a sizable amount, depends on your work_mem... then it'll be difficult to beat it...
If the table is not updated often, or is insert-only, and you need the statistics often, consider a materialized view (keep min/max/avg for each id_variable in a separate table, and keep them updated on each insert). Updating the mat-view takes time, so this is a tradeoff if you need the stats very often.
You could keep your stats in cache if you don't mind them being stale.
Or, if your table has tons of old data, you could partition it, and keep the min/max/sum/count for the old read-only partition, and only compute the stats on the new stuff.

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

PostgreSQL using UUID vs Text as primary key

Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.
My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.
I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.
It would be a significant project to change these, and I'm wondering if it would be worth it?
When dealing with UUID numbers store them as data type uuid. Always. There is simply no good reason to even consider text as alternative. Input and output is done via text representation by default anyway. The cast is very cheap.
The data type text requires more space in RAM and on disk, is slower to process and more error prone. #khampson's answer provides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.
This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:
Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
What is the optimal data type for an MD5 field?
bigint?
Maybe you don't need UUIDs (GUIDs) at all. Consider bigint instead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:
-9223372036854775808 to +9223372036854775807
That's 9.2 millions of millions of millions positive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.
If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.
UUID is really just for distributed systems and other special cases.
As #Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.
What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.
In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.
There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).
Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).
One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).
Edit in response to comment from OP:
So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.
Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.
You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.
A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.
So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.
I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).
Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.
e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.
There's also a module available for generating uuids, uuid-ossp.

Postgres partitioning?

My software runs a cronjob every 30 minutes, which pulls data from Google Analytics / Social networks and inserts the results into a Postgres DB.
The data looks like this:
url text NOT NULL,
rangeStart timestamp NOT NULL,
rangeEnd timestamp NOT NULL,
createdAt timestamp DEFAULT now() NOT NULL,
...
(various integer columns)
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table. At this rate, the cronjob will generate about 480 000 records a day and about 14.5 million a month.
I think the solution would be using several tables, for example I could use a specific table to store data generated in a given month: stats_2015_09, stats_2015_10, stats_2015_11 etc.
I know Postgres supports table partitioning. However, I'm new to this concept, so I'm not sure what's the best way to do this. Do I need partitioning in this case, or should I just create these tables manually? Or maybe there is a better solution?
The data will be queried later in various ways, and those queries are expected to run fast.
EDIT:
If I end up with 12-14 tables, each storing 10-20 millions rows, Postgres should be still able to run select statements quickly, right? Inserts don't have to be super fast.
Partitioning is a good idea under various circumstances. Two that come to mind are:
Your queries have a WHERE clause that can be readily mapped onto one or a handful of partitions.
You want a speedy way to delete historical data (dropping a partition is faster than deleting records).
Without knowledge of the types of queries that you want to run, it is difficult to say if partitioning is a good idea.
I think I can say that splitting the data into different tables is a bad idea because it is a maintenance nightmare:
You can't have foreign key references into the table.
Queries spanning multiple tables are cumbersome, so simple questions are hard to answer.
Maintaining tables becomes a nightmare (adding/removing a column).
Permissions have to be carefully maintained, if you have users with different roles.
In any case, the place to start is with Postgres's documentation on partitioning, which is here. I should note that Postgres's implementation is a bit more awkward than in other databases, so you might want to review the documentation for MySQL or SQL Server to get an idea of what it is doing.
Firstly, I would like to challenge the premise of your question:
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table.
As far as I know, there is no fundamental reason why the database would not cope fine with a single table of many millions of rows. At the extreme, if you created a table with no indexes, and simply appended rows to it, Postgres could simply carry on writing these rows to disk until you ran out of storage space. (There may be other limits internally, I'm not sure; but if so, they're big.)
The problems only come when you try to do something with that data, and the exact problems - and therefore exact solutions - depend on what you do.
If you want to regularly delete all rows which were inserted more than a fixed timescale ago, you could partition the data on the createdAt column. The DELETE would then become a very efficient DROP TABLE, and all INSERTs would be routed through a trigger to the "current" partition (or could even by-pass it if your import script was aware of the partition naming scheme). SELECTs, however, would probably not be able to specify a range of createAt values in their WHERE clause, and would thus need to query all partitions and combine the results. The more partitions you keep around at a time, the less efficient this would be.
Alternatively, you might examine the workload on the table and see that all queries either already do, or easily can, explicitly state a rangeStart value. In that case, you could partition on rangeStart, and the query planner would be able to eliminate all but one or a few partitions when planning each SELECT query. INSERTs would need to be routed through a trigger to the appropriate table, and maintenance operations (such as deleting old data that is no longer needed) would be much less efficient.
Or perhaps you know that once rangeEnd becomes "too old" you will no longer need the data, and can get both benefits: partition by rangeEnd, ensure all your SELECT queries explicitly mention rangeEnd, and drop partitions containing data you are no longer interested in.
To borrow Linus Torvald's terminology from git, the "plumbing" for partitioning is built into Postgres in the form of table inheritance, as documented here, but there is little in the way of "porcelain" other than examples in the manual. However, there is a very good extension called pg_partman which provides functions for managing partition sets based on either IDs or date ranges; it's well worth reading through the documentation to understand the different modes of operation. In my case, none quite matched, but forking that extension was significantly easier than writing everything from scratch.
Remember that partitioning does not come free, and if there is no obvious candidate for a column to partition by based on the kind of considerations above, you may actually be better off leaving the data in one table, and considering other optimisation strategies. For instance, partial indexes (CREATE INDEX ... WHERE) might be able to handle the most commonly queried subset of rows; perhaps combined with "covering indexes", where Postgres can return the query results directly from the index without reference to the main table structure ("index-only scans").