Here is my table structure
id
name varchar(150),
timestamp_one bigint,
timestamp_two bigint,
value double,
additional_list jsonb
I also have an index on these three fields:
name (varchar(150)), timestamp_one (bigint), additional_list (jsonb)
The database is quite fast with my queries and inserts. The problem I have that it is growing a lot. With the rate of data I have, it goes up to 100GB within a day. My main concert here is storage. What can I improve here? Can PostgreSQL compress my data? Will it be worth to create another table for name (varchar(150)) (it can be repeated for many rows) field and store foreign key, will it save a lot of data? Any other ideas? Thanks.
Related
I have this table in PostgreSQL database with 6 millions for rows.
CREATE TABLE IF NOT EXISTS public.processed
(
id bigint NOT NULL DEFAULT nextval('processed_id_seq'::regclass),
created_at timestamp without time zone,
word character varying(200) COLLATE pg_catalog."default",
score double precision,
updated_at timestamp without time zone,
is_domain_available boolean,
CONSTRAINT processed_pkey PRIMARY KEY (id),
CONSTRAINT uk_tb03fca6mojpw7wogvaqvwprw UNIQUE (word)
)
I want to optimize it for performance like adding index for column and add partitioning.
Should I add index only for column word or it should be better to add it for several columns.
What is the recommended to partition this table?
Are there other recommended ways like adding compression for example to do some optimization?
First there is no compression, nor columnar indexes in PostGreSQL, like other RBBMS that have those features (as an example Microsoft SQL have 4 ways to compress data without needs to decompress to read or seek, and can use columstore indexes). For columnar indexes you have to go on the Fujistu PG version that cost a lot...
https://www.postgresql.fastware.com/in-memory-columnar-index-brochure
So the only ways you have to accelerates some accesses to seeks on "word" column are :
storing a hash of the word in an additionnal column and use this colums to do searches after having indexed it
effectively use a partitionning with an equilibrate split like sanborn cutter tables
And finally combine the two options.
PostgreSQL v13
I am analyzing the use JSONB data type for a column in a table.
JSONB will be one of the column in the table. The idea is to get flexibility. The information (keys) stored inside the JSON will not be same every time and we may add keys overtime. The JSON object is expected to be under 50 KB and once written will not be changed.
My concerns/questions :
This is an OLTP db and requires high read/write performance. Any performance issues with JSONB data type ?
Does having JSONB lead to more bloat in the table and hence we may suffer overtime ?
In general pls share your experience with JSONB for such use-case.
Thanks in advance!
I am writing an application backed by Postgres DB.
The application is like a logging system, the main table is like this
create table if not exists logs
(
user_id bigint not null,
log bytea not null,
timestamp timestamptz not null default clock_timestamp() at time zone 'UTC'
);
One of the main query is to fetch all log about a certain user_id, ordered by timestamp desc. It would be nice that under the hood Postgres DB stores all rows about the same user_id in one page or sequential pages, instead of scattering here and there on the disk.
As I recall from textbooks, is this the so-called "index-sequential files"? How can I guide Postgres to do that?
The simple thing to do is to create a B-tree index to speed up the search:
CREATE INDEX logs_user_time_idx ON logs (user_id, timestamp);
That would speed up the query, but take extra space on the disk and slow down all INSERT operations on the table (the index has to be maintained). There is no free lunch!
I assume that you were talking about that when you mentioned "index-sequential files". But perhaps you meant what is called a clustered index or index-organized table, which essentially keeps the table itself in a certain order. That can speed up searches like that even more. However, PostgreSQL does not have that feature.
The best you can do to make disk access more efficient in PostgreSQL is to run the CLUSTER command, which rewrites the table in index order:
CLUSTER logs USING logs_user_time_idx;
But be warned:
That statement rewrites the whole table, so it could take a long time. During that time, the table is inaccessible.
Subsequent INSERTs won't maintain the order in the table, so it “rots” over time, and after a while you will have to CLUSTER the table again.
The title isn't very specific, so I'll elaborate.
I'm working on a database system in which users can add data to a postgres database though a watered-down API.
So far, all the user's data is compiled into one table, structured similar this:
CREATE TABLE UserData (
userId int NOT NULL,
dataId int NOT NULL PRIMARY KEY,
key varchar(255) NOT NULL,
data json not NOT NULL,
);
However, I am thinking that it may be more efficient (and a faster query) to instead give each userId it's own table:
CREATE TABLE UserData_{userId} (
dataId int NOT NULL PRIMARY KEY,
key varchar(255) NOT NULL,
data json not NOT NULL,
);
CREATE TABLE UserData_{anotherUserId} ();
etc...
I am worried that this will clog up the database, however.
What are the pros and cons for each? Under what load/speed requirements would each serve well? And which of these do you think would be better for a high-load, high-speed scenario?
What you are suggesting is essentially partitioning, so I suggest reading the docs about that. It's mainly advantageous when your operations each cover most of one partition (i.e. select all data for one user, or delete all data for one user).
Most use cases, however, are better served by having one properly indexed table. It's a much simpler structure, and can be very performant. If all of your queries are for a single user, then you'll want all of the indexes to start with the userId column, and postgres will use them to efficiently reach only the relevant rows. And if a day comes when you want to query data across multiple users, it will be much easier to do that.
I advise you not to take my word for it, though. Create both structures, generate fake data to fill them up, and see how they behave!
Consider:
You might end up with x amount of tables if you have one per user. How many "users" do you expect?
The json data is unbound and might grow as your solution/app grows. How will you handle missing keys/values?
The Users table will grow horizontally (more columns) where you should always aim to grow vertically (more rows)
A better solution would be to hold your data in tables related to the user_id.
ie. a "keys" table which holds the key, date_added, active and foreign key (user_id)
This will also solve saving your data as a json which, in you example, will be difficult to maintain. Rather open that json up into a table where you can benefit from indexes and clustering.
If you reference your user_id in separate tables as a foreign key, you can partition or cluster these tables on that key to significantly increase speed and compensate for growth. Which means you have a single table for users (id, name, active, created_at, ...) and lots of tables linked to that user, eg.
subscriptions (id, user_id, ...), items (id, user_id, ...), things (id,user_id, ...)
When I have the following table:
CREATE TABLE test
(
"id" integer NOT NULL,
"myval" text NOT NULL,
CONSTRAINT "test-id-pkey" PRIMARY KEY ("id")
)
When doing a lot of queries like the following:
UPDATE "test" set "myval" = "myval" || 'foobar' where "id" = 12345
Then the row myval will get larger and larger over time.
What will postgresql do? Where will it get the space from?
Can I avoid that postgresql needs more than one seek to read a particular myval-column?
Will postgresql do this automatically?
I know that normally I should try to normalize the data much more. But I need to read the value with one seek. Myval will enlarge by about 20 bytes with each update (that adds data). Some colums will have 1-2 updates, some 1000 updates.
Normally I would just use one new row instead of an update. But then selecting is getting slow.
So I came to the idea of denormalizing.
Change the FILLFACTOR of the table to create space for future updates. This can also be HOT updates because the text field doesn't have an index, to make the update faster and autovacuum overhead lower because HOT updates use a microvacuum. The CREATE TABLE statement has some information about the FILLFACTOR.
ALTER TABLE test SET (fillfactor = 70);
-- do a table rebuild to blow some space in your current table:
VACUUM FULL ANALYZE test;
-- start testing
The value 70 is not the perfect setting, it depends on your unique situation. Maybe you're fine with 90, it could also be 40 or something else.
This is related to this question about TEXT in PostgreSQL, or at least the answer is similar. PostgreSQL stores large columns away from the main table storage:
Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.
So you can expect a TEXT (or BYTEA or large VARCHAR) column to always be stored away from the main table and something like SELECT id, myval FROM test WHERE id = 12345 will take two seeks to pull both columns off the disk (and more seeks to resolve their locations).
If your UPDATEs really are causing your SELECTs to slow down then perhaps you need to review your vacuuming strategy.