Text length reduction while storing in database table - mongodb

I am having a scenario in which the column of the table looks like as follows
1234124124124
2343243253253
2131324324324
4545645354356
0982349874598
1298349832595
5365240240324
0980979879832
0924320982438
....
So on many rows will be there.
I want an efficient way to compress and store all these rows of that column in a single row. I just want to reduce the size occupied by the data with the concept of text compression or some similar functionality. while storing in same row even separators can also be used to store in the same row for differentiating them.
How can this be achieved?
Thanks in advance.

You don't need to alter your schema to save disk space; MongoDB's WiredTiger storage engine includes compression, so you get the benefit of reduced disk space while keeping the schema you need to do your queries effectively.

Related

Postgresql - How do I find compressed fields in a table?

Postgresql uses different storage techniques for field values. If values become large (e.g. are long texts), Postgresql will eventually apply compression and/or TOAST the values.
How do I find out which fields in a table are compressed?
(Backgound: I have database that stores tiny BLOBs in a some columns and I want to find out how much of it is compressed - if compression is hardly used, I want to turn it off, so Postgres won't waste CPU cycles on trying)
Starting in v14, there is the function pg_column_compression.
select pg_column_compression(colname),count(*) from tablename group by 1;

Will gzipping + storage as bytea save more disk space over storage as text?

If I have a table containing 30 million rows and one of the columns in the table is currently a text column. The column is populated with random strings of a size between 2 and 10 kb. I don't need to search the strings directly.
I'm considering gzipping the strings before saving them (typically reducing them 2x in size) and instead save them in a bytea column.
I have read that Postgresql does some compression of text columns by default, so I'm wondering: Will there be any actual disk space reduction as a product of the suggested change?
I'm running Postgresql 9.3
PostgreSQL stores text columns that exceed about 2000 bytes in a TOAST table and compresses the data.
The compression is fast, but not very good, so you can have some savings if you use a different compression method. Since the stored values are not very large, the savings will probably be small.
If you want to go that way, you should disable compression on that already compressed column:
ALTER TABLE tab
ALTER bin_col SET STORAGE EXTERNAL;
I'd recommend that you go with PostgreSQL's standard compression and keep things simple, but the best thing would be for you to run a test and see if you get a benefit from using custom compression.

Redshift and ultra wide tables

In attempt to handle custom fields for specific objects in multi-tenant dimensional DW I created ultra wide denormalized dimension table (hundreds of columns, hard coded limit of column) that Redshift is not liking too much ;).
user1|attr1|attr2...attr500
Even innocent update query on single column on handful of records takes approximately 20 seconds. (Which is kind of surprising as I would guess it shouldn't be such a problem on columnar database.)
Any pointer how to modify design for better reporting from normalized source table (one user has multiple different attributes, one attribute is one line) to denormalized (one row per user with generic columns, different for each of the tenants)?
Or anyone tried to perform transposing (pivoting) of normalized records into denormalized view (table) in Redshift? I am worried about performance.
Probably important to think about how Redshift stores data and then implements updates on that data.
Each column is stored in it's own sequence of 1MB blocks and the content of those blocks is determined by the SORTKEY. So, how ever many rows of the sort key's values can fit in 1MB is how many (and which) values are in corresponding 1MB for all other columns.
When you ask Redshift to UPDATE a row it actually writes a new version of the entire block for all columns that correspond to that row - not just the block(s) which change. If you have 1,600 columns that means updating a single row requires Redshift to write a minimum of 1,600MB of new data to disk.
This issue can be amplified if your update touches many rows that are not located together. I'd strongly suggest choosing a SORTKEY that corresponds closely to the range of data being updated to minimise the volume of writes.

How much data can practically be stored in a Postgres table field?

I have an application that stores data in a Postgres database that I am considering extending. This would entail storing larger amounts of data in fields of a table. How large can the data in a field reasonably be before one starts running into performance issues? 100kb? 1mb? 100mb? 500mb? Does it matter what type the data is stored as (other than the fact the binary data tends to be more compact)?
Up to 1 GB per field, 32 TB per relation.
Upper limits are as defined in the "about" page of Postgres.
... have since been moved to the manual page "PostgreSQL Limits".
But storing massive amounts of data in table columns is typically a bad idea. If you want to change anything in a 1 GB field, Postgres has to write a new row version, which is extremely inefficient. That's not the use case relational databases are optimized for.
Consider storing large objects in files or at least use binary large objects for this. See:
Storing long binary (raw data) strings
I would think twice before even storing megabytes of data into a single table field. You can do it. Doesn't mean you should.

Postgres determine size of all blobs

hi i'm trying to find the size of all blobs. I always used this
SELECT sum(pg_column_size(pg_largeobject)) lob_size FROM pg_largeobject
but while my database is growing ~40GB this takes several hours and loads the cpu too much.
is there any more efficent way?
Some of the functions mentioned in Database Object Management Functions give an immediate result for the entire table.
I'd suggest pg_table_size(regclass) which is defined as:
Disk space used by the specified table, excluding indexes (but
including TOAST, free space map, and visibility map)
It differs from sum(pg_column_size(tablename)) FROM tablename because it counts entire pages, so that includes the padding between the rows, the dead rows (updated or deleted and not reused), and the row headers.