pg_largeobject huge, but no tables have OID column type - postgresql

postrgresql noob, PG 9.4.x, no access to application code, developers, anyone knowledgeable about it
User database CT has 427GB pg_largeobject (PGLOB) table, next largest table is 500ish MB.
Per this post (Does Postgresql use PGLOB internally?) a very reputable member said postgresql does not use PGLOB internally.
I have reviewed the schema of all user tables in the database, and none of them are of type OID (or lo) - which is the value used for PGLOB rows to tie the collection of blob chunks back to a referencing table row. I think this means I cannot use vacuumlo (vacuumlo) to delete orphaned PGLOB rows because that utility searches user objects for those two data types in user tables.
I HAVE identified a table with an integer field type that has int values that match LOID values in PGLOB. This seems to indicate that the developers somehow got their blobs into PGLOB using the integer value stored in a user table row.
QUESTION: Is that last statement possible?
A) If it is not, what could be adding all this data to PGLOB table?
B) If it is possible, is there a way I can programatically search ALL tables for integer values that might represent rows in PGLOB?
NEED: I DESPERATELY need to reduce the size of the PGLOB table, as we are running out of disk space. And no, we cannot add space to existing disk per admin. So I somehow need to determine if there are LOID values in PGLOB that do NOT exist in ANY user tables as integer-type fields and then run lo_unlink to remove the rows. This could get me more usable 8K pages in the table.
BTW, I have also run pg_freespace on PGLOB, and it identified that most of the pages in PGLOB did not contain enough space in which to insert another blob chunk.
THANKS FOR THE ASSISTANCE!

Not really an answer but thinking out loud:
As you found all large objects are stored in a single table. The oid field you refer to is something you add to a table so you can have a pointer to a particular LO oid in pg_largeobject. That being said there is nothing compelling you to store that info in a table, you can just create LO's in pg_largeobject. From the looks of it, and just a guess, the developers stored the oid's as integer's with the intent of doing integer::oid to get a particular LO back as needed. I would look at other information is stored in that table to see if helps determine what the LO's are for?
Also you might join the integer::oid values to the oid(loid) in pg_catalog to see if that table accounts for all of them?

I was able to do a detailed analysis of all user tables in the database, find all columns that contained numeric data with no decimals, and then do a a query from pg_largeobject with a NOT EXISTS clause for every table matching pglob.loid against the appropriate field(s) in the user tables.
I found 25794 LOIDs that could be DELETEd from the PGLOB table, totaling 3.4M rows.
select distinct loid
into OrphanedBLOBs
from pg_largeobject l
where NOT exists (select * from tbl1 cn where cn.noteid = l.loid)
and not exists (select * from tbl1 d where d.document = l.loid)
and not exists (select * from tbl1 d where d.reportid = l.loid)
I used that table to execute lo_unlink(loid) for each of the LOIDs.

Related

How to get the describe tables from the Redshift and ALTER it

I have create a redshift cluster and created a db inside.
My schema is new_schema
I have created 2 tables inside two tables inside table1, table2
My Question.
I want to list the datatypes of table1
I need to change the datatype of description which is inside the table1 which is of VARCHAR to TEXT
I have tried to list the datatypes of table1 with below query but nothing listing
SELECT * FROM PG_TABLE_DEF WHERE schemaname = 'new_schema';
A few possibilities as to why you are not seeing the expected results. Most likely is that new_schema isn't in your search_path. Pg_table_info only return info for tables in your search_path - see: https://docs.aws.amazon.com/redshift/latest/dg/r_PG_TABLE_DEF.html
Another possibility is that the tables have no data rows (no blocks assigned) and this can lead to incomplete info from some system tables.
Another possibility is that the tables were not committed by the creating session and being checked by a different session. Since you say that you are creating a new db this comes to mind.
Are the tables visible in svv_table_info?
Also the premise of changing varchar to text is a bit off. From https://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html#r_Character_types-text-and-bpchar-types
You can create an Amazon Redshift table with a TEXT column, but it is
converted to a VARCHAR(256) column that accepts variable-length values
with a maximum of 256 characters.
So it seems like the objective you are trying to achieve is a bit off.

Best performance method for getting records by large collection of IDs

I am writing a query with code to select all records from a table where a column value is contained in a CSV. I found a suggestion that the best way to do this was using ARRAY functionality in PostgresQL.
I have a table price_mapping and it has a primary key of id and a column customer_id of type bigint.
I want to return all records that have a customer ID in the array I will generate from csv.
I tried this:
select * from price_mapping
where ARRAY[customer_id] <# ARRAY[5,7,10]::bigint[]
(the 5,7,10 part would actually be a csv inserted by my app)
But I am not sure that is right. In application the array could contain 10's of thousands of IDs so want to make sure I am doing right with best performance method.
Is this the right way in PostgreSQL to retrieve large collection of records by pre-defined column value?
Thanks
Generally this is done with the SQL standard in operator.
select *
from price_mapping
where customer_id in (5,7,10)
I don't see any reason using ARRAY would be faster. It might be slower given it has to build arrays, though it might have been optimized.
In the past this was more optimal:
select *
from price_mapping
where customer_id = ANY(VALUES (5), (7), (10)
But new-ish versions of Postgres should optimize this for you.
Passing in tens of thousands of IDs might run up against a query size limit either in Postgres or your database driver, so you may wish to batch this a few thousand at a time.
As for the best performance, the answer is to not search for tens of thousands of IDs. Find something which relates them together, index that column, and search by that.
If your data is big enough, try this:
Read your CSV using a FDW (foreign data wrapper)
If you need this connection often, you might build a materialized view from it, holding only needed columns. Refresh this when new CSV is created.
Join your table against this foreign table or materialized viev.

PostgreSQL Latest Record w/o id nor date

I have a foreign table without id nor date.
If for example other users insert a number of records, is it possible in PostgreSQL to select the last record inserted?
*Note: My only access to that table is select only
SQL tables represent unordered sets and the result sets too. You cannot guarantee your data without specify ORDER BY.
And :
I have a foreign table without id nor date
There is no other way to workaround without this to specify what you need.
My only access to that table is select only
If you only get just Select privilege you should tell your DBA you cannot give the data with 100% guarantee if that is the last data inserted from that user.
Based on my knowledge PostgreSQL does not guarantee to preserve insertion order. Without a timestamp field or sequential primary key I do not think guaranteed fetching of the last row is possible.
You can try this
SELECT * FROM YOUR_TABLE WHERE CTID = (SELECT MAX(CTID) FROM YOUR_TABLE)
provided that the target table does not do update operations.

Number of tables in a tablespace in db2

I tried getting the number of tables of a particular tablespace and database from SYSIBM.SYSTABLES using the select query. This number is more than the number of tables for the same tablespace and database stored in the SYSIBM.SYSTABLESPACE table under the NTABLES column. Why is this so?
It could be the fact that systables stores entries for each table, view or alias, in fact a large number of objects that may not necessarily be included within a tablespace.
You could confirm this by only listing those where type = 'T' (or some other combination of the allowed values).
If you select count(*) from systables (for a given tablespace) and group it by type, you may find it reasonably easy to assign some of those types to the tablespace.

Query rows by time of creation?

I have a table that contains no date or time related fields. Still I want to query that table based on when records/rows were created. Is there a way to do this in PostgreSQL?
I prefer an answer about doing it in PostgreSQL directly. But if that's not possible, can hibernate do it for PostgreSQL?
Basically: no. There is no automatic timestamp for rows in PostgreSQL.
I usually add a column like this to my tables (ignoring time zones):
ALTER TABLE tbl ADD COLUMN log_in timestamp DEFAULT localtimestamp NOT NULL;
As long as you don't manipulate the values in that column, you got your creation timestamp. You can add a trigger and / or restrict write privileges to avoid tempering with the values.
Second class options
If you have a serial column, you could at least tell with some probability in what order rows were entered. That's not 100% reliable, because the values can be changed by hand, and applications can get values from the sequence and INSERT out of order.
If you created your table WITH (OIDS=TRUE), then the OID column could be some indication - unless your database is heavily written and / or very old, then you may have gone through OID wrap-around and later rows can have a smaller OID. That's one of the reasons, why this feature is hardly used any more.
The default depends on the setting of default_with_oids I quote the manual:
The parameter is off by default; in PostgreSQL 8.0 and earlier, it was
on by default.
If you have not updated your rows or went through a dump / restore cycle, or ran VACUUM FULL or CLUSTER or .. , a plain SELECT * FROM tbl returns all rows in the order they were entered. But this is very unreliable and implementation-dependent. PostgreSQL (like any RDBMS) does not guarantee any order without an ORDER BY clause.