Dump integer-only sql table to binary file? - postgresql

I have a PostgrqSQL table w/ a bunch of columns that are just different sizes of integer.
Table "public.place2022_final"
Column | Type | Collation | Nullable | Default
---------------+---------+-----------+----------+---------
toff | integer | | |
palette_index | bigint | | |
censorship | boolean | | |
row0 | integer | | |
col0 | integer | | |
row1 | integer | | |
col1 | integer | | |
uint_id | bigint | | |
seqno | bigint | | |
I can export it to a CSV, but for my purposes I really want the data to be small. Is there a way I can create a minimal dump to a binary file, w/ a format something like
<8 bytes for # of rows in table><4 bytes for row 1 toff><8 bytes for row 1 palette_index>...<do that for all fields, then repeat for all rows>.
I also know for a fact that all these bigints can be squashed down to 32-bit ints... so doing that "conversion" here would be nice too.

Related

Postgres query taking long to execute

I am using libpq to connect the Postgres server in c++ code. Postgres server version is 12.10
My table schema is defined below
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------------------+----------+-----------+----------+------------+----------+--------------+-------------
event_id | bigint | | not null | | plain | |
event_sec | integer | | not null | | plain | |
event_usec | integer | | not null | | plain | |
event_op | smallint | | not null | | plain | |
rd | bigint | | not null | | plain | |
addr | bigint | | not null | | plain | |
masklen | bigint | | not null | | plain | |
path_id | bigint | | | | plain | |
attribs_tbl_last_id | bigint | | not null | | plain | |
attribs_tbl_next_id | bigint | | not null | | plain | |
bgp_id | bigint | | not null | | plain | |
last_lbl_stk | bytea | | not null | | extended | |
next_lbl_stk | bytea | | not null | | extended | |
last_state | smallint | | | | plain | |
next_state | smallint | | | | plain | |
pkey | integer | | not null | 1654449420 | plain | |
Partition key: LIST (pkey)
Indexes:
"event_pkey" PRIMARY KEY, btree (event_id, pkey)
"event_event_sec_event_usec_idx" btree (event_sec, event_usec)
Partitions: event_spl_1651768781 FOR VALUES IN (1651768781),
event_spl_1652029140 FOR VALUES IN (1652029140),
event_spl_1652633760 FOR VALUES IN (1652633760),
event_spl_1653372439 FOR VALUES IN (1653372439),
event_spl_1653786420 FOR VALUES IN (1653786420),
event_spl_1654449420 FOR VALUES IN (1654449420)
When I execute the following query it takes 1 - 2 milliseconds to execute.
Time is provided as a parameter to function executing this query, it contains epoche seconds and microseconds.
SELECT event_id FROM event WHERE (event_sec > time.seconds) OR ((event_sec=time.seconds) AND (event_usec>=time.useconds) ORDER BY event_sec, event_usec LIMIT 1
This query is executed every 30 seconds on the same client connection (Which is persistent for weeks). This process runs for weeks, but some time same query starts taking more than 10 minutes.
If I restart the process it recreated connection with the server and now execution time again falls back to 1-2 milliseconds. This issue is intermittent, sometimes it triggers after a week of running process and some time after 2 - 3 weeks of running process.
We add a new partition to table every Sunday and write new data in new partition.
I don't know why the performance is inconsistent, there are many possibilities we can't distinguish with the info provided. Like, does the plan change when the performance changes, or does the same plan just perform worse?
But your query is not written to take maximal advantage of the index. In my hands it can use the index for ordering, but then it still needs to read and individually skip over things that fail the WHERE clause until it finds the first one which passes. And due to partitioning, I think it is even worse than that, it has to do this read-and-skip until it finds the first one which passes in each partition.
You could rewrite it to do a tuple comparison, which can use the index to determine both the order, and where to start:
SELECT event_id FROM event
WHERE (event_sec,event_sec) >= (:seconds,:useconds)
ORDER BY event_sec, event_usec LIMIT 1;
Now this might also degrade, or might not, or maybe will degrade but still be so fast that it doesn't matter.

Postgres Changing column from TEXT to INTEGER increases table size

I have a postgres table that has a schema like this
Table "am.old_product"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-----------------+--------------------------+-----------+----------+---------+----------+--------------+-------------
p_config_sku | text | | | | extended | |
p_simple_sku | text | | | | extended | |
p_merchant_id | text | | | | extended | |
p_country | character varying(2) | | | | extended | |
p_discount_rate | numeric(10,2) | | | | main | |
p_black_price | numeric(10,2) | | | | main | |
p_red_price | numeric(10,2) | | | | main | |
p_received_at | timestamp with time zone | | | | plain | |
p_event_id | uuid | | | | plain | |
p_is_deleted | boolean | | | | plain | |
Indexes:
"product_p_simple_sku_p_country_p_merchant_id_idx" UNIQUE, btree (p_simple_sku, p_country, p_merchant_id)
"config_sku_country_idx" btree (p_config_sku, p_country)
We decided that it would be a better idea remove the TEXT field merchant_id and move it to another table, and reference it in the product table using a foreign key. So the new schema looks just like this.
Table "am.product"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-------------------+--------------------------+-----------+----------+---------+----------+--------------+-------------
p_config_sku | text | | not null | | extended | |
p_simple_sku | text | | not null | | extended | |
p_country | character varying(2) | | not null | | extended | |
p_discount_rate | numeric(10,2) | | | | main | |
p_black_price | numeric(10,2) | | | | main | |
p_red_price | numeric(10,2) | | | | main | |
p_received_at | timestamp with time zone | | not null | | plain | |
p_event_id | uuid | | not null | | plain | |
p_is_deleted | boolean | | | false | plain | |
p_merchant_id_new | integer | | not null | | plain | |
Indexes:
"new_product_p_simple_sku_p_country_p_merchant_id_new_idx" UNIQUE, btree (p_simple_sku, p_country, p_merchant_id_new)
"p_config_sku_country_idx" btree (p_config_sku, p_country)
Foreign-key constraints:
"fk_merchant_id" FOREIGN KEY (p_merchant_id_new) REFERENCES am.merchant(m_id)
Now this should make the product table size drop right? we are using a 4 bytes integer instead of a TEXT. Well not really, the two tables, have the same exact number of rows. The product table (one with integer field) size is 34.3 GB. While the old table's size (with TEXT) has size of 19.7GB
Does anyone have an explanation for that?
At a wild guess you have done this with various ALTER TABLE commands forcing at least one rewrite of the entire table.
The unused space will be gradually re-used, or for a more prompt change try a CLUSTER or VACUUM FULL on the table.
Look at the VACUUM command.
A database file is an organized collection of tuples. A row can be made up of one or more tuples. When you added a new column, you added tuples to the table file. But when you dropped a column, the space occupied by the tuples remains, because to delete it from the file is a costly operation. They are dead tuples.
VACUUM FULL am.product;
This will unfortunately will create exclusive locks on the table, and you won't be able to query it in the process.

Redshift: tables info query not working via spark

I am trying to run this query from spark code using databricks:
select * from svv_table_info
but I am getting this error msg:
Exception in thread "main" java.sql.SQLException: Amazon Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.;
any opinion why I am getting this?
That view returns table_id which is in the Postgres system type OID.
psql=# \d+ svv_table_info
Column | Type | Modifiers | Storage | Description
---------------+---------------+-----------+----------+-------------
database | text | | extended |
schema | text | | extended |
table_id | oid | | plain |
table | text | | extended |
encoded | text | | extended |
diststyle | text | | extended |
sortkey1 | text | | extended |
max_varchar | integer | | plain |
sortkey1_enc | character(32) | | extended |
sortkey_num | integer | | plain |
size | bigint | | plain |
pct_used | numeric(10,4) | | main |
empty | bigint | | plain |
unsorted | numeric(5,2) | | main |
stats_off | numeric(5,2) | | main |
tbl_rows | numeric(38,0) | | main |
skew_sortkey1 | numeric(19,2) | | main |
skew_rows | numeric(19,2) | | main |
You can cast it to INTEGER and Spark should be able to handle it.
SELECT database,schema,table_id::INT
,"table",encoded,diststyle,sortkey1
,max_varchar,sortkey1_enc,sortkey_num
,size,pct_used,empty,unsorted,stats_off
,tbl_rows,skew_sortkey1,skew_rows
FROM svv_table_info;

PostgreSQL two groups segregated but not ordered only by zero price column

I need help with a bit of a crazy single-query goal please that I'm not sure if GROUP BY or sub-SELECT applies to?
The following query:
SELECT id_finish, description, inside_rate, outside_material, id_part, id_metal
FROM parts_finishing AS pf
LEFT JOIN parts_finishing_descriptions AS fd ON (pf.id_description=fd.id);
Returns the results like the following:
+-------------+-------------+------------------+--------------------------------+
| description | inside_rate | outside_material | id_part - id_finish - id_metal |
+-------------+-------------+------------------+--------------------------------+
| Nickle | 0 | 33.44 | 4444-44-44, 5555-55-55 |
+-------------+-------------+------------------+--------------------------------+
| Bend | 11.22 | 0 | 1111-11-11 |
+-------------+-------------+------------------+--------------------------------+
| Pack | 22.33 | 0 | 2222-22-22, 3333-33-33 |
+-------------+-------------+------------------+--------------------------------+
| Zinc | 0 | 44.55 | 6000-66-66 |
+-------------+-------------+------------------+--------------------------------+
I need the results to return in the fashion below but there are catches:
I need to group by either the inside_rate column or the outside_material column but ORDER BY the description column but not ORDER BY or sort them by price (inside_rate and outside_material are the prices). So we know that they belong to a group if inside_rate is 0 or to the other group if outside_material is 0.
I need to ORDER BY the description column desc secondary after they are returned per group.
I need to return a list of parts (composed of three separate columns) for that inside/outside group / price for that finishing.
Stack format fix.
+-------------+-------------+------------------+--------------------------------+
| description | inside_rate | outside_material | id_part - id_finish - id_metal |
+-------------+-------------+------------------+--------------------------------+
| Bend | 11.22 | 0 | 1111-11-11 |
+-------------+-------------+------------------+--------------------------------+
| Pack | 22.33 | 0 | 2222-22-22, 3333-33-33 |
+-------------+-------------+------------------+--------------------------------+
| Nickle | 0 | 33.44 | 4444-44-44, 5555-55-55 |
+-------------+-------------+------------------+--------------------------------+
| Zinc | 0 | 44.55 | 6000-66-66 |
+-------------+-------------+------------------+--------------------------------+
The tables I'm working with and their data types:
Table "public.parts_finishing"
Column | Type | Modifiers
------------------+---------+-------------------------------------------------------------
id | bigint | not null default nextval('parts_finishing_id_seq'::regclass)
id_part | bigint |
id_finish | bigint |
id_metal | bigint |
id_description | bigint |
date | date |
inside_hours_k | numeric |
inside_rate | numeric |
outside_material | numeric |
sort | integer |
Indexes:
"parts_finishing_pkey" PRIMARY KEY, btree (id)
Table "public.parts_finishing_descriptions"
Column | Type | Modifiers
------------+---------+------------------------------------------------------------------
id not null | bigint | default nextval('parts_finishing_descriptions_id_seq'::regclass)
date | date |
description | text |
rate_hour | numeric |
type | text |
Indexes:
"parts_finishing_descriptions_pkey" PRIMARY KEY, btree (id)
The second table's first column is just id. (Why are we still dealing with a 1024 static width layout in 2015?)
I'd make an SQL fiddle though it refuses to load for me regardless of the browser.
Not entirely sure I understand your question. Might look like this:
SELECT pd.description, pf.inside_rate, pf.outside_material
, concat_ws(' - ', pf.id_part::text
, pf.id_finish::text
, pf.id_metal::text) AS id_part_finish_metal
FROM parts_finishing pf
LEFT JOIN parts_finishing_descriptions fd ON pf.id_description = fd.id
ORDER BY (pf.inside_rate = 0) -- 1. sorts group "inside_rate" first
, pd.description DESC NULLS LAST -- 2. possible NULL values last
;

psycopg2 is mixing column names with values depending on the table selected

I query a whole postgres table using
c.execute("select * from train_temp")
trans=np.array(c.fetchall())
and amid the expected data I got one row with the column names.
trans[-1,]
Out[63]:
array(['ACTION', 'RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2',
'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY',
'ROLE_CODE', None, None, None, None, None, None, None, None, None], dtype=object)
More puzzling is the fact the the number of rows returned match the number of row in the table
trans.shape
Out[67]: (32770, 19)
select count(1) from train_temp ;
count
-------
32770
(1 row)
Here's the schema of the table
Table "public.train_temp"
Column | Type | Modifiers | Storage | Description
---------------------+------------------+-----------+----------+-------------
action | text | | extended |
resource | text | | extended |
mgr_id | text | | extended |
role_rollup_1 | text | | extended |
role_rollup_2 | text | | extended |
role_deptname | text | | extended |
role_title | text | | extended |
role_family_desc | text | | extended |
role_family | text | | extended |
role_code | text | | extended |
av_role_code | double precision | | plain |
av_role_family | double precision | | plain |
av_role_family_desc | double precision | | plain |
av_role_title | double precision | | plain |
av_role_deptname | double precision | | plain |
av_role_rollup_2 | double precision | | plain |
av_role_rollup_1 | double precision | | plain |
av_mgr_id | double precision | | plain |
av_resource | double precision | | plain |
Has OIDs: no
What's going on here? Note it does not happen with all tables. Actually for this last one the process works fine
Table "public.play"
Column | Type | Modifiers | Storage | Description
-----------+------------------+-----------+----------+-------------
row.names | text | | extended |
action | double precision | | plain |
color | text | | extended |
type | text | | extended |
Has OIDs: no
This last table is completely passed as string, while the problematic one respects the data types.
play[1,]
Out[73]:
array(['2', '0.0', 'blue', 'car'],
dtype='|S5')
trans[1,]
Out[74]:
array(['1', '0', '36', '117961', '118413', '119968', '118321', '117906',
'290919', '118322', 0.920412992041299, 0.942349726775956,
0.933439675174014, 0.920412992041299, 0.976, 0.964478764478764,
0.949222217031812, 0.909090909090909, 0.923076923076923], dtype=object)
Thanks for insight.
Actually I just wrote the headers myself when importing the *csv into postgres.
I should have used the header option in psql such
\copy test from 'test.csv' with (delimiter ',' , format csv, header TRUE);