PostgreSQL table size larger than raw data text files - postgresql

I have the following table in my PostgreSQL database:
CREATE TABLE values
(
dt timestamp,
series_id integer,
value real
);
CREATE INDEX idx_values_date ON public."values" USING btree (dt);
ALTER TABLE ONLY public."values" ADD CONSTRAINT values_series_id_fkey FOREIGN KEY (series_id) REFERENCES public.series(id) ON DELETE CASCADE;
I'm parsing some CSV files and extracting floats which I add to this table together with a timestamp and series_id which is a foreign key to another table.
The directory containing my raw data files amounts to about 28MB on my drive.
After feeding the data into my table I do a
SELECT pg_size_pretty( pg_total_relation_size('values') );
And find that the table now has ~871503 rows and is now 98MB in size. Is this normal? I was expecting my table to be way less in
size than actual text files containing raw data.
I'd like to mention that the PostgreSQL instance also has PostGIS installed but I'm not using it
in this particular schema. Furthermore, I'm running PostgreSQL from a docker container.
Later edit ...
After doing some more research and running the following query:
SELECT *, pg_size_pretty(total_bytes) AS total
, pg_size_pretty(index_bytes) AS INDEX
, pg_size_pretty(toast_bytes) AS toast
, pg_size_pretty(table_bytes) AS TABLE
FROM (
SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM (
SELECT c.oid,nspname AS table_schema, relname AS TABLE_NAME
, c.reltuples AS row_estimate
, pg_total_relation_size(c.oid) AS total_bytes
, pg_indexes_size(c.oid) AS index_bytes
, pg_total_relation_size(reltoastrelid) AS toast_bytes
FROM pg_class c
LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE relkind = 'r'
) a
) a WHERE a.table_name = 'values';
I came up with the following results:
Index: 61MB
Table: 38MB
Can I somehow optimize the index? Maybe it's using some defaults that make it take up so much space?

When I populate a table with your structure with that number of rows, I get:
table: 37 MB
index: 24 MB
So either your index is bloated (you can drop and recreate it, or use REINDEX) or you have more indexes than you are admitting to.
But perhaps the better answer is "Yes, relational databases have a lot of overhead, get used to it." If you try to investigate every difference between a database and a flat file, you will drive yourself crazy and accomplish very little from it.

Related

Postgres clustering using multi-column indexes

I have a table which includes a multi-column index defined as
CREATE INDEX tab_a_idx1 ON tab_a USING btree (device, fixtime)
The index was chosen deliberately because the majority of the queries run against this table include selection criteria like this
WHERE device = 'xyz' AND fixtime > 'sometime' AND fixtime <= 'someothertime' ORDER BY fixtime;
The table has been clustered on this index in a effort to improve performance.
CLUSTER tab_a USING tab_a_idx1;
Based on the comments and answers in a previous question I've used this query to list my clustered tables, the indexes they're clustered on, and the definitions of those indexes.
SELECT c.oid, c.relname as tablename, x.relname as indexname, z.indexdef
FROM pg_class c
JOIN pg_index i ON i.indrelid = c.oid
JOIN pg_class x ON i.indexrelid = x.oid
JOIN pg_indexes z ON x.relname = z.indexname
WHERE c.relkind = 'r' AND c.relhasindex AND i.indisclustered
And I've been using the pg_stats table to check the correlation of the indexed columns.
The quoted answer states that a correlation close to '1' is good, and as the value get lower the more clustering is indicated.
Immediately after the table was clustered the correlation of the 1st field in the index (device) was low (0.008) and the 2nd one (fixtime) relatively high (0.994).
If these values are supposed to be close to '1' but aren't, does that mean that a table can't (or shouldn't) be clustered on a multi-column index?
There are several versions of the tab_a (it's partitioned on fixtime) and I've noticed that the correlation values don't actually seem to vary much between the clustered and un-clustered versions of the table. Does this mean there's no point in clustering on this index?
Thanks
UPDATE - the parent table was created as follows....
CREATE TABLE tab_a
( device CHAR(6),
fixTime TIMESTAMP,
....lots more fields.....
)
PARTITION BY RANGE (fixTime);
The individual partitions were created like this
CREATE TABLE tab_a_201704 PARTITION OF tab_a FOR VALUES FROM ('2017-04-01' ) TO ( '2017-05-01' )
And the index used for the clustering like this....
CREATE INDEX tab_a_201704_idx2 ON tab_a_201704 (device, fixTime);
And the command to do the cluster....
CLUSTER tab_a_201704 USING tab_a_201704_idx2 ;

Human readable PostgreSQL 10 partition description

If I create a PostgreSQL 10 partition for a table like this:
CREATE TABLE measurement_y2006m01 PARTITION OF measurement
FOR VALUES FROM ('2006-01-01') TO ('2006-02-01');
How can I recreate the DDL from the pg_catalog tables and views? The pg_class table has a relpartbound column, but its content is in an internal unreadable format.
You can use pg_get_expr() to get a readable version of the partition definition:
select pg_get_expr(c.relpartbound, c.oid, true) as partition_expression
from pg_class c
where relname = 'measurement_y2006m01';

Summarize repeated data in a Postgres table

I have a Postgres 9.1 table called ngram_sightings. Each row is a record of seeing an ngram in a document. An ngram can appear multiple times in a given document.
CREATE TABLE ngram_sightings
(
ngram VARCHAR,
doc_id INTEGER
);
I want summarize this table in another table called ngram_counts.
CREATE TABLE ngram_counts
(
ngram VARCHAR PRIMARY INDEX,
-- the number of unique doc_ids for a given ngram
doc_count INTEGER,
-- the count of a given ngram in ngram_sightings
corpus_count INTEGER
);
What is the best way to do this?
ngram_sightings is ~1 billion rows.
Should I create an index on ngram_sightings.ngram first?
Give this a shot!
INSERT INTO ngram_counts (ngram, doc_count, corpus_count)
SELECT
ngram
, count(distinct doc_id) AS doc_count
, count(*) AS corpus_count
FROM ngram_counts
GROUP BY ngram;
-- EDIT --
Here is a longer version using some temporary tables. First, count how many documents each ngram is associated with. I'm using 'tf' for "term frequency" and 'df' for "doc frequency", since you are heading in the direction of tf-idf vectorization and you may as well use the standard language, it will help with the next few steps.
CREATE TEMPORARY TABLE ngram_df AS
SELECT
ngram
, count(distinct doc_id) AS df
FROM ngram_counts
GROUP BY ngram;
Now you can create table for the total count of each ngram.
CREATE TEMPORARY TABLE ngram_tf AS
SELECT
ngram
, count(*) AS tf
FROM ngram_counts
GROUP BY ngram;
Then join the two on ngram.
CREATE TABLE ngram_tfidf AS
SELECT
tf.ngram
, tf.tf
, df.df
FROM ngram_tf
INNER JOIN ngram_df ON ngram_tf.ngram = ngram_df.ngram;
At this point, I expect you will be looking up ngram quite a bit, so it makes sense to index the last table on ngram. Keep me posted!

Delete row by row number in postgresql

I am new to postgreSql and I used following query to retrieve all the fields from database.
SELECT student.*,row_number() OVER () as rnum FROM student;
I don't know how to delete particular row by row number.Please give me some idea.
This is my table:
Column | Type
------------+------------------
name | text
rollno | integer
cgpa | double precision
department | text
branch | text
with a as
(
SELECT student.*,row_number() OVER () as rnum FROM student
)
delete from student where ctid in (select ctid from a where rnum =1) -- the
-- row_number you want
-- to delete
Quoted from PostgreSQL - System Columns
ctid :
The physical location of the row version within its table. Note
that although the ctid can be used to locate the row version very
quickly, a row's ctid will change each time it is updated or moved by
VACUUM FULL. Therefore ctid is useless as a long-term row identifier.
The OID, or even better a user-defined serial number, should be used
to identify logical rows.
Note : I strongly recommend you to use an unique filed in student table.
As per Craig's comment, I'll give another way to solve OP's issue it's a bit tricky
First create a unique column for table student, for this use below query
alter table student add column stu_uniq serial
this will produce stu_uniq with corresponding unique values for each row, so that OP can easily DELETE any row(s) using this stu_uniq
I don't know whether its a correct alternative for this problem.But it satisfies my problem.What my problem is I need to delete a row without help of anyone of it's column.I created table with OIDS,and with help of oid I deleted the rows.
CREATE TABLE Student(Name Text,RollNo Integer,Cgpa Float,Department Text,Branch Text)WITH OIDS;
DELETE FROM STUDENT WHERE oid=18789;
DELETE FROM STUDENT WHERE oid=18790;
Quoted from PostgreSQL - System Columns
Thanks to #WingedPanther for suggesting this idea.
You could try like this.
create table t(id int,name varchar(10));
insert into t values(1,'a'),(2,'b'),(3,'c'),(4,'d');
with cte as
(
select *,ROW_NUMBER()over(order by id) as rn from t
)
delete from cte where rn=1;
Cte in Postgres

Finding duplicates between two tables

I've got two SQL2008 tables, one is a "Import" table containing new data and the other a "Destination" table with the live data. Both tables are similar but not identical (there's more columns in the Destination table updated by a CRM system), but both tables have three "phone number" fields - Tel1, Tel2 and Tel3. I need to remove all records from the Import table where any of the phone numbers already exist in the destination table.
I've tried knocking together a simple query (just a SELECT to test with just now):
select t2.account_id
from ImportData t2, Destination t1
where
(t2.Tel1!='' AND (t2.Tel1 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel2!='' AND (t2.Tel2 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel3!='' AND (t2.Tel3 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
... but I'm aware this is almost certainly Not The Way To Do Things, especially as it's very slow. Can anyone point me in the right direction?
this query requires a little more that this information. If You want to write it in the efficient way we need to know whether there is more duplicates each load or more new records. I assume that account_id is the primary key and has a clustered index.
I would use the temporary table approach that is create a normalized table #r with an index on phone_no and account_id like
SELECT Phone, Account into #tmp
FROM
(SELECT account_id, tel1, tel2, tel3
FROM destination) p
UNPIVOT
(Phone FOR Account IN
(Tel1, tel2, tel3)
)AS unpvt;
create unclustered index on this table with the first column on the phone number and the second part the account number. You can't escape one full table scan so I assume You can scan the import(probably smaller). then just join with this table and use the not exists qualifier as explained. Then of course drop the table after the processing
luke
I am not sure on the perforamance of this query, but since I made the effort of writing it I will post it anyway...
;with aaa(tel)
as
(
select Tel1
from Destination
union
select Tel2
from Destination
union
select Tel3
from Destination
)
,bbb(tel, id)
as
(
select Tel1, account_id
from ImportData
union
select Tel2, account_id
from ImportData
union
select Tel3, account_id
from ImportData
)
select distinct b.id
from bbb b
where b.tel in
(
select a.tel
from aaa a
intersect
select b2.tel
from bbb b2
)
Exists will short-circuit the query and not do a full traversal of the table like a join. You could refactor the where clause as well, if this still doesn't perform the way you want.
SELECT *
FROM ImportData t2
WHERE NOT EXISTS (
select 1
from Destination t1
where (t2.Tel1!='' AND (t2.Tel1 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel2!='' AND (t2.Tel2 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel3!='' AND (t2.Tel3 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
)