Postgresql difference between \dt+ and count rows - postgresql

I'm running at now DMS (FULL LOAD, ONGOING REPLICATION) replication and I would like to check difference of rows between clusters A and B. and How B is delayed against A.
Database engine is postgresql 11(A) and postgresql14(B)
I run \dt+ on A:
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+----------+-------+--------+-------------+-------+-------------
public | x | table | example | permanent | 73 GB |
(1 row)
and on B:
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+----------+-------+--------+-------------+---------------+-------+-------------
public | x | table | example | permanent | heap | 14 GB |
I was surprised by the difference in size, which is huge, so I run the count on reader instance of this clusters:
reader A:
=> Select count(*) FROM X;
count
----------
47830564
(1 row)
reader B:
=> Select count(*) FROM X;
count
----------
47830564
(1 row)
And looks okay, but I don't understand why the difference in size is so big.
DMS is an aws service which is reinserting the rows in batches from A to B.

Related

How to list all indexes of a table with their corresponding size in PostgreSQL?

I can view the total size of all indexes in a table with
SELECT pg_size_pretty (pg_indexes_size('table_name'));
and the size of a specific index with:
select pg_size_pretty(pg_relation_size('index_name'));,
but I would like to retrieve a list with size information for each index of the table separately (a list of index sizes with the corresponding index name they belong to).
Use pg_indexes.
select indexname, pg_size_pretty(pg_relation_size(indexname::regclass)) as size
from pg_indexes
where tablename = 'my_table';
You can use \di+ psql command:
postgres=> \di+ schema.*
List of relations
Schema | Name | Type | Owner | Table | Persistence | Size | Description
--------+--------+-------+-------+----------------------+--------+-------------
schema | index1 | index | owner | table1 | permanent | 139 MB |
schema | index2 | index | owner | table1 | permanent | 77 MB |
schema | index3 | index | owner | table1 | permanent | 73 MB |
schema | index4 | index | owner | table1 | permanent | 38 MB |
(4 rows)

What exactly is a wide column store?

Googling for a definition either returns results for a column oriented DB or gives very vague definitions.
My understanding is that wide column stores consist of column families which consist of rows and columns. Each row within said family is stored together on disk. This sounds like how row oriented databases store their data. Which brings me to my first question:
How are wide column stores different from a regular relational DB table? This is the way I see it:
* column family -> table
* column family column -> table column
* column family row -> table row
This image from Database Internals simply looks like two regular tables:
The guess I have as to what is different comes from the fact that "multi-dimensional map" is mentioned along side wide column stores. So here is my second question:
Are wide column stores sorted from left to right? Meaning, in the above example, are the rows sorted first by Row Key, then by Timestamp, and finally by Qualifier?
Let's start with the definition of a wide column database.
Its architecture uses (a) persistent, sparse matrix, multi-dimensional
mapping (row-value, column-value, and timestamp) in a tabular format
meant for massive scalability (over and above the petabyte scale).
A relational database is designed to maintain the relationship between the entity and the columns that describe the entity. A good example is a Customer table. The columns hold values describing the Customer's name, address, and contact information. All of this information is the same for each and every customer.
A wide column database is one type of NoSQL database.
Maybe this is a better image of four wide column databases.
My understanding is that the first image at the top, the Column model, is what we called an entity/attribute/value table. It's an attribute/value table within a particular entity (column).
For Customer information, the first wide-area database example might look like this.
Customer ID Attribute Value
----------- --------- ---------------
100001 name John Smith
100001 address 1 10 Victory Lane
100001 address 3 Pittsburgh, PA 15120
Yes, we could have modeled this for a relational database. The power of the attribute/value table comes with the more unusual attributes.
Customer ID Attribute Value
----------- --------- ---------------
100001 fav color blue
100001 fav shirt golf shirt
Any attribute that a marketer can dream up can be captured and stored in an attribute/value table. Different customers can have different attributes.
The Super Column model keeps the same information in a different format.
Customer ID: 100001
Attribute Value
--------- --------------
fav color blue
fav shirt golf shirt
You can have as many Super Column models as you have entities. They can be in separate NoSQL tables or put together as a Super Column family.
The Column Family and Super Column family simply gives a row id to the first two models in the picture for quicker retrieval of information.
Most (if not all) Wide-column stores are indeed row-oriented stores in that every parts of a record are stored together. You can see that as a 2-dimensional key-value store. The first part of the key is used to distribute the data across servers, the second part of the key lets you quickly find the data on the target server.
Wide-column stores will have different features and behaviors. However, Apache Cassandra, for example, allows you to define how the data will be sorted. Take this table for example:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | US | 2020-10-01 | "a..." |
| 1 | JP | 2020-11-01 | "b..." |
| 1 | US | 2020-09-01 | "c..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
If your partitioning key is (id) and your clustering key is (country, timestamp), the data will be stored like this:
[Key 1]
1:JP,2020-11-01,"b..." | 1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."
Or in table form:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | JP | 2020-11-01 | "b..." |
| 1 | US | 2020-09-01 | "c..." |
| 1 | US | 2020-10-01 | "a..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
If you change the primary key (composite of partitioning and clustering key) to (id, timestamp) WITH CLUSTERING ORDER BY (timestamp DESC) (id is the partitioning key, timestamp is the clustering key in descending order), the result would be:
[Key 1]
1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..." | 1:JP,2020-11-01,"b..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."
Or in table form:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | US | 2020-09-01 | "c..." |
| 1 | US | 2020-10-01 | "a..." |
| 1 | JP | 2020-11-01 | "b..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|

PostgreSQL UPDATE JOIN ... How does it work?

I'm having trouble understanding UPDATE with some sort of JOIN in PostgreSQL
I have the following table (names), sometimes a synonym is filled in the 3rd column:
id,name,synm,sd
41,xgvf
24,y4tg
32,zagr,xgvf
48,argv,bvre
53,bvre
I like to fill column 4 (sd) with the 'parent' id (sd column is empty now)
id,name,synm,sd
41,xgvf
24,y4tg
32,zagr,xgvf,41
48,argv,bvre,53
53,bvre
I tried the following sql statement (and many similar version of it) ...
update names
set sd =
(select n2.id from names n1
inner join names n2
on
n1.synm = n2.name);
... i get the following error:
ERROR: more than one row returned by a subquery used as an expression
SQL state: 21000
I understand my current wrong SQL tries to fill one sd row with all found id's. So somehow I do not get it.
How do I fill the synonym id (sd) in the whole table? Perhaps WITH RECURSIVE like statements?
You can simulate the join like this:
update names n1
set sd = n2.id
from names n2
where n2.name = n1.synm;
See the demo.
Results:
| id | name | synm | sd |
| --- | ---- | ---- | --- |
| 41 | xgvf | | |
| 24 | y4tg | | |
| 53 | bvre | | |
| 48 | argv | bvre | 53 |
| 32 | zagr | xgvf | 41 |

Using postgres table description

In postgres while listing the relation with \dt+ new columns 'size' and 'Description' column are added. As the name suggests Description can we use to store the description of the table, if yes how?
# \dt
List of relations
Schema | Name | Type | Owner
--------+---------+-------+-------
drs | Records | table | rho
drs | Reports | table | rho
(2 rows)
# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------+-------+-------+------------+-------------
drs | Records | table | rho | 8192 bytes |
drs | Reports | table | rho | 0 bytes |
(2 rows)
This can be defined using the comment statement:
comment on table drs."Reports" is 'This table stores reports';

Postgresql: Combining similarity with tsvector

I got a database table containing more than 50 million records
which i need to full text search as fast as possible.
On a smaller table i just had a index on the text column and i use the similarity function to get similar results. I was also able to sort by the result of similarity().
Now, after my table is a lot bigger, i switched to tsvector. I created a column for the tsvector result and a trigger which updates the column before insert or update. After that i can search ultra fast (<100ms).
The problem is that i would like to use a combination of both tsvector and similarity.
Example
My table contains the following data.
| MyColumn |
------------
| Apple |
| Orange |
| ... |
But if i search for "App" i don't get "Apple" back.
Any ideas on how to get a fast "like/similar" search with a "score/similarity" score ?
https://www.postgresql.org/docs/current/static/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
Also, * can be attached to a lexeme to specify prefix matching:
smth like that?.:
postgres=# with c(v) as (values('Apple'),('App'),('application'),('apricote'))
select v,to_tsvector(v),to_tsvector(v) ## to_tsquery('app:*') from c;
v | to_tsvector | ?column?
-------------+-------------+----------
Apple | 'appl':1 | t
App | 'app':1 | t
application | 'applic':1 | t
apricote | 'apricot':1 | f
(4 rows)
postgres=# with c(v) as (values('Apple'),('App'),('application'),('apricote'))
select v,to_tsvector(v),to_tsvector(v) ## to_tsquery('ap:*') from c;
v | to_tsvector | ?column?
-------------+-------------+----------
Apple | 'appl':1 | t
App | 'app':1 | t
application | 'applic':1 | t
apricote | 'apricot':1 | t
(4 rows)