This question may sound very rudimentary.
To my surprise, after several hours of searching and referring many books, I could not figure out ...
How to CREATE/MAKE SURE the below table that is created in Postgresql is STORED in Columnar format and is effective for analytical queries. We have 500 million records in this table. We never update this table.
Question is how can if its a ANALYTICAL/COLUMNAR table vs a transactional table ?
CREATE TABLE dwh.access_log
(
ts timestamp with time zone,
remote_address text,
remote_user text,
url text,
domain text,
status_code int,
body_size int,
http_referer text,
http_user_agent text
);
Thanks
Related
I have this table in PostgreSQL database with 6 millions for rows.
CREATE TABLE IF NOT EXISTS public.processed
(
id bigint NOT NULL DEFAULT nextval('processed_id_seq'::regclass),
created_at timestamp without time zone,
word character varying(200) COLLATE pg_catalog."default",
score double precision,
updated_at timestamp without time zone,
is_domain_available boolean,
CONSTRAINT processed_pkey PRIMARY KEY (id),
CONSTRAINT uk_tb03fca6mojpw7wogvaqvwprw UNIQUE (word)
)
I want to optimize it for performance like adding index for column and add partitioning.
Should I add index only for column word or it should be better to add it for several columns.
What is the recommended to partition this table?
Are there other recommended ways like adding compression for example to do some optimization?
First there is no compression, nor columnar indexes in PostGreSQL, like other RBBMS that have those features (as an example Microsoft SQL have 4 ways to compress data without needs to decompress to read or seek, and can use columstore indexes). For columnar indexes you have to go on the Fujistu PG version that cost a lot...
https://www.postgresql.fastware.com/in-memory-columnar-index-brochure
So the only ways you have to accelerates some accesses to seeks on "word" column are :
storing a hash of the word in an additionnal column and use this colums to do searches after having indexed it
effectively use a partitionning with an equilibrate split like sanborn cutter tables
And finally combine the two options.
I have been doing quite a few tutorials such as SqlBOLT where I have been trying to learn more and more regarding SQL.
I have asked some of my friends where they recommended me to check "JOIN" for my situation even though I dont think it does fit for my purpose.
The idea of mine is to store products information which are title, image and url and I have came to a conclusion to use:
CREATE TABLE products (
id INTEGER PRIMARY KEY,
title TEXT,
image TEXT,
url TEXT UNIQUE,
added_date DATE
);
The reason of URL is UNIQUE is due to we cannot have same URL in the database as it will be duplicated which we do not want but then still I dont really understand why and how I should use JOIN in my situation.
So my question is, what would be the best way for me to store the products information? Which way would be the most benefit as well as best performance wise? If you are planning to use JOIN, I would gladly get more information why in that case. (There could be a situation where I could have over 4000 rows inserted overtime.)
I hope you all who are reading this will have a wonderful day! :)
The solution using stores.
CREATE TABLE stores (
id SERIAL PRIMARY KEY,
store_name TEXT
-- add more fields if needed
);
CREATE TABLE products (
id SERIAL,
store_id INTEGER NOT NULL,
title TEXT,
image TEXT,
url TEXT UNIQUE,
added_date timestamp without time zone NOT NULL DEFAULT NOW(),
PRIMARY KEY(id, store_id)
);
ALTER TABLE products
ADD CONSTRAINT "FK_products_stores" FOREIGN KEY ("store_id")
REFERENCES stores (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE RESTRICT;
Here is my table structure
id
name varchar(150),
timestamp_one bigint,
timestamp_two bigint,
value double,
additional_list jsonb
I also have an index on these three fields:
name (varchar(150)), timestamp_one (bigint), additional_list (jsonb)
The database is quite fast with my queries and inserts. The problem I have that it is growing a lot. With the rate of data I have, it goes up to 100GB within a day. My main concert here is storage. What can I improve here? Can PostgreSQL compress my data? Will it be worth to create another table for name (varchar(150)) (it can be repeated for many rows) field and store foreign key, will it save a lot of data? Any other ideas? Thanks.
I imported a csv file to an sql table, but this csv file will change on a regular basis. Is there a way to refresh the table based on the changes in the csv file without removing the table, creating it again, and using the 'import' function in pgadmin?
If possible, would such a solution also exist for the entire schema, consisting of tables based on imported csv files?
Thank you in advance!
Edit To Add: This assumes you have decent access to the postgres server so not just a pure PGADMIN solution.
You can do this file an FDW (File Data Wrapper).
https://www.postgresql.org/docs/9.5/file-fdw.html or for your correct version.
For example I have a FDW setup to look at the Postgres logfile from within SQL rather than having to open an ssh session to the server.
The file exists as a table in the schema when you access it refreshes the data from the file.
The code I used for the file is as follows, obviously the file needs to be on the db server local system.
create foreign table pglog
(
log_time timestamp(3) with time zone,
user_name text,
database_name text,
process_id integer,
connection_from text,
session_id text,
session_line_num bigint,
command_tag text,
session_start_time timestamp with time zone,
virtual_transaction_id text,
transaction_id bigint,
error_severity text,
sql_state_code text,
message text,
detail text,
hint text,
internal_query text,
internal_query_pos integer,
context text,
query text,
query_pos integer,
location text,
application_name text
)
server pglog
options (filename '/var/db/postgres/data11/log/postgresql.csv', format 'csv');
I'm trying to determine whether or not postgresql keeps internal (but accessible via a query) sequential record ids and / or record creation dates.
In the past I have created a serial id field and a record creation date field, but I have been asked to see if Postgres already does that. I have not found any indication that it does, but I might be overlooking something.
I'm currently using Postgresql 9.5, but I would be interested in knowing if that data is kept in any version.
Any help is appreciated.
Thanks.
No is the short answer.
There is no automatic timestamp for rows in PostgreSQL.
You could create the table with a timestamp with a default.
create table foo (
foo_id serial not null unique
, created_timestamp timestamp not null
default current_timestamp
) without oids;
So
insert into foo values (1);
Gives us
You could also have a modified_timestamp column, which you could
update with an after update trigger.
Hope this helps