create table car_park {
id bigint primary key,
car_number bytea,
the_date date,
the_time timestamp
}
This table has thousands of rows daily in order to register coming/leaving cars in the car park. It will be huge table after several months. How to rebuilt or optimize my table to get fast query results?
Table Partitioning can be useful.
Related
I have a table products, a table orders and a table orderProducts.
Products have a name as a PK (apple, banana, mango) and a price .
orders have a created_at date and an id as a PK.
orderProducts connects orders and products, so they have a product_name and an order_id. Now I would like to show all orders for a given product that happened in the last 24 hours.
I use the following query:
SELECT
orders.id,
orders.created_at,
products.name,
products.price
FROM
orderProducts
JOIN products ON
products.name=orderProducts.product
JOIN orders ON
orders.id=orderProducts.order
WHERE
products.name='banana'
AND
orders.created_at BETWEEN NOW() - INTERVAL '24 HOURS' AND NOW()
ORDER BY
orders.created_at
This works, but I would like to optimize this query with an index. This index would need to first be ordered by
the product name, so it can be filtered
then the created_at of the order in descending order, so it can select only the ones from 24 hours ago
The problem is, that from what I have seen, indexes can only be created on a single table, without the possibility of joining another tables values to it. Since two individual index do not solve this problem either, I was wondering if there was an alternative way to optimize this particular query.
Here are the table scripts:
CREATE TABLE products
(
name text PRIMARY KEY,
price integer,
)
CREATE TABLE orders
(
id SERIAL PRIMARY KEY,
created_at TIMESTAMP DEFAULT NOW(),
)
CREATE TABLE orderProducts
(
product text REFERENCES products(name),
"order" integer REFERENCES orders(id),
)
First of all. Please do not put indices everywhere - that lead to slower changing operations...
As proposed by #Laurenz Albe - do not guess - check.
Other than that. Note that you know product name, price is repeated - so you can query that once. Question if in your case two queries are going to be faster then single one... Check that.
Please read docs. I would try this index:
create index orders_id_created_at on orders(created_at desc, id)
Normally id should go first, since that is unique, however here system should be able to filter out on both predicates - where/join. Just guessing here.
orderProducts I would like to see index on both columns, however for this query only one should be needed. In practice you are going from products to orders, or other way - both paths are possible, that is why I've wrote about indexing both columns. I would use two separate indexes:
create index orderproducts_product_id on orderproducts (product_id) include (order_id);
create index orderproducts_order_id on orderproducts (order_id) include (product_id);
Probably that is not changing much, but... idea is to use only index, but not the table itself.
These rules are important in terms of performance:
Integer index faster than string index, therefore, you should try to make the primary keys always be an integer. Because join the tables uses primary keys too.
If when in where clauses always use two fields then we must create an index for both fields.
Foreign-Keys are not indexed, you must create an index for foreign-key fields manually.
So, recommended table scripts will be are that:
CREATE TABLE products
(
id serial primary key,
name text,
price integer
);
CREATE UNIQUE INDEX products_name_idx ON products USING btree (name);
CREATE TABLE orders
(
id SERIAL PRIMARY KEY,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX orders_created_at_idx ON orders USING btree (created_at);
CREATE TABLE orderProducts
(
product_id integer REFERENCES products(id),
order_id integer REFERENCES orders(id)
);
CREATE INDEX orderproducts_product_id_idx ON orderproducts USING btree (product_id, order_id);
---- OR ----
CREATE INDEX orderproducts_product_id ON orderproducts (product_id);
CREATE INDEX orderproducts_order_id ON orderproducts (order_id);
Good day everyone!
I have insert performance issues, when inserting data into a table with a clustered columnstore index.
Table:
CREATE TABLE T
(
ID int NOT NULL,
cal int NOT NULL,
cod varchar(300) NULL
);
I am running a huge query in 100 batches. The amount of rows for the insert per batch is about 1.000.000.000.
Running the query and inserting into that table takes about 2h. Doing the same but inserting in a table without a clustered columnstore index takes 40 min.
Am I doing something wrong?
Any suggestions?
Thanks alot
I would like to implement an append-only list in PostgreSQL. Basically, this is trivial: Create a table, and only ever INSERT into that table.
However, I would like to be able to read that list again, in the order it was created. How can I do this? Is a simple SELECT * FROM MyTable enough? If not, what do I sort by?
Rows in a relational database have no inherent sort order. The only way to get a guaranteed sort order is to use an order by.
You can either create an identity column that is incremented on every insert or a timestamp column that records the precise time a row was inserted (or do both).
e.g.
create table append_only
(
id bigint generated always as identity,
... other columns ...
created_at timestamp default clock_timestamp()
);
Then use that column for an order by. By having both, you can use the id column as a tie breaker when sorting by the timestamp in case two rows were inserted at exactly same microsecond.
You could create column with data type SERIAL(similiar to AUTOINCREMENT/SEQUENCE):
CREATE TABLE myTable(id SERIAL, ...)
SELECT * FROM myTable ORDER BY id;
I have the following table.
CREATE TABLE public.ad
(
id integer NOT NULL DEFAULT nextval('ad_id_seq'::regclass),
uuid uuid NOT NULL DEFAULT uuid_generate_v4(),
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
cmdb_id integer,
platform character varying(100),
bidfloor numeric(15,6),
views integer NOT NULL DEFAULT 1,
year integer,
month integer,
day integer,
CONSTRAINT ad_pkey PRIMARY KEY (id),
CONSTRAINT ad_cmdb_id_foreign FOREIGN KEY (cmdb_id)
REFERENCES public.cmdb (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT ad_id_unique UNIQUE (uuid)
)
WITH (
OIDS=FALSE
);
Without going in too much detail, this table logs all the requests and impressions of advertisements on electronic screens throughout the country. This table is also being used to generate reports and consists of +- 50 million records.
Currently, the reports are filtered on the created_at timestamp. You can imagine that with +- 50 million records the query will get slow, even with an index on the created_at column. The reports are generated by selecting between which dates you want to request the data on the UI of the system.
The year, month and day columns are new columns that I just added to make the reporting more efficient. Instead of indexing on the date, I want the system to index on a year, month and day, all separate values.
The newly added columns are still empty. I want to run a query that inserts a value where the created_at column is between two dates. For example:
INSERT INTO ad (year) VALUES (2016) WHERE created_at BETWEEN '2016-01-01 00:00:00' AND '2016-12-31 23:59:59';
This doesn't work of course. I cannot seem to find anything on the internet where an INSERT statement makes use of a WHERE BETWEEN clause. I also tried using subqueries and the WITH clausule to generate a series of years between 2012 and 2020 using generate_series. It all didn't work out.
You don't want to insert new rows, you should update your table.
UPDATE table_name
SET column1=value1,column2=value2,...
WHERE column_name BETWEEN value1 AND value2;
Otherwise you'll have 100 milions rows ;)
I want to store some encoded 'data' into cassadra, versioned by timestamp. My tentative schema is:
CREATE TABLE items (
item_id varchar,
timestamp timestamp,
data blob,
PRIMARY KEY (item_id, timestamp)
);
I would like to be able to return the list of items, returning only the latest ( highest timestamp) for each item_id; Is it possible with this schema?
It is not possible to express such a query in a single CQL statement for this table, so the answer is no.
You can try creating another table, e.g. latest_items, and only storing the last update there, so the schema would be:
CREATE TABLE latest_items (
item_id varchar,
timestamp timestamp,
data blob,
PRIMARY KEY (item_id)
);
If your rows are inserted in timestamp order, the table would naturally contain only the latest row for each item. Then you can just run select * from latest_items limit 10000000;. This will of course be expensive, because you're fetching all rows, but given your requirements where you actually want all of them, there is no way to avoid it.
This second table involves duplicating your data, but this is a common theme with Cassandra. You can avoid duplicating the blob by storing it indirectly, i.e. as a path or URL or somesuch.