optimising a SQL query with multiple min & max ranges - postgresql

I'm having big problems with optimizing a SQL query that is taking ages to run on a set of data with ~300,000 rows.
I'm running the query on a stat_records table with decimal value and datetime recorded_at column.
I want to find out the MAX and MIN values in any of the following periods: all time, last year, last 6 months, last 3 months, last month, last 2 weeks.
The way I'm doing it right now, is by running the following SQL query individually for every interval specified above:
SELECT MIN("stat_records"."value")
FROM "stat_records"
INNER JOIN "stats" ON "stats"."id" = "stat_records"."stat_id"
WHERE "stat_records"."object_id" = $1
AND "stats"."identifier" = $2
AND ("stat_records"."recorded_at" BETWEEN $3 AND $4)
[["object_id", 1],
["identifier", "usd"],
["recorded_at", "2018-10-15 20:10:58.418512"],
["recorded_at", "2018-12-15 20:11:59.351437"]]
The table definition is:
create_table "stat_records", force: :cascade do |t|
t.datetime "recorded_at"
t.decimal "value"
t.bigint "coin_id"
t.bigint "object_id"
t.index ["object_id"], name: "index_stat_records_on_object_id"
t.index ["recorded_at", "object_id", "stat_id"], name: "for_upsert", unique: true
t.index ["recorded_at", "stat_id"], name: "index_stat_records_on_recorded_at_and_stat_id", unique: true
t.index ["recorded_at"], name: "index_stat_records_on_recorded_at"
t.index ["stat_id"], name: "index_stat_records_on_stat_id"
t.index ["value"], name: "index_stat_records_on_value"
end
This approach, however, takes forever to complete. I have indexes on the stat_records table on both value and recorded_at columns.
What am I missing here - what should I do to optimise this?
Perhaps there is some better approach where I could execute 1 query, and let postgres do the optimisations for me.

An index can only speed up queries that need smaller parts of a table (or sorting). So you can never expect an index to make the query over the whole time range faster.
Your solution could be materialized views. That way you can pre-aggregate the values and the resulting table is much smaller, so that queries will be faster. The disadvantage is that a materialized view needs to be refreshed regularly and contains slightly stale data in between.
An example:
CREATE MATERIALIZED VIEW stats_per_month AS
SELECT stat_records.object_id,
stats.identifier
date_trunc('month', stat_records.recorded_at) AS recorded_month,
min(stat_records.value) AS minval
FROM stat_records
INNER JOIN stats ON stats.id = stat_records.stat_id
GROUP BY stat_records.object_id,
stats.identifier
date_trunc('month', stat_records.recorded_at);
If you need month granularity for your query, you just query from the materialized view rather than from the original tables.
You could also use a hybrid solution and use the original query for small ranges, where stale data might hurt more. That should be fast with an index on recorded_at.

Related

Postgres CTE exponentially saving time?

I would love some clear explanation on the below, I would have thought PG would have optimized the first query to be just as fast as the second query, which uses a CTE, since it's basically using a simple index to filter and join on 2 columns. Everything in the joins and filtering, except "l"."type", has an index. This would be on PG 10.
The below takes 20 minutes+.
SELECT
transactions.id::text AS id,
transactions.amount,
transactions.currency::text AS currency,
transactions.external_id::text AS external_id,
transactions.check_sender_balance,
transactions.created,
transactions.type::text AS type,
transactions.sequence,
transactions.legacy_id::text AS legacy_id,
transactions.reference_transaction::text AS reference_transaction,
a.user_id as user_id
FROM transactions
JOIN lines l ON transactions.id = l.transaction
JOIN accounts a ON l.account = a.id
WHERE l.type='DEBIT'
AND "sequence" > 357550718
AND user_id IN ('5bf4ceb45d27fd2985a000000')
But the following, which I suppose explicitly optimizes accounts via CTE, finishes in ~2-4minutes. I would have thought PG would have optimized to match this type of performance?
WITH "accts" AS (
SELECT "id", "user_id"
FROM "accounts" WHERE "user_id" IN ('5bf4ceb45d27fd2985a000000')
)
SELECT "transactions"."id"::TEXT AS "id",
"transactions"."amount",
"transactions"."currency"::TEXT AS "currency",
"transactions"."external_id"::TEXT AS "external_id",
"transactions"."check_sender_balance",
"transactions"."created",
"transactions"."type"::TEXT AS "type",
"transactions"."sequence",
"transactions"."legacy_id"::TEXT AS "legacy_id",
"transactions"."reference_transaction"::TEXT AS "reference_transaction",
a."user_id" AS "user_id"
FROM "transactions"
JOIN "lines" "l" ON "transactions"."id" = "l"."transaction"
JOIN "accts" "a" ON "a"."id" = "l"."account"
WHERE "l"."type" = 'DEBIT'
AND "sequence" > 357550718
You have a second predicate in your second query vs your first. In your second in the CTE you are limiting it to only a specific user_id. Nowhere in your first query do you have that filter. If there is an index on the user_id field then it is probably helping your performance. You can run an explain plan on both queries separately by adding EXPLAIN to the beginning of them and see how the plan differs. This will help you figure out why there is a difference.

How to config hibernate isolation level for postgres

I have a table ErrorCase in postgres database. This table has one field case_id with datatype text. Its value is generated by format: yymmdd_xxxx. yymmdd is the date when the record insert to DB, xxxx is the number of record in that date.
For example, 3th error case on 2019/08/01 will have the case_id = 190801_0003. On 08/04, if there is one more case, its case_id will be 190804_0001, and go on.
I already using trigger in database to generate value for this field:
DECLARE
total integer;
BEGIN
SELECT (COUNT(*) + 1) INTO total FROM public.ErrorCase WHERE create_at = current_date;
IF (NEW.case_id is null) THEN
NEW.case_id = to_char(current_timestamp, 'YYMMDD_') || trim(to_char(total, '0000'));
END IF;
RETURN NEW;
END
And in Spring Project, I config the application properties for jpa/hibernates:
datasource:
type: com.zaxxer.hikari.HikariDataSource
url: jdbc:postgresql://localhost:5432/table_name
username: postgres
password: postgres
hikari:
poolName: Hikari
auto-commit: false
jpa:
database-platform: io.github.jhipster.domain.util.FixedPostgreSQL82Dialect
database: POSTGRESQL
show-sql: true
properties:
hibernate.id.new_generator_mappings: true
hibernate.connection.provider_disables_autocommit: true
hibernate.cache.use_second_level_cache: true
hibernate.cache.use_query_cache: false
hibernate.generate_statistics: true
Currently, it generates the case_id correctly.
However, when insert many records in nearly same time, it generates the same case_id for two record. I guess the reason is because of the isolation level. When the first transaction not yet committed, the second transaction do the SELECT query to build case_id. So, the result of SELECT query does not include the record from first query (because it has not committed yet). Therefore, the second case_id has the same result as the first one.
Please suggest me any solution for this problems, which isolation level is good for this case???
"yymmdd is the date when the record insert to DB, xxxx is the number of record in that date" - no offense but that is a horrible design.
You should have two separate columns, one date column and one integer column. If you want to increment the counter during an insert, make that date column the primary key and use insert on conflict. You can get rid that horribly inefficient trigger and more importantly that will be safe for concurrent modifications even with read committed.
Something like:
create table error_case
(
error_date date not null primary key,
counter integer not null default 1
);
Then use the following to insert rows:
insert into error_case (error_date)
values (date '2019-08-01')
on conflict (error_date) do update
set counter = counter + 1;
No trigger needed and safe for concurrent inserts.
If you really need a text column as a "case ID", create a view that returns that format:
create view v_error_case
as
select concat(to_char(error_date, 'yymmdd'), '_', to_char(counter, '0000')) as case_id,
... other columns
from error_case;

Select query became very very very slow in postgresql

I have one table which contains "133,072,194" records and I am trying to execute
SELECT COUNT(test)
FROM mytable
WHERE test = false
but it is taking Execution time: 128320.712 ms
I already have indexing on test column. Could you please let me know, what I can optimize or change, so my query became faster?
Because of this, my other select query is also not working.
If there are many rows where test is FALSE, you won't be able to get an exact result faster than with a sequential scan, which is slow for big tables.
If you have only few rows that satisfy the condition, you should create a partial index:
CREATE INDEX mytable_notest_ind ON mytable(id) WHERE NOT test;
(assuming that id is the primary key) and keep mytable autovacuumed often enough that you get an index only scan.
But usually exact results for queries like this are not required.
You could calculate an estimated count from the table statistics with a query like this:
SELECT t.reltuples
* (1 - t.nullfrac)
* mcv.freq AS count_false
FROM pg_stats AS s
CROSS JOIN LATERAL unnest(s.most_common_vals::text::boolean[],
s.most_common_freqs) AS mcv(val, freq)
JOIN pg_class AS t
ON s.tablename = t.relname
AND s.schemaname = t.relnamespace::regnamespace::text
WHERE s.tablename = 'mytable'
AND s.attname = 'test'
AND mcv.val = FALSE;
That would be very fast.
See my blog post for more considerations about the speed of SELECT count(*).

Select most reviewed courses starting from courses having at least 2 reviews

I'm using Flask-SQLAlchemy with PostgreSQL. I have the following two models:
class Course(db.Model):
id = db.Column(db.Integer, primary_key = True )
course_name =db.Column(db.String(120))
course_description = db.Column(db.Text)
course_reviews = db.relationship('Review', backref ='course', lazy ='dynamic')
class Review(db.Model):
__table_args__ = ( db.UniqueConstraint('course_id', 'user_id'), { } )
id = db.Column(db.Integer, primary_key = True )
review_date = db.Column(db.DateTime)#default=db.func.now()
review_comment = db.Column(db.Text)
rating = db.Column(db.SmallInteger)
course_id = db.Column(db.Integer, db.ForeignKey('course.id') )
user_id = db.Column(db.Integer, db.ForeignKey('user.id') )
I want to select the courses that are most reviewed starting with at least two reviews. The following SQLAlchemy query worked fine with SQlite:
most_rated_courses = db.session.query(models.Review, func.count(models.Review.course_id)).group_by(models.Review.course_id).\
having(func.count(models.Review.course_id) >1) \ .order_by(func.count(models.Review.course_id).desc()).all()
But when I switched to PostgreSQL in production it gives me the following error:
ProgrammingError: (ProgrammingError) column "review.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT review.id AS review_id, review.review_date AS review_...
^
'SELECT review.id AS review_id, review.review_date AS review_review_date, review.review_comment AS review_review_comment, review.rating AS review_rating, review.course_id AS review_course_id, review.user_id AS review_user_id, count(review.course_id) AS count_1 \nFROM review GROUP BY review.course_id \nHAVING count(review.course_id) > %(count_2)s ORDER BY count(review.course_id) DESC' {'count_2': 1}
I tried to fix the query by adding models.Review in the GROUP BY clause but it did not work:
most_rated_courses = db.session.query(models.Review, func.count(models.Review.course_id)).group_by(models.Review.course_id).\
having(func.count(models.Review.course_id) >1) \.order_by(func.count(models.Review.course_id).desc()).all()
Can anyone please help me with this issue. Thanks a lot
SQLite and MySQL both have the behavior that they allow a query that has aggregates (like count()) without applying GROUP BY to all other columns - which in terms of standard SQL is invalid, because if more than one row is present in that aggregated group, it has to pick the first one it sees for return, which is essentially random.
So your query for Review basically returns to you the first "Review" row for each distinct course id - like for course id 3, if you had seven "Review" rows, it's just choosing an essentially random "Review" row within the group of "course_id=3". I gather the answer you really want, "Course", is available here because you can take that semi-randomly selected Review object and just call ".course" on it, giving you the correct Course, but this is a backwards way to go.
But once you get on a proper database like Postgresql you need to use correct SQL. The data you need from the "review" table is just the course_id and the count, nothing else, so query just for that (first assume we don't actually need to display the counts, that's in a minute):
most_rated_course_ids = session.query(
Review.course_id,
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
order_by(func.count(Review.course_id).desc()).\
all()
but that's not your Course object - you want to take that list of ids and apply it to the course table. We first need to keep our list of course ids as a SQL construct, instead of loading the data - that is, turn it into a derived table by converting the query into a subquery (change the word .all() to .subquery()):
most_rated_course_id_subquery = session.query(
Review.course_id,
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
order_by(func.count(Review.course_id).desc()).\
subquery()
one simple way to link that to Course is to use an IN:
courses = session.query(Course).filter(
Course.id.in_(most_rated_course_id_subquery)).all()
but that's essentially going to throw away the "ORDER BY" you're looking for and also doesn't give us any nice way of actually reporting on those counts along with the course results. We need to have that count along with our Course so that we can report it and also order by it. For this we use a JOIN from the "course" table to our derived table. SQLAlchemy is smart enough to know to join on the "course_id" foreign key if we just call join():
courses = session.query(Course).join(most_rated_course_id_subquery).all()
then to get at the count, we need to add that to the columns returned by our subquery along with a label so we can refer to it:
most_rated_course_id_subquery = session.query(
Review.course_id,
func.count(Review.course_id).label("count")
).\
group_by(Review.course_id).\
having(func.count(Review.course_id) > 1).\
subquery()
courses = session.query(
Course, most_rated_course_id_subquery.c.count
).join(
most_rated_course_id_subquery
).order_by(
most_rated_course_id_subquery.c.count.desc()
).all()
A great article I like to point out to people about GROUP BY and this kind of query is SQL GROUP BY techniques which points out the common need for the "select from A join to (subquery of B with aggregate/GROUP BY)" pattern.

PostgreSQL and pl/pgsql SYNTAX to update fields based on SELECT and FUNCTION (while loop, DISTINCT COUNT)

I have a large database, that I want to do some logic to update new fields.
The primary key is id for the table harvard_assignees
The LOGIC GOES LIKE THIS
Select all of the records based on id
For each record (WHILE), if (state is NOT NULL && country is NULL), update country_out = "US" ELSE update country_out=country
I see step 1 as a PostgreSQL query and step 2 as a function. Just trying to figure out the easiest way to implement natively with the exact syntax.
====
The second function is a little more interesting, requiring (I believe) DISTINCT:
Find all DISTINCT foreign_keys (a bivariate key of pat_type,patent)
Count Records that contain that value (e.g., n=3 records have fkey "D","388585")
Update those 3 records to identify percent as 1/n (e.g., UPDATE 3 records, set percent = 1/3)
For the first one:
UPDATE
harvard_assignees
SET
country_out = (CASE
WHEN (state is NOT NULL AND country is NULL) THEN 'US'
ELSE country
END);
At first it had condition "id = ..." but I removed that because I believe you actually want to update all records.
And for the second one:
UPDATE
example_table
SET
percent = (SELECT 1/cnt FROM (SELECT count(*) AS cnt FROM example_table AS x WHERE x.fn_key_1 = example_table.fn_key_1 AND x.fn_key_2 = example_table.fn_key_2) AS tmp WHERE cnt > 0)
That one will be kind of slow though.
I'm thinking on a solution based on window functions, you may want to explore those too.