improve performance for postgres - postgresql

I count the number of users in this way, it runs for 5 seconds to produce results, I am looking for a better solution
SELECT COUNT(*)
FROM (SELECT user_id
FROM slot_result_primary
WHERE session_timestamp BETWEEN 1590598800000 AND 1590685199999
GROUP BY user_id) AS foo

First of all you can simplify the query:
SELECT COUNT(DISTINCT user_id)
FROM slot_result_primary
WHERE session_timestamp BETWEEN 1590598800000 AND 1590685199999
Most importantly - make sure you have an index on sesion_timestamp

Counting is a very heavy operation in Postgres. It should be avoided if possible.
It is very difficult to make it better so for each row Postgress needs to go the the disc. You can indeed create a better index to choose which rows to pick from the disc faster but even with this count time will always go up in time in a linear time compared to the size of the data.
Your index should be:
CREATE INDEX session_timestamp_user_id_index ON slot_result_primary (session_timestamp, user_id)
for best results.
Still an index will not solve your count problems fully. In a similar situation I faced two days ago (with a SELECT query running 3s and count
running 1s) dedicated indexes allowed to push down the time of select to 0,3ms but best I could do with count was 700ms.
Here you can find a good article with a summary why count is difficult and different ways to make it better:
https://www.citusdata.com/blog/2016/10/12/count-performance/

Related

Scalable approach to calculating balances over thousands of records in PostgreSQL

I'm facing the challenge of generating 'balance' values from thousands of entries in a PG table.
The rows in the table have many different columns, each useful in calculating that rows contribution to the balance. Each row/entry belongs to some profile. I need to calculate the balance value for some profile, from all entries belonging to that profile according to some set of rules. Complexity should be O(N) - N being the number of entries that belong to the profile.
The different approaches I took:
Fetching the rows, calculating balances on backend. This degrades doesn't scale well and degrades quickly, depending on the number of entries that belong to the profile. While fetching the entries is initially fast, once a profile has over 10,000 entries it becomes prohibitively slow.
I figured that a lot of time is being spent on transport, additionally we don't really need the rows only the balances. Since we already do the work of finding the entries, we can calculate the balance and save time on backend calculations as well as the transport of thousands of rows, thus leading to the second approach:
The second approach was creating a PG query that iterates over the rows and calculates the balance. This has proven to be more scalable when there are many entries per profile. This approach however, probably due to the complexity of the PG query, puts a lot of load on the database. It's enough to run 3-4 of these queries concurrently to max out the database CPU.
the third approach is to create a PL/pgSQL function to loop over the relevant entries and return the rows, hoping to reduce the impact on the database. This is the next thing I want to try.
Main question is - what would be the most efficient way to achieve this while being 'database friendly'?
Additionally:
Whether you think these approaches are sane?
Am I missing another obvious solution?
Is it unlikely that I improve over the performance of the query with the help of a function looping over same rows as the query, or is worth trying?
I realize I haven't provided a lot of concrete data, but I figured that since this is probably a common problem, maybe the issue can be understood from a general description.
EDIT:
To be a but more specific, I'm dealing with the following data:
CREATE TABLE entries (
profileid bigint NOT NULL,
programid bigint NOT NULL,
ledgerid text NOT NULL, -- this provides further granularity, on top of 'programid'
startdate timestamptz,
enddate timestamptz,
amount numeric NOT NULL
)
What I want to get is the balances for a certain profileid, separate by (programid, ledgerid).
The desired form is:
RETURNS TABLE (
programid bigint,
ledgerid text,
programid bigint,
currentbalance numeric,
pendingbalance numeric,
expiredbalance numeric,
spentbalance numeric
)
The four balance values are produced by applying arithmetic on certain entries. For example, negative amount would only add to spentbalance, expired balance is generated from entries that have a positive amount and the enddate is after now(), etc...
While I did manage to create a very large aggregate query with many calls to COALESCE(SUM(CASE WHEN ... amount), 0), I was wondering if I have anything to benefit from porting that logic into a PL/pgSQL function. However, when trying to implement this function I realized I don't know how to iterate over one function and return another, different in columns and rows, function. Should I use a temp table for this? Seems like an overkill as this query is expected to execute tens of times every second...

What kind of index should I use in postgresql for a column with 3 values

I have a table with 100Mil+ records and 237 fields.
One of the fields is a varchar 1 field with three possible values (Y,N,I)
I need to find all of the records with N.
Right now I have a b-tree index built and the query below takes about 20 min to run.
Is there another index I can use to get better performance?
SELECT * FROM tableone WHERE export_value='N';
Assuming your values are roughly equally distributed (say at least 15% of each value) and roughly equally distributed throughout the table (some physically at the beginning, some in the middle, some at the end) then no.
If you think about it you'll see why. You'll have to look up tens of millions of disk blocks in the index and then fetch them from the disk one by one. By the time you have done that, it would have been quicker to just scan the whole table and pick out the values as they match. The planner knows this and would probably not use the index at all.
However - if you only have 17 rows with "N" or they are all very recently added to the table and so physically happen to be close to each other then yes, and index can help.
If you only had a few rows with "N" you would have mentioned it, so we can ignore that one.
If however you mostly insert to this table you might find a BRIN index helpful. That can let the planner see that e.g. the first 80% of your table doesn't have any "N" blocks and so it just needs to look at the last bit.

Statistical query to loop through different date periods

I have a massive query log table in postgresql. I have been asked to get statistical data from it, but the table is sooooo massive. It has about ~170000000 rows in it.
So I've been asked a statistical data for last 6 months, that will have count of services for each day.
The issue is that since the table is so big, it will take forever to get this data.
Here's the current query I use:
SELECT ql.query_time::timestamp::date,count(ql.query_name),ql.query_name
FROM query_log ql
WHERE ql.query_time BETWEEN '2017-12-20 14:00:00.000'::timestamp AND '2018-06-20 14:00:00.000'::timestamp AND success=TRUE
GROUP BY ql.query_time::timestamp::date, ql.query_name;
Please make proposals how to make this query faster and and effective. I want to save the output into the CSV.
I've been thinking on looping through each day for past 6 months but dont know how to do it.
OH, ql.query_time is indexed.
Thx!

Query distinct values from historical database

If I run this query on large Historical database without specifying a date, will KDB be smart enough to retrive status values from index and not bring database down?
select distinct status from trades
The only way kdb can possibly tell all the distinct status is by reading from every partition. Yes this will take a lot of memory but unless you yourself want to maintain a cache of all distinct status, there is nothing else you can do. As previous mentioned an attribute will speed the query up but the query time will still only scale with the number of partitions.
To retrieve using index, kdb provides 'g#' attribute. Distinct alone can take more time which depends on size of your table(it will be linear search without `g# attribute).
Check this-> http://code.kx.com/q4m3/8_Tables/#88-attributes
Let's look at simple example:
q) a: 10000000#1 2 3 5
q) b:`g#a
q) \ts distinct a
68 134217888
q) \ts distinct b
0 288
Difference shows that g# attribute makes a lot of difference in time and space taken during searching. It is becauseg# attribute creates and maintains index on vector.

Pagination on large data sets? – Abort count(*) after a certain time

We use the following pagination technique here:
get count(*) of given filter
get first 25 records of given filter
-> render some pagination links on the page
This works pretty well as long as count(*) is reasonable fast. In our case the data size has grown to a point where a non-indexd query (although most stuff is covered by indices) takes more than a minute. So at this point the user waits for a mostly unimportant number (total records matching filter, number of pages). The first N records are often ready pretty fast.
Therefore I have two questions:
can I limit the count(*) to a certain number
or would it be possible to limit it by time? (no count() known after 20ms)
Or just in general: are there some easy ways to avoid that problem? We would like to keep the system as untouched as possible.
Database: Oracle 10g
Update
There are several scenarios
a) there's an index -> neither count(*) nor the actual select should be a problem
b) there's no index
count(*) is HUGE, and it takes ages to determine it -> rownum would help
count(*) is zero or very low, here a time limit would help. Or I could just dont do a count(*) if the result set is already below the page limit.
You could use 'where rownum < x' to limit the number of rows to count. And if you need to show to your user that you has more register, you could use x+1 in count just to see if there is more than x registers.