Sphinx, SUM(#weight)/#count - group-by

I have a very specific query I am currently doing with MySQL
My Table structure is :
id : INT, primary KeY, AUTO_INCREMENT
occupationID : INT
alias : VARCHAR(255)
Then, I do a
SELECT occupationID, (MATCH(alias) AGAINST ('Web Developer' IN NATURAL LANGUAGE MODE) / count(*)) as score FROM aliases group by occupationID order by score DESC LIMIT 0,2;
This query performs a search on every ROW, doing a full scan, and divide matches by their number of occurrences. This way, I got an average score on all rows, giving me the accuracy I need.
This is very slow (20 sec) , with a 50k records table . ( I am not surprised, MySQL fulltext is very slow...).
With Sphinx, I was thinking to build an index with this query:
select id,occupationID,alias, (SELECT count(*) from aliases AS A WHERE B.occupationID=A.occupationID) as nb from aliases AS B
And then do a
$sphinx->setSelect("#id, sum(#weight)/nb as score");
$sphinx->setGroupBy("occupationID", GROUP_BY_ATTR, "score DESC");
and
$sphinx->query("Web Developer");
Am I doing this right?

Mostly. The only odity I spot, is in the original mysql, you group by occupationID. But in building the index you join with id - meaning the count will be the number with the same id, not the sample occupationID.
I'm also not certain that sum(#weight) works in sphinx - in theory it should, but there are a few odd restrictions, so cant be sure without testing.

Related

Postgres: Optimization of query with simple where clause

I have a table with the following columns:
ID (VARCHAR)
CUSTOMER_ID (VARCHAR)
STATUS (VARCHAR) (4 different status possible)
other not relevant columns
I try to find all the lines with customer_id = and status = two different status.
The query looks like:
SELECT *
FROM my_table
WHERE customer_id = '12345678' AND status IN ('STATUS1', 'STATUS2');
The table contains about 1 mio lines. I added two indexes on customer_id and status. The query still needs about 1 second to run.
The explain plan is:
1. Gather
2. Seq Scan on my_table
Filter: (((status)::text = ANY ('{SUBMITTED,CANCELLED}'::text[])) AND ((customer_id)::text = '12345678'::text))
I ran the 'analyze my_table' after creating the indexes. What could I do to improve the performance of this quite simple query?
You need a compound (multi-column) index to help satisfy your query.
Guessing, it seems like the most selective column (the one with the most distinct values) is customer_id. status probably has only a few distinct values. So customer_id should go first in the index. Try this.
ALTER TABLE my_table ADD INDEX customer_id_status (customer_id, status);
This creates a BTREE index. A useful mental model for such an index is an old-fashioned telephone book. It's sorted in order. You look up the first matching entry in the index, then scan it sequentially for the items you want.
You may also want to try running ANALYZE my_table; to update the statistics (about selectivity) used by the query planner to choose an appropriate index.
Pro tip Avoid SELECT * if you can. Instead name the columns you want. This can help performance a lot.
Pro tip Your question said some of your columns aren't relevant to query optimization. That's probably not true; index design is a weird art. SELECT * makes it less true.

Does the index column order matter on row insert in Postgresql?

I have a table similar to this one:
create table request_journal
(
id bigint,
request_body text,
request_date timestamp,
user_id bigint,
);
It is used for request logging purposes, so frequent inserts in it are expected (2k+ rps).
I want to create composite index on columns request_date and user_id to speed up execution of select queries like this:
select *
from request_journal
where request_date between '2021-07-08 10:00:00' and '2021-07-08 16:00:00'
and user_id = 123
order by request_date desc;
I tested select queries with (request_date desc, user_id) btree index and (user_id, request_date desc) btree index. With request_date leading column index select queries are executed about 10% faster, but in general performance of any of this indexes is acceptable.
So my question is does the index column order affect insertion time? I have not spotted any differences using EXPLAIN/EXPLAIN ANALYZE on insert query. Which index will be more build time efficient under "high load"?
It is hard to believe your test were done on any vaguely realistic data size.
At the rate you indicate, a 6 hour range would include over 43 million records. If the user_ids are spread evenly over 1e6 different values, I get the index leading with user_id to be a thousand times faster for that query than the one leading with request_date.
But anyway, for loading new data, assuming the new data is all from recent times, then the one with request_date should be faster as the part of the index needing maintenance while loading will be more concentrated in part of the index, and so better cached. But this would depend on how much RAM you have, what your disk system is like, and how many distinct user_ids you are loading data for.

Postgresql 9.4 slow [duplicate]

I have table
create table big_table (
id serial primary key,
-- other columns here
vote int
);
This table is very big, approximately 70 million rows, I need to query:
SELECT * FROM big_table
ORDER BY vote [ASC|DESC], id [ASC|DESC]
OFFSET x LIMIT n -- I need this for pagination
As you may know, when x is a large number, queries like this are very slow.
For performance optimization I added indexes:
create index vote_order_asc on big_table (vote asc, id asc);
and
create index vote_order_desc on big_table (vote desc, id desc);
EXPLAIN shows that the above SELECT query uses these indexes, but it's very slow anyway with a large offset.
What can I do to optimize queries with OFFSET in big tables? Maybe PostgreSQL 9.5 or even newer versions have some features? I've searched but didn't find anything.
A large OFFSET is always going to be slow. Postgres has to order all rows and count the visible ones up to your offset. To skip all previous rows directly you could add an indexed row_number to the table (or create a MATERIALIZED VIEW including said row_number) and work with WHERE row_number > x instead of OFFSET x.
However, this approach is only sensible for read-only (or mostly) data. Implementing the same for table data that can change concurrently is more challenging. You need to start by defining desired behavior exactly.
I suggest a different approach for pagination:
SELECT *
FROM big_table
WHERE (vote, id) > (vote_x, id_x) -- ROW values
ORDER BY vote, id -- needs to be deterministic
LIMIT n;
Where vote_x and id_x are from the last row of the previous page (for both DESC and ASC). Or from the first if navigating backwards.
Comparing row values is supported by the index you already have - a feature that complies with the ISO SQL standard, but not every RDBMS supports it.
CREATE INDEX vote_order_asc ON big_table (vote, id);
Or for descending order:
SELECT *
FROM big_table
WHERE (vote, id) < (vote_x, id_x) -- ROW values
ORDER BY vote DESC, id DESC
LIMIT n;
Can use the same index.
I suggest you declare your columns NOT NULL or acquaint yourself with the NULLS FIRST|LAST construct:
PostgreSQL sort by datetime asc, null first?
Note two things in particular:
The ROW values in the WHERE clause cannot be replaced with separated member fields. WHERE (vote, id) > (vote_x, id_x) cannot be replaced with:
WHERE vote >= vote_x
AND id > id_x
That would rule out all rows with id <= id_x, while we only want to do that for the same vote and not for the next. The correct translation would be:
WHERE (vote = vote_x AND id > id_x) OR vote > vote_x
... which doesn't play along with indexes as nicely, and gets increasingly complicated for more columns.
Would be simple for a single column, obviously. That's the special case I mentioned at the outset.
The technique does not work for mixed directions in ORDER BY like:
ORDER BY vote ASC, id DESC
At least I can't think of a generic way to implement this as efficiently. If at least one of both columns is a numeric type, you could use a functional index with an inverted value on (vote, (id * -1)) - and use the same expression in ORDER BY:
ORDER BY vote ASC, (id * -1) ASC
Related:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Improve performance for order by with columns from many tables
Note in particular the presentation by Markus Winand I linked to:
"Pagination done the PostgreSQL way"
Have you tried partioning the table ?
Ease of management, improved scalability and availability, and a
reduction in blocking are common reasons to partition tables.
Improving query performance is not a reason to employ partitioning,
though it can be a beneficial side-effect in some cases. In terms of
performance, it is important to ensure that your implementation plan
includes a review of query performance. Confirm that your indexes
continue to appropriately support your queries after the table is
partitioned, and verify that queries using the clustered and
nonclustered indexes benefit from partition elimination where
applicable.
http://sqlperformance.com/2013/09/sql-indexes/partitioning-benefits

PostgreSQL Aggregate groups with limit performance

I'm a newbie in PostgreSQL. Is there a way to improve execution time of the following query:
SELECT s.id, s.name, s.url,
(SELECT array_agg(p.url)
FROM (
SELECT url
FROM pages
WHERE site_id = s.id ORDER BY created DESC LIMIT 5
) as p
) as last_pages
FROM sites s
I havn't found how to insert LIMIT clause into aggregate call, as ordering.
There are indexes by created (timestamp) and site_id (integer) in table pages, but the foreign key from sites.id to pages.site_id is absent, unfortunately. The query is intented to return a list of sites with sublists of 5 most recently created pages.
PostgreSQL version is 9.1.5.
You need to start by thinking like the database management system. You also need to think very carefully about what you are asking from the database.
Your fundamental problem here is that you likely have a very large number of separate indexing calls happening here when a sequential scan may be quite a bit faster. Your current query gives very little flexibility to the planner because of the fact that you have subqueries which must be correlated.
A much better way to do this would be with a view (inline or not) and a window function:
SELECT s.id, s.name, s.url, array_agg(p.url)
FROM sites s
JOIN (select site_id, url,
row_number() OVER (partition by site_id order by created desc) as num
from pages) p on s.id = p.site_id
WHERE num <= 5;
This will likely change a very large number of index scans to a single large sequential scan.

PostgreSQL DISTINCT problem: works locally but not on server

I've come across a vexing problem with a PostgreSQL query. This works in my local development environment:
SELECT distinct (user_id) user_id, created_at, is_goodday
FROM table
WHERE ((created_at >= '2011-07-01 00:00:00') AND user_id = 95
AND (created_at < '2011-08-01 00:00:00'))
ORDER BY user_id, created_at ASC;
...but gives the following error on my QA server (which is on Heroku):
PGError: ERROR: syntax error at or near "user_id"
LINE 1: SELECT distinct (user_id) user_id, created_at,
^
Why could this be?
Other possibly relevant info:
I have tried single-quoting and double-quoting the field names
It's a Rails 3 app, but I'm using this SQL raw, i.e. no ActiveRecord magic
My local version of Postgres is 9.0.4 on Mac, but I have no idea what version Heroku is using
As per your comment, the standard PostgreSQL version of that query would be:
SELECT user_id, created_at, is_goodday
FROM table
WHERE created_at >= '2011-07-01 00:00:00'
AND created_at < '2011-08-01 00:00:00'
AND user_id = 95
ORDER BY created_at DESC, id DESC
LIMIT 1
You don't need user_id in the ORDER BY because you have user_id = 95, you want created_at DESC in the ORDER BY to put the most recent created_at at the top; then you LIMIT 1 to slice off just the first row in the result set. GROUP BY can be used to enforce uniqueness or if you need to group things for an aggregate function but you don't need it for either one of those here as you can get uniqueness through ORDER BY and LIMIT and you can hide your aggregation inside the ORDER BY (i.e. you don't need MAX because ORDER BY does that for you).
Since you have user_id = 95 in your WHERE, you don't need user_id in the SELECT but you can leave it in if that makes it easier for you in Ruby-land.
It is possible that you could have multiple entries with the same created_at so I added an id DESC to the ORDER BY to force PostgreSQL to choose the one with the highest id. There's nothing wrong with being paranoid when they really are out to get you and bugs definitely are out to get you.
Also, you want DESC in your ORDER BY to get the highest values at the top, ASC puts the lowest values at the top. The more recent timestamps will be the higher ones.
In general, the GROUP BY and SELECT have to match up because:
When GROUP BY is present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions, since there would be more than one possible value to return for an ungrouped column.
But that doesn't matter here because you don't need a GROUP BY at all. I linked to the 8.3 version of the documentation to match the PostgreSQL version you're using.
There are probably various other ways to do this but this one as probably as straight forward and clear as you're going to get.
put a quote in user_id like user_id = '95'. Your query should be
SELECT distinct (user_id) as uid, created_at, is_goodday FROM table WHERE
((created_at >= '2011-07-01 00:00:00') AND user_id = '95' AND (created_at < '2011-08-01 00:00:00')) ORDER BY user_id, created_at ASC;
You're using DISTINCT ON (without writing the ON). Perhaps you should write the ON. Perhaps your postgres server dates from before the feature was implemented (which is pretty old by now).
If all else fails, you can always do that with some GROUP BY...