Sphinx query takes too much time - sphinx

I am making an index on a table with ~90 000 000 rows. Fulltext search must be done on a varchar field, called email. I also set parent_id as an attribute.
When I run queries to search emails that match words with small amount of hits, they are fired immediately:
mysql> SELECT count(*) FROM users WHERE MATCH('diedsmiling');
+----------+
| count(*) |
+----------+
| 26 |
+----------+
1 row in set (0.00 sec)
mysql> show meta;
+---------------+-------------+
| Variable_name | Value |
+---------------+-------------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | diedsmiling |
| docs[0] | 26 |
| hits[0] | 26 |
+---------------+-------------+
6 rows in set (0.00 sec)
Things get complicated when I am searching for emails that match words with a big amount of hits:
mysql> SELECT count(*) FROM users WHERE MATCH('mail');
+----------+
| count(*) |
+----------+
| 33237994 |
+----------+
1 row in set (9.21 sec)
mysql> show meta;
+---------------+----------+
| Variable_name | Value |
+---------------+----------+
| total | 1 |
| total_found | 1 |
| time | 9.210 |
| keyword[0] | mail |
| docs[0] | 33237994 |
| hits[0] | 33253762 |
+---------------+----------+
6 rows in set (0.00 sec)
Using parent_id attribute, doesn't give any profit:
mysql> SELECT count(*) FROM users WHERE MATCH('mail') AND parent_id = 62003;
+----------+
| count(*) |
+----------+
| 21404 |
+----------+
1 row in set (8.66 sec)
mysql> show meta;
+---------------+----------+
| Variable_name | Value |
+---------------+----------+
| total | 1 |
| total_found | 1 |
| time | 8.666 |
| keyword[0] | mail |
| docs[0] | 33237994 |
| hits[0] | 33253762 |
Here are my sphinx configs:
source src1
{
type = mysql
sql_host = HOST
sql_user = USER
sql_pass = PASS
sql_db = DATABASE
sql_port = 3306 # optional, default is 3306
sql_query = \
SELECT id, parent_id, email \
FROM users
sql_attr_uint = parent_id
}
index test1
{
source = src1
path = /var/lib/sphinx/test1
}
The query that I need to run looks like:
SELECT * FROM users WHERE MATCH('mail') AND parent_id = 62003;
I need to get all emails that match a certain work and have a certain parent_id.
My questions are:
Is there a way to optimize the situation described above? Maybe there is a more convenient matching mode for such type of queries? If I migrate to a server with SSD disks will the performance growth be significant?

Just to get count can just do
Select id from index where match(...) limit 0 option ranker=none; show meta;
And get from total_found.
Will be much more efficient than count[*) which invokes group by.
Or even call keywords('word','index',1); if only single words.

Related

Flaky tests query

Given a (postgresql) table of test results, we would like to find tests that are flaky: Tests that fail and then pass on the same run :
+-----------+--------+-----------------+----------------------------+----------+
| result_id | run_id | scenario | time | result |
+-----------+--------+-----------------+----------------------------+----------+
| 12031 | 123 | #loginHappyFlow | 2020-12-22 12:23:20.077636 | Pass |
| 12032 | 123 | #signUpSocial | 2020-12-22 12:22:03.355052 | Fail |
| 12033 | 123 | #signUpSocial | 2020-12-22 12:19:19.812301 | Pass |
+-----------+--------+-----------------+----------------------------+----------+
Not sure how to approach this, please advice, Thanks!
SELECT
a.result_id
,a.run_id
,a.scenario
,a.time
,a.result
,b.quantity
FROM
table a
JOIN (
SELECT
run_id
,scenario
,COUNT(*) AS quantity
FROM
table
GROUP BY 1,2
) b on (a.run_id = b.run_id and a.scenario = b.scenario)
WHERE
b.quantity > 1

How to get non-aggregated measures?

I calculate my metrics with SQL and publish the resulting table to Tableau Server. Afterward, use this data source to create charts and dashboards.
For one analysis, I already calculated the measures per day with SQL. When I use the resulting table in Tableau, it aggregates these measures to SUM by default. However, I don't want to have SUM or AVG of the average or SUM of the Percentiles.
What I want is the result when I don't select date dimension and not GROUP BY date in SQL as attached below.
Here is the query:
SELECT
-- date,
COUNT(DISTINCT id) AS count_of_id,
AVG(timediff_in_sec) AS avg_timediff,
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY timediff_in_sec) AS percentile_25,
PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY timediff_in_sec) AS percentile_50
FROM
(
--subquery
) AS t1
-- GROUP BY date
Here are the first 10 rows of the resulting table:
+------------+--------------+-------------+---------------+---------------+
| date | avg_timediff | count_of_id | percentile_25 | percentile_50 |
+------------+--------------+-------------+---------------+---------------+
| 10/06/2020 | 61,65186364 | 22 | 8,5765 | 13,3015 |
| 11/06/2020 | 127,2913333 | 3 | 15,6045 | 17,494 |
| 12/06/2020 | 306,0348214 | 28 | 12,2565 | 17,629 |
| 13/06/2020 | 13,2664 | 5 | 11,944 | 13,862 |
| 14/06/2020 | 16,728 | 7 | 14,021 | 17,187 |
| 15/06/2020 | 398,6424595 | 37 | 11,893 | 19,271 |
| 16/06/2020 | 293,6925152 | 33 | 12,527 | 17,134 |
| 17/06/2020 | 155,6554286 | 21 | 13,452 | 16,715 |
| 18/06/2020 | 383,8101429 | 7 | 266,048 | 493,722 |
+------------+--------------+-------------+---------------+---------------+
How can I achieve the desired output above?
Drag them all into the dimensions list, then they will be static dimensions. For your use you could also just drag the Date field to Rows. Aggregating 1 value, which you have for each date, returns the same value whatever the aggregation type.

How can I `SUM()` in PostgreSQL based on certain condition? For summing debits and credits in accounting journal table

I have a database full with accounting journals. There is table for accounting journal itself (the accounting journal's metadata) and there is a table for accounting journal line (for each account with its debit or credit).
I have database like this:
+----+---------------+--------+---------+
| ID | JOURNAL_NAME | DEBIT | CREDIT |
+----+---------------+--------+---------+
| | | | |
| 1 | INV/0001 | 100 | 0 |
| | | | |
| 2 | INV/0001 | 0 | 100 |
| | | | |
| 3 | INV/0002 | 200 | 0 |
| | | | |
| 4 | INV/0002 | 0 | 200 |
+----+---------------+--------+---------+
I want to have all journal with the same name to be summed in one, their debits and credits. So from the above table... I want to have a query that makes something like this:
+--------------+--------+---------+
| JOURNAL_NAME | DEBIT | CREDIT |
+--------------+--------+---------+
| | | |
| INV/0001 | 100 | 100 |
| | | |
| INV/0002 | 200 | 200 |
+--------------+--------+---------+
I have tried with:
SELECT DISTINCT ON (accounting_journal.id)
accounting_journal.name,
accounting_journal_line.debit,
accounting_journal_line.credit
FROM accounting_journal_line
JOIN accounting_journal ON accounting_journal.id = accounting_journal_line.move_id
ORDER BY accounting_journal.id ASC
LIMIT 3;
With the above query, I have all the journal and the journal lines. I just need to have the above query to sum the debits and credits for every same accounting_journal.name.
I have tried with SUM() but it always stuck in GROUP BY` clause.
SELECT DISTINCT ON (accounting_journal.id)
accounting_journal.name,
accounting_journal.ref,
accounting_journal_line.name,
SUM(accounting_journal_line.debit),
SUM(accounting_journal_line.credit)
FROM accounting_journal_line
JOIN accounting_journal ON accounting_journal.id = accounting_journal_line.move_id
ORDER BY accounting_journal.id ASC
LIMIT 3;
The error:
Error in query (7): ERROR: column "accounting_journal.name" must appear in the GROUP BY clause or be used in an aggregate function
LINE 2: accounting_journal.name,
I hope I can get assistance or pointer where I need to look at, here. Thanks!
When you are using any aggregation function with normal columns then your have to mention all the non-aggregating column in group by clause,
So try This:
SELECT DISTINCT ON (accounting_journal.id)
accounting_journal.name,
accounting_journal.ref,
accounting_journal_line.name,
SUM(accounting_journal_line.debit),
SUM(accounting_journal_line.credit)
FROM accounting_journal_line
JOIN accounting_journal ON accounting_journal.id = accounting_journal_line.move_id
group by 1,2,3
ORDER BY accounting_journal.id ASC
LIMIT 3;
In your query you are having 3 non-aggregation column so you can mention column number in group by clause to achieve it.
You can use the Sum Window Function, it does not require "group by". So:
select aj.id journal_id
aj.name journal_name,
aj.ref journal_ref,
ajl.name line_name,
sum(ajl.debit) over(partition by aj.id) total_debit,
sum(ajl.credit) over(partition by aj.id) total_credit
from accounting_journal_line ajl
join accounting_journal aj
on aj.id = ajl.move_id
order by aj.id;
See fiddle for a working example.

Output of Show Meta in SphinxQL

I am trying to check if my config has issues or I am not understanding Show Meta correctly;
If I make a regex in the config:
regexp_filter=NY=>New York
then if I do a SphinxQL search on 'NY'
Search Index where MATCH('NY')
and then Show Meta
it should show keyword1=New and keyword2=York not NY is that correct?
And if it does not then somehow my config is not working as intended?
it should show keyword1=New and keyword2=York not NY is that correct?
This is correct. When you do MATCH('NY') and have NY=>New York regexp conversion then Sphinx first converts NY into New York and only after that it starts searching, i.e. it forgets about NY completely. The same happens when indexing: it first prepares tokens, then indexes them forgetting about the original text.
To demonstrate (this is in Manticore (fork of Sphinx), but in terms of processing regexp_filter and how it affects searching works the same was as Sphinx):
mysql> create table t(f text) regexp_filter='NY=>New York';
Query OK, 0 rows affected (0.01 sec)
mysql> insert into t values(0, 'I low New York');
Query OK, 1 row affected (0.01 sec)
mysql> select * from t where match('NY');
+---------------------+----------------+
| id | f |
+---------------------+----------------+
| 2810862456614682625 | I low New York |
+---------------------+----------------+
1 row in set (0.01 sec)
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | new |
| docs[0] | 1 |
| hits[0] | 1 |
| keyword[1] | york |
| docs[1] | 1 |
| hits[1] | 1 |
+---------------+-------+
9 rows in set (0.00 sec)

postgres LAG() using wrong previous value

Take the following data and queries:
create table if not exists my_example(a_group varchar(1)
,the_date date
,metric numeric(4,3)
);
INSERT INTO my_example
VALUES ('1','2018-12-14',0.514)
,('1','2018-12-15',0.532)
,('2','2018-12-15',0.252)
,('3','2018-12-14',0.562)
,('3','2018-12-15',0.361);
select
t1.the_date
,t1.a_group
,t1.metric AS current_metric
,lag(t1.metric, 1) OVER (ORDER BY t1.a_group, t1.the_date) AS previous_metric
from
my_example t1;
Which yields the following results:
+------------+---------+----------------+-----------------+
| the_date | a_group | current_metric | previous_metric |
+------------+---------+----------------+-----------------+
| 2018-12-14 | 1 | 0.514 | NULL |
| 2018-12-15 | 1 | 0.532 | 0.514 |
| 2018-12-15 | 2 | 0.252 | 0.532 |
| 2018-12-14 | 3 | 0.562 | 0.252 |
| 2018-12-15 | 3 | 0.361 | 0.562 |
+------------+---------+----------------+-----------------+
I expected the value of previous_metric for the lone a_group==2 row to be NULL. However, as you can see, the value is showing as 0.532, which is being picked up from the previous row. How can I modify this query to yield a value of NULL as I expected?
You need to use LAG with a partition on a_group, since you want the lag values from a specific frame:
SELECT
t1.the_date,
t1.a_group,
t1.metric AS current_metric,
LAG(t1.metric, 1) OVER (PARTITION BY t1.a_group ORDER BY t1.the_date)
AS previous_metric
FROM my_example t1;