Sphinx search by path of a word - sphinx

I have simple table
id: primary
name: varchar fulltext index
here is my Sphinx config
https://justpaste.it/1okop
Indexer warns about docinfo
Sphinx 2.2.11-id64-release (95ae9a6) Copyright (c) 2001-2016, Andrew
Aksyonoff Copyright (c) 2008-2016, Sphinx Technologies Inc
(http://sphinxsearch.com)
using config file '/etc/sphinxsearch/sphinx.conf'... indexing index
'words'... WARNING: Attribute count is 0: switching to none docinfo
collected 10000 docs, 0.1 MB sorted 0.0 Mhits, 100.0% done total 10000
docs, 79566 bytes total 0.065 sec, 1210829 bytes/sec, 152179.20
docs/sec total 3 reads, 0.000 sec, 94.6 kb/call avg, 0.0 msec/call avg
total 9 writes, 0.000 sec, 47.5 kb/call avg, 0.0 msec/call avg
But it's said here Sphinx: WARNING: Attribute count is 0: switching to none docinfo that it's nothing seriouse.
OK, starting service and search for part of the word:
SELECT *
FROM test_words
where match (name) AGAINST ('lema')
No rows.
Tha same as
SELECT *
FROM test_words
where match (name) AGAINST ('*lema*')
No rows.
And in the same time there are results for query
SELECT *
FROM `test_words`
where position('lema' in name)>0
so as far as I can see - Sphinx is not searching by part of a word.
Why and how to fix it?
And - if I uncomment
min_infix_len = 3
infix_fields = name
I get
WARNING: index 'words': prefix_fields and infix_fields has no effect with dict=keywords, ignoring
And one more - show engines; show no Sphinx engine, is it normal now? mysql service was restarted.
All sql queries were run through Adminer logged in localhost:3312

There's no functions like position() or syntax like 'match(name) against' in Sphinx. According to your config your index is 'words' while your requests are against 'test_words' which is the source table you build your index from. So it seems to me you're connecting not to Sphinx, but to mysql. If you're looking for closer integration between MySQL and Sphinx try SphinxSE (http://sphinxsearch.com/docs/devel.html#sphinxse), it will then show up in "SHOW ENGINES" or if you don't want to deal with compiling MySQL with SphinxSE enable you might want to try Manticore Search (a fork of Sphinx) since it has integration with FEDERATED engine (https://docs.manticoresearch.com/2.6.4/html/federated_storage_engine.html) which is by default compiled in mysql, you just need to start it properly to enable it.
If you want to use Sphinx traditional way just make sure you connect to it, not to MySQL.

In sphinx it calls extended-syntax, you should specify field by #field, try this:
SELECT * FROM test_words
WHERE match('#name lema')
http://sphinxsearch.com/docs/current.html#extended-syntax

Related

What is an updated index configuration of Google Firestore in Datastore mode?

Since Nov 08 2022, 16h UTC, we sometimes get the following DatastoreException with code: UNAVAILABLE, and message:
Query timed out. Please try either limiting the entities scanned, or run with an updated index configuration.
I want to get all Keys of a certain kind of entities. These are returned in batches together with a new cursor. When using the cursor to get the next batch, then the above stated error happens. I am expecting that the query does not time out so fast. (It might be that it takes up to a few seconds until I am requesting the next batch of Keys using the returned cursor, but this never used to be a problem in the past.)
There no problem before the automatic upgrade to Firestore. Also counting entities of a kind often results in the error DatastoreException: "The datastore operation timed out, or the data was temporarily unavailable."
I am wondering whether I have to make any changes on my side. Does anybody else encounter these problems with Firestore in Datastore mode?
What is meant by "an updated index configuration"?
Thanks
Stefan
I just wanted to follow up here since we were able to do detailed analysis and come up with a workaround. I wanted to record our findings here for posterity's sake.
The root of the problem is queries over large ranges of deleted keys. Given schema like:
Kind: ExampleKind
Data:
Key
lastUpdatedMillis
ExampleKind/1040
5
ExampleKind/1052
0
ExampleKind/1064
12
ExampleKind/1065
100
ExampleKind/1070
42
Datastore will automatically generate both ASC and DESC index on the lastUpdatedMillis field.
The the lastUpdatedMillis ASC index table would have the following logical entries:
Index Key
Entity Key
0
ExampleKind/1052
5
ExampleKind/1040
12
ExampleKind/1064
42
ExampleKind/1070
100
ExampleKind/1065
In the workload you've described, there was an operation that did the following:
SELECT * FROM ExampleKind WHERE lastUpdatedMillis <= nowMillis()
For every ExampleKind Entity returned by the query, perform some operation which updates lastUpdatedMillis
Some of the updates may fail, so we repeat the query from step 1 again to catch any remaining entities.
When the operation completes, there are large key ranges in the index tables that are deleted, but in the storage system these rows still exist with special deletion markers. They are visible internally to queries, but are filtered in the results:
Index Key
Entity Key
x
xxxx
x
xxxx
x
xxxx
42
ExampleKind/1070
...
Und so weiter ...
x
xxxx
When we repeat the query over this data, if the number of deleted rows is very large (100_000 ... 1_000_000), the storage system may spend the entire operation looking for non-deleted data in this range. Eventually the Garbage Collection and Compaction mechanisms will remove the deleted rows and querying this key range becomes fast again.
A reasonable is workaround is to reduce the amount of work the query has to do by restricting the time range of the lastUpdateMillis field.
For example, instead of scanning the entire range of lastUpdateMillis < now, we could break up the query into:
(now - 60 minutes) <= lastUpdateMillis < now
(now - 120 minutes) <= lastUpdateMillis < (now - 60 minutes)
(now - 180 minutes) <= lastUpdateMillis < (now - 120 minutes)
This example uses 60 minute ranges, however the specific "chunk size" can be tuned to the shape of your data. These smaller queries will either succeed and find some results, or scan the entire key range and return 0 results, however in both scenarios they will complete within the RPC deadline.
Thank you again for reaching out about this!
A couple notes:
This deadlining query problem could occur with any kind of query over the index (projection, keys only, full entity, etc)
Despite what the error message says, no extra index here is need or would speed up the operation. Datastore's built-in ASC/DESC index over each field already exists for you and is serving this query.

where column in (single value) performance

I am writing dynamic sql code and it would be easier to use a generic where column in (<comma-seperated values>) clause, even when the clause might have 1 term (it will never have 0).
So, does this query:
select * from table where column in (value1)
have any different performance than
select * from table where column=value1
?
All my test result in the same execution plans, but if there is some knowledge/documentation that sets it to stone, it would be helpful.
This might not hold true for each and any RDBMS as well as for each an any query with its specific circumstances.
The engine will translate WHERE id IN(1,2,3) to WHERE id=1 OR id=2 OR id=3.
So your two ways to articulate the predicate will (probably) lead to exactly the same interpretation.
As always: We should not really bother about the way the engine "thinks". This was done pretty well by the developers :-) We tell - through a statement - what we want to get and not how we want to get this.
Some more details here, especially the first part.
I Think this will depend on platform you are using (optimizer of the given SQL engine).
I did a little test using MySQL Server and:
When I query select * from table where id = 1; i get 1 total, Query took 0.0043 seconds
When I query select * from table where id IN (1); i get 1 total, Query took 0.0039 seconds
I know this depends on Server and PC and what.. But The results are very close.
But you have to remember that IN is non-sargable (non search argument able), it will not use the index to resolve the query, = is sargable and support the index..
If you want the best one to use, You should test them in your environment because they both work so good!!

Improve speed by moving to NoSQL

Hello and thank you for reading my question!
Currently, we use PostgreSQL v.10 on 3 nodes through stolon (https://github.com/sorintlab/stolon)
We have 3 tables (I want to make my question simple):
Invoice (150 000 000 records)
User (35 000 000 records)
User_Address (20 000 000 records)
The main query looks like this (The original query is a large, using a temporary table and have a lot of where conditions, but the sample shows my problem.)
select
i.*
from invoice as i
inner join get_similar_name('Jon') as s on i.name ilike s.name
left join user_address as a on i.user_id = a.user_id
where
a.state = 'OH'
and
i.last_name = 'Smith'
and
i.date between '2016-01-01'::date and '2018-12-31'::date;
The function get_similar_name returns similar names (example: get_similar_name('Jon') will return John, Jonny, Jonathan ... etc) average 200 - 1000 names. I must use the function :\
The query was executed a long time, around 30 - 120 seconds,
but if I exclude the function get_similar_name from the query, then execution time will be not more then 1 second.
I already configured PostgreSQL and the server working pretty good. I also created indexes and my query don't use seq scan and etc.
We don't have the possibility to make partitioned tables because we have a lot of columns for this. We can't divide a table only by one row.
I think about migrating my warehouse to MongoDB
My questions are:
Am I right about moving to MongoDB?
Increase performance if I move warehouse from PostgreSQL to 20-40 nodes under MongoDB control?
Is it possible to have the function get_similar_name on MongoDB or similar solution? If yes, how?
Do you have good experience to use fulltext search in MongoDB?
Is it right way to use MongoDB on production?
Can you please advise a "google-vector" to right solution on your opinion?
I don't know if moving to MongoDB will solve a text search problem, but Postgres has excellent features like Vector and trigram. Have you tired any of this?
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
https://www.postgresql.org/docs/9.6/pgtrgm.html
On my previous project, we used pg_trgm and were pretty happy with its performance.

Export large dataset (JSON) from PostgreSQL

I have a postgres database with geospatial data and I want to export certain parts of those as a GeoJSON.
So I have a SQL-Command like the following:
SELECT jsonb_build_object ( 'some_test_data', jsonb_agg (ST_AsGeoJSON (ST_Transform (way, 4326))::jsonb)) as json
FROM (
SELECT way, name, highway
FROM planet_osm_line
LIMIT 10) result
and that basically works fine. I can also save it to a file and dump it directly to a file like so:
psql -qtAX -d my-database -f my_cool_sql_command.sql > result.json
So my data is correct and usable, but now I'd like to remove the LIMIT 10 and I get ERROR: total size of jsonb array elements exceeds the maximum of 268435455 bytes
I've read that it's not easy to remove this 256MB limit of postgres... But I guess there are other ways to get my result?
Try approach with shrinking the file be simplifying the geometry with st_simplify() and json_build_object() (should be 1 GB limit [same as a limit for text] not 256 MB as for binary files)
SELECT json_build_object('some_test_data', json_agg (ST_AsGeoJSON (ST_Transform (way, 4326))::json)) as json
FROM (SELECT st_simplify(way,1) way, name, highway
FROM planet_osm_line) result
You can simplify more then for 1 meter but with 1 meter you usually don't lose any important vertices on your graph.

How to check the status of long running DB2 query?

I am running a db2 query that unions two very large tables. I started the query 10 hours ago, and it doesn't seem to finish yet.
However, when I check the status of the process by using top, it shows the status is 'S'. Does this mean that my query stopped running? But I couldn't find any error message.
How can I check what is happening to the query?
In DB2 for LUW 11.1 there is a text-based dsmtop utility that allows you to monitor the DB2 instance, down to individual executing statements, in real time. It's pre-11.1 equivalent is called db2top.
There is also a Web-based application, IBM Data Server Manager, which has a free edition with basic monitoring features.
Finally, you can query one of the supplied SQL monitor interfaces, for example, the SYSIBMADM.MON_CURRENT_SQL view:
SELECT session_auth_id,
application_handle,
elapsed_time_sec,
activity_state,
rows_read,
SUBSTR(stmt_text,1,200)
FROM sysibmadm.mon_current_sql
ORDER BY elapsed_time_sec DESC
FETCH FIRST 5 ROWS ONLY
You can try this command as well
db2 "SELECT agent_id,
Substr(appl_name, 1, 20) AS APPLNAME,
elapsed_time_min,
Substr(authid, 1, 10) AS AUTH_ID,
agent_id,
appl_status,
Substr(stmt_text, 1, 30) AS STATEMENT
FROM sysibmadm.long_running_sql
WHERE elapsed_time_min > 0
ORDER BY elapsed_time_min desc
FETCH first 5 ROWS only"