Sphinx morphology stem_en not working - sphinx

I have a single-field Sphinx index with stemming set up as follows:
index main_sphinxalert
{
# Options:
type = rt
path = /var/lib/sphinxsearch/data/main_sphinxalert
morphology = stem_en
# Fields:
rt_field = query
}
I then insert:
mysql> INSERT INTO main_sphinxalert VALUES(201,'tables');
Query OK, 1 row affected (0.00 sec)
And select:
mysql> SELECT * FROM main_sphinxalert WHERE MATCH('tables');
+------+--------+
| id | weight |
+------+--------+
| 201 | 1709 |
+------+--------+
1 row in set (0.00 sec)
But can't select on a stem:
mysql> SELECT * FROM main_sphinxalert WHERE MATCH('table');
Empty set (0.00 sec)
Can anyone tell me what gives?

Turns out Sphinx will not read modifications to config files for a realtime index after it's been created: see here for info.
The way around this is to:
sudo service sphinxsearch stop
Then delete all index files (for me held in /var/lib/sphinxsearch/data/):
sudo rm main_sphinxalert.*
Then restart service:
sudo service sphinxsearch start
This means you lose your existing index, which will then have to be manually rebuilt from your database.

Related

What is the seq_name column in the ag_label table?

I'm working on a new feature that involves labels for Apache AGE and I'm looking for tools with which I can work with.
In psql interface, when you input the command SELECT * FROM ag_catalog.ag_label; the following output is shown:
name | graph | id | kind | relation | seq_name
------------------+--------+----+------+-----------------------+-------------------------
_ag_label_vertex | 495486 | 1 | v | test._ag_label_vertex | _ag_label_vertex_id_seq
_ag_label_edge | 495486 | 2 | e | test._ag_label_edge | _ag_label_edge_id_seq
vtx_label | 495486 | 3 | v | test.vtx_label | vtx_label_id_seq
elabel | 495486 | 4 | e | test.elabel | elabel_id_seq
I came across this and wasn't able to figure out what kind of data I can retrieve from it, what is it used for or how can it help me.
Can you explain the seq_name column?
Seq_name refers to sequences. Sequences are single-row tables that can be thought of as 'number generators' that start at some minimum integer value and then increment as they are 'consumed'.
A sequence that is associated with a column can be used to assign values to it. For example, 'mytable_seq_id' associated with column 'id' in a particular table 'mytable' might start at 1. Then as you add more entries to mytable, the 'id' column begins to increment to 2,3 and so on.
Postgres docs on creating sequences:
https://www.postgresql.org/docs/current/sql-createsequence.html
As for AGE, here's a comment taken directly out of the 'graph_commands.c' source file. It describes how sequences are used to generate labels ids.
static Oid create_schema_for_graph(const Name graph_name)
{
char *graph_name_str = NameStr(*graph_name);
CreateSchemaStmt *schema_stmt;
CreateSeqStmt *seq_stmt;
TypeName *integer;
DefElem *data_type;
DefElem *maxvalue;
DefElem *cycle;
Oid nsp_id;
/*
* This is the same with running the following SQL statement.
*
* CREATE SCHEMA `graph_name`
* CREATE SEQUENCE `LABEL_ID_SEQ_NAME`
* AS integer
* MAXVALUE `LABEL_ID_MAX`
* CYCLE
*
* The sequence will be used to assign a unique id to a label in the graph.
*
* schemaname doesn't have to be graph_name but the same name is used so
* that users can find the backed schema for a graph only by its name.
*
* ProcessUtilityContext of this command is PROCESS_UTILITY_SUBCOMMAND
* so the event trigger will not be fired.
*/
Note that sequences are used in other functions in the AGE internals as well, and the above function is just one example.

What is the simplest way to migrate data from MySQL to DB2

I need to migrate data from MySQL to DB2. Both DBs are up and running.
I tried to mysqldump with --no-create-info --extended-insert=FALSE --complete-insert and with a few changes on the output (e.g. change ` to "), I get to a satisfactory result but sometimes I have weird exceptions, like
does not have an
ending string delimiter. SQLSTATE=42603
Ideally I would want to have a routine that is as general as possible, but as an example here, let's say I have a DB2 table that looks like:
db2 => describe table "mytable"
Data type Column
Column name schema Data type name Length Scale Nulls
------------------------------- --------- ------------------- ---------- ----- ------
id SYSIBM BIGINT 8 0 No
name SYSIBM VARCHAR 512 0 No
2 record(s) selected.
Its MySQL counterpart being
mysql> describe mytable;
+-------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+----------------+
| id | bigint(20) | NO | PRI | NULL | auto_increment |
| name | varchar(512) | NO | | NULL | |
+-------+--------------+------+-----+---------+----------------+
2 rows in set (0.01 sec)
Let's assume the DB2 and MySQL databases are called mydb.
Now, if I do
mysqldump -uroot mydb mytable --no-create-info --extended-insert=FALSE --complete-insert | # mysldump, with options (see below): # do not output table create statement # one insert statement per record# ouput table column names
sed -n -e '/^INSERT/p' | # only keep lines beginning with "INSERT"
sed 's/`/"/g' | # replace ` with "
sed 's/;$//g' | # remove `;` at end of insert query
sed "s/\\\'/''/g" # replace `\'` with `''` , see http://stackoverflow.com/questions/2442205/how-does-one-escape-an-apostrophe-in-db2-sql and http://stackoverflow.com/questions/2369314/why-does-sed-require-3-backslashes-for-a-regular-backslash
, I get:
INSERT INTO "mytable" ("id", "name") VALUES (1,'record 1')
INSERT INTO "mytable" ("id", "name") VALUES (2,'record 2')
INSERT INTO "mytable" ("id", "name") VALUES (3,'record 3')
INSERT INTO "mytable" ("id", "name") VALUES (4,'record 4')
INSERT INTO "mytable" ("id", "name") VALUES (5,'" "" '' '''' \"\" ')
This ouput can be used as a DB2 query and it works well.
Any idea on how to solve this more efficiently/generally? Any other suggestions?
After having played around a bit I came with the following routine which I believe to be fairly general, robust and scalable.
1 Run the following command:
mysqldump -uroot mydb mytable --no-create-info --extended-insert=FALSE --complete-insert | # mysldump, with options (see below): # do not output table create statement # one insert statement per record# ouput table column names
sed -n -e '/^INSERT/p' | # only keep lines beginning with "INSERT"
sed 's/`/"/g' | # replace ` with "
sed -e 's/\\"/"/g' | # replace `\"` with `#` (mysql escapes double quotes)
sed "s/\\\'/''/g" > out.sql # replace `\'` with `''` , see http://stackoverflow.com/questions/2442205/how-does-one-escape-an-apostrophe-in-db2-sql and http://stackoverflow.com/questions/2369314/why-does-sed-require-3-backslashes-for-a-regular-backslash
Note: here unlike in the question ; are not being removed.
2 upload the file to DB2 server
scp out.sql user#myserver:out.sql
3 run queries from the file
db2 -tvsf /path/to/query/file/out.sql

Hyphen as blend_char doesn't seem to work

MariaDB> select id,name from t where type='B' and name='Foo-Bar';
+----------------+---------+
| item_source_id | name |
+----------------+---------+
| 2000245 | Foo-Bar |
+----------------+---------+
1 row in set (0.00 sec)
index base_index { # Don't use this directly; it's for inheritance only.
blend_chars = +, &, U+23, U+22, U+27, -, /
blend_mode = trim_none, trim_head, trim_tail, trim_both
}
source b_source : base_source {
sql_query = select id,name from t where type='B'
sql_field_string = name
}
index b_index_lemma : base_index {
source = b_source
path = /path/b_index_lemma
morphology = lemmatize_en_all
}
SphinxQL> select * from b_index_lemma where match('Foo-Bar');
Empty set (0.00 sec)
Other Sphinx queries have results, so the problem isn't e.g. that the index is empty. Yet the hyphenated form does not, and I'd like it to. Am I misusing blend_chars-cum-blend_mode?

skipping non-plain index rt (sphinx 2.1.6)

There is the question. Sphinx, version 2.1.6. I used to rt(real time) index, but when indexing display message in koncole:
using config file 'sphinx.conf'...
skipping non-plain index 'rt'...
But at a connection to sphinxbase and write query mysql> desc rt - displays:
+------------+--------+
| Field | Type |
+------------+--------+
| id | bigint |
| id | field |
| first_name | field |
| last_name | field |
+------------+--------+
This is default data?? They do not meet my request. How to work with index rt?
Sphinx.conf.
source database
{
type = mysql
sql_host = 127.0.0.1
sql_user = test
sql_pass = test
sql_db = community
sql_port = 3306
mysql_connect_flags = 32 # enable compression
sql_query_pre = SET NAMES utf8
sql_query_pre = SET SESSION query_cache_type=OFF
}
source rt : database
{
sql_query_range = SELECT MIN(id),MAX(id) FROM mbt_accounts
sql_query = SELECT id AS 'accountId', first_name AS 'fname', last_name AS 'lname' FROM mbt_accounts WHERE id >= 0 AND id<= 1000
sql_range_step = 1000
sql_ranged_throttle = 1000 # milliseconds
}
index rt
{
source = rt
type = rt
path = /etc/sphinxsearch/rtindex
rt_mem_limit = 700M
rt_field = accountId
rt_field = fname
rt_field = lname
rt_attr_string = fname
rt_attr_string = lname
charset_type = utf-8
charset_table = 0..9, A..Z->a..z, _, -, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451
}
searchd
{
listen = localhost:9312 # port for API
listen = localhost:9306:mysql41 #port for a SphinxQL
log = /var/log/sphinxsearch/searchd.log
binlog_path = /var/log/sphinxsearch/
query_log = /var/log/sphinxsearch/query.log
query_log_format = sphinxql
pid_file = /var/run/sphinxsearch/searchd.pid
workers = threads
max_matches = 1000
read_timeout = 5
client_timeout = 300
max_children = 30
max_packet_size = 8M
binlog_flush = 2
binlog_max_log_size = 90M
thread_stack = 8M
expansion_limit = 500
rt_flush_period = 1800
collation_server = utf8_general_ci
compat_sphinxql_magics = 0
prefork_rotation_throttle = 100
}
Thanks.
indexer only works with indexes that have a 'source' - ie plain disk indexesd. ie indexer does the stuff in the source to get the data to create the index.
RT (Real Time) indexes work very differently. indexer is not involved with RT indexes at all. They are handled totally by searchd.
To add data to a RT index, you need to run a bunch of SphinxQL commands (INSERT, UPDATE etc) that actually add the data to the index.
(DESCRIBE works, because searchd knows the 'structure' of the index (you told it via the rt_field etc) - even if never inserted any data)
Ah, I think you are asking why the structure is different. That's probably because the index was probably created before, you modified sphinx.conf. If you change the definiton of a RT index, you need to 'destroy' the index, to allow it be recreated again.
The simplest way is to shutdown searchd, delete the index files, delete the binlog (it no longer relevent) and then restart searchd.
searchd --stopwait
rm /etc/sphinxsearch/rtindex*
rm /path/to/binlog* #(you dont define a path, so it must be the default, which varies)
searchd #(starts searchd again)

EXISTS(select 1 from t1) vs EXISTS(select * from t1) [duplicate]

I used to write my EXISTS checks like this:
IF EXISTS (SELECT * FROM TABLE WHERE Columns=#Filters)
BEGIN
UPDATE TABLE SET ColumnsX=ValuesX WHERE Where Columns=#Filters
END
One of the DBA's in a previous life told me that when I do an EXISTS clause, use SELECT 1 instead of SELECT *
IF EXISTS (SELECT 1 FROM TABLE WHERE Columns=#Filters)
BEGIN
UPDATE TABLE SET ColumnsX=ValuesX WHERE Columns=#Filters
END
Does this really make a difference?
No, SQL Server is smart and knows it is being used for an EXISTS, and returns NO DATA to the system.
Quoth Microsoft:
http://technet.microsoft.com/en-us/library/ms189259.aspx?ppud=4
The select list of a subquery
introduced by EXISTS almost always
consists of an asterisk (*). There is
no reason to list column names because
you are just testing whether rows that
meet the conditions specified in the
subquery exist.
To check yourself, try running the following:
SELECT whatever
FROM yourtable
WHERE EXISTS( SELECT 1/0
FROM someothertable
WHERE a_valid_clause )
If it was actually doing something with the SELECT list, it would throw a div by zero error. It doesn't.
EDIT: Note, the SQL Standard actually talks about this.
ANSI SQL 1992 Standard, pg 191 http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
3) Case:
a) If the <select list> "*" is simply contained in a <subquery> that
is immediately contained in an <exists predicate>, then the <select list> is
equivalent to a <value expression>
that is an arbitrary <literal>.
The reason for this misconception is presumably because of the belief that it will end up reading all columns. It is easy to see that this is not the case.
CREATE TABLE T
(
X INT PRIMARY KEY,
Y INT,
Z CHAR(8000)
)
CREATE NONCLUSTERED INDEX NarrowIndex ON T(Y)
IF EXISTS (SELECT * FROM T)
PRINT 'Y'
Gives plan
This shows that SQL Server was able to use the narrowest index available to check the result despite the fact that the index does not include all columns. The index access is under a semi join operator which means that it can stop scanning as soon as the first row is returned.
So it is clear the above belief is wrong.
However Conor Cunningham from the Query Optimiser team explains here that he typically uses SELECT 1 in this case as it can make a minor performance difference in the compilation of the query.
The QP will take and expand all *'s
early in the pipeline and bind them to
objects (in this case, the list of
columns). It will then remove
unneeded columns due to the nature of
the query.
So for a simple EXISTS subquery like
this:
SELECT col1 FROM MyTable WHERE EXISTS (SELECT * FROM Table2 WHERE MyTable.col1=Table2.col2) The * will be
expanded to some potentially big
column list and then it will be
determined that the semantics of the
EXISTS does not require any of those
columns, so basically all of them can
be removed.
"SELECT 1" will avoid having to
examine any unneeded metadata for that
table during query compilation.
However, at runtime the two forms of
the query will be identical and will
have identical runtimes.
I tested four possible ways of expressing this query on an empty table with various numbers of columns. SELECT 1 vs SELECT * vs SELECT Primary_Key vs SELECT Other_Not_Null_Column.
I ran the queries in a loop using OPTION (RECOMPILE) and measured the average number of executions per second. Results below
+-------------+----------+---------+---------+--------------+
| Num of Cols | * | 1 | PK | Not Null col |
+-------------+----------+---------+---------+--------------+
| 2 | 2043.5 | 2043.25 | 2073.5 | 2067.5 |
| 4 | 2038.75 | 2041.25 | 2067.5 | 2067.5 |
| 8 | 2015.75 | 2017 | 2059.75 | 2059 |
| 16 | 2005.75 | 2005.25 | 2025.25 | 2035.75 |
| 32 | 1963.25 | 1967.25 | 2001.25 | 1992.75 |
| 64 | 1903 | 1904 | 1936.25 | 1939.75 |
| 128 | 1778.75 | 1779.75 | 1799 | 1806.75 |
| 256 | 1530.75 | 1526.5 | 1542.75 | 1541.25 |
| 512 | 1195 | 1189.75 | 1203.75 | 1198.5 |
| 1024 | 694.75 | 697 | 699 | 699.25 |
+-------------+----------+---------+---------+--------------+
| Total | 17169.25 | 17171 | 17408 | 17408 |
+-------------+----------+---------+---------+--------------+
As can be seen there is no consistent winner between SELECT 1 and SELECT * and the difference between the two approaches is negligible. The SELECT Not Null col and SELECT PK do appear slightly faster though.
All four of the queries degrade in performance as the number of columns in the table increases.
As the table is empty this relationship does seem only explicable by the amount of column metadata. For COUNT(1) it is easy to see that this gets rewritten to COUNT(*) at some point in the process from the below.
SET SHOWPLAN_TEXT ON;
GO
SELECT COUNT(1)
FROM master..spt_values
Which gives the following plan
|--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1004],0)))
|--Stream Aggregate(DEFINE:([Expr1004]=Count(*)))
|--Index Scan(OBJECT:([master].[dbo].[spt_values].[ix2_spt_values_nu_nc]))
Attaching a debugger to the SQL Server process and randomly breaking whilst executing the below
DECLARE #V int
WHILE (1=1)
SELECT #V=1 WHERE EXISTS (SELECT 1 FROM ##T) OPTION(RECOMPILE)
I found that in the cases where the table has 1,024 columns most of the time the call stack looks like something like the below indicating that it is indeed spending a large proportion of the time loading column metadata even when SELECT 1 is used (For the case where the table has 1 column randomly breaking didn't hit this bit of the call stack in 10 attempts)
sqlservr.exe!CMEDAccess::GetProxyBaseIntnl() - 0x1e2c79 bytes
sqlservr.exe!CMEDProxyRelation::GetColumn() + 0x57 bytes
sqlservr.exe!CAlgTableMetadata::LoadColumns() + 0x256 bytes
sqlservr.exe!CAlgTableMetadata::Bind() + 0x15c bytes
sqlservr.exe!CRelOp_Get::BindTree() + 0x98 bytes
sqlservr.exe!COptExpr::BindTree() + 0x58 bytes
sqlservr.exe!CRelOp_FromList::BindTree() + 0x5c bytes
sqlservr.exe!COptExpr::BindTree() + 0x58 bytes
sqlservr.exe!CRelOp_QuerySpec::BindTree() + 0xbe bytes
sqlservr.exe!COptExpr::BindTree() + 0x58 bytes
sqlservr.exe!CScaOp_Exists::BindScalarTree() + 0x72 bytes
... Lines omitted ...
msvcr80.dll!_threadstartex(void * ptd=0x0031d888) Line 326 + 0x5 bytes C
kernel32.dll!_BaseThreadStart#8() + 0x37 bytes
This manual profiling attempt is backed up by the VS 2012 code profiler which shows a very different selection of functions consuming the compilation time for the two cases (Top 15 Functions 1024 columns vs Top 15 Functions 1 column).
Both the SELECT 1 and SELECT * versions wind up checking column permissions and fail if the user is not granted access to all columns in the table.
An example I cribbed from a conversation on the heap
CREATE USER blat WITHOUT LOGIN;
GO
CREATE TABLE dbo.T
(
X INT PRIMARY KEY,
Y INT,
Z CHAR(8000)
)
GO
GRANT SELECT ON dbo.T TO blat;
DENY SELECT ON dbo.T(Z) TO blat;
GO
EXECUTE AS USER = 'blat';
GO
SELECT 1
WHERE EXISTS (SELECT 1
FROM T);
/* ↑↑↑↑
Fails unexpectedly with
The SELECT permission was denied on the column 'Z' of the
object 'T', database 'tempdb', schema 'dbo'.*/
GO
REVERT;
DROP USER blat
DROP TABLE T
So one might speculate that the minor apparent difference when using SELECT some_not_null_col is that it only winds up checking permissions on that specific column (though still loads the metadata for all). However this doesn't seem to fit with the facts as the percentage difference between the two approaches if anything gets smaller as the number of columns in the underlying table increases.
In any event I won't be rushing out and changing all my queries to this form as the difference is very minor and only apparent during query compilation. Removing the OPTION (RECOMPILE) so that subsequent executions can use a cached plan gave the following.
+-------------+-----------+------------+-----------+--------------+
| Num of Cols | * | 1 | PK | Not Null col |
+-------------+-----------+------------+-----------+--------------+
| 2 | 144933.25 | 145292 | 146029.25 | 143973.5 |
| 4 | 146084 | 146633.5 | 146018.75 | 146581.25 |
| 8 | 143145.25 | 144393.25 | 145723.5 | 144790.25 |
| 16 | 145191.75 | 145174 | 144755.5 | 146666.75 |
| 32 | 144624 | 145483.75 | 143531 | 145366.25 |
| 64 | 145459.25 | 146175.75 | 147174.25 | 146622.5 |
| 128 | 145625.75 | 143823.25 | 144132 | 144739.25 |
| 256 | 145380.75 | 147224 | 146203.25 | 147078.75 |
| 512 | 146045 | 145609.25 | 145149.25 | 144335.5 |
| 1024 | 148280 | 148076 | 145593.25 | 146534.75 |
+-------------+-----------+------------+-----------+--------------+
| Total | 1454769 | 1457884.75 | 1454310 | 1456688.75 |
+-------------+-----------+------------+-----------+--------------+
The test script I used can be found here
Best way to know is to performance test both versions and check out the execution plan for both versions. Pick a table with lots of columns.
There is no difference in SQL Server and it has never been a problem in SQL Server. The optimizer knows that they are the same. If you look at the execution plans, you will see that they are identical.
Personally I find it very, very hard to believe that they don't optimize to the same query plan. But the only way to know in your particular situation is to test it. If you do, please report back!
Not any real difference but there might be a very small performance hit. As a rule of thumb you should not ask for more data than you need.