How to select columns in sql query of source in sphinx - sphinx

I'm using sphinx indexer to create a dictionary based on documents in my mysql db, but I can't limit the source's sql query to selected columns.
This is the command I use
indexer --buildstops dict.txt 1000 --verbose --print-queries --dump-rows listing_rows --buildfreqs listing_core -c config/development.sphinx.conf
With the following source in development.sphinx.conf, no documents are found and the dict.txt is empty
source listing_source {
type = mysql
sql_host = mysql
sql_user = sharetribe
sql_pass = secret
sql_db = sharetribe_development
sql_query = SELECT title AS title, description AS description FROM listings;
}
Output
Sphinx 2.2.10-id64-release (2c212e0)
Copyright (c) 2001-2015, Andrew Aksyonoff
Copyright (c) 2008-2015, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file 'config/development.sphinx.conf'...
WARNING: key 'max_matches' was permanently removed from Sphinx configuration. Refer to documentation for details.
WARNING: key 'charset_type' was permanently removed from Sphinx configuration. Refer to documentation for details.
WARNING: key 'enable_star' was permanently removed from Sphinx configuration. Refer to documentation for details.
WARNING: key 'charset_type' was permanently removed from Sphinx configuration. Refer to documentation for details.
WARNING: key 'enable_star' was permanently removed from Sphinx configuration. Refer to documentation for details.
indexing index 'listing_core'...
building stopwords list...
SQL-CONNECT: ok
SQL-QUERY: SELECT title AS title, description AS description FROM listings;: ok
total 0 docs, 0 bytes
total 0.008 sec, 0 bytes/sec, 0.00 docs/sec
total 0 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
total 0 writes, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
When I change the sql_query to return all columns, the indexer finds the expected number of documents (2) and adds them to the dictionary.
source listing_source {
type = mysql
sql_host = mysql
sql_user = sharetribe
sql_pass = secret
sql_db = sharetribe_development
sql_query = SELECT * FROM listings;
}
Output:
Sphinx 2.2.10-id64-release (2c212e0)
Copyright (c) 2001-2015, Andrew Aksyonoff
Copyright (c) 2008-2015, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file 'config/development.sphinx.conf'...
WARNING: key 'max_matches' was permanently removed from Sphinx configuration. Refer to documentation for details.
WARNING: key 'charset_type' was permanently removed from Sphinx configuration. Refer to documentation for details.
WARNING: key 'enable_star' was permanently removed from Sphinx configuration. Refer to documentation for details.
WARNING: key 'charset_type' was permanently removed from Sphinx configuration. Refer to documentation for details.
WARNING: key 'enable_star' was permanently removed from Sphinx configuration. Refer to documentation for details.
indexing index 'listing_core'...
building stopwords list...
SQL-CONNECT: ok
SQL-QUERY: SELECT * FROM listings;: ok
total 2 docs, 485 bytes
total 0.008 sec, 56303 bytes/sec, 232.18 docs/sec
total 0 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
total 0 writes, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
How can I limit the query to return selected columns only?

sql_query = SELECT title AS title, description AS description FROM listings;
Doesn't work because it doesn't have a document_id. Add a id column (as the first!) and it should work. You also dont need 'AS' if using same name. (* probably worked, because a id like column happened to be first :)
So just make sure include a id, and then name your columns want as fields ...
sql_query = SELECT id, title, description FROM listings

Related

Unable to match sendmail "Connection rate limit exceeded" with fail2ban

I can't manage to find the error preventing fail2ban to match these lines:
Apr 19 20:17:12 localhost sm-mta[201892]: ruleset=check_relay, arg1=[12.345.7.789], arg2=12.345.7.789, relay=host.hostname.com [12.345.7.789] (may be forged), reject=421 4.3.2 Connection rate limit exceeded.
Apr 19 20:17:53 localhost sm-mta[201902]: 13JIHpTD201902: [12.345.7.789] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA-v4
Here is the associated fail2ban configuration:
[Definition]
_daemon = (?:(sm-(mta|acceptingconnections)|sendmail))
__prefix_line = %(known/__prefix_line)s(?:\w{14,20}: )?
prefregex = ^<F-MLFID>%(__prefix_line)s</F-MLFID><F-CONTENT>.+</F-CONTENT>$
cmnfailre = ^ruleset=check_relay, arg1=(?P<dom>\S+), arg2=(?:IPv6:<IP6>|<IP4>), relay=((?P=dom) )?\[(\d+\.){3}\d+\](?: \(may be forged\))?, reject=421 4\.3\.2 (Connection rate limit exceeded\.|Too many open connections\.)$
^(?:\S+ )?\[(?:IPv6:<IP6>|<IP4>)\](?: \(may be forged\))? did not issue (?:[A-Z]{4}[/ ]?)+during connection to (?:TLS)?M(?:TA|S[PA])(?:-\w+)?$
I am testing with fail2ban-regex test-mail.log /etc/fail2ban/filter.d/sendmail-reject.conf
Resulting in:
Results
=======
Failregex: 0 total
Ignoreregex: 0 total
Date template hits:
|- [# of hits] date format
| [5] {^LN-BEG}(?:DAY )?MON Day %k:Minute:Second(?:\.Microseconds)?(?: ExYear)?
`-
Lines: 5 lines, 0 ignored, 0 matched, 5 missed
[processed in 0.00 sec]
Any idea ?
Thanks !
The second message (did not issue MAIL/EXPN/VRFY/ETRN) can be found if you would set mode aggressive by sendmail-reject jail (after this fix, e. g. v.0.10.6 and 0.11.2).
There was indeed no exact rule for the first message (rate limit exceeded) matching this kind of message exactly, due to different handling on the arguments, but...
I fixed this now in f0214b3 on github.
Unless not released you can extend it by yourselves either in filter (copy & paste from github filter) or directly in jail:
[sendmail-reject]
enabled = true
mode = aggressive
failregex = %(known/failregex)s
^ruleset=check_relay(?:, arg\d+=\S*)*, relay=(\S+ )?\[?<ADDR>\]?(?: \(may be forged\))?, reject=421 4\.3\.2 (Connection rate limit exceeded\.|Too many open connections\.)$"
^(?:\S+ )?\[<ADDR>\](?: \(may be forged\))? did not issue \S+ during connection

PostgreSQL write amplification

I'm trying to find how much stress PostgreSQL puts on disks and results are kind of discouraging so far. Please take a look on methodology, apparently I'm missing something or calculating numbers in a wrong way.
Environment
PostgreSQL 9.6.0-1.pgdg16.04+1 is running inside a separate LXC container with Ubuntu 16.04.1 LTS (kernel version 4.4.0-38-generic, ext4 filesystem on top of SSD), has only one client connection from which I run tests.
I disabled autovacuum to prevent unnecessary writes.
Calculation of written bytes is done by following command, I want to find total number of bytes written by all PostgreSQL processes (including WAL writer):
pgrep postgres | xargs -I {} cat /proc/{}/io | grep ^write_bytes | cut -d' ' -f2 | python -c "import sys; print sum(int(l) for l in sys.stdin)"
Tests
With # sign I marked a database command, with → I marked result of write_bytes sum after the database command. The test case is simple: a table with just one int4 column filled with 10000000 values.
Before every test I run set of commands to free disk space and prevent additional writes:
# DELETE FROM test_inserts;
# VACUUM FULL test_inserts;
# DROP TABLE test_inserts;
Test #1: Unlogged table
As documentation states, changes in UNLOGGED table are not written to WAL log, so it's a good point to start:
# CREATE UNLOGGED TABLE test_inserts (f1 INT);
→ 1526276096
# INSERT INTO test_inserts SELECT generate_series(1, 10000000);
→ 1902977024
The difference is 376700928 bytes (~359MB), which sort of makes sense (ten millions of 4-byte integers + rows, pages and other costs), but still looks a bit too much, almost 10x of actual data size.
Test #2: Unlogged table with primary key
# CREATE UNLOGGED TABLE test_inserts (f1 INT PRIMARY KEY);
→ 2379882496
# INSERT INTO test_inserts SELECT generate_series(1, 10000000);
→ 2967339008
The difference is 587456512 bytes (~560MB).
Test #3: regular table
# CREATE TABLE test_inserts (f1 INT);
→ 6460669952
# INSERT INTO test_inserts SELECT generate_series(1, 10000000);
→ 7603630080
There the difference is already 1142960128 bytes (~1090MB).
Test #4: regular table with primary key
# CREATE TABLE test_inserts (f1 INT PRIMARY KEY);
→ 12740534272
# INSERT INTO test_inserts SELECT generate_series(1, 10000000);
→ 14895218688
Now the difference is 2154684416 bytes (~2054MB) and after about 30 seconds additional 100MB were written.
For this test case I made a breakdown by processes:
Process | Bytes written
/usr/lib/postgresql/9.6/bin/postgres | 0
\_ postgres: 9.6/main: checkpointer process | 99270656
\_ postgres: 9.6/main: writer process | 39133184
\_ postgres: 9.6/main: wal writer process | 186474496
\_ postgres: 9.6/main: stats collector process | 0
\_ postgres: 9.6/main: postgres testdb [local] idle | 1844658176
Any ideas, suggestions on how to measure values I'm looking for correctly? Maybe it's a kernel bug? Or PostgreSQL really does so many writes?
Edit: To double check what write_bytes means I wrote a simple python script that proved, that this value is the actual written bytes value.
Edit 2: For PostgreSQL 9.5 Test case #1 showed 362577920 bytes, test #4 showed 2141343744 bytes, so it's not about PG version.
Edit 3: Richard Huxton mentioned Database Page Layout article and I'd like to elaborate: I agree with the storage cost, that includes 24 bytes of row header, 4 bytes of data itself and even 4 bytes for data alignment (8 bytes usually), which gives 32 bytes per row and with that amount of rows it's about 320MB per table and this is something I got with test #1.
I could assume that primary key in that case should be about the same size as data and it test #4 both, data and PK, would be written to WAL. That gives something like 360MB x 4 = 1.4GB, which is less than result I got.

sqlcmd not showing RESTORE database stats

The following command in a cmd window
sqlcmd -S. -Usa -Ppass -dmaster -Q "RESTORE DATABASE [MYDATABASE] FROM DISK = 'D:\SQL Server\MYDATABASE.BAK' WITH FILE = 1, NOUNLOAD, REPLACE, STATS = 10"
displays the following progress output:
10 percent processed.
20 percent processed.
30 percent processed.
40 percent processed.
50 percent processed.
60 percent processed.
70 percent processed.
80 percent processed.
90 percent processed.
100 percent processed.
Processed 32320 pages for database 'MYDATABASE', file 'MYDATABASE' on file 1.
Processed 7 pages for database 'MYDATABASE', file 'MYDATABASE_log' on file 1.
But it turns that the progress is shown only after the entire restore, turning the stats during the process useless.
Any advice?
Here is the version of sqlcmd tool:
Microsoft (R) SQL Server Command Line Tool
Version 12.0.2000.8 NT
Copyright (c) 2014 Microsoft. All rights reserved.
Update Dec-2016:
Just including the comment from Microsoft Connect link shared in comments:
SQLCMD was rewritten in SQL 2012 to use ODBC. Here is a small
regression error that appears to have sneaked in.
It's the same effect reported when calling RAISERROR('Hello', 0, 1) WITH NOWAIT along a script.
I believe you can look in the SQL logs to see the progress ongoing.
you can query percent_complete in sys.dm_exec_requests
use start to open a separate window and issue a select percent_complete from sys.dm_exec_requests where percent_complete > 0

How to debug "Sugar CRM X Files May Only Be Used With A Sugar CRM Y Database."

Sometimes one gets a message like:
Sugar CRM 6.4.5 Files May Only Be Used With A Sugar CRM 6.4.5 Database.
I am wondering how Sugar determines what version of the database it is using. In the above case, I get the following output:
select * from config where name='sugar_version';
+----------+---------------+-------+
| category | name | value |
+----------+---------------+-------+
| info | sugar_version | 6.4.5 |
+----------+---------------+-------+
1 row in set (0.00 sec)
cat config.php |grep sugar_version
'sugar_version' => '6.4.5',
Given the above output, I am wondering how to debug the output "Sugar CRM 6.4.5 Files May Only Be Used With A Sugar CRM 6.4.5 Database.": Sugar seems to think the files are not of version 6.4.5 even though the sugar_version is 6.4.5 in config.php; where should I look next?
Two options for the issue:
Option 1: Update your database for the latest version.
Option 2: Follow the steps below and change the SugarCRM cnfig version.
mysql> select * from config where name ='sugar_version';
+----------+---------------+---------+----------+
| category | name | value | platform |
+----------+---------------+---------+----------+
| info | sugar_version | 7.7.0.0 | NULL |
+----------+---------------+---------+----------+
1 row in set (0.00 sec)
Update your sugarcrm version to apporipriate :
mysql> update config set value='7.7.1.1' where name ='sugar_version';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
The above commands seem to be correct. Sugar seems to check that config.php and the config table in the database contain the same version. In my case I was making the mistake of using the wrong database -- so if you're like me and tend to have your databases mixed up, double check in config.php that 'dbconfig' is indeed pointing to the right database.

DB2 CLI result output

When running command-line queries in MySQL you can optionally use '\G' as a statement terminator, and instead of the result set columns being listed horizontally across the screen, it will list each column vertically, which the corresponding data to the right. Is there a way to the same or a similar thing with the DB2 command line utility?
Example regular MySQL result
mysql> select * from tagmap limit 2;
+----+---------+--------+
| id | blog_id | tag_id |
+----+---------+--------+
| 16 | 8 | 1 |
| 17 | 8 | 4 |
+----+---------+--------+
Example Alternate MySQL result:
mysql> select * from tagmap limit 2\G
*************************** 1. row ***************************
id: 16
blog_id: 8
tag_id: 1
*************************** 2. row ***************************
id: 17
blog_id: 8
tag_id: 4
2 rows in set (0.00 sec)
Obviously, this is much more useful when the columns are large strings, or when there are many columns in a result set, but this demonstrates the formatting better than I can probably explain it.
I don't think such an option is available with the DB2 command line client. See http://www.dbforums.com/showthread.php?t=708079 for some suggestions. For a more general set of information about the DB2 command line client you might check out the IBM DeveloperWorks article DB2's Command Line Processor and Scripting.
Little bit late, but found this post when I searched for an option to retrieve only the selected data.
So db2 -x <query> gives only the result back. More options can be found here: https://www.ibm.com/docs/en/db2/11.1?topic=clp-options
Example:
[db2inst1#a21c-db2 db2]$ db2 -n select postschemaver from files.product
POSTSCHEMAVER
--------------------------------
147.3
1 record(s) selected.
[db2inst1#a21c-db2 db2]$ db2 -x select postschemaver from files.product
147.3
DB2 command line utility always displays data in tabular format. i.e. rows horizontally and columns vertically. It does not support any other format like \G statement terminator do for mysql. But yes, you can store column organized data in DB2 tables when DB2_WORKLOAD=ANALYTICS is set.
db2 => connect to coldb
Database Connection Information
Database server = DB2/LINUXX8664 10.5.5
SQL authorization ID = BIMALJHA
Local database alias = COLDB
db2 => create table testtable (c1 int, c2 varchar(10)) organize by column
DB20000I The SQL command completed successfully.
db2 => insert into testtable values (2, 'bimal'),(3, 'kumar')
DB20000I The SQL command completed successfully.
db2 => select * from testtable
C1 C2
----------- ----------
2 bimal
3 kumar
2 record(s) selected.
db2 => terminate
DB20000I The TERMINATE command completed successfully.