Sphinx n-gram & charset_table - sphinx

I'm modifying the charset mapping for a sphinx cluster and I've run into a bit of an oddity, one which the documentation does not cover. The previous author of the charset_table and ngram_chars definitions has put the CJK unicode ranges in both charset mapping and ngrams.
Is this necessary?
If not, what is the purpose of this duplication?

I am going to answer my own question after doing some extensive testing. As it turns out, charset_table and ngram_chars complement each other rather than one being a subset of the other.
Testing run
Docset
<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
<sphinx:schema>
<sphinx:field name="foo"/>
</sphinx:schema>
<sphinx:document id="123">
<foo><![CDATA[ぇえぉおかがきぎく]]></foo>
</sphinx:document>
</sphinx:docset>
Just charset_table
using config file 'sphinx.conf'...
index 'i_blah': query 'ぇ ': returned 0 matches of 0 total in 0.000 sec
using config file 'sphinx.conf'...
index 'i_blah': query 'ぇえぉおかがきぎく ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=123, weight=1500
words:
1. 'ぇえぉおかがきぎく': 1 documents, 1 hits
Just ngram_chars
using config file 'sphinx.conf'...
index 'i_blah': query 'ぇえぉおかがきぎく ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=123, weight=9500
words:
1. 'ぇ': 1 documents, 1 hits
2. 'え': 1 documents, 1 hits
3. 'ぉ': 1 documents, 1 hits
4. 'お': 1 documents, 1 hits
5. 'か': 1 documents, 1 hits
6. 'が': 1 documents, 1 hits
7. 'き': 1 documents, 1 hits
8. 'ぎ': 1 documents, 1 hits
9. 'く': 1 documents, 1 hits
So, the presence of a character in charset_table does not in any way affect the indexing if the character is present in ngram_chars. They do not depend on one another.

I admit never used ngram_chars, but I think chars listed in ngram_chars do also need to be in charset_table
'charset_table', defines all chars that get indexed, then 'ngram_chars' defines ones that get segmented.
if only in 'charset_table' then will be indexed as normal words
if only in 'ngram_chars' then have no effect.

Related

kdb q - efficiently count tables in flatfiles

I have a lot of tables stored in flat files (in a directory called basepath) and I want to check their number of rows. The best I can so right now is:
c:([] filename:system "ls ",basepath;
tablesize:count each get each hsym `$basepath,/:system "ls ",basepath)
which loads each table entirely into memory and then performs the count (that's quite slow). Is saving as splayed tables the only way to make this faster (because I would only load 1 column and count that) or is there a trick in q that I can use?
Thanks for the help
If you have basepath defined as a string of the path to directory where all your flat tables are stored then you can create a dictionary of the row counts as follows:
q)cnt:{count get hsym x}
q)filename:key hsym `$basepath
q)filename!cnt each filename
t| 2
g| 3
This is where I have flat tables t and g saved in my basepath directory. This stops you from having to use system commands which are often less effiecient.
The function cnt takes the path of each flat table (as a symbol) and returns the number of rows without saving them into memory.
The best solution if you have control of the process of saving such files down is to add an extra step of saving the meta information of the row count somewhere seperate at the same time of saving the raw data. This would allow you to quickly access the table size from this file instead of reading the full tbale in each time.
However, note that to avoid pulling them into memory at all you would have to instead use read1 and look at the headers on the binary data. As you said it would be better to save as a splayed table and read in one column.
UPDATE: I would not recommend doing this and strongly suggest doing the above but for curiosity after looking into using read1 here's an example what what a hacky solution might look like:
f:{
b:read1(y;0;x);
if[not 0x62630b~b[2 4 5];'`$"not a table"];
cc:first first((),"i";(),4)1:b 7+til 4;
if[null ce:first where cc=sums 0x0=11 _ b;:.z.s[x*2;y]];
c:`$"\000" vs "c"$b[11+til ce];
n:first first((),"i";(),4)1:b[(20+ce)+til 4];
:`columns`rows!(c;n);
}[2000]
The q binary file format isn’t documented anywhere, the only way to figure it out is to save different things and see how the bytes change. It’s also subject to changes between versions - the above is written for 3.5 and is probably valid for 3.0-3.5 only, not the latest 3.6 release or anything 2.X.
The given code works in the following way:
reads a chunk from the front of the file
validates that it looks like a flat unkeyed table (flip[98] of a dict[99] with symbol[11] keys)
reads the count of symbols in the list of columns as a little endian 4 byte int
scans through the null terminated strings for that many zero bytes
if the columns are so numerous or verbose that we don’t have them
all in this chunk it will double the size of the chunk and try again
turn the strings into symbols
using the offset we get from the end of the column list, skip a bit
more of the header for the mixed list of columns
then read the count from the header of the first column
Hope this answers your question!
From experimenting with the binary files, it seems that the table count is saved as part of the binary file when you save down a flat file, taking up 4 bytes after the initial object type and column headings which will vary from table to table.
`:test set ([]a:1 2 3;b:4 5 6;c:7 8 9;aa:10 11 12;bb:13 14 15)
q)read1 `:test
0xff016200630b000500000061006200630061610062620000000500000009000300000
0 7 11 31
bytes | example | meaning
---------------------------------------------------------------------------------------
0 - 5 | 0xff016200630b0 | object is a flat table
7 - 11 | 0x05000000 | number of columns (5)
12- 22 | 0x6100620063006161006262 | one byte for the ascii values of column "a" and "b" in hex followed by the one byte separator
23 - 30 | 0x0000050000000900 | 8 bytes that can be skipped
31 - 34 | 0x0300000 | 4 bytes for row count of first column (3)
This should help you understand the function that Fiona posted.
The binary is saved down little-endian meaning the most-significant byte is the right-hand most digit - doing this in decimal for the number 100 would give 001, with the 100's (most significant) on the right and then 10s and finally 1s on the left. In the binary file, each group of 2 digits is a byte.
You can use 1: to read in the contents of a binary file, with additional arguments in the list specifying the offset - where to start reading from, and how many bytes to read. In our case we want to start at byte 31 and read in 4 bytes, specifying the output should be an integer and to cut the input into separate 4 byte chunks.
q)first first (enlist "i";enlist 4)1:(`:test;31;4)
3i
Converting the little-endian bytes into a long gives us the row count. Since this only has to read in 4 bytes instead of the whole file it is a lot quicker.
For a table with 10000 rows and 2 columns there is not much difference:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10000;31;4)
0
q)\t count get `:test10000
0
For a table with 100m rows and 2 columns:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10m;31;4)
0
q)\t count get `:test10m
2023
If you have a splayed table instead you can read in the number of elements in one of the columns from bytes 9-13 like so, assuming the column is a simple list:
q)first first (enlist "i";enlist 4)1:(`:a;8;4)
3i
You can read more about reading in from binary files here https://code.kx.com/q/ref/filenumbers/#1-binary-files
You can make what you currently have more efficient by using the following
counttables:{count each get each hsym `$basepath}
This will improve the speed of the count by not including the extra read in of the data as well as the join which you are currently doing. You are correct though that if the tables where saved splayed you would only have to read in the one column making it much more efficient.
If your tables are stored uncompressed there's probably something quite hacky you could do with a read1 on the headers within the file until you find the first column header.
But v hacky :-(
Are you responsible for saving these down? Can you keep a running state as you do?

Multilevel list skip numbering

I have created a multilevel list for being able to reference to table rows in a word document. Multilevel list thread
With the multilevel list I struggle to skip numbers. The table numbering is divided into two levels. NNN and NNNL, where N: Number and L:Letter.
Example:
1. Header 1
101
101a
2. Header 2
201a
The numbering below header 1 works fine, but header 2 does NOT work.
The numbering are used for clauses in a document, so in case a clause is divided into different conditions, the class NNNL is used. However, as it is now, it needs the NNN for create NNNL.
I have tried using 'Set numbering value' - 'Continue from previous list' - 'Advance value (skip numbers)', following this solution suggestion, but following this guide still results in the addition of a level NNN before (see below for examplification).
2. Header 2
201 <- This is added
201a
Can't I skip a numbering value of a list level above the item I wish to change?
Edited:
Also, when having a subheader I face issues. If the first clause after a subheader is divided into subclauses, I get
101
1.1 subheader
101a
What I want is this
101
1.1 subheader
102a
I have uploaded a word-file I have uploaded a word-document here which shows the issue.
Set both Level 2 and Level 3 to restart after Level 1 (in the dialog box described in the Answer of first post to which you link). This way you can leave out Level 2 and Level 3 will still restart.

Is it possible to return only one instance of each ID in a view?

I am trying to work out how I would ensure I only get one instance of each user and their ID when I try to do an inner join on my source table.
The source data is a series of user names and IDs
userid username
1 alice
2 bob
3 charley
4 dave
5 robin
6 jon
7 lou
8 scott
I have had to write the algorithm in python, to make sure I only get one set of user data matches with another (so we can make sure each user's data is used once in each round)
We store the pairings, and how many rounds have been completed successfully after the tests, but I'd like to try and shortcut the process
I can get all the results, but I want to find a better way to remove each matched pair from the results so they can't be matched again.
select u.user_id, u.user, ur.user_id, ur.user
from userResults u inner join userResults ur
on u.user_id < ur.user_id
and (u.user_id, ur.user_id) not in
(select uid, uid2 from rounds)
where u.match <= ur.match and ((u.user_id) not in %s
and ur.user_id not in %s) limit 1;
I've tried making materialised views with a unique constraint, but it doesn't seem to affect it - I get each possible pairing once, rather than each user paired only once
I'm trying to work out how I only get 4 results, in the right order.
Every time I look at the underlying code, I can't help but think there's a better way to write it natively in SQL rather than having to iterate over results in python.
edit
assuming each user has been matched 0 or more times, you might have a situtation where user_id's 1-4 have rounds set to 1, and matches set to 1, and the remaining 4 have rounds set to 1 and no matches.
I have a view which will return a default value of 0 and 0 for rounds and matches if they haven't yet played, and you can't assume all rounds entered have met with a match.
If the first 4 have all matched, and have generated rounds, user 1 and user 2 have already met and matched in a round, so they won't be matched again, so user 1 will match user 3 (or 4) and user 2 will match user 4 (or 3)
The issue I'm having is that when I remove limit, and iterate through manually - the first three matches I always get are: 2,4 then 1,3, then 2,3 (rather than 5,7 or 6,8)
Adding the sample data and current rounds
table rounds
uid uid2
1 2
3 4
userresults view
user_id user rounds score
1 alice 1 0
2 bob 1 1
3 charley 1 1
4 dave 1 0
5 robin 0 0
6 jon 0 0
7 lou 0 0
8 scott 0 0
I'm currently getting results like:
2,4
2,3
1,3
1,4
4,6
...
These are all valid results, but I would like to limit them to a single instance of each ID in each column, so just the first match of each valid pairing.
I've created a new view to try and simplify things a but, and populated it with dummy data and tried to generate matches
All these matches are valid, and I'm trying to add some form of filter or restriction to bring it back to sensible numbers.
777;"Bongo Steveson";779;"Drongo Mongoson"
777;"Bongo Steveson";782;"Cheesey McHamburger"
777;"Bongo Steveson";780;"Buns Mattburger"
779;"Drongo Mongoson";782;"Cheesey McHamburger"
779;"Drongo Mongoson";781;"Hamburgler Bunburger"
775;"Bob Jobsworth";777;"Bongo Steveson"
778;"Mongo Bongoson";779;"Drongo Mongoson"
775;"Bob Jobsworth";778;"Mongo Bongoson"
778;"Mongo Bongoson";781;"Hamburgler Bunburger"
775;"Bob Jobsworth";782;"Cheesey McHamburger"
775;"Bob Jobsworth";781;"Hamburgler Bunburger"
775;"Bob Jobsworth";780;"Buns Mattburger"
776;"Steve Bobson";777;"Bongo Steveson"
776;"Steve Bobson";779;"Drongo Mongoson"
776;"Steve Bobson";782;"Cheesey McHamburger"
776;"Steve Bobson";778;"Mongo Bongoson"
776;"Steve Bobson";781;"Hamburgler Bunburger"
780;"Buns Mattburger";782;"Cheesey McHamburger"
780;"Buns Mattburger";781;"Hamburgler Bunburger"
I still can't work out a sensible way to restrict these values, and it's driving me nuts.
I've implemented a solution in code but I'd really like to see if I can get this working in native Postgres.
At this point I'm monkeying around with a new test database schema, and this is my view - the adding unique to the index generates an error, and I can't add a check constraint to a materialised view (grrrr)
You can try joining sub query to ensure distinct record from user table.
select * from any_table t1
inner join(
select distinct userid,username from source_table
) t2 on t1.userid=t2.userid;

Process two space delimited text files into one by common column [duplicate]

This question already has answers here:
merge two files by key if exists in the first file / bash script [duplicate]
(2 answers)
Closed 9 years ago.
I have two text files that look like:
col1 primary col3 col4
blah 1 blah 4
1 2 5 6
...
and
colA primary colC colD
1 1 7 27
foo 2 11 13
I want to merge them into a single wider table, such as:
primary col1 col3 col4 colA colC colD
1 blah blah 4 a 7 27
2 1 5 6 foo 11 13
I'm pretty new to Perl, so I'm not sure what the best way is to do this.
Note that column order does not matter, and there are a couple million rows. Also my files are unfortunately not sorted.
My current plan unless there's an alternative:
For a given line in one of the files, scan the other file for the matching row and append them both as necessary into the new file. This sounds slow and cumbersome though.
Thanks!
Solution 1.
Read the smaller of two files line by line, using a standard CPAN delimited-file parser like TXT::CSV_XS to parse out columns.
Save each record (as arrayref of columns) in a hash, with your merge column being the hash key
When done, read the larger of two files line by line, using a standard CPAN delimited-file parser like TXT::CSV_XS to parse out columns.
For each record, find the join key field, find the matching record from your hash storing the data from file#1, merge the 2 records as needed, and print.
NOTE: This is pretty memory intensive as the entire smaller file will live in memory, but won't require you to read one of the files million times.
Solution 2.
Sort file1 (using Unix sort or some simple Perl code) into "file1.sorted"
Sort file2 (using Unix sort or some simple Perl code) into "file2.sorted"
Open both files for reading. Loop until both are fully read:
Read 1 line from each file into the buffer if the buffer for that file is empty (buffer being simply a variable containing the next record).
Compare indexes between 2 lines.
If index1 < index2, write the record for file1 into output (without merging) and empty buffer1. Repeat step 3
If index1 > index2, write the record for file2 into output (without merging) and empty buffer2. Repeat.
If index1 == index2, merge 2 records, write the merged record into output and empty out both buffers (assuming the join index column is unique. If not unique, this step is more complicated).
NOTE: this does NOT require you to keep entire file in memory, aside from sorting the files (which CAN be done in memory constrained way if you need to).

How to pick items from warehouse to minimise travel in TSQL?

I am looking at this problem from a TSQL point of view, however any advice would be appreciated.
Scenario
I have 2 sets of criteria which identify items in a warehouse to be selected.
Query 1 returns 100 items
Query 2 returns 100 items
I need to pick any 25 of the 100 items returned in query 1.
I need to pick any 25 of the 100 items returned in query 2.
- The items in query 1/2 will not be the same, ever.
Each item is stored in a segment of the warehouse.
A segment of the warehouse may contain numerous items.
I wish to select the 50 items (25 from each query) in a way as to reduce the number of segments I must visit to select the items.
Suggested Approach
My initial idea has been to combined the 2 result sets and produce a list of
Segment ID, NumberOfItemsRequiredInSegment
I would then select 25 items from each query, giving preference to those in a segments with the most NumberOfItemsRequiredInSegment. I know this would not be optimal but would be an easy to implement heuristic.
Questions
1) I suspect this is a standard combinational problem, but I don't recognise it.. perhaps multiple knapsack, does anyone recognise it?
2) Is there a better (easy-ish to impliment) heuristic or solution - ideally in TSQL?
Many thanks.
This might also not be optimal but i think would at least perform fairly well.
Calculate this set for query 1.
Segment ID, NumberOfItemsRequiredInSegment
take the top 25, Just by sorting by NumberOfItemsRequiredInSegment. call this subset A.
take the top 25 from query 2, by joining to A and sorting by "case when A.segmentID is not null then 1 else 0, NumberOfItemsRequiredInSegmentFromQuery2".
repeat this but take the top 25 from query 2 first. return the better performing of the 2 sets.
The one scenario where i think this fails would be if you got something like this.
Segment Count Query 1 Count Query 2
A 10 1
B 5 1
C 5 1
D 5 4
E 5 4
F 4 4
G 4 5
H 1 5
J 1 5
K 1 10
You need to make sure you choose A, D, E, from when choosing the best segments from query 1. To deal with this you'd almost still need to join to query two, so you can get the count from there to use as a tie breaker.