make tsvector tokenize by space only - postgresql

I need to create a tsvector that does not split its content by hyphens but ideally only by whitespace.
select to_tsvector('simple','7073-03-001-01 7072-05-003-06')
creates
'-001':3 '-003':7 '-01':4 '-03':2 '-05':6 '-06':8 '7072':5 '7073':1
where I rather want
'7072-05-003-06':2 '7073-03-001-01':1
is this possible somehow?

There is a simple example of a parser called test_parser which seems to do what you want. It was last in the documents in 9.4, after that it was moved to only be documented in the source tree. These test extensions aren't always installed, so you might need to take special steps (depending on how you installed PostgreSQL and what your OS is and whether you are really using an EOL version) to get it.
create extension test_parser ;
create text search configuration test ( parser = testparser);
ALTER TEXT SEARCH CONFIGURATION test ADD MAPPING FOR word WITH simple;
SELECT * FROM to_tsvector('test', '7073-03-001-01 7072-05-003-06');
to_tsvector
---------------------------------------
'7072-05-003-06':2 '7073-03-001-01':1

Related

Full Text search with multiple synonyms in PostgresSQL

I am implementing Full Text Search with PostgreSQL. I am using following type query to search in document column.
FROM schema.table t0
WHERE t0.document ## websearch_to_tsquery('error')
I am working on to use FTS Dictionaries to search for similar words. I come across C:\Program Files\PostgreSQL\14\share\tsearch_data folder where I have defined word and its synonyms in xsyn_sample.rules file. File content is as mentioned below.
# Sample rules file for eXtended Synonym (xsyn) dictionary
# format is as follows:
#
# word synonym1 synonym2 ...
#
error fault issue mistake malfunctioning
I want to use this dictionary but don't know how to use it. When I search for 'error', I wants to display result for 'error', 'fault', 'issues', 'mistakes' etc which are having similar meanings. Kindly share if you have ever come across this implementation. Few things I am asking for
Is this xsyn_sample.rules is sufficient for this? If not then what other techniques can be used for this type of search?
How to configure postgreSQL 14 in my local system to use this dictionary instead of 'simple' or 'english'. I know how to use both of these dictionary with select plainto_tsquery('english','errors'); and select plainto_tsquery('simple','errors'); queries. Similarly I want to use my custom dictionary.
Is there any better source for dictionaries use in postgres in compare to https://www.postgresql.org/docs/current/textsearch-dictionaries.html ?
Don't edit the example rules file, create your own file mysyn.rules and add the synonyms there. Then create a dictionary that uses the file:
CREATE TEXT SEARCH DICTIONARY mysyn (TEMPLATE = xsyn_template, RULES = mysyn);
Then copy the English text search configuration and add your dictionary:
CREATE TEXT SEARCH CONFIGURATION myconf (COPY = english);
ALTER TEXT SEARCH CONFIGURATION myconf
ALTER MAPPING FOR word, asciiword WITH mysyn, english_stem;

How to export full-text files with SQL?

There are an easy way to import/export full-text fields as files?
that solve the problem of "load as multiple lines". Trying with SQL's COPY I can only to transform full-file into full-table, not into a single text field, because each line from COPY is a raw.
that solve the save-back problem, to save the full XML file in the filesystem, without changes in bynary representation (preserving SHA1), and without other exernal procedures (as Unix sed use).
The main problem is on export, so this is the title of this page.
PS: the "proof of same file" in the the round trip — import, export back and compare with original — can be obtained by sha1sum demonstration; see examples below. So, a natural demand is also to check same SHA1 by SQL, avoiding to export on simple check tasks.
All examples
Import a full text into a full-table (is not what I need), and test that can export as the same text. PS: I need to import one file into one field and one row.
Transform full table into one file (is not what I need) and test that can export as same text.PS: I need one row (of one field) into one file.
Calculate the hash by SQL, the SHA1 of the field. Must be the same when compare ... Else it is not a solution for me.
The folowing examples show each problem and a non-elegant workaround.
1. Import
CREATE TABLE ttmp (x text);
COPY ttmp FROM '/tmp/test.xml' ( FORMAT text ); -- breaking lines lines
COPY (SELECT x FROM ttmp) TO '/tmp/test_back.xml' (format TEXT);
Checking that original and "back" have exactly the same content:
sha1sum /tmp/test*.*
570b13fb01d38e04ebf7ac1f73dfad0e1d02b027 /tmp/test_back.xml
570b13fb01d38e04ebf7ac1f73dfad0e1d02b027 /tmp/test.xml
PS: seems perfect, but the problem here is the use of many rows. A real import-solution can import a file into a one-row (and one field). A real export-solution is a SQL function that produce test_back.xml from a single row (of a single field).
2. Transform full table into one file
Use it to store XML:
CREATE TABLE xtmp (x xml);
INSERT INTO xtmp (x)
SELECT array_to_string(array_agg(x),E'\n')::xml FROM ttmp
;
COPY (select x::text from xtmp) TO '/tmp/test_back2-bad.xml' ( FORMAT text );
... But not works as we can check by sha1sum /tmp/test*.xml, not produce the same result for test_back2-bad.xml.
So do also a translation from \n to chr(10), using an external tool (perl, sed or any other) perl -p -e 's/\\n/\n/g' /tmp/test_back2-bad.xml > /tmp/test_back2-good.xml
Ok, now test_back2-good.xml have the same hash ("570b13fb..." in my example) tham original.
Use of Perl is a workaround, how to do without it?
3. The SHA1 of the field
SELECT encode(digest(x::text::bytea, 'sha1'), 'hex') FROM xtmp;
Not solved, is not the same hash tham original (the "570b13fb..." in my example)... Perhaps the ::text enforced internal representation with \n symbols, so a solution will be direct cast to bytea, but it is an invalid cast. The other workaround also not is a solution,
SELECT encode(digest( replace(x::text,'\n',E'\n')::bytea, 'sha1' ), 'hex')
FROM xtmp
... I try CREATE TABLE btmp (x bytea) and COPY btmp FROM '/tmp/test.xml' ( FORMAT binary ), but error ("unknown COPY file signature").
COPY isn't designed for this. It's meant to deal with table-structured data, so it can't work without some way of dividing rows and columns; there will always be some characters which COPY FROM interprets as separators, and for which COPY TO will insert some escape sequence if it finds one in your data. This isn't great if you're looking for a general file I/O facility.
In fact, database servers aren't designed for general file I/O. For one thing, anything which interacts directly with the server's file system will require a superuser role. If at all possible, you should just query the table as usual, and deal with the file I/O on the client side.
That said, there are a few alternatives:
The built-in pg_read_file() function, and pg_file_write() from the adminpack module, provide the most direct interface to the file system, but they're both restricted to the cluster's data directory (and I wouldn't recommend storing random user-created files in there).
lo_import() and lo_export() are the only built-in functions I know of which deal directly with file I/O and which have unrestricted access to the server's file system (within the constraints imposed by the host OS), but the Large Object interface is not particularly user-friendly....
If you install the untrusted variant of a procedural language like Perl (plperlu) or Python (plpythonu), you can write wrapper functions for that language's native I/O routines.
There isn't much you can't accomplish via COPY TO PROGRAM if you're determined enough - for one, you could COPY (SELECT 1) TO PROGRAM 'mv <source_file> <target_file>' to work around the limitations of pg_file_write() - though this blurs the line between SQL and external tools somewhat (and whoever inherits your codebase will likely not be impressed...).
You can use plpythonu f.open(), f.write(), f.close() within a postgres function to write to a file.
Language extension would need to be installed.,
https://www.postgresql.org/docs/8.3/static/plpython.html
Working example from the mailing list.
https://www.postgresql.org/message-id/flat/20041106125209.55697.qmail%40web51806.mail.yahoo.com#20041106125209.55697.qmail#web51806.mail.yahoo.com
for example plpythonu
CREATE FUNCTION makefile(p_file text, p_content text) RETURNS text AS $$
o=open(args[0],"w")
o.write(args[1])
o.close()
return "ok"
$$ LANGUAGE PLpythonU;
PS: for safe implementation see this example.
Preparing
There are a not-so-obvious procedure to use PLpython extension. Supposing an UBUNTU server:
On SQL check SELECT version().
On terminal check sudo apt install postgresql-plpython listed versions.
Install the correct version, eg. sudo apt install postgresql-plpython-9.6.
Back to SQL do CREATE EXTENSION plpythonu.
Testing
The /tmp is default, to create or use other folder, eg. /tmp/sandbox, use sudo chown postgres.postgres /tmp/sandbox.
Suppose the tables of the question's examples. SQL script, repeating some lines:
DROP TABLE IF EXISTS ttmp;
DROP TABLE IF EXISTS xtmp;
CREATE TABLE ttmp (x text);
COPY ttmp FROM '/tmp/sandbox/original.xml' ( FORMAT text );
COPY (SELECT x FROM ttmp) TO '/tmp/sandbox/test1-good.xml' (format TEXT);
CREATE TABLE xtmp (x xml);
INSERT INTO xtmp (x)
SELECT array_to_string(array_agg(x),E'\n')::xml FROM ttmp
;
COPY (select x::text from xtmp)
TO '/tmp/sandbox/test2-bad.xml' ( FORMAT text );
SELECT makefile('/tmp/sandbox/test3-good.xml',x::text) FROM xtmp;
The sha1sum *.xml output of my XML original file:
4947.. original.xml
4947.. test1-good.xml
949f.. test2-bad.xml
4947.. test3-good.xml

Find my Postgres text search dictionaries

I created a thesaurus for full text search a few months back. I just recently added some entries, and (I think) I update it like this:
ALTER TEXT SEARCH CONFIGURATION english
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
WITH [my_thesaurus], english_stem;
However, I don't actually don't remember what my thesaurus was called. How can I figure this out?
You may find it in the output of:
SELECT dictname FROM pg_catalog.pg_ts_dict;
If you use psql client, you can use the following command.
\dFd[+] PATTERN
lists text search dictionaries
Basically, you can use \dFd+ to list all dictionaries along with their initialization options.

Netbeans SQL select column names with # in the

I have an odd problem with netbeans (6.7.1). Using the built in SQL editor I cannot select any column defined with a # in it's name. It appears that Netbeans is treating this a comment and never passing to the underlying connection. Is there a way to change this?
Thanks,
David
If you have any control over the column names, I suggest you remove the # symbols. NetBeans is not the only application that will choke on them.

Disabling the PostgreSQL 8.4 tsvector parser's `file` token type

I have some documents that contain sequences such as radio/tested that I would like to return hits in queries like
select * from doc
where to_tsvector('english',body) ## to_tsvector('english','radio')
Unfortunately, the default parser takes radio/tested as a file token (despite being in a Windows environment), so it doesn't match the above query. When I run ts_debug on it, that's when I see that it's being recognized as a file, and the lexeme ends up being radio/tested rather than the two lexemes radio and test.
Is there any way to configure the parser not to look for file tokens? I tried
ALTER TEXT SEARCH CONFIGURATION public.english
DROP MAPPING FOR file;
...but it didn't change the output of ts_debug. If there's some way of disabling file, or at least having it recognize both file and all the words that it thinks make up the directory names along the way, or if there's a way to get it to treat slashes as hyphens or spaces (without the performance hit of regexp_replaceing them myself) that would be really helpful.
I think the only way to do what you want is to create your own parser :-( Copy wparser_def.c to a new file, remove from the parse tables (actionTPS_Base and the ones following it) the entries that relate to files (TPS_InFileFirst, TPS_InFileNext etc), and you should be set. I think the main difficulty is making the module conform to PostgreSQL's C idiom (PG_FUNCTION_INFO_V1 and so on). Have a look at contrib/test_parser/ for an example.