postgres text search - numhword matching and order - postgresql

The postgres documentation here explains the numhword parser as one that matches Hyphenated word, letters and digits. The example they give for this is postgres-beta1 and this matches nicely. However, somethnig such as postgres-9-beta1 does not match, and i can't seem to find a default parser that will work with this. SQL below.
Is my best choice something that parsed just on spaces? Is there such a default parser? (It seems test_parser doesn't ship with 9.5 anymore...)
I want to tokenize alphanumerics, hyphenated. Am I stuck with regular expressions for the time being, or is there a straightforward way to create a customer parser (without dropping down into C) ?
CREATE TEXT SEARCH DICTIONARY simple_nostem_no_stop (TEMPLATE = pg_catalog.simple);
CREATE TEXT SEARCH CONFIGURATION test_id_search ( COPY = pg_catalog.simple );
alter text search configuration test_id_search
drop mapping for asciihword, asciiword, email, file, float, host, hword, hword_asciipart, hword_numpart, hword_part, int, numhword, numword, sfloat, uint, url, url_path, version, word ;
ALTER TEXT SEARCH CONFIGURATION test_id_search
ALTER MAPPING FOR numhword WITH simple_nostem_no_stop;
\dF+ test_id_search
Text search configuration "public.test_id_search"
Parser: "pg_catalog.default"
Token | Dictionaries
---------+-----------------------
numhword | simple_nostem_no_stop
/* This works as i hoped, per the docs: */
test_db=# select to_tsvector('test_id_search', ' postgresql-beta1 ') ;
to_tsvector
----------------------
'postgresql-beta1':1
(1 row)
/* This doesn't seem to work? */
test_db=# select to_tsvector('test_id_search', ' postgresql-9-beta1 ') ;
to_tsvector
-------------
(1 row)

Related

How to remove multiple characters between 2 special characters in a column in SSIS expression

I want to remove the multiple characters starting from '#' till the ';' in derived column expression in SSIS.
For example,
my input column values are,
and want the output as,
Note: Length after '#' is not fixed.
Already tried in SQL but want to do it via SSIS derived column expression.
First of all: Please do not post pictures. We prefer copy-and-pastable sample data. And please try to provide a minimal, complete and reproducible example, best served as DDL, INSERT and code as I do it here for you.
And just to mention this: If you control the input, you should not mix information within one string... If this is needed, try to use a "real" text container like XML or JSON.
SQL-Server is not meant for string manipulation. There is no RegEx or repeated/nested pattern matching. So we would have to use a recursive / procedural / looping approach. But - if performance is not so important - you might use a XML hack.
--DDL and INSERT
DECLARE #tbl TABLE(ID INT IDENTITY,YourString VARCHAR(1000));
INSERT INTO #tbl VALUES('Here is one without')
,('One#some comment;in here')
,('Two comments#some comment;in here#here is the second;and some more text')
--The query
SELECT t.ID
,t.YourString
,CAST(REPLACE(REPLACE((SELECT t.YourString AS [*] FOR XML PATH('')),'#','<!--'),';','--> ') AS XML) SeeTheIntermediateXML
,CAST(REPLACE(REPLACE((SELECT t.YourString AS [*] FOR XML PATH('')),'#','<!--'),';','--> ') AS XML).value('.','nvarchar(max)') CleanedValue
FROM #tbl t
The result
+----+-------------------------------------------------------------------------+-----------------------------------------+
| ID | YourString | CleanedValue |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 1 | Here is one without | Here is one without |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 2 | One#some comment;in here | One in here |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 3 | Two comments#some comment;in here#here is the second;and some more text | Two comments in here and some more text |
+----+-------------------------------------------------------------------------+-----------------------------------------+
The idea in short:
Using some string methods we can wrap your unwanted text in XML comments.
Look at this
Two comments<!--some comment--> in here<!--here is the second--> and some more text
Reading this XML with .value() the content will be returned without the comments.
Hint 1: Use '-->;' in your replacement to keep the semi-colon as delimiter.
Hint 2: If there might be a semi-colon ; somewhere else in your string, you would see the --> in the result. In this case you'd need a third REPLACE() against the resulting string.

Can I disable dictionary in postgres ts_vector / ts_query full text search?

I need to do a text search on machine language. If I use any of the available text search dictonaries, the ts_vectors are messing up.
ex. move -> becomes mov and my searching is failing.
any Idea how to index non- lingual words?
Thanks!
Have you tried the simple dictionary with an empty stop word file?
Create an empty stop word file $(pg_config --sharedir)/tsearch_data/empty.stop and run:
CREATE TEXT SEARCH DICTIONARY machine (
TEMPLATE = pg_catalog.simple,
STOPWORDS = empty
);
CREATE TEXT SEARCH CONFIGURATION machine (
PARSER = default
);
ALTER TEXT SEARCH CONFIGURATION machine
ADD MAPPING FOR asciiword, word, numword, asciihword, hword,
numhword, hword_asciipart, hword_part,
hword_numpart, email, protocol, url, host,
url_path, file, sfloat, float, int, uint,
version, tag, entity, blank
WITH machine;
Then you can get:
test=> SELECT * FROM ts_debug('machine', 'move');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+--------------+------------+---------
asciiword | Word, all ASCII | move | {machine} | machine | {move}
(1 row)
If you want this configuration by default (so you don't have to specify 'machine' all the time), change the parameter default_text_search_config appropriately.

PostgreSQL full text search abbreviations

I created a Postgresql full text search using 'german'. How can I configer, that when I search for "Bezirk", lines containing "Bez." are also a match? (And vice-versa)
#pozs is right. You need to use a synonym dictionary.
1 - In the directory $SHAREDIR/tsearch_data create the file german.syn with the following contents:
Bez Bezirk
2 - Execute the query:
CREATE TEXT SEARCH DICTIONARY german_syn (
template = synonym,
synonyms = german);
CREATE TEXT SEARCH CONFIGURATION german_syn(COPY='simple');
ALTER TEXT SEARCH CONFIGURATION german_syn
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_syn, german_stem;
Now you can test it. Execute queries:
test=# SELECT to_tsvector('german_syn', 'Bezirk') ## to_tsquery('german_syn', 'Bezirk & Bez');
?column?
----------
t
(1 row)
test=# SELECT to_tsvector('german_syn', 'Bez Bez.') ## to_tsquery('german_syn', 'Bezirk');
?column?
----------
t
(1 row)
Additional links:
PostgreSQL: A Full Text Search engine (expired)
Try using a wildcard in your search.
For example:
tableName.column LIKE 'Bez%'
The % will search for any letter or number after the Bez
Description is very vague to understand what you are trying to achieve, but it looks like you need simple pattern matching search as you looking for abbreviations (so need to do stemming like in Full Text Search). I would with pg_trgm for this purpose:
WITH t(word) AS ( VALUES
('Bez'),
('Bezi'),
('Bezir')
)
SELECT word, similarity(word, 'Bezirk') AS similarity
FROM t
WHERE word % 'Bezirk'
ORDER BY similarity DESC;
Result:
word | similarity
-------+------------
Bezir | 0.625
Bezi | 0.5
Bez | 0.375
(3 rows)

Escaping Backslashes in Postgresql

I need to write a file to disk from postgres that has character string of a backslash immediately followed by a forward slash \/
Code similar to this has not worked:
drop table if exists test;
create temporary table test (linetext text);
insert into test values ('\/\/foo foo foo\/bar\/bar');
copy (select linetext from test) to '/filepath/postproductionscript.sh';
The above code yields \\/\\/foo foo foo\\/bar\\/bar ... it inserts an extra backslash.
When you view the temp table, the string is correctly viewed as \/\/, so I am not sure where or when the text is changed into \\/\\/
I've tried doubling the \, variations of E before the string, and quote_literal() without luck.
I have note found a solution here Postgres Manual
Running Postgres 9.2, encoded UTF-8.
The problem is that COPY is not intended to write out plain-text files. It is intended to write out files that can be read back by COPY. And the semi-internal encoding that it uses does some backslash escaping.
For what you want to do, you need to write some custom code. Either use a normal client library to read the query results and write them to a file, or, if you want to do it in-server, use something like PL/Perl or PL/Python.
The \ excaping is only recognised if the stringliteral is prefixed with E , otherwise the standard_conforming_strings setting (or the like) is respected (ANSI-SQL has a different way of string escaping, probably stemming from COBOL;-).
drop table if exists test;
create temporary table test (linetext text);
insert into test values ( E'\/\/foo foo foo\/bar\/bar');
copy (select linetext from test) to '/tmp/postproductionscript.sh';
UPATE: an ugly hack is to use .csv format and still use \t as delimter.
The #!/bin/sh as a shebang headerline should be consdered a feature
-- without a header line
drop table if exists test;
create temporary table test (linetext text);
insert into test values ( '\/\/foo foo foo\/bar\/bar');
copy (select linetext AS "#linetext" from test) to '/tmp/postproductionscript_c.sh'
WITH CSV
DELIMITER E'\t'
;
-- with a shebang header line
drop table if exists test;
create temporary table test (linetext text);
insert into test values ( '\/\/foo foo foo\/bar\/bar');
copy (select linetext AS "#/bin/sh" from test) to '/tmp/postproductionscript_h.sh'
WITH CSV
HEADER
DELIMITER E'\t'
;

ERROR: COPY delimiter must be a single one-byte character

I want to load the data from a flat file with delimiter "~,~" into a PostgreSQL table. I have tried it as below but looks like there is a restriction for the delimiter. If COPY statement doesn't allow multiple chars for delimiter, is there any alternative to do this?
metadb=# \COPY public.CME_DATA_STAGE_TRANS FROM 'E:\Infor\Outbound_Marketing\7.2.1\EM\metadata\pgtrans.log' WITH DELIMITER AS '~,~'
ERROR: COPY delimiter must be a single one-byte character
\copy: ERROR: COPY delimiter must be a single one-byte character
If you are using Vertica, you could use E'\t'or U&'\0009'
To indicate a non-printing delimiter character (such as a tab),
specify the character in extended string syntax (E'...'). If your
database has StandardConformingStrings enabled, use a Unicode string
literal (U&'...'). For example, use either E'\t' or U&'\0009' to
specify tab as the delimiter.
Unfortunatelly there is no way to load flat file with multiple characters delimiter ~,~ in Postgres unless you want to modify source code (and recompile of course) by yourself in some (terrific) way:
/* Only single-byte delimiter strings are supported. */
if (strlen(cstate->delim) != 1)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("COPY delimiter must be a single one-byte character")));
What you want is to preprocess your input file with some external tool, for example sed might to be best companion on GNU/Linux platfom, for example:
sed s/~,~/\\t/g inputFile
The obvious thing to do is what all other answers advised. Edit import file. I would do that, too.
However, as a proof of concept, here are two ways to accomplish this without additional tools.
1) General solution
CREATE OR REPLACE FUNCTION f_import_file(OUT my_count integer)
RETURNS integer AS
$BODY$
DECLARE
myfile text; -- read xml file into that var.
datafile text := '\path\to\file.txt'; -- !pg_read_file only accepts relative path in database dir!
BEGIN
myfile := pg_read_file(datafile, 0, 100000000); -- arbitrary 100 MB max.
INSERT INTO public.my_tbl
SELECT ('(' || regexp_split_to_table(replace(myfile, '~,~', ','), E'\n') || ')')::public.my_tbl;
-- !depending on file format, you might need additional quotes to create a valid format.
GET DIAGNOSTICS my_count = ROW_COUNT;
END;
$BODY$
LANGUAGE plpgsql VOLATILE;
This uses a number of pretty advanced features. If anybody is actually interested and needs an explanation, leave a comment to this post and I will elaborate.
2) Special case
If you can guarantee that '~' is only present in the delimiter '~,~', then you can go ahead with a plain COPY in this special case. Just treat ',' in '~,~' as an additional columns.
Say, your table looks like this:
CREATE TABLE foo (a int, b int, c int);
Then you can (in one transaction):
CREATE TEMP TABLE foo_tmp ON COMMIT DROP (
a int, tmp1 "char"
,b int, tmp2 "char"
,c int);
COPY foo_tmp FROM '\path\to\file.txt' WITH DELIMITER AS '~';
ALTER TABLE foo_tmp DROP COLUMN tmp1;
ALTER TABLE foo_tmp DROP COLUMN tmp2;
INSERT INTO foo SELECT * FROM foo_tmp;
Not quite sure if you're looking for a postgresql solution or just a general one.
If it were me, I would open up a copy of vim (or gvim) and run the commend :%s/~,~/~/g
That replaces all "~,~" with "~".
you can use a single character delimiter, open notepad press ctrl+h replace ~,~ with something will not interfere. like |