Postgres full text search ignore url - postgresql

I am trying to use PostgreSQL to implement a full-text search system.
I encounter this strange or may be intended feature with that.
While trying to index or search for a column which contains names of files with extension (e.g. myimage.jpg), the system treats it as a url and does not properly tokenize.
I referred to the documentation and see that via ts_debug that the file name is taken as a host of a url.
Could some one tell how to take all inputs as normal word in the FTS of PostgreSQL.
Also, on a second request, how can one do a contains, startswith, and endswith searches with it?
Update
I have now tried the statement create text search configuration..., copied from pg_catalog.english and removed host,url, and url_path and then specified the configuration for the ts_debug method. But still no go., myimage.jpg is still identified as host.
Version
I use version 9.4

tl;dr Look at pre-parsing your input and removing punctuation if you really only want words (and not emails, urls, hosts, etc).
So after trying to figure this out myself the issue is that you don't seem to be able to easily customise the parser. From my understanding the parser runs first, which generates tokens. Those tokens are then matched to dictionaries.
By removing host, url, url_path from the configuration all you are doing is making it so that these tokens don't get looked up in a dictionary, resulting in no lexeme from these tokens. Which essentially means that they don't exist in terms of search. Which is not want you want...
Ideally what you need to do is customise the parser to not generate those tokens in the first place, or to also generate overlapping tokens (similar to how hyphenated words generate a token for the entire word as well as individual components) . This doesn't seem to be possible at the moment without writing a custom parser.
The only solution to this would be to pre-parse the text to remove the full stop. Note that if you rely on other types of tokens like version (e.g. 8.3.0) or email (e.g. name#domain.com) this will break those. So you may need to be a bit clever on how you remove characters.
select ts_debug('english', replace('this-is-a-file.jpg', '.', ' '));
"(asciihword,"Hyphenated word, all ASCII",this-is-a-file,{english_stem},english_stem,{this-is-a-fil})"
"(hword_asciipart,"Hyphenated word part, all ASCII",this,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",is,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",a,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",file,{english_stem},english_stem,{file})"
"(blank,"Space symbols"," ",{},,)"
"(asciiword,"Word, all ASCII",jpg,{english_stem},english_stem,{jpg})"
In terms of your second question. Are you talking about partial word matches? You get this a little bit with the stemming when using a config like english, so running becomes run which will match if you search for run or running. If you're talking about fuzzy matching it gets a little more complicated. I suggest reading this article http://rachbelaid.com/postgres-full-text-search-is-good-enough/

Related

VSCode: activeTextEditor encoding

Is there any way to get current document encoding (that is in the bottom bar) in my extension code?
Something like vscode.window.activeTextEditor.encoding
This does not appear to be possible.
Since it's nearly impossible to prove a negative, the rest of this answer documents what I explored.
The string "encoding" does not appear (in this sense) anywhere in the API docs nor in the index.d.ts file it is derived from. (With VSCode 1.37.1, current as of writing.)
I dug into the vscode sources to see if there might be a clever solution, but came up empty. The code that executes when the encoding is changed by the user is in editorStatus.ts, class ChangeEncodingAction. This makes its way to textFileEditorModel.ts, function updatePreferredEncoding, which sets preferredEncoding. That field controls what happens when the file is saved, and is used to populate the status indicator, but doesn't go anywhere else I can find.
Reading the status indicator itself does not appear possible since the API allows extensions to create new indicators with window.createStatusBarItem but not enumerate existing ones. And directly accessing the DOM is not possible.
I also came up empty searching through VSCode issues related to encoding, both open and closed, but only skimmed the most recent ~100 closed issue titles.
Alternatives
My main suggestion at this point would be to file an enhancement request on the VSCode github.
It should also be possible to do something with reflection but of course it would be fragile.
Finally, the encoding controls how the document in memory (a sequence of characters) maps to a file on disk (a sequence of bytes). Depending on what you're trying to do, it might work to speculatively encode the document in several encodings and compare each to what is on disk (so long as the file is not dirty).

Dataprep import dataset does not detect headers in first row automatically

I am importing a dataset from Google Cloud Storage (parameterized) into Dataprep. So far, this worked perfectly fine and one of the feature that I liked is that it auto detects that the first row in my (application/octet-stream) .csv file are my headers.
However, today I tried to import a new dataset and it did not detect the headers, but it auto assigned column1, column2...
What has changed and or why is this the case. I have checked the box auto-detect and use UTF-8:
While the auto-detect option is usually pretty good, there are times that it fails for numerous reasons. I've specifically noticed this when the field names contain certain characters (e.g. comma, invisible characters like zero-width-non-joiners, null bytes), or when multiple different styles of newline delimiters are used within the same file.
Another case I saw this is when there were more columns of data than there were headers.
As you already hit on, you can use the following snippet to do mostly the same thing:
rename type: header method: filter sanitize: true
. . . or make separate recipe steps to convert the first row to header and then bulk-rename to your own liking.
More often than not, however, I've found that when auto-detect fails on a previously working file, it tends to be a sign of some sort of issue with the source file. I would look for mismatched data, as well as misplaced commas within the output, as well as comparing the header and some data rows to the original source using a plaintext editor.
When all else fails, you can try a CSV validator . . . but in my experience they tend to be incredibly opinionated when it comes to the formatting options of the fileā€”so depending on the system generating the CSV, it could either miss any errors or give false-positives. I have had two experiences where auto-detect fails for no apparent reason on perfectly clean files, so it is possible that process was just skipped for some reason.
It should also be noted that if you have a structured file that was correctly detected but want to revert it, you can go to the dataset details, select the "..." (More) button, and choose "Remove structure..." (I'm hoping that one day they'll let you do the opposite when you want to add structure to a raw dataset or work around bugs like this!)
Best of luck!
Can be resolved as a transformation within a Flow:
rename type: header method: filter sanitize: true

Configuration Key Value Store

I'm in the planning stages of a script/app that I'm going to need to write soon. In short, I'm going to have a configuration file that stores multiple key value pairs for a system configuration. Various applications will talk to this file including python/shell/rc scripts.
One example case would be that when the system boots, it pulls the static IP to assign to itself from that file. This means it would be nice to quickly grab a key/value from this file in a shell/rc script (ifconfig `evalconffile main_interface` `evalconffile primary_ip` up), where evalconffile is the script that fetches the value when provided with a key.
I'm looking for suggestions on the best way to approach this. I've tossed around the idea of using a plain text file and perl to retrieve the value. I've also tossed around the idea of using YAML for the configuration file since there may end up being a use case where we need multiple values for a key and general expansion. I know YAML would make it accessible from python and perl, but I'm not sure what the best way to access it from a quickly access it from a shell/rc script would be.
Am I headed in the right direction?
One approach would be to simply do the YAML as you wanted, and then when a shell/RC wants a key/value pair, they would call a small Perl script (the evalconffile in your example) that would parse YAML on the shell script's behalf and print out the value(s)
SQLite will give you greatest flexibility, since you don't seem to know the scope of what will be stored in there. It appears there's support for it in all scripting languages you mentioned.

Idempotent Powershell word search/replace across documents with headers, change tracking, etc

I've found one or two guides to doing a word search and replace across multiple documents with powershell. They work well on simple documents. However, the script ignores text in headers and footers; and if "track changes" is enabled, it replaces text which has already been replaced, resulting in multiple copies of the new text if I run the script more than once on the same file.
Any clues as to how I can avoid these undesirable behaviors and make this script robust?
(reposted from Serverfault).
For replacing text in all parts of a Word document see:
Using a macro to replace text where ever it appears in a document

Disabling the PostgreSQL 8.4 tsvector parser's `file` token type

I have some documents that contain sequences such as radio/tested that I would like to return hits in queries like
select * from doc
where to_tsvector('english',body) ## to_tsvector('english','radio')
Unfortunately, the default parser takes radio/tested as a file token (despite being in a Windows environment), so it doesn't match the above query. When I run ts_debug on it, that's when I see that it's being recognized as a file, and the lexeme ends up being radio/tested rather than the two lexemes radio and test.
Is there any way to configure the parser not to look for file tokens? I tried
ALTER TEXT SEARCH CONFIGURATION public.english
DROP MAPPING FOR file;
...but it didn't change the output of ts_debug. If there's some way of disabling file, or at least having it recognize both file and all the words that it thinks make up the directory names along the way, or if there's a way to get it to treat slashes as hyphens or spaces (without the performance hit of regexp_replaceing them myself) that would be really helpful.
I think the only way to do what you want is to create your own parser :-( Copy wparser_def.c to a new file, remove from the parse tables (actionTPS_Base and the ones following it) the entries that relate to files (TPS_InFileFirst, TPS_InFileNext etc), and you should be set. I think the main difficulty is making the module conform to PostgreSQL's C idiom (PG_FUNCTION_INFO_V1 and so on). Have a look at contrib/test_parser/ for an example.