Disabling the PostgreSQL 8.4 tsvector parser's `file` token type - postgresql

I have some documents that contain sequences such as radio/tested that I would like to return hits in queries like
select * from doc
where to_tsvector('english',body) ## to_tsvector('english','radio')
Unfortunately, the default parser takes radio/tested as a file token (despite being in a Windows environment), so it doesn't match the above query. When I run ts_debug on it, that's when I see that it's being recognized as a file, and the lexeme ends up being radio/tested rather than the two lexemes radio and test.
Is there any way to configure the parser not to look for file tokens? I tried
ALTER TEXT SEARCH CONFIGURATION public.english
DROP MAPPING FOR file;
...but it didn't change the output of ts_debug. If there's some way of disabling file, or at least having it recognize both file and all the words that it thinks make up the directory names along the way, or if there's a way to get it to treat slashes as hyphens or spaces (without the performance hit of regexp_replaceing them myself) that would be really helpful.

I think the only way to do what you want is to create your own parser :-( Copy wparser_def.c to a new file, remove from the parse tables (actionTPS_Base and the ones following it) the entries that relate to files (TPS_InFileFirst, TPS_InFileNext etc), and you should be set. I think the main difficulty is making the module conform to PostgreSQL's C idiom (PG_FUNCTION_INFO_V1 and so on). Have a look at contrib/test_parser/ for an example.

Related

make tsvector tokenize by space only

I need to create a tsvector that does not split its content by hyphens but ideally only by whitespace.
select to_tsvector('simple','7073-03-001-01 7072-05-003-06')
creates
'-001':3 '-003':7 '-01':4 '-03':2 '-05':6 '-06':8 '7072':5 '7073':1
where I rather want
'7072-05-003-06':2 '7073-03-001-01':1
is this possible somehow?
There is a simple example of a parser called test_parser which seems to do what you want. It was last in the documents in 9.4, after that it was moved to only be documented in the source tree. These test extensions aren't always installed, so you might need to take special steps (depending on how you installed PostgreSQL and what your OS is and whether you are really using an EOL version) to get it.
create extension test_parser ;
create text search configuration test ( parser = testparser);
ALTER TEXT SEARCH CONFIGURATION test ADD MAPPING FOR word WITH simple;
SELECT * FROM to_tsvector('test', '7073-03-001-01 7072-05-003-06');
to_tsvector
---------------------------------------
'7072-05-003-06':2 '7073-03-001-01':1

rename file when using DD name

In a 'C' language LP64 compiled program, which will run in Batch, TSO and z/OS UNIX, when opening a PDS(E) member using the following notation (recommended in order to allow file disposition to be used):-
hFile = fopen("DD:CONFIG(COPY)", "w");
fclose(hFile);
I am surprised to discover that the following does not appear to work:-
rename("DD:CONFIG(COPY)","DD:CONFIG(MAIN)");
Failing as it does with an errno of ENOENT (EDC5129I No such file or directory.)
The documentation for rename says:-
The rename() function renames memory files and DASD data sets. It also renames individual members of PDSs (and PDSEs)
If instead I do:-
rename("//'MYUSER.CONFIG(COPY)'","//'MYUSER.CONFIG(MAIN)'");
the rename() works.
Alternatively if I do:-
rename("//'MYUSER.CONFIG(COPY)'","DD:CONFIG(MAIN)");
if fails with an errno of EINVAL (EDC5121I Invalid argument.)
Why does it not accept the same file name notation that is used for fopen?
The reason this is important is because the rename() cannot succeed while the PDSE is being browsed by someone. Whereas, using the DD: notation allows an fopen() for write to succeed when the PDSE is being browsed because the DISP=SHR coded on the DD name in the JCL is adopted by the fopen().
So, I suppose the real question is - how can my program rename a PDSE member in a way that will succeed when the PDSE is also being browsed by someone?
The technique required to rename a dataset is different than the technique to rename a member inside a PDS/PDSE...I'd wager that the system rename() function you're calling is just getting this wrong. In z/OS, there are lots of combinations functions like "rename()" have to handle, and it's not unusual to find some that don't work as you expect.
Certainly it's worth a call to IBM Support to see if there's something else going on here...what you're trying to do seems like it should work, so I think there's something to be said for treating it like a bug or documentation error.
Beyond that, as you suggest, you can either use the form of rename that works, or you can replace the system's rename function with something that actually works properly.
One simple way would be to create the rename() as you show it:
rename("//'MYUSER.CONFIG(COPY)'","//'MYUSER.CONFIG(MAIN)'");
You can get the DSN for a DDNAME using the fldata() function, so it's not hard to create a rename like this on the fly given an open file handle. Beware that the form of rename may allocate the file you specify with DISP=OLD, and hence cause problems if some other task has the file allocated. Also, if this is supposed to be commercial quality code, as a customer, my eyebrows would go up if I found out you needed to launch some external program because you couldn't figure out how to rename a PDS/PDSE member - but that might just be me.
The other alternative is to write your own "rename()" function...unfortunately, it most likely would need to be assembler language if you want it to be efficient. As others suggest, you might spawn off a shell, REXX or TSO command, but of course, that means creating a new process, etc etc etc just to rename the PDS/PDSE member. Keep in mind also that some of these approaches might also have issues with trying to allocate the input file with DISP=OLD.
If that's too slow for your needs, the way to do what you want is to call a small assembler routine that invokes the system STOW service against your DDNAME to do your rename. The flow would be something like this:
You'd create a 16-byte area containing the old and new member names. They're 8 characters each and blank padded.
You'd need the address of an open DCB that describes the file you're looking at. You can get the DCB address from the FILE structure, I believe - or you could just open a second DCB to the DDNAME you have allocated.
You'd call the system STOW service with the parameters that tell it to rename a PDS/PDSE member:
STOW dcb,area_from_step1,C
In the STOW macro above, the "directory option" of "C" tells STOW that you want to rename an existing member. The area_from_step1 has the current and new member names - the system searches the directory for the current name and rewrites it with the new member name in place.
To be honest, what I describe above is exactly what the system runtime should be doing, but if it's not and IBM doesn't want to fix it, then you might prefer to do this sort of thing "by hand".
Not sure if this will work, but since you have the dataset already allocated, perhaps you could "call" (for some value of call) IEHPROGM from your program, constucting the proper SYSIN before making the call?
Here's a link to the IBM example for IEHPROGM (mind any break):
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1.idau100/u1354.htm
--Scott

Dataprep import dataset does not detect headers in first row automatically

I am importing a dataset from Google Cloud Storage (parameterized) into Dataprep. So far, this worked perfectly fine and one of the feature that I liked is that it auto detects that the first row in my (application/octet-stream) .csv file are my headers.
However, today I tried to import a new dataset and it did not detect the headers, but it auto assigned column1, column2...
What has changed and or why is this the case. I have checked the box auto-detect and use UTF-8:
While the auto-detect option is usually pretty good, there are times that it fails for numerous reasons. I've specifically noticed this when the field names contain certain characters (e.g. comma, invisible characters like zero-width-non-joiners, null bytes), or when multiple different styles of newline delimiters are used within the same file.
Another case I saw this is when there were more columns of data than there were headers.
As you already hit on, you can use the following snippet to do mostly the same thing:
rename type: header method: filter sanitize: true
. . . or make separate recipe steps to convert the first row to header and then bulk-rename to your own liking.
More often than not, however, I've found that when auto-detect fails on a previously working file, it tends to be a sign of some sort of issue with the source file. I would look for mismatched data, as well as misplaced commas within the output, as well as comparing the header and some data rows to the original source using a plaintext editor.
When all else fails, you can try a CSV validator . . . but in my experience they tend to be incredibly opinionated when it comes to the formatting options of the file—so depending on the system generating the CSV, it could either miss any errors or give false-positives. I have had two experiences where auto-detect fails for no apparent reason on perfectly clean files, so it is possible that process was just skipped for some reason.
It should also be noted that if you have a structured file that was correctly detected but want to revert it, you can go to the dataset details, select the "..." (More) button, and choose "Remove structure..." (I'm hoping that one day they'll let you do the opposite when you want to add structure to a raw dataset or work around bugs like this!)
Best of luck!
Can be resolved as a transformation within a Flow:
rename type: header method: filter sanitize: true

Postgres full text search ignore url

I am trying to use PostgreSQL to implement a full-text search system.
I encounter this strange or may be intended feature with that.
While trying to index or search for a column which contains names of files with extension (e.g. myimage.jpg), the system treats it as a url and does not properly tokenize.
I referred to the documentation and see that via ts_debug that the file name is taken as a host of a url.
Could some one tell how to take all inputs as normal word in the FTS of PostgreSQL.
Also, on a second request, how can one do a contains, startswith, and endswith searches with it?
Update
I have now tried the statement create text search configuration..., copied from pg_catalog.english and removed host,url, and url_path and then specified the configuration for the ts_debug method. But still no go., myimage.jpg is still identified as host.
Version
I use version 9.4
tl;dr Look at pre-parsing your input and removing punctuation if you really only want words (and not emails, urls, hosts, etc).
So after trying to figure this out myself the issue is that you don't seem to be able to easily customise the parser. From my understanding the parser runs first, which generates tokens. Those tokens are then matched to dictionaries.
By removing host, url, url_path from the configuration all you are doing is making it so that these tokens don't get looked up in a dictionary, resulting in no lexeme from these tokens. Which essentially means that they don't exist in terms of search. Which is not want you want...
Ideally what you need to do is customise the parser to not generate those tokens in the first place, or to also generate overlapping tokens (similar to how hyphenated words generate a token for the entire word as well as individual components) . This doesn't seem to be possible at the moment without writing a custom parser.
The only solution to this would be to pre-parse the text to remove the full stop. Note that if you rely on other types of tokens like version (e.g. 8.3.0) or email (e.g. name#domain.com) this will break those. So you may need to be a bit clever on how you remove characters.
select ts_debug('english', replace('this-is-a-file.jpg', '.', ' '));
"(asciihword,"Hyphenated word, all ASCII",this-is-a-file,{english_stem},english_stem,{this-is-a-fil})"
"(hword_asciipart,"Hyphenated word part, all ASCII",this,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",is,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",a,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",file,{english_stem},english_stem,{file})"
"(blank,"Space symbols"," ",{},,)"
"(asciiword,"Word, all ASCII",jpg,{english_stem},english_stem,{jpg})"
In terms of your second question. Are you talking about partial word matches? You get this a little bit with the stemming when using a config like english, so running becomes run which will match if you search for run or running. If you're talking about fuzzy matching it gets a little more complicated. I suggest reading this article http://rachbelaid.com/postgres-full-text-search-is-good-enough/

When and how might the operating system store a file under a different name than I gave it?

I found this statement under another SO question concerning Unicode and I'd like to ask for further elaboration of this rather surprising fact.
Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory,
you'll actually find that file with the name you created it under is
buggy, broken, and wrong. Stop being surprised by this!
When does this happen and what to do about it?
The first example which comes to my mind: If you create a file under OSX that is named é (single U+00E9 codepoint), the OS will store it actually as U+0065 U+0301 (Unicode decomposition). The file will be still accessible under the original name, but listed as decomposed.
How to avoid: don't lookup your files manually unless you are sure their names are pure ASCII.
Second: On Windows, if you have a file called e, try creating (with overwriting enabled) a file called E, the OS will still list a file called e. If e didn't exists beforehand, a file called E would be created.
How to avoid: don't lookup your files manually unless you are sure their names are pure ASCII, and take case into account. Try using a consistent capitalisation style. I suggest going all lowercase.
Third: on Windows, if for example you have Windows 1250 as your system encoding, and you want to create a file named ê via the narrow, char-based API, a file called e will be created instead. This of course is easy to avoid, but this exact problem bit me once: WinRAR extracted files ê.png, è.png and e.png all into e.png, overwriting data. Similar problems can happen with other encoding mixups, too.
How to avoid: don't use API's that take the filename as a char* on Windows.