In DB2 SQL RegEx, how can a conditional replacement be done without CASE WHEN END..? - db2

I have a DB2 v7r3 SQL SELECT statement with three instances of REGEXP_SUBSTR(), all with the same regex pattern string, each of which extract one of three groups.
I'd like to change the first SUBSTR to REGEXP_REPLACE() to do a conditional replacement if there's no match, to insert a default value similarly to the ELSE section of a CASE...END. But I can't make it work. I could easily use a CASE, but it seems more compact & efficient to use RegEx.
For example, I have descriptions of food containers sizes, in various states of completeness:
12X125
6X350
1X1500
1500ML
1000
The last two don't have the 'nnX' part at the beginning, in which case '1X' is assumed and needs to be inserted.
This is my current working pattern string:
^(?:(\d{1,3})(?:X))?((?:\d{1,4})(?:\.\d{1,3})?)(L|ML|PK|Z|)$
The groups returned are: quantity, size, and unit.
But only the first group needs the conditional replacement:
(?:(\d{1,3})(?:X))?
This RexEgg webpage describes the (?=...) operator, and it seems to be what I need, but I'm not sure. It's in the list of operators for my version of DB2, but I can't make it work. Frankly, it's a bit deeper than my regex knowledge, and I can't even make it work in my favorite online regex tester, Regex101.
So...does anyone have any idea or suggestions..? Thanks.

Try this (replace "digits not followed by X_or_digit"):
with t(s) as (values
'12X125'
, '6X350'
, '1X1500'
, '1500'
, '1125'
)
select regexp_replace(s, '^([\d]+(?![X\d]))', '1X\1')
from t;

Related

Prefix/wildcard searches with 'websearch_to_tsquery' in PostgreSQL Full Text Search?

I'm currently using the websearch_to_tsquery function for full text search in PostgreSQL. It all works well except for the fact that I no longer seem to be able to do partial matches.
SELECT ts_headline('english', q.\"Content\", websearch_to_tsquery('english', {request.Text}), 'MaxFragments=3,MaxWords=25,MinWords=2') Highlight, *
FROM (
SELECT ts_rank_cd(f.\"SearchVector\", websearch_to_tsquery('english', {request.Text})) AS Rank, *
FROM public.\"FileExtracts\" f, websearch_to_tsquery('english', {request.Text}) as tsq
WHERE f.\"SearchVector\" ## tsq
ORDER BY rank DESC
) q
Searches for customer work but cust* and cust:* do not.
I've had a look through the documentation and a number of articles but I can't find a lot of info on it. I haven't worked with it before so hopefully it's just something simple that I'm doing wrong?
You can't do this with websearch_to_tsquery but you can do it with to_tsquery (because ts_query allows to add a :* wildcard) and add the websearch syntax yourself in in your backend.
For example in a node.js environment you could do smth. like this:
let trimmedSearch = req.query.search.trim()
let searchArray = trimmedSearch.split(/\s+/) //split on every whitespace and remove whitespace
let searchWithStar = searchArray.join(' & ' ) + ':*' //join word back together adds AND sign in between an star on last word
let escapedSearch = yourEscapeFunction(searchWithStar)
and than use it in your SQL
search_column ## to_tsquery('english', ${escapedSearch})
You need to write the tsquery directly if you want to use partial matching. plainto_tsquery doesn't pass through partial match notation either, so what were you doing before you switched to websearch_to_tsquery?
Anything that applies a stemmer is going to have hard time handling partial match. What is it supposed to do, take off the notation, stem the part, then add it back on again? Not do stemming on the whole string? Not do stemming on just the token containing the partial match indicator? And how would it even know partial match was intended, rather than just being another piece of punctuation?
To add something on top of the other good answers here, you can also compose your query with both websearch_to_tsquery and to_tsquery to have everything from both worlds:
select * from your_table where ts_vector_col ## to_tsquery('simple', websearch_to_tsquery('simple', 'partial query')::text || ':*')
Another solution I have come up with is to do the text transform as part of the query so building the tsquery looks like this
to_tsquery(concat(regexp_replace(trim(' all the search terms here '), '\W+', ':* & '), ':*'));
(trim) Removes leading/trailing whitespace
(regexp_replace) Splits the search string on non word chars and adds trailing wildcards to each term, then ANDs the terms (:* & )
(concat) Adds a trailing wildcard to the final term
(to_tsquery) Converts to a ts_query
You can test the string manipulation by running
SELECT concat(regexp_replace(trim(' all the search terms here '), '\W+', ':* & ', 'gm'), ':*')
the result should be
all:* & the:* & search:* & terms:* & here:*
So you have multi word partial matches e.g. searching spi ma would return results matching spider man

Can I write a PCRE conditional that only needs the no-match part?

I am trying to create a regular expression to determine if a string contains a number for an SQL statement. If the value is numeric, then I want to add 1 to it. If the number is not numeric, I want to return a 1. More or less. Here is the SQL:
SELECT
field,
CASE
WHEN regexp_like(field, '^ *\d*\.?\d* *$') THEN dec(field) + 1
ELSE 1
END nextnumber
FROM mytable
This actually works, and returns something like this:
INVALID 1
00000 1
00001E 1
00379 380
00013 14
99904 99905
But to push the envelope of understanding, what if I wanted to cover negative numbers, or those with a positive sign. The sign would have to immediately precede or follow the number, but not both, and I would not want to allow white space between the sign and the number.
I came up with a conditional expression with a capture group to capture the sign on the front of the number to determine if a sign was allowed on the end, but it seems a little awkward to handle given I don't really need a yes-pattern.
Here is the modified regex: ^ ([+-]?)*\d*\.?\d*(?(1) *|[+-]? *)$
This works at regex101.com, but in order for it to work I need to have something before the pipe, so I have to duplicate the next pattern in both the yes-pattern and the no-pattern.
All that background for this question: How can I avoid that duplication?
EDIT: DB2 for i uses International Components for Unicode to provide regular expression processing. It turns out that this library does not support conditionals like PRCE, so I changed the tags on this question. The answer given by Wiktor Stribiżew provides a working alternative to the conditional by using a negative lookahead.
You do not have to duplicate the end pattern, just move it outside the conditional:
^ *([+-])?\d*\.?\d*(?(1)|[+-]?) *$
See the regex demo. So, the yes-part is empty, and the no-part has an optional pattern.
You may also solve it with a mere negative lookahead:
^ *([+-](?!.*[-+]))?\d*\.?\d*[+-]? *$
See another regex demo. Here, ([+-](?!.*[-+]))? matches (optionally) a + or - that are not followed with any 0+ char followed with another + or -.

PostgreSQL regexp.replace all unwanted chars

I have registration codes in my PostgreSQL table which are written messy, like MU-321-AB, MU/321/AB, MU 321-AB and so forth...
I would need to clear all of this to get MU321AB.
For this I uses following expression:
SELECT DISTINCT regexp_replace(ccode, '([^A-Za-z0-9])', ''), ...
This expression work as expected in 'NET' but not in PostgreSQL where it 'clears' only first occurrence of unwanted character.
How would I modify regular expression which will replace all unwanted chars in string to get clear code with only letters and numbers?
Use the global flag, but without any capture groups:
SELECT DISTINCT regexp_replace(ccode, '[^A-Za-z0-9]', '', 'g'), ...
Note that the global flag is part of the standard regular expression parser, so .NET is not following the standard in this case. Also, since you do not want anything extracted from the string - you just want to replace some characters - you should not use capture groups ().

SQL functions to match last two parts of a URL

This is a follow up question to a post here Regex for matching last two parts of a URL
I was wondering if I could use built in sql funcitons to accomplish the same type of pattern match without using regular expressions. In particular I was thinking if there was a way to reverse the string say www.stackoverflow.com to com.stackoverflow.www and then apply concatenation to split('com.stackoverflow.www , '.', 1) || split('com.stackoverflow.www , '.', 2) I would be done but I am not sure if this is possible.
Here is the general problem description:
I am trying to figure out the best sql function simply match only the last two strings seperated by a . in a url.
For instance with www.stackoverflow.com I just want to match stackoverflow.com
The issue i have is some strings can have a large number of periods for instance
a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
should also return only yimg.com
The set of URLS I am working with does not have any of the path information so one can assume the last part of the string is always .org or .com or something of that nature.
What sql functions will return stackoverflow.com when run against www.stackoverflow.com and will return yimg.com when run against a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com under the conditions stated above? I did not want to use regular expressions in the solution just sql string manipulation functions.
doesn't look like you can do it with 8.3 function set http://www.postgresql.org/docs/8.3/static/functions-string.html
there's no reverse - that comes in 9.1. with that you could do:
select reverse(split_part(reverse(data), '.', 2)) || '.'
|| reverse(split_part(reverse(data), '.', 1))
from example;
see http://sqlfiddle.com/#!1/25e43/2/0
you can declare your own reverse: http://a-kretschmer.de/diverses.shtml and then solve this problem.
but regex is just easier...

T-SQL syntax issue with "LTRIM(RTRIM())" not working correctly

What is wrong with this statement that it is still giving me spaces after the field. This makes me think that the syntax combining the WHEN statements is off. My boss wants them combined in one statement. What am I doing wrong?
Case WHEN LTRIM(RTRIM(cSHortName))= '' Then NULL
WHEN cShortname is NOT NULL THEN
REPLACE (cShortName,SUBSTRING,(cShortName,PATINDEX('%A-Za-z0-9""},1,) ''_
end AS SHORT_NAME
Judging from the code, it seems that you may be trying to strip spaces and non-alphanumeric characters from the beginning and ending of the string.
If so, would this work for you?
I think it provides the substring from the first alphanumeric occurrence to the last.
SELECT
SUBSTRING(
cShortName,
PATINDEX('%A-Za-z0-9',cShortName),
( LEN(cShortName)
-PATINDEX('%A-Za-z0-9',REVERSE(cShortName))
-PATINDEX('%A-Za-z0-9',cShortName)
)
) AS SHORTNAME
Replace TRIM with LTRIM.
You can also test LEN(cShortName) = 0
Ummm there seems to be some problems in this script, but try this.
Case
WHEN LTRIM(RTRIM(cSHortName))= '' Then NULL
WHEN cShortname is NOT NULL THEN REPLACE(cShortName, SUBSTRING(cShortName, PATINDEX('%A-Za-z0-9', 1) , ''), '')
end AS SHORT_NAME
Why do you think it is supposed not to give you spaces after the field?
Edit:
As far as I understand you are trying to remove any characters from the string that do not match this regular expression range [a-zA-Z0-9] (add any other characters that you want to preserve).
I see no clean way to do that in Microsoft SQL Server (you are using Microsoft SQL Server it seems) using the built-in functions. There are some examples on the web that use a temporary table and a while loop, but this is ugly. I would either return the strings as is and process them on the caller side, or write a function that does that using the CLR and invoke it from the select statement.
I hope this helps.