SQL functions to match last two parts of a URL - postgresql

This is a follow up question to a post here Regex for matching last two parts of a URL
I was wondering if I could use built in sql funcitons to accomplish the same type of pattern match without using regular expressions. In particular I was thinking if there was a way to reverse the string say www.stackoverflow.com to com.stackoverflow.www and then apply concatenation to split('com.stackoverflow.www , '.', 1) || split('com.stackoverflow.www , '.', 2) I would be done but I am not sure if this is possible.
Here is the general problem description:
I am trying to figure out the best sql function simply match only the last two strings seperated by a . in a url.
For instance with www.stackoverflow.com I just want to match stackoverflow.com
The issue i have is some strings can have a large number of periods for instance
a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
should also return only yimg.com
The set of URLS I am working with does not have any of the path information so one can assume the last part of the string is always .org or .com or something of that nature.
What sql functions will return stackoverflow.com when run against www.stackoverflow.com and will return yimg.com when run against a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com under the conditions stated above? I did not want to use regular expressions in the solution just sql string manipulation functions.

doesn't look like you can do it with 8.3 function set http://www.postgresql.org/docs/8.3/static/functions-string.html
there's no reverse - that comes in 9.1. with that you could do:
select reverse(split_part(reverse(data), '.', 2)) || '.'
|| reverse(split_part(reverse(data), '.', 1))
from example;
see http://sqlfiddle.com/#!1/25e43/2/0
you can declare your own reverse: http://a-kretschmer.de/diverses.shtml and then solve this problem.
but regex is just easier...

Related

Prefix/wildcard searches with 'websearch_to_tsquery' in PostgreSQL Full Text Search?

I'm currently using the websearch_to_tsquery function for full text search in PostgreSQL. It all works well except for the fact that I no longer seem to be able to do partial matches.
SELECT ts_headline('english', q.\"Content\", websearch_to_tsquery('english', {request.Text}), 'MaxFragments=3,MaxWords=25,MinWords=2') Highlight, *
FROM (
SELECT ts_rank_cd(f.\"SearchVector\", websearch_to_tsquery('english', {request.Text})) AS Rank, *
FROM public.\"FileExtracts\" f, websearch_to_tsquery('english', {request.Text}) as tsq
WHERE f.\"SearchVector\" ## tsq
ORDER BY rank DESC
) q
Searches for customer work but cust* and cust:* do not.
I've had a look through the documentation and a number of articles but I can't find a lot of info on it. I haven't worked with it before so hopefully it's just something simple that I'm doing wrong?
You can't do this with websearch_to_tsquery but you can do it with to_tsquery (because ts_query allows to add a :* wildcard) and add the websearch syntax yourself in in your backend.
For example in a node.js environment you could do smth. like this:
let trimmedSearch = req.query.search.trim()
let searchArray = trimmedSearch.split(/\s+/) //split on every whitespace and remove whitespace
let searchWithStar = searchArray.join(' & ' ) + ':*' //join word back together adds AND sign in between an star on last word
let escapedSearch = yourEscapeFunction(searchWithStar)
and than use it in your SQL
search_column ## to_tsquery('english', ${escapedSearch})
You need to write the tsquery directly if you want to use partial matching. plainto_tsquery doesn't pass through partial match notation either, so what were you doing before you switched to websearch_to_tsquery?
Anything that applies a stemmer is going to have hard time handling partial match. What is it supposed to do, take off the notation, stem the part, then add it back on again? Not do stemming on the whole string? Not do stemming on just the token containing the partial match indicator? And how would it even know partial match was intended, rather than just being another piece of punctuation?
To add something on top of the other good answers here, you can also compose your query with both websearch_to_tsquery and to_tsquery to have everything from both worlds:
select * from your_table where ts_vector_col ## to_tsquery('simple', websearch_to_tsquery('simple', 'partial query')::text || ':*')
Another solution I have come up with is to do the text transform as part of the query so building the tsquery looks like this
to_tsquery(concat(regexp_replace(trim(' all the search terms here '), '\W+', ':* & '), ':*'));
(trim) Removes leading/trailing whitespace
(regexp_replace) Splits the search string on non word chars and adds trailing wildcards to each term, then ANDs the terms (:* & )
(concat) Adds a trailing wildcard to the final term
(to_tsquery) Converts to a ts_query
You can test the string manipulation by running
SELECT concat(regexp_replace(trim(' all the search terms here '), '\W+', ':* & ', 'gm'), ':*')
the result should be
all:* & the:* & search:* & terms:* & here:*
So you have multi word partial matches e.g. searching spi ma would return results matching spider man

In DB2 SQL RegEx, how can a conditional replacement be done without CASE WHEN END..?

I have a DB2 v7r3 SQL SELECT statement with three instances of REGEXP_SUBSTR(), all with the same regex pattern string, each of which extract one of three groups.
I'd like to change the first SUBSTR to REGEXP_REPLACE() to do a conditional replacement if there's no match, to insert a default value similarly to the ELSE section of a CASE...END. But I can't make it work. I could easily use a CASE, but it seems more compact & efficient to use RegEx.
For example, I have descriptions of food containers sizes, in various states of completeness:
12X125
6X350
1X1500
1500ML
1000
The last two don't have the 'nnX' part at the beginning, in which case '1X' is assumed and needs to be inserted.
This is my current working pattern string:
^(?:(\d{1,3})(?:X))?((?:\d{1,4})(?:\.\d{1,3})?)(L|ML|PK|Z|)$
The groups returned are: quantity, size, and unit.
But only the first group needs the conditional replacement:
(?:(\d{1,3})(?:X))?
This RexEgg webpage describes the (?=...) operator, and it seems to be what I need, but I'm not sure. It's in the list of operators for my version of DB2, but I can't make it work. Frankly, it's a bit deeper than my regex knowledge, and I can't even make it work in my favorite online regex tester, Regex101.
So...does anyone have any idea or suggestions..? Thanks.
Try this (replace "digits not followed by X_or_digit"):
with t(s) as (values
'12X125'
, '6X350'
, '1X1500'
, '1500'
, '1125'
)
select regexp_replace(s, '^([\d]+(?![X\d]))', '1X\1')
from t;

PHP: Using preg_replace to replace an unknown string between two known strings

I have $stringF. Contained within $stringF is the following (the string is all one line, not word-wrapped as below):
http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=
AFQjCNHWQk0M4bZi9xYO4OY4ZiDqYVt2SA&clid=
c3a7d30bb8a4878e06b80cf16b898331&cid=52779892300270&ei=
H4IAW6CbK5WGhQH7s5SQAg&url=https://abcnews.
go.com/Lifestyle/wireStory/latest-royal-wedding-thousands-streets-windsor-55280649
I want to locate that string and make it look like this:
https://abcnews.go.com/Lifestyle/wireStory/latest-royal-
wedding-thousands-streets-windsor-55280649
Basically I need to use preg_replace to find the following string:
http://news.google.com/news/url?sa= ***SOME UNKNOWN CONTENT*** &url=http
and replace it with the following string:
http
I'm a little rusty with my php, and even rustier with regular expressions, so I'm struggling to figure this one out. My code looks like this:
$stringG = preg_replace('http://news.google.com/news/url?sa=*&url=http','http',$stringH);
except I know I can't use wildcards and I know I need to specially deal with the special characters (colon, forward slash, question mark, and sign, etc). Hoping someone can help me out here.
Also of note is that my $stringF contains multiple instances of such strings, so I need the preg_replace to be not greedy - otherwise it will replace a huge chunk of my string unnecessarily.
PHP has tools for that, no need to use a regex. parse_url to get the components of an url (scheme, host, path, anchor, query, ...) and parse_str to get the keys/values of the query part.
$url = 'http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNHWQk0M4bZi9xYO4OY4ZiDqYVt2SA&clid=c3a7d30bb8a4878e06b80cf16b898331&ci=52779892300270&ei=H4IAW6CbK5WGhQH7s5SQAg&url=https://abcnews.go.com/Lifestyle/wireStory/latest-royal-wedding-thousands-streets-windsor-55280649';
parse_str(parse_url($url, PHP_URL_QUERY), $arr);
echo $arr['url'];

list trigger no system ending with "_BI"

I want to list the trigger no system ending with "_BI" in firebird database,
but no result with this
select * from rdb$triggers
where
rdb$trigger_source is not null
and (coalesce(rdb$system_flag,0) = 0)
and (rdb$trigger_source not starting with 'CHECK' )
and (rdb$trigger_name like '%BI')
but with this syntaxs it gives me a "_bi" and "_BI0U" and "_BI0U" ending result
and (rdb$trigger_name like '%BI%')
but with this syntaxs it gives me null result
and (rdb$trigger_name like '%#_BI')
thank you beforehand
The problem is that the Firebird system tables use CHAR(31) for object names, this means that they are padded with spaces up to the declared length. As a result, use of like '%BI') will not yield results, unless BI are the 30th and 31st character.
There are several solutions
For example you can trim the name before checking
trim(rdb$trigger_name) like '%BI'
or you can require that the name is followed by at least one space
rdb$trigger_name || ' ' like '%BI %'
On a related note, if you want to check if your trigger name ends in _BI, then you should also include the underscore in your condition. And as an underscore in like is a single character matcher, you need to escape it:
trim(rdb$trigger_name) like '%\_BI' escape '\'
Alternatively you could also try to use a regular expressions, as you won't need to trim or otherwise mangle the lefthand side of the expression:
rdb$trigger_name similar to '%\_BI[[:SPACE:]]*' escape '\'

T-SQL syntax issue with "LTRIM(RTRIM())" not working correctly

What is wrong with this statement that it is still giving me spaces after the field. This makes me think that the syntax combining the WHEN statements is off. My boss wants them combined in one statement. What am I doing wrong?
Case WHEN LTRIM(RTRIM(cSHortName))= '' Then NULL
WHEN cShortname is NOT NULL THEN
REPLACE (cShortName,SUBSTRING,(cShortName,PATINDEX('%A-Za-z0-9""},1,) ''_
end AS SHORT_NAME
Judging from the code, it seems that you may be trying to strip spaces and non-alphanumeric characters from the beginning and ending of the string.
If so, would this work for you?
I think it provides the substring from the first alphanumeric occurrence to the last.
SELECT
SUBSTRING(
cShortName,
PATINDEX('%A-Za-z0-9',cShortName),
( LEN(cShortName)
-PATINDEX('%A-Za-z0-9',REVERSE(cShortName))
-PATINDEX('%A-Za-z0-9',cShortName)
)
) AS SHORTNAME
Replace TRIM with LTRIM.
You can also test LEN(cShortName) = 0
Ummm there seems to be some problems in this script, but try this.
Case
WHEN LTRIM(RTRIM(cSHortName))= '' Then NULL
WHEN cShortname is NOT NULL THEN REPLACE(cShortName, SUBSTRING(cShortName, PATINDEX('%A-Za-z0-9', 1) , ''), '')
end AS SHORT_NAME
Why do you think it is supposed not to give you spaces after the field?
Edit:
As far as I understand you are trying to remove any characters from the string that do not match this regular expression range [a-zA-Z0-9] (add any other characters that you want to preserve).
I see no clean way to do that in Microsoft SQL Server (you are using Microsoft SQL Server it seems) using the built-in functions. There are some examples on the web that use a temporary table and a while loop, but this is ugly. I would either return the strings as is and process them on the caller side, or write a function that does that using the CLR and invoke it from the select statement.
I hope this helps.