Table:
create table tbl_prefix
(
col_pre varchar
);
Records:
insert into tbl_prefix values
('Mr.'),('Mrs.'),('Ms.'),('Dr.'),
('Jr.'),('Sr.'),('II'),('III'),
('IV'),('V'),('VI'),('VII'),
('VIII'),('I'),('IX'),('X'),
('Officer'),('Judge'),('Master');
Expected output:
col_pre
----------
Mr.
Mrs.
Ms.
Dr.
Jr.
Sr.
Officer
Judge
Master
Try:
select *
from tbl_prefix
where col_pre ~ '[^a-zA-Z]'
Getting:
col_pre
----------
Mr.
Mrs.
Ms.
Dr.
Jr.
Sr.
One approach here might be to match any prefix which is not a Roman numeral:
SELECT *
FROM tbl_prefix
WHERE col_pre !~ '^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$';
Demo
The regex pattern used here for Roman numerals was gratefully taken from this SO question:
How do you match only valid roman numerals with a regular expression?
Related
I have a query that looks like this:
select regexp_replace('john (junior) jones','\([^)]*\)','','g');
regexp_replace
------------------
john jones
As you can see, this query removes the values in brackets but it results in a double space remaining.
Is there an easy way around this?
So far I have this, which works to an extent:
select regexp_replace((regexp_replace('john (junior) jones','\([^)]*\)','','g')),'\s','');
regexp_replace
------------------
john jones
The above works but not when I pass through something like this:
select regexp_replace((regexp_replace('john (junior) jones (hughes) smith','\([^)]*\)','','g')),'\s','');
regexp_replace
---------------------
john jones smith
SELECT regexp_replace(
'john (junior) jones (hughes) smith',
' *\([^)]*\) *',
' ',
'g'
);
regexp_replace
══════════════════
john jones smith
(1 row)
To explain the regular expression:
an arbitrary number of spaces, followed by an opening parenthesis ( *\()
an arbitrary number of characters that are not a closing parenthesis ([^)]*)
a closing parenthesis and arbitrarily many spaces (\) *)
That is replaced with a single space.
How to find non-ASCII symbols in a string ? (We are using DB2)
We have tried following select statement but it is not working.
SELECT columnname
FROM tablename
WHERE columnname LIKE '%[' + CHAR(127) + '-' + CHAR(255) + ']%'
COLLATE Latin1_General_100_BIN2
I guess you were trying to use CHR() function, instead of CHAR(), which is a data-type.
If you are using a newer db2 version, that has REGEXP functions, you can try using REGEXP_LIKE() function.
Follow an example from samble db:
SELECT EMPNO, LASTNAME FROM EMPLOYEE WHERE REGEXP_LIKE(LASTNAME,'[E-H]')
EMPNO LASTNAME
------ ---------------
000010 HAAS
000020 THOMPSON
000050 GEYER
000060 STERN
000090 HENDERSON
000100 SPENSER
000110 LUCCHESSI
000120 O'CONNELL
000140 NICHOLLS
000170 YOSHIMURA
000180 SCOUTTEN
000190 WALKER
000210 JONES
000230 JEFFERSON
000250 SMITH
000260 JOHNSON
000270 PEREZ
000280 SCHNEIDER
000290 PARKER
000300 SMITH
000310 SETRIGHT
000320 MEHTA
000330 LEE
000340 GOUNOT
200010 HEMMINGER
200220 JOHN
200240 MONTEVERDE
200280 SCHWARTZ
200310 SPRINGER
200330 WONG
30 record(s) selected.
All names selected contains letters from E to H, as specified by the search-pattern.
As I didn't have any row containing such ranges.. I updated one of the rows, adding chars 169 and 174 to it.
Update employee set LASTNAME = ('LEE' || chr(169) || chr(174)) WHERE LASTNAME = 'LEE'
and, using this REGEXP_LIKE function:
SELECT EMPNO, LASTNAME FROM EMPLOYEE WHERE REGEXP_LIKE(LASTNAME , '[' || CHR(127) || '-' || CHR(255) || ']')"
EMPNO LASTNAME
------ ---------------
000330 LEE©®
1 record(s) selected.
Regards
I'm using the %HASHMERGE macro found at http://www.sascommunity.org/mwiki/images/2/22/Hashmerge.sas and the following example datasets:
data working;
length IID TYPE $12;
input IID $ TYPE $;
datalines;
B 0
B 0
A 1
A 1
A 1
C 2
D 3
;
run;
data master;
length IID FIRST_NAME MIDDLE_NAME LAST_NAME SUFFIX_NAME $12;
input IID $ FIRST_NAME $ MIDDLE_NAME $ LAST_NAME $ SUFFIX_NAME;
datalines;
X John James Smith Sr
Z Sarah Marie Jones .
Y Tim William Miller Jr
C Nancy Lynn Brown .
B Carol Elizabeth Collins .
A Wayne Mark Rooney .
;
run;
On the working dataset, I'm trying to attach the _NAME variables from the master dataset using this hash merge. The output looks fine and IS the desired output. However, in my real-life scenario the master dataset is too large to fit into a hash object and the macro keeps placing it as the hash object. I'd ultimately like to flip these two datasets to where the working dataset is the hash object, but I cannot get the desired output when I flip the code. Below is the part of the macro that produces the desired output and needs adjusted, but I am unsure how to set this up:
data OUTPUT;
if 0 then set MASTER (keep=IID FIRST_NAME MIDDLE_NAME LAST_NAME SUFFIX_NAME)
WORKING (keep=IID);
declare hash h_merge(dataset:"MASTER"); /* I want WORKING to be the hash object since it's smaller! */
rc=h_merge.DefineKey("IID");
rc=h_merge.DefineData("FIRST_NAME","MIDDLE_NAME","LAST_NAME","SUFFIX_NAME");
rc=h_merge.DefineDone();
do while(not eof);
set WORKING (keep=IID) end=eof;
call missing(FIRST_NAME,MIDDLE_NAME,LAST_NAME,SUFFIX_NAME);
rc=h_merge.find();
output;
end;
drop rc;
stop;
run;
Desired output:
IID FIRST_NAME MIDDLE_NAME LAST_NAME SUFFIX_NAME
---------------------------------------------------
B Carol Elizabeth Collins
B Carol Elizabeth Collins
A Wayne Mark Rooney
A Wayne Mark Rooney
A Wayne Mark Rooney
C Nancy Lynn Brown
D
While it's feasible to do what you say, I doubt you'll get that from a non-purpose-built macro. That's because it's not the normal way to do that; typically you want to keep the main dataset in its form and put the relational dataset in the hash table. Usually the sizes are reversed of course - the relational table is usually smaller than the main table.
Personally I would not use hash for this particular case. I'd use a format (or three). Just as fast as a hash and has less of the size issues (since it doesn't have to fit in memory), though it eventually would slow down (but not break!) due to size.
Format solution:
data working;
length IID TYPE $12;
input IID $ TYPE $;
datalines;
B 0
B 0
A 1
A 1
A 1
C 2
D 3
;
run;
data master;
length IID FIRST_NAME MIDDLE_NAME LAST_NAME SUFFIX_NAME $12;
input IID $ FIRST_NAME $ MIDDLE_NAME $ LAST_NAME $ SUFFIX_NAME;
datalines;
X John James Smith Sr
Z Sarah Marie Jones .
Y Tim William Miller Jr
C Nancy Lynn Brown .
B Carol Elizabeth Collins .
A Wayne Mark Rooney .
;
run;
data for_fmt;
set master;
retain type 'char';
length fmtname $32
label $255
start $255
;
start=iid;
*first;
label=first_name;
fmtname='$FIRSTNAMEF';
output;
*last;
label=last_name;
fmtname='$LASTNAMEF';
output;
*middle;
label=middle_name;
fmtname='$MIDNAMEF';
output;
*suffix;
label=suffix_name;
fmtname='$SUFFNAMEF';
output;
if _n_=1 then do;
start=' ';
label=' ';
hlo='o';
fmtname='$FIRSTNAMEF';
output;
fmtname='$LASTNAMEF';
output;
fmtname='$MIDNAMEF';
output;
fmtname='$SUFFNAMEF';
output;
end;
run;
proc sort data=for_fmt;
by fmtname start;
run;
proc format cntlin=for_fmt;
quit;
data want;
set working;
first_name = put(iid,$FIRSTNAMEF.);
last_name = put(iid,$LASTNAMEF.);
middle_name = put(iid,$MIDNAMEF.);
suffix_name = put(iid,$SUFFNAMEF.);
run;
That said...
If you do want to do this in a hash table, what you'd need to do is, for each row in MASTER, do a FIND in the working table, then if successful a REPLACE, then FIND_NEXT and REPLACE until that fails.
The problem? You're doing at least one find per master row, which you yourself noted is very large. If WORKING is 100k and MASTER is 100M, then you're doing 1000 finds for each match. That's very expensive, and probably means you're better off with some other solution.
I'm really at my wits end, with this Problem, and I really hope someone could help me. I am using a Postgresql 9.3. My Database contains mostly german texts but not only, so it's encoded in utf-8. I want to establish a fulltextsearch wich supports german language, nothing special so far.
But the search is behaving really strange,, and I can't find out what I am doing wrong.
So, given the following table given as example
select * from test;
a
-------------
ein Baum
viele Bäume
Überleben
Tisch
Tische
Café
\d test
Tabelle »public.test«
Spalte | Typ | Attribute
--------+------+-----------
a | text |
sintext=# \d
Liste der Relationen
Schema | Name | Typ | Eigentümer
--------+---------------------+---------+------------
(...)
public | test | Tabelle | paf
Now, lets have a look at some textsearch examples:
select * from test where to_tsvector('german', a) ## plainto_tsquery('Baum');
a
-------------
ein Baum
viele Bäume
select * from test where to_tsvector('german', a) ## plainto_tsquery('Bäume');
--> No Hits
select * from test where to_tsvector('german', a) ## plainto_tsquery('Überleben');
--> No Hits
select * from test where to_tsvector('german', a) ## plainto_tsquery('Tisch');
a
--------
Tisch
Tische
Whereas Tische is Plural of Tisch (table) and Bäume is plural of Baum (tree). So, Obviously Umlauts does not work while textsearch perfoms well.
But what really confuses me is, that a) non-german special characters are matching
select * from test where to_tsvector('german', a) ## plainto_tsquery('Café');
a
------
Café
and b) if I don't use the german dictionary, there is no Problem with umlauts (but of course no real textsearch as well)
select * from test where to_tsvector(a) ## plainto_tsquery('Bäume');
a
-------------
viele Bäume
So, if I use the german dictionary for Text-Search, just the german special characters do not work? Seriously? What the hell is wrong here? I Really can't figure it out, please help!
You're explicitly using the German dictionary for the to_tsvector calls, but not for the to_tsquery or plainto_tsquery calls. Presumably your default dictionary isn't set to german; check with SHOW default_text_search_config.
Compare:
regress=> select plainto_tsquery('simple', 'Bäume'),
plainto_tsquery('english','Bäume'),
plainto_tsquery('german', 'Bäume');
plainto_tsquery | plainto_tsquery | plainto_tsquery
-----------------+-----------------+-----------------
'bäume' | 'bäume' | 'baum'
(1 row)
The language setting affects word simplification and root extraction, so a vector from one language won't necessarily match a query from another:
regress=> SELECT to_tsvector('german', 'viele Bäume'), plainto_tsquery('Bäume'),
to_tsvector('german', 'viele Bäume') ## plainto_tsquery('Bäume');
to_tsvector | plainto_tsquery | ?column?
-------------------+-----------------+----------
'baum':2 'viel':1 | 'bäume' | f
(1 row)
If you use a consistent language setting, all is well:
regress=> SELECT to_tsvector('german', 'viele Bäume'), plainto_tsquery('german', 'Bäume'),
to_tsvector('german', 'viele Bäume') ## plainto_tsquery('german', 'Bäume');
to_tsvector | plainto_tsquery | ?column?
-------------------+-----------------+----------
'baum':2 'viel':1 | 'baum' | t
(1 row)
I have a function and I want to get a string between two strings where the first one is "Start" and the second one is the new line character.
I mean: From "Start blablabla \n" I only want "blablabla".
I've tried this, but it doesn't work:
select substring(test from 'Start(.+)\n') into vtest;
How can I identify the newline character??
Thanks!
It needs to be double-escaped:
test=> select substring('foo
bar' from E'\\A(.*)\\r?\\n');
substring
-----------
foo
(1 row)
Alternative version:
select substring('foo
bar' from E'\\A.*(?=\\r?\\n)');
The $ symbol matches the string end:
select substring('Start123' from 'Start(.+)$');
substring
-----------
123