Searching For Special Characters in postgresql - postgresql

i am trying to find rows in a postgresql table where a specific column contains special characters excluding the following
#^$.!\-#+'~_
any help appreciated

Hi I think I figured it out. I found a solution that worked for me using Posix Regular Expressions.
SELECT *
FROM TABLE_NAME
WHERE fieldName ~ '[^A-Za-z0-9#^\\$.!\-#+~_]'
The regular expression matches any character that is not between A-Z, a-z, 0-9 and is also not any of your whitelisted characters ^$.!-#+~_. Notice that in the regex I had to escape the backslash and the hyphen, because they have a special meaning in regex. Maybe start by evaluating my proposed regex online with a few examples, e.g. here: https://regex101.com

Related

Perl regex presumably removing non ASCII characters

I found a code with regex where it is claimed that it strips the text of any non-ASCII characters.
The code is written in Perl and the part of code that does it is:
$sentence =~ tr/\000-\011\013-\014\016-\037\041-\055\173-\377//d;
I want to understand how this regex works and in order to do this I have used regexr. I found out that \000, \011, \013, \014, \016, \037, \041, \055, \173, \377 mean separate characters as NULL, TAB, VERTICAL TAB ... But I still do not get why "-" symbols are used in the regex. Do they really mean "dash symbol" as shown in regexr or something else? Is this regex really suited for deleting non-ASCII characters?
This isn't really a regex. The dash indicates a character range, like inside a regex character class [a-z].
The expression deletes some ASCII characters, too (mainly whitespace) and spares a range of characters which are not ASCII; the full ASCII range would simply be \000-\177.
To be explicit, the d flag says to delete any characters not between the first pair of slashes. See further the documentation.

If there's even one non-english character in any entry in a column I want have 'TRUE' in another column

There are several entries in the column, eng characters with non english characters, eng characters numbers/symbols, non eng characters with numbers/symbols etc. If there's even one non-english character in any entry in the column, I want 'TRUE' in the adjacent column.
SELECT * 
FROM companies
WHERE name LIKE '%[a-z]%';
This code doesn't work.
You can achieve this using regular expressions. Here's a regular expression that will match all ASCII printable characters along with tab (\t), new-line/line-feed (\n), and carriage return (\r).
SELECT
*,
name ~ '[^\t\n\r\x20-\x7E]' AS has_bad_chars
FROM companies
Now this will match any character that's not A-Z, a-z, 0-9, , ., ;, :, ", ', /.
Working from the assumption that the adjacent column you mention is defined in the table as
has_non_english_char boolean then try
update companies
set has_non_english_char = name ~ '[^A-Za-z0-9_.,/$]';
]
Alternatively look into character classes or include additional characters in the above. Note: The regular expression should include the 'English' characters you want to allow.
Perhaps this is not the preferred SO protocol, but I think another answer may be better than just expanding a previous one. If not community please forgive me.
Run:
with companies as
( select * from (values ('abc.com:;'), ('abc.com - "Something')) as c(name))
SELECT 'My RE',name,name ~ '[^A-Za-z0-9&_`!##$^&*()_+=\|\][{’\;:"<.,.}? -]' has_bad_chars, 'f' desirded FROM companies
union
SELECT 'Your RE',name,name ~ '[^A-Za-z0-9&_`!##$^&*()_+=\|][{’;:""<.,.}?-]', 'f' from companies
order by 2,1 desc;
Let's examine the RE themselves and see the difference:
My RE [^A-Za-z0-9&_`!##$^&*()_+=\|\][{’\;:""<.,.}? -]
Your RE [^A-Za-z0-9&_`!##$^&*()_+=\|][{’;:""<.,.}?-]
^^^
Notice the difference at the indicated position. You have '|]' while I have '|\]' also I later have a space.
See the regular expression reference from #cpburnz earlier, in particular section "9.7.3.2. Bracket Expressions".
Your RE breaks down to '[^...][...]' which breaks the RE into looking for 2 distinct set of characters telling the RE engine to find 'any character not in the first bracketed expression' followed immediately by 'any character that is in the second bracketed expression'
The difference is I escaped the right bracket ] thus removing its special meaning in the RE and making it just another character. This is the nature of REs the exact individual characters can make all the difference. If you are going to do much of this type of stuff study REs intently.
Good luck finding the exact RE you need, its out there you just need to work with it until you find it.

Sed's regex to eliminate a very specific string

Disclaimer:
I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
The "Problem":
I have a list of servers (VMs) that have it's UUID embedded as part of the name. I need to get rid of that in order to obtain the "pure/clean" server name. Now, the problem is precisely that: I need to get rid of the UUID (which has a very specific and constant format, more details on this below) and ONLY that, nothing else.
The UUID - as you might already know or have noticed - has a specific and constant format which consists of the following parts:
It starts with a dash (-).
Which is followed by a subset of 8 alphanumeric characters (letters are always lowercase).
Which is followed by a dash (-).
Which is followed by a subset of 4 alphanumeric characters (letters are always lowercase).
Which is followed by a dash (-).
Which is followed by a subset of 4 alphanumeric characters (letters are always lowercase).
Which is followed by a dash (-).
Which is followed by a subset of 4 alphanumeric characters (letters are always lowercase).
Which is followed by a dash (-).
Which is followed by a subset of 12 alphanumeric characters (letters are always lowercase).
Samples of results achieved using "my" """"code"""":
In this case the result is the expected one:
echo PRODSERVER0022-872151c8-1a75-43fb-9b63-e77652931d3f | sed 's/-[a-z0-9]*//g'
PRODSERVER0022
In this case the result is the expected one too:
echo PRODSERVER0022-872151c8-1a75-43fb-9b63-e77652931d3f_OLD | sed 's/-[a-z0-9]*//g'
PRODSERVER0022_OLD
Expected result: PRODSERVER0022-OLD
echo PRODSERVER0022-872151c8-1a75-43fb-9b63-e77652931d3f-OLD | sed 's/-[a-z0-9]*//g'
PRODSERVER0022
Expected result: PRODSERVER00-22
echo PRODSERVER00-22-872151c8-1a75-43fb-9b63-e77652931d3f-old | sed 's/-[a-z0-9]*//g'
PRODSERVER00
I know that, within the sed universe, a . means "any character", while a * means "any number of the preceding character". However, what I would need in this case, as I see it at least, is a way to tell sed to do the replacement only if this specific sequence is present (8 alphanumeric characters [any, but specifically 8, not more, not less]; followed by a dash, then followed by 4 alphanumeric characters [any, but specifically 4, not more, not less], etc..). So, the question would be: Is there a regex construction (or a combination [through piping I guess] of several of them, if it has to be the case) that can achieve the expected results in this case?
Note that: Even though servers may have additional dashes (-) as part of their names, the resulting sub-strings will never consist of 8 characters, neither of 4. They might, however, end up having 12 characters, which, even though would initially match up with the last sub-string in the UUID, it will not be at the end of the string, so we have that to discriminate between these two 12-chars substrings (and also it will not be a problem if there is indeed a regex combination that can get rid of the UUID as a whole).
Try this to match the UUID.
-[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}
Embed it in the sed command line in the usual way. As Benjamin W. has said, we need to use extended regular expressiongs.
sed -E 's/-[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}//g'

How to ignore leading special characters in sphinx search?

For example, we want the search term "Test" to match the word "#Test" and "(Test)"in the results.
it does this by default. Unless you have added these chars to the charset_table in the config
Did you try with escape character backslash just before special characters.
Also check there is EscapeString function available in sphinx API.

ack-grep: chars escaping

My goal is to find all "<?=" occurrences with ack. How can I do that?
ack "<?="
Doesn't work. Please tell me how can I fix escaping here?
Since ack uses Perl regular expressions, your problem stems from the fact that in Perl RegEx language, ? is a special character meaning "last match is optional". So what you are grepping for is = preceded by an optional <
So you need to escape the ? if that's just meant to be a regular character.
To escape, there are two approaches - either <\?= or <[?]=; some people find the second form of escaping (putting a special character into a character class) more readable than backslash-escape.
UPDATE As Josh Kelley graciously added in the comment, a third form of escaping is to use the \Q operator which escapes all the following special characters till \E is encountered, as follows: \Q<?=\E
Rather than trying to remember which characters have to be escaped, you can use -Q to quote everything that needs to be quoted.
ack -Q "<?="
This is the best solution if you will want to find by simple text.
(if you need not find by regular expression.)
ack "<\?="
? is a regex operator, so it needs escaping