Inserting hyphens into length limited String using regex - swift

Within a Swift project I have some regex which at present ensures that an input can only be 10 characters long:
"^[\\da-zA-Z]{10,10}$"
I need to tweak this slightly, so that the string which this is working on will have the below format:
#####-####
i.e, inserting a character after the fifth character.
So far I have tried combining what I have with some other regex, however this is incorrect and I can't figure out what I need to do differently to make this work:
"^[\\da-zA-Z]{10,10}$(.{5}),$1-$2"

If you have as string of 10 characters and you want to replace the character after the sixth character you could use 2 capturing groups.
Capture the first 5 characters in the first group, then match the sixth character which you want to replace and capture the last 4 in the second group.
^([\\da-zA-Z]{5})[\\da-zA-Z]([\\da-zA-Z]{4})$
regex demo
In the replacement use $1-$2 which in total will be 10 characters as in your desired pattern #####-####
Note that {10,10} can be written as {10}

Related

How to replace a character using sed with different lengths in preceding string

I have a file in which I want to replace the "_" string with "-" in cases where it makes up a part of my gene name. Examples of the gene names and my intended output are:
aa1c1_123 -> aa1c1-123
aa1c2_456 -> aa1c1-456
aa1c10_789 -> aa1c1-789
In essence, the first four characters are fixed, followed by 1 or 2 characters depending on the chromosome, an underscore and then the remainder of the gene ID which could vary in length and character. Important is that there are other strings in this gene information column contains other strings with underscores (e.g. "gene_id", "transcript_id", "five_prime_utr") so using sed -i.bak s/_/-/g' file.gtf
can't be done.
Perhaps not the most elegant way, but this should work:
sed -i.bak 's/\([0-9a-z]\{4\}[0-9][0-9]\?\)_/\1-/g' file.gtf
i.e. capture a group (referenced by \1 in the substitution) of 4 characters consisting of lower case letters and digits followed by exactly one digit and perhaps another digit, which is followed by an underscore; if found, replace it by the group's content and a dash. This should exclude your other occurrences consisting of only characters and an underscore.

PATINDEX does not recognize dot and comma

I have a column that should contain phone numbers but it contains whatever the user wanted. I need to create an update to remove all the characters after an invalid character.
To do this I am using a regex as PATINDEX('%[^0-9+-/()" "]%', [MobilNr]) and it seemed to work until I had some numbers as +1235, 36446 and to my surprise the result is 0 instead of 6. Also if the number contains . it returns 0.
Does PATINDEX ignores dot(".") and comma(",")? Are there other characters that PATINDEX will ignore?
It's not that PATINDEX ignores the comma and the dot, it's your pattern that created this problem.
With PATINDEX, the hyphen char (-) has a special meaning - it's in fact an operator that denotes an inclusive range - like 0-9 denotes all digits between 0 and 9 - so when you do +-/ it means all the chars between + and / (inclusive, of course). The comma and dot chars are within this range, that's why you get this result.
Fixing the pattern is easy: either use | as a logical or, or simply move the hyphen to the end of the pattern:
SELECT PATINDEX('%[^0-9/()" "+-]%', '+1235, 36446') -- Result: 6

create a generic regex for a string in perl

I have tried to create regex for the below:
STRING sou_u02_mlpv0747_CCF_ASB001_LU_FW_ALERT|/opt/app/medvhs/mvs/applications/cm_vm5/fwhome/UnifiedLogging|UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv|FATAL|red|1h||fw_alert
REGEX----> /^[^#]\w+\|[^\|]+\|\w+\|\w+\|\w*\|\w*\|([^\|]+|)\|\w*$/
I am unable to figure out the mistake here.
I created the above by referring another regex which working fine and given below
/^[^#]\w+\|[^\|]+\|([^\|]+|)\|[rm]\|(in|out|old|new|arch|missing)\|\w+\|([0-9-,]+|)\|\w*\|\w*$/
sou_u02_mlpv0747_CCF_ASB001_LU_ODR|/opt/app/medvhs/mvs/applications/cm_vm5/components/CCF_ASB001_LU/SPOOL/ODR||r|out|30m|0400-1959|30m|gprs_in_stag
can some one please help me. Any leads would be highly apprciated.
Let's start from a brief look at your source text (the first that you included).
It is composed of "sections" separated with | char.
This char (|) must be matched by \|. Remember about the preceding
backslash, otherwise, a "bare" | would mean the alternative separator
(you used it in one place).
And now take a look at each section (between |):
Some of them contain only a sequence of word chars (and can be matched
by \w+).
Other sections, however, contain also other chars, e.g. slashes,
backslash, braces and dots, so each such section is actually a sequence
of chars other than "|" and must be matched by [^|]+ (here,
between [ and ], the vertical bar may be unescaped).
Now let's write each section and its "type":
sou_u02_..._FW_ALERT - word chars.
/opt/app/.../UnifiedLogging - other chars (because of slashes).
UL_\d{8}_..._Primary.log.csv - other chars (because of \d{8}
and dots).
FATAL|red|1h - 3 sections composed of word chars.
An empty section, between 2 consecutive | chars.
fw_alert - word chars.
And now, how to match these groups, and the separating |:
Point 1: \w+\| - word chars and (escaped) vertical bar.
Point 2 and 3 (together): (?:[^|]+\|){2} - a non-capturing
group - (?:...), containing a sequence of "other" chars - [^|]+
and a vertical bar - \|, occurring 2 times {2}.
Point 4 (three "word char" groups): (?:\w+\|){3} - similiar to
the previous point.
Point 5: Just as in your solution - ([^|]+|)\|, a capturing group -
(...), with 2 alternatives ...|.... The first alternative is
[^|]+ (a sequence of "other" chars), and the second alternative
is empty. After the capturing group there is \| to match the vertical
bar.
Point 6: \w+ - word chars. This time no \|, as this is the last
section.
The regex assembled so far must be:
prepended with a ^ (start of string) and
appended with a $ (end of string).
So the whole regex, matching your source text can be:
^\w+\|(?:[^|]+\|){2}(?:\w+\|){3}([^|]+|)\|\w+$
Actually, the only capturing group can be written another way,
as ([^|]*) - without alternatives, but with * as the
repetition count, allowing also empty content.
Your choice, which variant to apply.
The third field
UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv
Contains a backslash, \, braces { } and dots .. None of these can be matched by \w
Note also that there is no need to escape a pipe | inside a characters class: [^|]+ is fine

Text file search for match strings regex

I am trying to understand how regex works and what are the possibilities of working with it.
So I have a txt file and I am trying to search for 8 char long strings containing numbers. for now I use a quite simple option:
clear
Get-ChildItem random.txt | Select-String -Pattern [0-9][a-z] | foreach {$_.line}
It sort of works but I am trying to find a better option. ATM it takes too long to read through the left out text since it writes entire lines and it does not filter them by length.
You can use a lookahead to assert that a string contains at least 1 digit, then specify the length of the match and finally anchor it with ^ (start of string) and $ (end of string) if the string is on a line of its own, or \b (word boundary) if it's part of an HTML document as your comments seem to suggest:
Get-ChildItem C:\files\ |Select-String -Pattern '^(?=.*\d)\w{8}$'
Get-ChildItem C:\files\ |Select-String -Pattern '\b(?=.*\d)\w{8}\b'
The pattern [0-9][a-z] matches a digit followed by a letter. If you want to match a sequence of 8 characters use .{8}. The dot in regular expressions matches any character except newlines. A number in curly brackets matches the preceding expression the given number of times.
If you want to match non-whitespace characters use \S instead of .. If you want to match only digits and letters use [0-9a-z] (a character class) instead of ..
For a more thorough introduction please go find a tutorial. The subject is way too complex to be covered by a single answer on SO.
What you're currently searching for is a single number ranging from 0-9 followed by a single lowercase letter ranging from a-z.
this, for example, will match any 8 char long strings containing only alphanumeric characters.
\w{8}
i often forget what some regex classes are, and it may be useful to you as a learning tool, but i use this as a point of reference: http://regexr.com/
It can also validate what you're typing inline via a text field so you can see if what you're doing works or not.
If you need more of a tutorial than a reference, i found this extremely useful when i learned: regexone.com

Search a pattern in the first 100 characters of a string

I want to display first 1000 characters of a string (literals are replaced by special symbol). I am using pcre library to replace the literal. After replacing every literal I am checking for the length of the string and if it is > 1000 then stop matching and display the string.
My problem is, Suppose I am sending a string with length 1GB, and if there is no literal in that string, pcre will check for the entire string. I want to search the pattern in the first 1000 characters. Is there any way to do this?
Just cut a 1000-chars head of your string and use substitution for it, not for the whole text.
In case you get less than 1000 chars after substitution, just cut another 1000-chars head, use substitution and concatenate two results. Do it in loop until you get 1000-chars string or reach the end of the whole text.