Adding 2nd search to regex pattern? - iphone

// LINE 1
<td align="left" nowrap><font face="courier, monospace" size="-1"> (2002 GC1)</font></td>
// LINE 2
<td align="left" nowrap><font face="courier, monospace" size="-1"> 99942 Cocoon</font></td>
I have created a simple regular expression to scrape a little data I need from the HTML lines above, the expression works well and puts the data I need in two groups.
Regular Expression Pattern = ([0-9]+) ([A-Za-z0-9]+)
LINE1: Group1 = 2002, Group2 = GC1
LINE2: Group1 = 99942, Group2 = Cocoon
Having run this through my data I have now noticed that there is a new type of HTML line that has an extra number at the start that I should get.
// LINE 3
<td align="left" nowrap><font face="courier, monospace" size="-1">162421 (2000 CG70)</font></td>
LINE3: Group1 = 2000, Group2 = CG70
What I am trying to do is alter my pattern to additionally capture 162421 this matches the same pattern ([0-9]+) but being new to regular expressions I am unsure how to add this possibility into my pattern. Each time I try I either negate my already working search or I overwrite part of the result.
NOTE: I am using this with: NSRegularExpression on iOS.

You will have to add a capture group for early digits in the string. In the example, these digits are followed by "&nbsp"; (one or many times) and "(", and all of this is optional for the regex to match.
(?:([0-9]+)(?: )+\()?([0-9]+) ([A-Za-z0-9]+)
// ^ ^ ^ capture groups
The trickiest part comes with capture ranges.
Now you have one capture group more, you will always have 4 ranges when querying the NSTextCheckingResult object (0-index range is the entire match range, others are capture ranges).
But some times, only the last two will be valid.
To be sure, test the location member of the NSRange against NSNotFound. If the test succeed then the range is valid and you match and capture early digits, otherwise not.

How about:
([0-9]+) ([A-Za-z0-9]*)
Btw. I use this site to test regular expressions, very useful.

Related

Regular expression: retrieve one or more numbers after certain text

I'm trying to parse HTML code and extra some data from with using regular expressions. The website that provides the data has no API and I want to show this data in an iOS app build using Swift. The HTML looks like this:
$(document).ready(function() {
var years = ['2020','2021','2022'];
var currentView = 0;
var amounts = [1269.2358,1456.557,1546.8768];
var balances = [3484626,3683646,3683070];
rest of the html code
What I'm trying to extract is the years, amounts and balances.
So I would like to have an array with the year in in [2020,2021,2022] same for amount and balances. In this example there are 3 years, but it could be more or less. I'm able to extra all the numbers but then I'm unable to link them to the years or amounts or balances. See this example: https://regex101.com/r/WMwUji/1, using this pattern (\d|\[(\d|,\s*)*])
Any help would be really appreciated.
Firstly I think there are some errors in your expression. To capture the whole number you have to use \d+ (which matches 1 or more consecutive numbers e.g. 2020). If you need to include . as a separator the expression then would look like \d+\.\d+.
In addition using non-capturing group, (?:) and non-greedy matches .*? the regular-expression that gives the desired result for years is
(?:year.*?|',')(\d+)
This can also be modified for the amount field which would look like this:
(?:amounts.*?|,)(\d+\.\d+)
You can try it here: https://regex101.com/r/QLcFQN/1
Edited: in the previous Version my proposed regex was non functional and only captured the last match.
You can continue with this regex:
^var (years \= (?'year'.*)|balances \= (?'balances'.*)|amounts \= (?'amounts'.*));$
It searches for lines with either years, balances or amount entries and names the matches acordingly. It matches the whole string within the brackets.

How to identify a character in a string?

I am trying to write a Powershell code to identify a string with a specific character from a filename from multiple files.
An example of a filename
20190902091031_202401192_50760_54206_6401.pdf
$Variable = $Filename.Substring(15,9)
Results:
202401192 (this is what I am after)
However in some instances the filename will be like below
20190902091031_20240119_50760_54206_6401.pdf
$Variable = $Filename.Substring(15,9)
Results:
20240119_ (this is NOT what I am after)
I am trying to find a code to identify the 9th character,
IF the 9th character = "_"
THEN Set
$Variable = $Filename.Substring(15,8)
Results:
20240119
All credit to TheMadTechnician who beat me to the punch with this answer.
To expand on the technique a bit, use the split method or operator to split a string every time a certain character shows up. Your data is separated by the underscore character, so is a perfect example of using this technique. By using either of the following:
$FileName.Split('_')
$FileName -split '_'
You can turn your long string into an array of shorter strings, each containing one of the parts of your original string. Since you want the 2nd one, you use the array descriptor [1] (0 is 1st) and you're done.
Good luck

Can I write a PCRE conditional that only needs the no-match part?

I am trying to create a regular expression to determine if a string contains a number for an SQL statement. If the value is numeric, then I want to add 1 to it. If the number is not numeric, I want to return a 1. More or less. Here is the SQL:
SELECT
field,
CASE
WHEN regexp_like(field, '^ *\d*\.?\d* *$') THEN dec(field) + 1
ELSE 1
END nextnumber
FROM mytable
This actually works, and returns something like this:
INVALID 1
00000 1
00001E 1
00379 380
00013 14
99904 99905
But to push the envelope of understanding, what if I wanted to cover negative numbers, or those with a positive sign. The sign would have to immediately precede or follow the number, but not both, and I would not want to allow white space between the sign and the number.
I came up with a conditional expression with a capture group to capture the sign on the front of the number to determine if a sign was allowed on the end, but it seems a little awkward to handle given I don't really need a yes-pattern.
Here is the modified regex: ^ ([+-]?)*\d*\.?\d*(?(1) *|[+-]? *)$
This works at regex101.com, but in order for it to work I need to have something before the pipe, so I have to duplicate the next pattern in both the yes-pattern and the no-pattern.
All that background for this question: How can I avoid that duplication?
EDIT: DB2 for i uses International Components for Unicode to provide regular expression processing. It turns out that this library does not support conditionals like PRCE, so I changed the tags on this question. The answer given by Wiktor Stribiżew provides a working alternative to the conditional by using a negative lookahead.
You do not have to duplicate the end pattern, just move it outside the conditional:
^ *([+-])?\d*\.?\d*(?(1)|[+-]?) *$
See the regex demo. So, the yes-part is empty, and the no-part has an optional pattern.
You may also solve it with a mere negative lookahead:
^ *([+-](?!.*[-+]))?\d*\.?\d*[+-]? *$
See another regex demo. Here, ([+-](?!.*[-+]))? matches (optionally) a + or - that are not followed with any 0+ char followed with another + or -.

Partial String Replacement using PowerShell

Problem
I am working on a script that has a user provide a specific IP address and I want to mask this IP in some fashion so that it isn't stored in the logs. My problem is, that I can easily do this when I know what the first three values of the IP typically are; however, I want to avoid storing/hard coding those values into the code to if at all possible. I also want to be able to replace the values even if the first three are unknown to me.
Examples:
10.11.12.50 would display as XX.XX.XX.50
10.12.11.23 would also display as XX.XX.XX.23
I have looked up partial string replacements, but none of the questions or problems that I found came close to doing this. I have tried doing things like:
# This ended up replacing all of the numbers
$tempString = $str -replace '[0-9]', 'X'
I know that I am partway there, but I aiming to only replace only the first 3 sets of digits so, basically every digit that is before a '.', but I haven't been able to achieve this.
Question
Is what I'm trying to do possible to achieve with PowerShell? Is there a best practice way of achieving this?
Here's an example of how you can accomplish this:
Get-Content 'File.txt' |
ForEach-Object { $_ = $_ -replace '\d{1,3}\.\d{1,3}\.\d{1,3}','xx.xx.xx' }
This example matches a digit 1-3 times, a literal period, and continues that pattern so it'll capture anything from 0-999.0-999.0-999 and replace with xx.xx.xx
TheIncorrigible1's helpful answer is an exact way of solving the problem (replacement only happens if 3 consecutive .-separated groups of 1-3 digits are matched.)
A looser, but shorter solution that replaces everything but the last .-prefixed digit group:
PS> '10.11.12.50' -replace '.+(?=\.\d+$)', 'XX.XX.XX'
XX.XX.XX.50
(?=\.\d+$) is a (positive) lookahead assertion ((?=...)) that matches the enclosed subexpression (a literal . followed by 1 or more digits (\d) at the end of the string ($)), but doesn't capture it as part of the overall match.
The net effect is that only what .+ captured - everything before the lookahead assertion's match - is replaced with 'XX.XX.XX'.
Applied to the above example input string, 10.11.12.50:
(?=\.\d+$) matches the .-prefixed digit group at the end, .50.
.+ matches everything before .50, which is 10.11.12.
Since the (?=...) part isn't captured, it is therefore not included in what is replaced, so it is only substring 10.11.12 that is replaced, namely with XX.XX.XX, yielding XX.XX.XX.50 as a result.

matlab regexprep

How to use matlab regexprep , for multiple expression and replacements?
file='http:xxx/sys/tags/Rel/total';
I want to replace 'sys' with sys1 and 'total' with 'total1'. For a single expression a replacement it works like this:
strrep(file,'sys', 'sys1')
and want to have like
strrep(file,'sys','sys1','total','total1') .
I know this doesn't work for strrep
Why not just issue the command twice?
file = 'http:xxx/sys/tags/Rel/total';
file = strrep(file,'sys','sys1')
strrep(file,'total','total1')
To solve it you need substitute functionality with regex, try to find in matlab's regexes something similar to this in php:
$string = 'http:xxx/sys/tags/Rel/total';
preg_replace('/http:(.*?)\//', 'http:${1}1/', $string);
${1} means 1st match group, that is what in parenthesis, (.*?).
http:(.*?)\/ - match pattern
http:${1}1/ - replace pattern with second 1 as you wish to add (first 1 is a group number)
http:xxx/sys/tags/Rel/total - input string
The secret is that whatever is matched by (.*?) (whether xxx or yyyy or 1234) will be inserted instead of ${1} in replace pattern, and then replace instead of old stuff into the input string. Welcome to see more examples on substitute functionality in php.
As documented in the help page for regexprep, you can specify pairs of patterns and replacements like this:
file='http:xxx/sys/tags/Rel/total';
regexprep(file, {'sys' 'total'}, {'sys1' 'total1'})
ans =
http:xxx/sys1/tags/Rel/total1
It is even possible to use tokens, should you be able to define a match pattern for everything you want to replace:
regexprep(file, '/([st][yo][^/$]*)', '/$11')
ans =
http:xxx/sys1/tags/Rel/total1
However, care must be taken with the first approach under certain circumstances, because MATLAB replaces the pairs one after another. That is to say if, say, the first pattern matches a string and replaces it with something that is subsequently matched by a later pattern, then that will also be replaced by the later replacement, even though it might not have matched the later pattern in the original string.
Example:
regexprep('This\is{not}LaTeX.', {'\\' '([{}])'}, {'\\textbackslash{}' '\\$1'})
ans =
This\textbackslash\{\}is\{not\}LaTeX.
=> This\{}is{not}LaTeX.
and
regexprep('This\is{not}LaTeX.', {'([{}])' '\\'}, {'\\$1' '\\textbackslash{}'})
ans =
This\textbackslash{}is\textbackslash{}{not\textbackslash{}}LaTeX.
=> This\is\not\LaTeX.
Both results are unintended, and there seems to be no way around this with consecutive replacements instead of simultaneous ones.