Finding a string between two ranges of strings or end of string - swift

I am trying to extract whatever is between two strings. The first string is a known string, the second string could be from a list of strings.
For example,
We have the start string and the end strings. We want to get the text between these.
start = "start"
end = ["then", "stop", "other"]
Criteria
test = "start a task then do something else"
result = "a task"
test = "start a task stop doing something else"
result = "a task"
test = "start a task then stop"
result = "a task"
test = "start a task"
result = "a task"
I have looked at using a regex, and I got one which works for between two strings, I just cannot create one which words with a option of strings:
(?<=start\s).*(?=\sthen)
I have tried using this:
(?<=start\s).*(?=\sthen|\sstop|\sother)
but this will include 'then, stop or other' in the match like so:
"start a task then stop" will return "a task then"
I have also tried to do a 'match any character except the end list" in the capture group like so: (?<=start\s)((?!then|stop|other).*)(?=\sthen|\sstop|\sother) but this has the same effect as the one above.
I am using swift, so I am also wondering whether this can be achieved by finding the substring between two strings.
Thanks for any help!

You may use
(?<=start\s).*?(?=\s+(?:then|stop|other)|$)
See the regex demo. To search for whole words, add \b word boundary in proper places:
(?<=\bstart\s).*?(?=\s+(?:then|stop|other)\b|$)
See another regex demo
Details
(?<=start\s) - a positive lookbehind that matches a location immediately preceded with start string and a whitespace
.*? - any 0+ chars other than line break chars, as few as possible
(?=\s+(?:then|stop|other)|$) - a position in the string that is immediately followed with
\s+ - 1+ whitespaces
(?:then|stop|other) - one of the words
|$ - or end of string.

Related

How to match exact string in perl

I am trying to parse all the files and verify if any of the file content has strings TESTDIR or TEST_DIR
Files contents might look something like:-
TESTDIR = foo
include $(TESTDIR)/chop.mk
...
TEST_DIR := goldimage
MAKE_TESTDIR = var_make
NEW_TEST_DIR = tesing_var
Actually I am only interested in TESTDIR ,$(TESTDIR),TEST_DIR but in my case last two lines should be ignored. I am new to perl , Can anyone help me out with re-rex.
/\bTEST_?DIR\b/
\b means a "word boundary", i.e. the place between a word character and a non-word character. "Word" here has the Perl meaning: it contains characters, numbers, and underscores.
_? means "nothing or an underscore"
Look at "characterset".
Only (space) surrounding allowed:
/^(.* )?TEST_?DIR /
^ beginning of the line
(.* )? There may be some content .* but if, its must be followed by a space
at the and says that a whitespace must be there. Otherwise use ( .*)?$ at the end.
One of a given characterset is allowed:
Should the be other characters then a space be possible you can use a character class []:
/^(.*[ \t(])?TEST_?DIR[) :=]/
(.*[ \t(])? in front of TEST_?DIR may be a (space) or a \t (tab) or ( or nothing if the line starts with itself.
afterwards there must be one of (space) or : or = or ). Followd by anything (to "anything" belongs the "=" of ":=" ...).
One of a given group is allowed:
So you need groups within () each possible group in there devided by a |:
/^(.*( |\t))?TEST_?DIR( | := | = )/
In this case, at the beginning is no change to [ \t] because each group holds only one character and \t.
At the end, there must be (single space) or := (':=' surrounded by spaces) or = ('=' surrounded by spaces), following by anything...
You can use any combination...
/^(.*[ \t(])?TEST_?DIR([) =:]| :=| =|)/
Test it on Debuggex.com. (Use 'PCRE')

How can I obtain only word without All Punctuation Marks when I read text file?

The text file abc.txt is an arbitrary article that has been scraped from the web. For example, it is as follows:
His name is "Donald" and he likes burger. On December 11, he married.
I want to extract only words in lower case and numbers except for all kinds of periods and quotes in the above article. In the case of the above example:
{his, name, is, Donald, and, he, likes, burger, on, December, 11, he, married}
My code is as follows:
filename = 'abc.txt';
fileID = fopen(filename,'r');
C = textscan(fileID,'%s','delimiter',{',','.',':',';','"','''});
fclose(fileID);
Cstr = C{:};
Cstr = Cstr(~cellfun('isempty',Cstr));
Is there any simple code to extract only alphabet words and numbers except all symbols?
Two steps are necessary as you want to convert certain words to lowercase.
regexprep converts words, which are either at the start of the string or follow a full stop and whitespace, to lower case.
In the regexprep function, we use the following pattern:
(?<=^|\. )([A-Z])
to indicate that:
(?<=^|\. ) We want to assert that before the word of interest either the start of string (^), or (|) a full stop (.) followed by whitespace are found. This type of construct is called a lookbehind.
([A-Z]) This part of the expression matches and captures (stores the match) a upper case letter (A-Z).
The ${lower($0)} component in the regex is called a dynamic expression, and replaces the contents of the captured group (([A-Z])) to lower case. This syntax is specific to the MATLAB language.
You can check the behaviour of the above expression here.
Once the lower case conversions have occurred, regexp finds all occurrences of one or more digits, lower case and upper case letters.
The pattern [a-zA-Z0-9]+ matches lower case letters, upper case letters and digits.
You can check the behavior of this regex here.
text = fileread('abc.txt')
data = {regexp(regexprep(text,'(?<=^|\. )([A-Z])','${lower($0)}'),'[a-zA-Z0-9]+','match')'}
>>data{1}
13×1 cell array
{'his' }
{'name' }
{'is' }
{'Donald' }
{'and' }
{'he' }
{'likes' }
{'burger' }
{'on' }
{'December'}
{'11' }
{'he' }
{'married' }

Using regexp_replace how do t replace a string with an exception

How do I replace all occurrences of ' sub.*' with the exception of ' substation.*'?
regexp_replace("CleanString",' sub.*',' ', 'ig')
I have tried using various combinations of groupings () but still not getting it.
Using postgres regexp_replace()
A regular expression normally matches only things that are there, not things that are not there - you cannot simply put an "if-then-else" in there.
However, Postgres's regex support, the manual page for which is here includes "lookahead" and "lookbehind" expressions.
In your case, you want a *negative lookahead":
(?!re) negative lookahead matches at any point where no substring matching re begins (AREs only)
It's important to note the phrase "at any point" - lookarounds are "zero width", so (?!station) doesn't mean "something other than station", it means "a position in the string where station isn't coming next".
You can therefore construct your query like this:
' sub(?!station).*'
That will match any of "sub", "foo sub", " subbar", or "foo subbar", but not any of "substation", "foo substation", " substationbar", or "foo substationbar". Since the (?!station) is zero-width, and the next token is .*, it's fine for nothing to come after " sub".
If you want there to be something after the "sub", you could instead write:
' sub(?!station).+'
The .+ means "at least one of something", so it will still match " subbar" and "foo subbar", but will no longer match " sub" or "foo sub".

Getting IndexOutOfBounds Exception while search for a subtring

I have a string like
var word = "banana"
and a sentence like var sent = "the monkey is holding a banana which is yellow"
sent1 = "banana!!"
I want to search banana in sent and then write to a file in the following way:
the monkey is holding a
banana
which is yellow
I'm doing it in the following way:
var before = sent.substring(0, sent.indexOf(word))
var after = sent.substring(sent.indexOf(word) + word.length)
println(before)
println(after)
This works fine but when I do the same for sent1, then it gives me IndexOutOfBoundsException. I think it is because there is nothing before banana in sent1. How to deal with this?
You can split based on the word and you will get an array with everything before and after the word.
val search = sent.split(word)
search: Array[String] = Array("the monkey is holding a ", " which is yellow")
This works in the "banana!!!" case:
"banana!!".split(word)
res5: Array[String] = Array("", !!)
Now you can write the three lines to a file in your favorite way:
println(search(0))
println(word)
println(search(1))
What if you had more than one occurrence of the word? .split understands regular expressions, so you could improve the previous solution with something like this:
string
.replaceAll("\\s+(?=banana)|(?<=banana)\\s+")
.foreach(println)
\\s means a whitespace character
(?=<word>) means "followed by <word>"
(?<=<word>) means "preceded by <word>"
So, this would split your string into pieces, using any spaces either preceded or followed by the "banana", and not the word itself. The actual word ends up in the list, just like the other parts of the string, so you don't need to print it out explicitly
This regex trick is called "positive look-around" ( ?= is look-ahead, ?<= is look-behind) in case you are wondering.

Autohotkey: Splitting a concatenated string into string and number

I am using an input box to request a string from the user that has the form "sometext5". I would like to separate this via regexp into a variable for the string component and a variable for the number. The number then shall be used in a loop.
The following just returns "0", even when I enter a string in the form "itemize5"
!n::
InputBox, UserEnv, Environment, Please enter an environment!, , 240, 120
If ErrorLevel
return
Else
FoundPos := RegExMatch(%UserEnv%, "\d+$")
MsgBox %FoundPos%
retur
n
FoundPos, as its name implies, contains the position of the leftmost occurrence of the needle. It does not contain anything you specifically want to match with your regex.
When passing variable contents to a function, don't enclose the variable names in percent signs (like %UserEnv%).
Your regex \d+$ will only match numbers at the end of the string, not the text before it.
A possible solution:
myText := "sometext55"
if( RegExMatch(myText, "(.*?)(\d+)$", splitted) ) {
msgbox, Text: %splitted1%`nNumber: %splitted2%
}
As described in the docs, splitted will be set to a pseudo-array (splitted1, splitted2 ...), with each element containing the matched subpattern of your regex (the stuff that is in between round brackets).