Regex to match lines between two expressions - iphone

I am sorting some data and want to 'cut' out some rubbish between two bits of useful information.
Eg:
Useful one
rubbish
rubbish //rubbish here is covered by [.*], but the number of lines can be any number 1 or above
rubbish
useful two
I have successfully matched the useful parts of my information, I just need to know how to match the rubbish stuff. The pattern is as follows: useful, new line (no content), new line (no content), rubbish, new line (no content), new line (no content), useful.
The important part of this is that the rubbish section can vary in number of lines, but always has at least one line. Im not sure if i described this very well, any help is appreciated.

The best way I know of doing this is to do this
(exp1)(.+?)(exp2)
and replace or use in code the two groups
$1 $3
where $x is the group place holder
comment me for more specific syntax

your regexp (rubbish\s+)(rubbish\s+)(rubbish)

Try a pattern like (useful\n\n\n(.*)\n\n\nuseful\n)+, capturing rubbish into parenthesis. Improving and applying this pattern depends on your needs and your code.

Related

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

Is it correct to use Word Joiner (U+2060) in the same word?

In Bangla, Hosonto (U+09CD) is used to create a ligature, which joins adjacent letters. For example ক্ক is created using ক + ্ + ক. But sometimes we need a non-joining Hosonto (ক্‌ক). To make it possible, traditionally we use a Zero-width non-joiner (‌‌‌‌‌U+200C‌).
The problem with ‌‌‌‌‌ZWNJ is that, when the line is too long and line wrapping occurs, the word is broken into two lines. To keep the word as a whole, I need a character, something like “Zero-width non-breaking non-joiner”. But I don’t see such character in Unicode. So I think, Word Joiner (U+2060) is the best option.
To me, Word Joiner sounds like “joins two words”. But in my case, I need to join two parts of a single word. So, the question is, is it correct to use Word Joiner here?
U+200C ZERO WIDTH NON-JOINER has no effect on line breaking. Its absence or presence does not change where line wrapping can occur. If inserting a ZWNJ within a word causes that word to be broken across lines, then whatever application you are using to view your text does not implement the standard correctly.
ZWNJ is the only correct character for your purposes. More than that, using U+2060 WORD JOINER could in fact lead to inconsistent results. Much like ZWNJ does not affect line breaks, WJ is not supposed to affect joining behaviour (it is defined as “transparent” in that regard). While the standard doesn’t explicitly mention cases like this to the best of my knowledge, one could reasonably argue that inserting a WJ between the two letters in your example should not change the way they are displayed.

Removing spaces from a string using Powershell

I have an issue where extracting data from database it sometimes (quite often) adds spaces in between strings of texts that should not be there.
What I'm trying to do is create a small script that will look at these strings and remove the spaces.
The problem is that the spaces can be in any position in the string, and the string is a variable that changes.
Example:
"StaffID": "0000 25" <- The space in the number should not be there.
Is there a way to have the script look at this particular line, and if it finds spaces, to remove them.
Or:"DateOfBirth": "23-10-199 0" <-It would also need to look at these spaces and remove them.
The problem is that the same data also has lines such as:
"Address": " 91 Broad street" <- The spaces should be here obviously.
I've tried using TRIM, but that only removes spaces from start/end.
Worth mentioning that the data extracted is in json format and is then imported using API into the new system.
You should think about the logic of what you want to do, and whether or not it's programmatically possible to determine if you can teach your script where it is or is not appropriate to put spaces. As it is, this is one of the biggest problems facing AI research right now, so unfortunately you're probably going to have to do this by hand.
If it were me, I'd specify the kind of data format that I expect from each column, and try my best to attempt to parse those strings. For example, if you know that StaffID doesn't contain spaces, you can have a rule that just deletes them:
$staffid = $staffid.replace("\s+",'')
There are some more complicated things that you can do with forced formatting (.replace) that have already been covered in this answer, but again, that requires some expectation of exactly what data is going to come out of what column.
You might want to look more closely at where those spaces are coming from, rather than process the output like this. Is the retrieval script doing it? Maybe you can optimize the database that you're drawing from?

Sed for partial replacement?

Imagine I have a file that has the following type of line:
FIXED_DATA1 VARIABLE_DATA FIXED_DATA2
I want to change the fixed data and leave the variable data as is. For various reasons, using two sed operations to replace the fixed data will not work. For instance, the fixed fields might be double-quotes, and the line has other areas containing them, thus really the regex is written to match a pattern in the variable data and the fixed data.
If I'm bent on using sed, is there a way to change both fixed data fields at once while leaving the variable field unchanged?
Thanks.
You need to partition the line into the three pieces, replace the outer two and leave the middle alone:
sed 's/^FIX1 \(.*\) FIX2$/New \1 End/'
You can make the beginning and end matches more complex as needed.

Removing a trailing Space from Regex Matched group

I'm using regular expression lib icucore via RegKit on the iPhone to
replace a pattern in a large string.
The Pattern i'm looking for looks some thing like this
| hello world (P1)|
I'm matching this pattern with the following regular expression
\|((\w*|.| )+)\((\w\d+)\)\|
This transforms the input string into 3 groups when a match is found, of which group 1(string) and group 3(string in parentheses) are of interest to me.
I'm converting these formated strings into html links so the above would be transformed into
Hello world
My problem is the trailing space in the third group. Which when the link is highlighted and underlined, results with the line extending beyond the printed characters.
While i know i could extract all the matches and process them manually, using the search and replace feature of the icu lib is a much cleaner solution, and i would rather not do that as a result.
Many thanks as always
Would the following work as an alternate regular expression?
\|((\w*|.| )+)\s+\((\w\d+)\)\| Where inserting the extra \s+ pulls the space outside the 1st grouping.
Though, given your example & regex, I'm not sure why you don't just do:
\|(.+)\s+\((\w\d+)\)\|
Which will have the same effect. However, both your original regex and my simpler one would both fail, however on:
| hello world (P1)| and on the same line | howdy world (P1)|
where it would roll it up into 1 match.
\|\s*([\w ,.-]+)\s+\((\w\d+)\)\|
will put the trailing space(s) outside the capturing group. This will of course only work if there always is a space. Can you guarantee that?
If not, use
\|\s*([\w ,.-]+(?<!\s))\s*\((\w\d+)\)\|
This uses a lookbehind assertion to make sure the capturing group ends in a non-space character.