searching a word with a particular character in it in perl - perl

am trying to search a word where it starts with any character (Capital letter) but ends with zero in perl.
For example
ABC0
XYZ0
EIU0
QW0
What I have tried -
$abc =~ /^[A-Z].+0$/
But I am not getting proper output for this. Can anybody help me please?

The ^ anchores at the start of a string, the $ at the end. .+ matches as many non-newline-characters as possible. Therefore
"ABC0 XYZ0 EIU0 QW0" =~ /^[A-Z].+0$/
matches the whole string.
The \b assertion matches at word edges: everywhere a word character and a non-word-character are adjacent. The \w charclass holds only word characters, the \S charclass all non-space-characters. Either of these is better than ..
So you may want to use /\b[A-Z]\W*0\b/.

This might work :
$abc =~ /\b[A-Z].*0\b/
\b matches word boundaries.

Related

Regex expression for detecting 2 consecutive words when first word starts with #

I wanted to know the regex expression that detects names starting with #. For eg, in the sentence "Hi #Steve Rogers, how are you?", I want to extract out #Steve Rogers using regex. I tried using Pattern.compile("#\\s*(\\w+)").matcher(text), but only "#Steve" get detected. What else should I use.??
Thanks
Try (#[\w\s]+)
It will only capture word and spaces after the #
See example at https://regex101.com/r/4Pv9bu/1
If you don't want to match an # sign followed by a space only like # and if there can be more than a single word after it:
(?<!\S)#\w+(?:\h+\w+)?
Explanation
(?<!\S) Assert a whitespace boundary to the left
# Match literally
\w+ Match 1+ word characters
(?:\s+\w+)? Optionally match 1+ horizontal whitespace chars and 1+ word chars
Regex demo
In Java
String regex = "(?<!\\S)#\\w+(?:\\h+\\w+)?";

Need Regular expression - perl

I am looking for a regx for below expression:
my $text = "1170 KB/s (244475 bytes in 2.204s)"; # I want to retrieve last ‘2.204’ from this String.
$text =~ m/\d+[^\d*](\d+)/; #Regexp
my $num = $1;
print " $num ";
Output:
204
But I need 2.204 as output, please correct me.
Can any one help me out?
The regex is doing exactly what you asked it to: It is matching digits \d+, followed by one non-digit or star [^\d*], followed by digits \d+. The only thing that matches that in your string is 204.
If you want a quick fix, you can just move the parentheses:
m/(\d+[^\d*]\d+)/
This would (with the above input) match what you want. A more exact way to put it would be:
m/(\d+\.\d+)/
Of course this will match any float precision number, so if you can have more of those, that's not a good idea. You can shore it up by using an anchor, like so:
m/(\d+\.\d+)s\)/
Where s\) forces the match to occur at only that place. Further strictures:
m/\(\d+\D+(\d+\.\d+)s\)/
You might also want to account for the possibility of your target number not being a float:
m/\(\d+\D+(\d+\.?\d*)s\)/
By using ? and * we allow for those parts not to match at all. This is not recommended to do unless you are using anchors. You can also replace everything in the capture group with [\d.]+.
If you are not fond of matching the parentheses, you can match the text:
m/bytes in ([\d.]+)s/
I'd go with the second marker as indicator where you are in the string:
my ($num) = ($text =~ /(\d+\.\d+)s/);
with explanations:
/( # start of matching group
\d+ # first digits
\. # a literal '.', take \D if you want non-numbers
\d+ # second digits
)/x # close the matching group and the regex
You had the matching groups wrong. Also the [^\d] is a bit excessive, generally you can negate some of the backspaced special classes (\d,\h, \s and \w) with their respective uppercase letter.
Try this regex:
$text =~ m/\d+[^\d]*(\d+\.?\d*)s/;
That should match 1+ digits, a decimal point if there is one, 0 or more decimal places, and make sure it's followed by a "s".

How do I replace all occurrences of certain characters with their predecessors?

$s = "bla..bla";
$s =~ s/([^%])\./$1/g;
I think it should replace all occurrences of . that is not after % with the character that is before ..
But $s is then: bla.bla, but
it should be blabla. Where is the problem? I know I can use quantifiers, but I need do it this way.
When a global regular expression is searching a string it will not find overlapping matches.
The first match in your string will be a., which is replaced with a. When the regex engine resumes searching it starts at the next . so it sees .bla as the rest of the string, and your regex requires a character to match before the . so it cannot match again.
Instead, use a negative lookbehind to perform the assertion that the previous character is not %:
$s =~ s/(?<!%)\.//g;
Note that if you use a positive lookbehind like (?<=[^%]), you will not replace the . if it is the first character in the string.
The problem is that even with the /g flag, each substitution starts looking where the previous one left off. You're trying to replace a. with a and then a. with a, but the second replacement doesn't happen because the a has already been "swallowed" by the previous replacement.
One fix is to use a zero-width lookbehind assertion:
$s =~ s/(?<=[^%])\.//g;
which will remove any . that is not the first character in the string, and that is not preceded by %.
But you might actually want this:
$s =~ s/(?<!%)\.//g;
which will remove any . that is not preceded by %, even if it is the first character in the string.
Much simpler than look-behinds is to use:
$s =~ s/([^%])\.+/$1/g;
This replaces any string of one or more dots after a character other than % by nothing.

How to get a perfect match for a regexp pattern in Perl?

I've to match a regular-expression, stored in a variable:
#!/bin/env perl
use warnings;
use strict;
my $expr = qr/\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)/sx;
$str = "abcd[3] xyzg[4:0]";
if ($str =~ m/$expr/) {
print "\n%%%%%%%%% $`-----$&-----$'\n";
}
else {
print "\n********* NOT MATCHED\n";
}
But I'm getting the outout in $& as
%%%%%%%%% -----abcd[3] xyzg-----[4:0]
But expecting, it shouldn't go inside the if clause.
What is intended is:
if $str = "abcd xyzg" => %%%%%%%%% -----abcd xyzg----- (CORRECT)
if $str = "abcd[2] xyzg" => %%%%%%%%% -----abcd[2] xyzg----- (CORRECT)
if $str = "abcd[2] xyzg[3] => %%%%%%%%% -----abcd[2] xyzg[3]----- (CORRECT)
if $str = "abcd[2:0] xyzg[3] => ********* NOT MATCHED (CORRECT)
if $str = "abcd[2:0] xyzg[3:0] => ********* NOT MATCHED (CORRECT)
if $str = "abcd[2] xyzg[3:0]" => ********* NOT MATCHED (CORRECT/INTENDED)
but output is %%%%%%%%% -----abcd[2] xyzg-----[3:0] (WRONG)
OR better to say this is not intended.
In this case, it should/my_expectation go to the else block.
Even I don't know, why $& take a portion of the string (abcd[2] xyzg), and $' having [3:0]?
HOW?
It should match the full, not a part like the above. If it didn't, it shouldn't go to the if clause.
Can anyone please help me to change my $expr pattern, so that I can have what is intended?
By default, Perl regexes only look for a matching substring of the given string. In order to force comparison against the entire string, you need to indicate that the regex begins at the beginning of the string and ends at the end by using ^ and $:
my $expr = qr/^\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)$/;
(Also, there's no reason to have the /x modifier, as your regex doesn't include any literal whitespace or # characters, and there's no reason for the /s modifier, as you're not using ..)
EDIT: If you don't want the regex to match against the entire string, but you want it to reject anything in which the matching portion is followed by something like "[0:0]", the simplest way would be to use lookahead:
my $expr = qr/^\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\]|(?=[^[\w])|$ ))/x;
This will match anything that takes the following form:
beginning of the string (which your example in the comments seems to imply you want)
zero or more whitespace characters
one or more word characters
optional: [, one or more digits, ]
one or more whitespace characters
one or more word characters
one of the following, in descending order of preference:
[, one or more digits, ]
an empty string followed by (but not including!) a character that is neither [ nor a word character (The exclusion of word characters is to keep the regex engine from succeeding on "a[0] bc[1:2]" by only matching "a[0] b".)
end of string (A space is needed after the $ to keep it from merging with the following ) to form the name of a special variable, and this entails the reintroduction of the /x option.)
Do you have any more unstated requirements that need to be satisfied?
The short answer is your regexp is wrong.
We can't fix it for you without you explaining what you need exactly, and the community is not going to write a regexp exactly for your purpose because that's just too localized a question that only helps you this one time.
You need to ask something more general about regexps that we can explain to you, that will help you fix your regexp, and help others fix theirs.
Here's my general answer when you're having trouble testing your regexp. Use a regexp tool, like the regex buddy one.
So I'm going to give a specific answer about what you're overlooking here:
Let's make this example smaller:
Your pattern is a(bc+d)?. It will match: abcd abccd etc. While it will not match bcd nor bzd in the case of abzd it will match as matching only a because the whole group of bc+d is optional. Similarly it will match abcbcd as a dropping the whole optional group that couldn't be matched (at the second b).
Regexps will match as much of the string as they can and return a true match when they can match something and have satisfied the entire pattern. If you make something optional, they will leave it out when they have to including it only when it's present and matches.
Here's what you tried:
qr/\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)/sx
First, s and x aren't needed modifiers here.
Second, this regex can match:
Any or no whitespace followed by
a word of at least one alpha character followed by
optionally a grouped square bracketed number with at least one digit (eg [0] or [9999]) followed by
at least one white space followed by
a word of at least one alpha character followed by
optionally a square bracketed number with at least one digit.
Clearly when you ask it to match abcd[0] xyzg[0:4] the colon ends the \d+ pattern but doesn't satisfy the \] so it backtracks the whole group, and then happily finds the group was optional. So by not matching the last optional group, your pattern has matched successfully.

What do these various pieces of syntax mean?

I'm trying to figure out the syntax of both the sed command and perl script:
sed 's/^EOR:$//' INPUTFILE |
perl -00 -ne '/
TAGA01:\s+(.*?)\n
.*
TAGCC08:\s+(.*?)\n
# and so on
/xs && print "$1 $2\n"'
Why is there a circumflex ^ in the sed command? The third slash / will replace all instances of EOR: with a blank line, correct?
I understand some of the Perl script. Looking at perlrun, -00 will slurp the stream in paragraph mode and -n starts a while <> loop.
Why is there the first slash / next to the apostrophe? The command searches for TAGXXXX:, but I am not sure what \s+(.*?) does. Does that put whatever is after the tag into a variable? How about the .* in the between tag searches? What does /ns do? What do the $1 and $2 refer to in the print line?
This was tough to find online, and if someone could kick me in the right direction, I'd appreciate it.
The circumflex ^ is regex for "start of line", and $ is regex for "end of line"; so sed will only remove lines which contain exactly "EOR:" and nothing else.
The Perl script is basically perl -00 -ne '/(re)g(ex)/ && print "re ex\n"' with a big ole regex instead of the simple placeholder I put here. In particular, the /x modifier allows you to split the regex over several lines. So the first / is the start of the regex and the final / is the end of the regex and the lines in between form the regex together.
The /s modifier changes how Perl interprets . in a regex; normally it will match any character except newline, but with this option, it includes newlines as well. This means that .* can match multiple lines.
\s matches a single whitespace character; \s+ matches as many whitespace characters as possible, but there has to be at least one.
(.*?) matches an arbitrary length of string; the dot matches any character, the asterisk says zero or more of any character, and the question mark modifies the asterisk repetition operator to match as short a string as possible instead of as long a string as possible. The parentheses cause the skipped expression to be captured in a back reference; the backrefs are named $1, $2, etc, as many as there are backreferences; the numbers correspond to the order of the opening parenthesis (so if you apply (a(b)) to the string "ab", $1 will be "ab" and $2 will be "b").
Finally, \n matches a literal newline. So the (.*?) non-greedy match will match up to the first newline, i.e. the tail of the line on which the TAGsomething was found. (I
imagine these are gene sequences, not "tags"?)
It doesn't really make sense to run sed separately; Perl would be quite capable of removing the EOR: lines before attempting to match the regex.
Let's see...
Yes, sed will empty the lines with EOR:
The first / in the Perl script means a regexp pattern. Concretely, it is searching for a pattern in the form below
The regex ends with "xs", which means that the regex will match multiple lines of the input
The script also will print as output the strings found in the tags (see below). The $1 and $2 mean the elements contained in the first pair of parentheses ($1) and in the second ($2).
. The form is this one:
TAGA01:<spaces><string1>
<whatever here>
TAGCC00:<spaces><string2>
In this case, $1 is <string1> and $2 is <string2>.