How can I preserve the uppercase/lower case of a string in search using perl? - perl

I want to search for "Frequencies" (its first letter in uppercase) in my text files. And my code will print to the output file some columns including "Frequencies". But there are also occurrences of "frequencies" (its first letter in lowercase) in the text files. I am using this part $search_word = qr/Frequencies/; in the code. How can I make the first letter of the word "Frequencies" upper case in the $search_word = qr/Frequencies/; part to eliminate the occurrences of "frequencies" in the search?

In Perl, you have ucfirst to capitalize the first letter. For example:
$a = "freQuEncY";
$a = ucfirst(lc($a)); # $a <-- "Frequency";

Why don't you use regex match to check , like this
if($string_to_be_searched =~ /Frequencies/){
do something; # like print
}

Try this one:
if ( $$test_string[$i] =~ /\b(?i)f(?-i)requencies/ ) {
my $captured = ucfirst($&);
# process $captured
}
Explanation:
The regex matches will be case-insensitive for the first letter of the word frequencies only. (?i) turns on case-insensitive matching at the position it occurs for the remainder of the pattern or until it is revoked by (?-i). This works for other flags too, cf. perldoc section on re.
$& contains the full match
\b denotes a word boundary (perhaps you don't need that but your problem description suggests you do).

Related

How to insert a colon between word and number

I want to insert a colon between word and number then add a new line after a number.
For example:
"cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011";
my expected output:
cat:11052000
cow_and_owner_:01011999 12031981
dog:22032011
My attempt :
$Bday=~ /^([a-z]||\_)/:/^([0-9])/
print "\n";
#!/usr/bin/perl
use warnings;
use strict;
my $str = "cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011";
$str =~ s/\s*([a-z_]+)((?: \d+)+)/$1:$2\n/g;
print $str;
produces your desired output from your sample input.
Edit: Note the use of the s operator for regular expression substitution. One of the many problems with your code is that you're not using that (IF your intent is to modify the string in place and not extract bits from it for further processing)
One more variant -
> cat test_perl.pl
#!/usr/bin/perl
use strict;
use warnings;
while ( "cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011" =~ m/([a-z_]+)\s+([0-9 ]+)/g )
{
print "$1:$2\n";
}
> test_perl.pl
cat:11052000
cow_and_owner_:01011999 12031981
dog:22032011
>
The original code $Bday=~ /^([a-z]||\_)/:/^([0-9])/ doesn't make much sense. Apart from missing a semicolon and having too many delimiters (matching patterns are of the format /.../ or m/.../ and replacing ones s/.../.../), it could never match anything.
([a-z]||\_) would match:
one lowercase ASCII letter (a through z);
an empty string (the space between the two |s; or
one underscore (escape with a backslash is superfluous).
To get it (or the corresponding subexpression for numbers) to match a sequence of one
or more of the characters, you need to follow it with a +.
^([0-9]) would fail to match unless it was at the beginning of the string. There it would match a single digit.
My solution (taking into account the later comments by the OP about having input such as cat[1] or dog3):
use strict;
use warnings;
my $bday = "cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011 cat[1] 01012018 dog3 02012018";
# capture groups:
# $1------------------------\ $2-------------\
$bday =~ s/([A-Za-z][A-Za-z0-9_\[\]]*)\h+(\d+(?:\h+\d+)*)(?!\S)\s*/$1:$2\n/g;
print $bday;
will print out:
cat:11052000
cow_and_owner_:01011999 12031981
dog:22032011
cat[1]:01012018
dog3:02012018
Breakdown:
[A-Za-z]: Begin with a letter.
[A-Za-z0-9_\[\]]*: Follow with zero or more letters, numbers, underscores and square brackets.
\h+: Separate with one or more horizontal whitespace.
\d+(?:\h+\d+)*: One sequence of digits (\d+) followed by zero or more sequences of horizontal whitespace and digits.
(?!\S): Can't be followed by non-whitespace.
\s*: Consume following whitespace (including line feeds; this allows the input to be separated on multiple lines, as long as a single entry is not spread on multiple lines. To get that, replace all the \h+ with \s+.).
The replace pattern will repeat (the /g modifier) sequentially in the source string as long as it matches, placing each heading-date record on its own line and then proceeding with the rest of the string.
Note that if your headers (dog etc.) might contain non-ASCII letters, use \pL or \p{XPosixAlpha} instead of [A-Za-z]:
$bday =~ s/\pL[\pL0-9_\[\]]*)\h+(\d+(?:\h+\d+)*)(?!\S)\s*/$1:$2\n/g;

index argument contains . perl

If a string contains . representing any character, index doesn't match on it. What to do so that it takes . as any character?
For ex,
index($str, $substr)
if $substr contains . anywhere, index will always return -1
thanks
carol
That is not possible. The documentation says:
The index function searches for one string within another, but without
the wildcard-like behavior of a full regular-expression pattern match.
...
The keywords, you can use for further googlings are:
perl regular expression wildcard
Update:
If you just want to know, if your string matches, using a regular expression could look like that:
my $string = "Hello World!";
if( $string =~ /ll. Worl/ )
{
print "Ahoi! Position: ".($-[0])."\n";
}
This is matching a single character.
$-[0] is the offset into the string of the beginning of the entire
match.
-- http://perldoc.perl.org/perlvar.html
If you want to have a pattern, that is matching an arbitary amount of arbitary characters, you could choose a pattern like...
...
if( $string =~ /ll.*orl/ )
{
...
See perlvar for further information about special perl variables. You will find the variable #LAST_MATCH_START and some explanation about $-[0] over there. There are several more variables, that can help you to find sub matches and to gather other interessting information about your matches...
From perldoc -f index, you can see index() doesn't have any regex syntax:
index STR,SUBSTR
The index function searches for one string within another, but without the wildcard-like behavior of a full regular-
expression pattern match. It returns the position of the first occurrence of SUBSTR in STR at or after POSITION. If
POSITION is omitted, starts searching from the beginning of the string. POSITION before the beginning of the string or after
its end is treated as if it were the beginning or the end, respectively. POSITION and the return value are based at 0 (or
whatever you've set the $[ variable to--but don't do that). If the substring is not found, "index" returns one less than the
base, ordinarily "-1"
A simple test:
$ perl -e 'print index("1234567asdfghj.","j.")'
13
Use regex:
$str =~ /$substr/g;
$index = pos();

Quantifier follows nothing in regex

My requirement is to print the files having 'xyz' text in their file names using perl.
I tried below and got the following error
Quantifier follows nothing in regex marked by <-- HERE in m/* <-- HERE xyz.xlsx$/;
use strict;
use warnings;
my #files = qw(file_xyz.xlsx,file.xlsx);
my #my_files = grep { /*xyz.xlsx$/ } #files;
for my $file (#my_files) {
print "The output $file \n";
}
Problem is coming when I add * in grep regular expression.
How can I possibly achieve this?
The * is a meta character, called a quantifier. It means "repeat the previous character or character class zero or more times". In your case, it follows nothing, and is therefore a syntax error. What you probably are trying is to match anything, which is .*: Wildcard, followed by a quantifier. However, this is already the default behaviour of a regex match unless it is anchored. So all you need is:
my #my_files = grep { /xyz/ } #files;
You could keep your end of the string anchor xlsx$, but since you have a limited list of file names, that hardly seems necessary. Though you have used qw() wrong, it is not comma separated, it is space separated:
my #files = qw(file_xyz.xlsx file.xlsx);
However, if you should have a larger set of file names, such as one read from a directory, you can place a wildcard string in the middle:
my #my_files = grep { /xyz.*\.xlsx$/i } #files;
Note the use of the /i modifier to match case insensitively. Also note that you must escape . because it is another meta character.

Perl - partial pattern matching in a sequence of letters

I am trying to find a pattern using perl. But I am only interested with the beginning and the end of the pattern. To be more specific I have a sequence of letters and I would like to see if the following pattern exists. There are 23 characters. And I'm only interested in the beginning and the end of the sequence.
For example I would like to extract anything that starts with ab and ends with zt. There is always
So it can be
abaaaaaaaaaaaaaaaaaaazt
So that it detects this match
but not
abaaaaaaaaaaaaaaaaaaazz
So far I tried
if ($line =~ /ab[*]zt/) {
print "found pattern ";
}
thanks
* is a quantifier and meta character. Inside a character class bracket [ .. ] it just means a literal asterisk. You are probably thinking of .* which is a wildcard followed by the quantifier.
Matching entire string, e.g. "abaazt".
/^ab.*zt$/
Note the anchors ^ and $, and the wildcard character . followed by the zero or more * quantifier.
Match substrings inside another string, e.g. "a b abaazt c d"
/\bab\S*zt\b/
Using word boundary \b to denote beginning and end instead of anchors. You can also be more specific:
/(?<!\S)ab\S*zt(?!\S)/
Using a double negation to assert that no non-whitespace characters follow or precede the target text.
It is also possible to use the substr function
if (substr($string, 0, 2) eq "ab" and substr($string, -2) eq "zt")
You mention that the string is 23 characters, and if that is a fixed length, you can get even more specific, for example
/^ab.{19}zt$/
Which matches exactly 19 wildcards. The syntax for the {} quantifier is {min, max}, and any value left blank means infinite, i.e. {1,} is the same as + and {0,} is the same as *, meaning one/zero or more matches (respectively).
Just a * by itself wont match anything (except a literal *), if you want to match anything you need to use .*.
if ($line =~ /^ab.*zt$/) {
print "found pattern ";
}
If you really want to capture the match, wrap the whole pattern in a capture group:
if (my ($string) = $line =~ /^(ab.*zt)$/) {
print "found pattern $string";
}

Perl Regex to match words with more than 2 characters

I am new to PERL and working on a regex to match only words with equal to or more than 3 letters . Here is the program I am trying. I tried adding \w{3,} since it should match 3 re more characters. But it is still matching <3 characters in a word. For example If i give "This is a Pattern". I want my $field to match only "This" and "Pattern" and skip "is" and "a".
#!/usr/bin/perl
while (<STDIN>) {
foreach my $reg_part (split(/\s+/, $_)) {
if ($reg_part =~ /([^\w\#\.]*)?([\w{3,}\#\(\)\+\$\.]+)(?::(.+))?/) {
print "reg_part = $reg_part \n";
my ($mod, $field, $pat) = ($1, $2, $3);
print "#$mod#$field#$pat#$negate#\n";
}
}
}
exit(0);
What am I missing?
You have
[\w{3,}...]+
which is the same as
[{},3\w...]+
I think you want
(?:\w{3,}|[\$\#()+.])+
Break your regular expression up.
You know you want three word characters, so specify :-
# Match three word characters.
\w{3}
After that, you don't really care if the word has more characters, but you won't block it either.
# Match 0 or more word characters
\w*
Finally, you want to ensure that you have boundaries to catch the end of words. So, putting it all together. To match a word with at least three word characters, possibly more, use:-
# Word boundaries at start and end
\b\w{3}\w*\b
Note - \w matches alphanumeric - if it's just alpha you need:-
# Alpha only
\b[A-Za-z]{3}[A-Za-z]*\b