Perl, Split string by specific pattern - perl

I found how to split a string by whitespaces, but that only takes into an account a single character. In my case, I have comments pasted into a file that includes newlines and whitespaces. I have them separated by this string: [|]
So I need to split my $string into an array for example, where $string =
This is a comment.
This is a newline.
This is the end[|]This is second comment.
This is second newline.
[|]Last comment
Gets split into $array[0], $array[1], and $array[2] which include the newlines and whitespaces. Separated by [|]
Every example I find on the web uses a single character, such as space or newline, to split strings. In my case I have to use a more specific identifier, which is why I selected [|] but having troubles splitting it by this.
I have tried to limit it to parse by a single '|' character with this code:
my #words = split /|/, $string;
foreach my $thisline (#words) {
print "This line = '" . $thisline . "'\n";
But this seems to split the entire string, character-by-character into #words.

[, |, and ] are all special characters in regular expressions -- | is used to separate options, and […] are used to specify character sets. Using an unquoted | makes the expression match the empty string (more specifically: the empty string or the empty string), causing it to match and split on every character boundary. These characters must be escaped to use them literally in an expression:
my #words = split /\[\|\]/, $string;
Since all the lines makes this visually confusing, you should probably use m{} quotes instead of //, and \Q…\E to quote a range of characters instead of a separate backslash for each one. (This is functionally identical, it's just a little easier to read.)
my #words = split m{\Q[|]\E}, $string;

Related

How to insert a colon between word and number

I want to insert a colon between word and number then add a new line after a number.
For example:
"cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011";
my expected output:
cat:11052000
cow_and_owner_:01011999 12031981
dog:22032011
My attempt :
$Bday=~ /^([a-z]||\_)/:/^([0-9])/
print "\n";
#!/usr/bin/perl
use warnings;
use strict;
my $str = "cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011";
$str =~ s/\s*([a-z_]+)((?: \d+)+)/$1:$2\n/g;
print $str;
produces your desired output from your sample input.
Edit: Note the use of the s operator for regular expression substitution. One of the many problems with your code is that you're not using that (IF your intent is to modify the string in place and not extract bits from it for further processing)
One more variant -
> cat test_perl.pl
#!/usr/bin/perl
use strict;
use warnings;
while ( "cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011" =~ m/([a-z_]+)\s+([0-9 ]+)/g )
{
print "$1:$2\n";
}
> test_perl.pl
cat:11052000
cow_and_owner_:01011999 12031981
dog:22032011
>
The original code $Bday=~ /^([a-z]||\_)/:/^([0-9])/ doesn't make much sense. Apart from missing a semicolon and having too many delimiters (matching patterns are of the format /.../ or m/.../ and replacing ones s/.../.../), it could never match anything.
([a-z]||\_) would match:
one lowercase ASCII letter (a through z);
an empty string (the space between the two |s; or
one underscore (escape with a backslash is superfluous).
To get it (or the corresponding subexpression for numbers) to match a sequence of one
or more of the characters, you need to follow it with a +.
^([0-9]) would fail to match unless it was at the beginning of the string. There it would match a single digit.
My solution (taking into account the later comments by the OP about having input such as cat[1] or dog3):
use strict;
use warnings;
my $bday = "cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011 cat[1] 01012018 dog3 02012018";
# capture groups:
# $1------------------------\ $2-------------\
$bday =~ s/([A-Za-z][A-Za-z0-9_\[\]]*)\h+(\d+(?:\h+\d+)*)(?!\S)\s*/$1:$2\n/g;
print $bday;
will print out:
cat:11052000
cow_and_owner_:01011999 12031981
dog:22032011
cat[1]:01012018
dog3:02012018
Breakdown:
[A-Za-z]: Begin with a letter.
[A-Za-z0-9_\[\]]*: Follow with zero or more letters, numbers, underscores and square brackets.
\h+: Separate with one or more horizontal whitespace.
\d+(?:\h+\d+)*: One sequence of digits (\d+) followed by zero or more sequences of horizontal whitespace and digits.
(?!\S): Can't be followed by non-whitespace.
\s*: Consume following whitespace (including line feeds; this allows the input to be separated on multiple lines, as long as a single entry is not spread on multiple lines. To get that, replace all the \h+ with \s+.).
The replace pattern will repeat (the /g modifier) sequentially in the source string as long as it matches, placing each heading-date record on its own line and then proceeding with the rest of the string.
Note that if your headers (dog etc.) might contain non-ASCII letters, use \pL or \p{XPosixAlpha} instead of [A-Za-z]:
$bday =~ s/\pL[\pL0-9_\[\]]*)\h+(\d+(?:\h+\d+)*)(?!\S)\s*/$1:$2\n/g;

Add a new line in a variable using perl

I am trying to add a new line in a variable after certain number of words. For example: If we have a variable:
$x = "This a variable, start a new line here, This is a new line.";
If I print the above variable
print $x;
I should get the below output:
This is a variable,
start a new line here,
This is a new line.
How can I achieve this in Perl from the variable itself?
I do not agree to the formula "after certain number of words".
Note that the first target line has 4 words, whereas remaining 2 have
5 words each.
Actually you need to replace each comma and following sequence of
spaces (if any) with a comma and \n.
So the intuitive way to do it is:
$x =~ s/,\s*/,\n/g;
The simplest way is to split the string on comma followed by a space and then
join the word groups with a comma followed by a newline.
my $x = "This a variable, start a new line here, This is a new line.";
print join(",\n", split /, /, $x) . "\n";
output
This a variable,
start a new line here,
This is a new line.
For solving the general, how do I reformat this string with line breaks after n-columns? problem, use the Text::Wrap library (as suggested by #ikegami):
use Text::Wrap;
my $x = "The quick brown fox jumped over the lazy dog.";
$Text::Wrap::columns = 15;
# wrap() needs an array of words
my #words = split /\s+/, $x;
# Initial tab, subsequent tab values set to '' (think indent amount)
print wrap('', '', #words) . "\n";
output
The quick
brown fox
jumped over
the lazy dog.
You probably want to use regular expressions. You can do this:
$x =~ s/^(\S+\s+){3}\K/\n/;
Or if this is about the commas and not the spaces:
$x =~ s/^([^,]+,+){2}\s*\K/\n/;
(in this case I also remove any potential space that would be after the comma)
You can also configure separately how many words or comma you want, by putting this in a variable:
my $nbwords = 7; # add a line after the 7th word
$x =~ s/^(\S+\s+){$nbwords}\K/\n/;
Now, that would keep the last space so you may want to do this:
my $nbwords = 7; # add a line after the 7th word
$nbwords--; # becomes 6 because there is another word after that we match as well
$x =~ s/^(\S+\s+){$nbwords}\S+\K\s+/\n/;
You should probably learn to use Regexps but just to explain the above:
\s is any space character (like space, tab, line feed, etc)
\S (uppercase) is any character except a space character
+ means any number of characters of that type described with what is before. So \s+ means any number of consecutive space characters.
{123} means 123 times that type of character ...
{3,80} means 3 to 80 times. So + is equivalent to {1,} (one to unlimited)
\K means that whatever is before will not be replaced, only what is after.

Replace returns with spaces and commas with semicolons?

I want to be able to be able to replace all of the line returns (\n's) in a single string (not an entire file, just one string in the program) with spaces and all commas in the same string with semicolons.
Here is my code:
$str =~ s/"\n"/" "/g;
$str =~ s/","/";"/g;
This will do it. You don't need to use quotations around them.
$str =~ s/\n/ /g;
$str =~ s/,/;/g;
Explanation of modifier options for the Substitution Operator (s///)
e Forces Perl to evaluate the replacement pattern as an expression.
g Replaces all occurrences of the pattern in the string.
i Ignores the case of characters in the string.
m Treats the string as multiple lines.
o Compiles the pattern only once.
s Treats the string as a single line.
x Lets you use extended regular expressions.
You don't need to quote in your search and replace, only to represent a space in your first example (or you could just do / / too).
$str =~ s/\n/" "/g;
$str =~ s/,/;/g;
I'd use tr:
$str =~ tr/\n,/ ;/;

What's happening in this Perl foreach loop?

I have this Perl code:
foreach (#tmp_cycledef)
{
chomp;
my ($cycle_code, $close_day, $first_date) = split(/\|/, $_,3);
$cycle_code =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$close_day =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$first_date =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
#print "$cycle_code, $close_day, $first_date\n";
$cycledef{$cycle_code} = [ $close_day, split(/-/,$first_date) ];
}
The value of tmp_cycledef comes from output of an SQL query:
select cycle_code,cycle_close_day,to_char(cycle_first_date,'YYYY-MM-DD')
from cycle_definition d
order by cycle_code;
What exactly is happening inside the for loop?
Huh, I'm surprised no one fixed it for you :)
It looks like the person who wrote this was trying to trim leading and trailing whitespace from each field. It's a really odd way to do that, and for some reason he was overly concerned with interior whitespace in each field despite his anchors.
I think that should be the same as trimming the whitespace around the delimiter in the split:
foreach (#tmp_cycledef)
{
s/^\s+//; s/$//; #leading and trailing whitespace on the whole string
my ($cycle_code, $close_day, $first_date) = split(/\s*\|\s*/, $_, 3);
$cycledef{$cycle_code} = [ $close_day, split(/-/,$first_date) ];
}
The key to thinking about split is considering which parts of the string you want to throw away, not just what separates the fields that you want.
For regex part, s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/ do stripping of leading and trailing whitespaces
Each row in #tmp_cycledef is composed of a string formatted following "cycle_code | close_day | first_date".
my ($cycle_code, $close_day, $first_date) = split(/\|/, $_,3);
Split the string into three parts. The following regular expressions are used to strip leading and trailing whitespaces.
The last instruction of the loop creates an entry in the dictionary $cycledef indexed by $cycle_code. The entry is formated is formatted using the following scheme:
[ $close_day, YYYY, MM, DD ]
where $first_date = "YYYY-MM-DD".
#tmp_cycledef: The output of the sql query is stored in this array
foreach (#tmp_cycledef) : For every element in this array.
chomp : remove the \n char from the end of every element.
my ($cycle_code, $close_day, $first_date) = split(/\|/, $_,3);
split the elements into 3 parts and assign the variable to each of the splited element. parts of split are "split(/PATTERN/,EXPR,LIMIT)"
$cycle_code =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$close_day =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$first_date =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
This regex part is sripping of leading and trailing whitespaces from each variable.
my god, it's been such a long time since I've read perl... but I'll give it a shot.
you grab a record from #tmp_cycledef, and chomp off the newline at the end, and split it up into the three variables: then, like S.Mark said, each substitution regex strips off the leading and trailing whitespace for each of the three variable. Finally, the values get pushed into a hash as a list, with some debugging code commented out right above it.
hth
Your query gives a set of rows that
are stored in the array
#tmp_cycledef.
We iterate over each row in the
result using: foreach
(#tmp_cycledef).
The result rows might have trailing
newline char, we get rid of them
using chomp.
Next we split the row (which is not
in $_) on the pipe and assign the
first 3 pieces to $cycle_code,
$close_day and $first_date
respectively.
The split pieces might have leading
and trailing white spaces, the next 3
lines are to remove the leading and
trailing white space in the 3
variables.
Finally we make an entry into the
hash %cycledef. The key use is
$cycle_code and the value is an
array whose first element is
$close_day and rest of the elements
are pieces got after splitting
$first_date on hyphen.

How can I insert text into a string in Perl?

If I had:
$foo= "12."bar bar bar"|three";
how would I insert in the text ".." after the text 12. in the variable?
Perl allows you to choose your own quote delimiters. If you find you need to use a double quote inside of an interpolating string (i.e. "") or single quote inside of a non-interpolating string (i.e. '') you can use a quote operator to specify a different character to act as the delimiter for the string. Delimiters come in two forms: bracketed and unbracketed. Bracketed delimiters have different beginning and ending characters: [], {}, (), [], and <>. All other characters* are available as unbracketed delimiters.
So your example could be written as
$foo = qq(12."bar bar bar"|three);
Inserting text after "12." can be done many ways (TIMTOWDI). A common solution is to use a substitution to match the text you want to replace.
$foo =~ s/^(12[.])/$1../;
the ^ means match at the start of the sting, the () means capture this text to the variable $1, the 12 just matches the string "12", and the [] mean match any one of the characters inside the brackets. The brackets are being used because . has special meaning in regexes in general, but not inside a character class (the []). Another option to the character class is to escape the special meaning of . with \, but many people find that to be uglier than the character class.
$foo =~ s/^(12\.)/$1../;
Another way to insert text into a string is to assign the value to a call to substr. This highlights one of Perl's fairly unique features: many of its functions can act as lvalues. That is they can be treated like variables.
substr($foo, 3, 0) = "..";
If you did not already know where "12." exists in the string you could use index to find where it starts, length to find out how long "12." is, and then use that information with substr.
Here is a fully functional Perl script that contains the code above.
#!/usr/bin/perl
use strict;
use warnings;
my $foo = my $bar = qq(12."bar bar bar"|three);
$foo =~ s/(12[.])/$1../;
my $i = index($bar, "12.") + length "12.";
substr($bar, $i, 0) = "..";
print "foo is $foo\nbar is $bar\n";
* all characters except whitespace characters (space, tab, carriage return, line feed, vertical tab, and formfeed) that is
If you want to use double quotes in a string in Perl you have two main options:
$foo = "12.\"bar bar bar\"|three";
or:
$foo = '12."bar bar bar"|three';
The first option escapes the quotes inside the string with backslash.
The second option uses single quotes. This means the double quotes are treated as part of the string. However, in single quotes everything is literal so $var or #array isn't treated as a variable. For example:
$myvar = 123;
$mystring = '"$myvar"';
print $mystring;
> "$myvar"
But:
$myvar = 123;
$mystring = "\"$myvar\"";
print $mystring;
> "123"
There are also a large number of other Quote-like Operators you could use instead.
$foo = "12.\"bar bar bar\"|three";
$foo =~s/12\./12\.\.\./;
print $foo; # results in 12...\"bar bar bar\"|three"