Splitting on whitespace character and strip empty fields - perl

($red, $tapinfo) = split(/:/, $line);
#fields = split(/\s+/, $tapinfo);
In the array fields, I see that even space gets added. I want to eliminate the space so that fields only contains non-space characters. Please comment on what can be going wrong.

I assume you are talking about leading whitespace remaining, so that #fields looks something like:
$VAR1 = [
'', # empty field
'foo',
'bar'
];
This is because you are using /\s+/ for your split when you should be using the default ' ' (a single blank space character). This default behaviour will strip leading whitespace before splitting the string. In other words, you should do:
#fields = split(' ', $tapinfo);
This is documented in perldoc -f split:
As another special case, "split" emulates the default behavior
of the command line tool awk when the PATTERN is either omitted
or a *literal string* composed of a single space character (such
as ' ' or "\x20", but not e.g. "/ /"). In this case, any leading
whitespace in EXPR is removed before splitting occurs, and the
PATTERN is instead treated as if it were "/\s+/"; in particular,
this means that *any* contiguous whitespace (not just a single
space character) is used as a separator. However, this special
treatment can be avoided by specifying the pattern "/ /" instead
of the string " ", thereby allowing only a single space
character to be a separator.

What split does by default is the same as
my #list = $string =~ /\S+/g;
i.e. it finds all the contiguous substrings of non-whitespace characters.
You could use the regex, but to to get the default behaviour from split, pass a single literal space character as the first parameter. Not a regex. The documentation says this
As another special case, split emulates the default behavior of the command line tool awk when the PATTERN is either omitted or a literal string composed of a single space character (such as ' ' or "\x20" , but not e.g. / / ). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were /\s+/ ; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator.

Related

split string with "."

I am trying to split a string with "." but getting nothing in the array. File name is "Head-First-Java-2nd-edition.pdf" After splitting I want to extract extension, but don't know why it is giving blank array.
my #fileInfo = split(/./, $filename);
&logMsg("Array is: #fileInfo");
The split is giving an empty list because you are splitting on a wildcard .. Period is a meta character, and if you want to split on a literal period, you need to escape it
my #fileInfo = split(/\./, $filename);
Also, the syntax for calling a subroutine is NAME(LIST). Using the & prefix has a certain hidden feature, in that it circumvents prototypes. Read more in perldoc perlsub.
. in a regular expression means any character except \n. To split on a literal ., you need to escape it:
split /\./, $filename;

How to search for a string that contains no whitespace in perl

my $string3 = "anima ls";
my $t3 = $string3 =~ /[^\s]+/;
print "$t3\n";
I wanted to write a regex that searches for a string containing no whitespace. The above code works even if i give space.
The regex [^\s]+ searches for at least one character that is not whitespace. It is better written as \S+, though. A regex that matches any string that does not contain a whitespace character is rather
/^\S+$/

meaning of the following regular expressions written in perl

Here is a piece of code
while($l=~/(\\\s*)$/) {
statements;
}
$l contains a line of text taken form file, in effect this code is for go through lines in file.
Questions:
I don't clearly understand what the condition in while is doing. I think it is trying to match group of \ followed by some number of white spaces at the end of line and loop should stop whenever a line ends with \ and may be some white spaces. I am not sure of it.
I came across statement $a ~= s/^(.*$)/$1/ . What I understand that ^ will force matching at the beginning of string, but in (.*$) would mean match all the characters at the end of string . Dose it mean that the statement is trying to find if any group of character at the end is same as group of character in the beginning of text ?
It is interesting to note that this statement:
while ( $l =~ /(\\\s*)$/ ) {
Is an infinite loop unless $l is altered inside the loop so that the regex no longer matches. As has already been mentioned by others, this is what it matches:
( ... ) a capture group, captures string to $1 (that's the number one, not lower case L)
\\ matches a literal backslash
\s* matches 0 or more whitespace characters.
$ matches end of line with optional newline.
Since you do not have the /g modifier, this regex will not iterate through matches, it will simply check if there is a match, resetting the regex each iteration, thereby causing an endless loop.
The statement
$a ~= s/^(.*$)/$1/
Looks rather pointless. It captures a string of characters up until end of string, then replaces it with itself. The captured text is stored in $1 and is simply replaced. The only marginally useful thing about this regex is that:
It matches up until newline \n, and nothing further, which may be of some use to a parser. A period . matches any character except newline, unless the /s modifier is present on the regex.
It captures the line in $1 for future use. However, a simple /^(.*$)/ would do the same.
1. the while
Usually while (regex) is used with the /g modifier, otherwise, if it matches, you get an infinite loop (unless you exit the loop, like using last).
statements would be executed continuously in an infinite loop.
In your case, adding the g
while($l=~/(\\\s*)$/g)
will have the while make only one loop, due to the $ - making a match unique (whatever matches up to the end of string is unique, as $ marks the end, and there is nothing after...).
2. $a ~= s/^(.*$)/$1/
This is a substitution. If the string ^.*$ matches (and it will, since ^.*$ matches (almost, see comment) anything) it is replaced with... $1 or what's inside the (), ie itself, since the match occurs from 1st char to the end of string
^ means beginning of string
(.*) means all chars
$ end of string
so that will replace $a with itself - probably not what you want.
it matches a literal backslash followed by 0 or more spaces followed by the end of the line.
it executes statements for all the lines in that text file that contain a \, followed by zero or more spaces ( \s* ), at the end of the line ($).
It matches lines that end with a backslash character, ignoring any trailing whitespace characters.
Ending a line with a backslash is used in some languages and data files to indicate that the line is being continued on the next line. So I suspect this is part of a parser that merges these continuation lines.
If you enter a regular expression at RegExr and hover your mouse over the pieces, it displays the meaning of each piece in a tooltip.
(\\\s*)$ this regex means --- a \ followed by zero or more number of white space characters which is followed by end of the line. Since you have your regex in (...), you can extract what you matched using $1, if you need.
http://rubular.com/r/dtHtEPh5DX
EDIT -- based on your update
$a ~= s/^(.$)/$1/ --- this is search and replace. So your regex matches a line which contains exactly one character (since you use . http://www.regular-expressions.info/dot.html), except a new-line character. Since you use (...), the character which matched the regex is extracted and stored in variable a
EDIT -- you changed your regex so here is the updated answer
$a ~= s/^(.*$)/$1/ -- same as above except now it matches zero or more characters (except new-line)

Split by dot using Perl

I use the split function by two ways. First way (string argument to split):
my $string = "chr1.txt";
my #array1 = split(".", $string);
print $array1[0];
I get this error:
Use of uninitialized value in print
When I do split by the second way (regular expression argument to split), I don't get any errors.
my #array1 = split(/\./, $string); print $array1[0];
My first way of splitting is not working only for dot.
What is the reason behind this?
"\." is just ., careful with escape sequences.
If you want a backslash and a dot in a double-quoted string, you need "\\.". Or use single quotes: '\.'
If you just want to parse files and get their suffixes, better use the fileparse() method from File::Basename.
Additional details to the information provided by Mat:
In split "\.", ... the first parameter to split is first interpreted as a double-quoted string before being passed to the regex engine. As Mat said, inside a double-quoted string, a \ is the escape character, meaning "take the next character literally", e.g. for things like putting double quotes inside a double-quoted string: "\""
So your split gets passed "." as the pattern. A single dot means "split on any character". As you know, the split pattern itself is not part of the results. So you have several empty strings as the result.
But why is the first element undefined instead of empty? The answer lies in the documentation for split: if you don't impose a limit on the number of elements returned by split (its third argument) then it will silently remove empty results from the end of the list. As all items are empty the list is empty, hence the first element doesn't exist and is undefined.
You can see the difference with this particular snippet:
my #p1 = split "\.", "thing";
my #p2 = split "\.", "thing", -1;
print scalar(#p1), ' ', scalar(#p2), "\n";
It outputs 0 6.
The "proper" way to deal with this, however, is what #soulSurfer2010 said in his post.

How does Perl split work with string exactly?

Quoted from perldoc -f split:
As a special case, specifying a PATTERN of space (' ' ) will split on
white space just as split with no arguments does. Thus, split(' ') can
be used to emulate awk's default behavior, whereas split(/ /) will
give you as many initial null fields (empty string) as there are
leading spaces.
The above is all that's mentioned about how split deals with string delimiter, but what's the general case,is the empty leading fields always deleted for string delimiters?
No, only when the delimiter is a string that is a single space. In any other case, the delimiter is interpreted as a regex pattern.