What does the Perl split function return when there is no value between tokens? - perl

I'm trying to split a string using the split function but there isn't always a value between tokens.
Ex: ABC,123,,,,,,XYZ
I don't want to skip the multiple tokens though. These values are in specific positions in the string. However, when I do a split, and then try to step through my resulting array, I get "Use of uninitialized value" warnings.
I've tried comparing the value using $splitvalues[x] eq "" and I've tried using defined($splitvalues[x]) , but I can't for the life of me figure out how to identify what the split function is putting in to my array when there is no value between tokens.
Here's the snippet of my code (now with more crunchy goodness):
my #matrixDetail = ();
#some other processing happens here that is based on matching data from the
##oldDetail array with the first field of the #matrixLine array. If it does
#match, then I do the split
if($IHaveAMatch)
{
#matrixDetail = split(',', $matrixLine[1]);
}
else
{
#matrixDetail = ('','','','','','','');
}
my $newDetailString =
(($matrixDetail[0] eq '') ? $oldDetail[0] : $matrixDetail[0])
. (($matrixDetail[1] eq '') ? $oldDetail[1] : $matrixDetail[1])
.
.
.
. (($matrixDetail[6] eq '') ? $oldDetail[6] : $matrixDetail[6]);
because this is just snippets, I've left some of the other logic out, but the if statement is inside a sub that technically returns the #matrixDetail array back. If I don't find a match in my matrix and set the array equal to the array of empty strings manually, then I get no warnings. It's only when the split populates the #matrixDetail.
Also, I should mention, I've been writing code for nearly 15 years, but only very recently have I needed to work with Perl. The logic in my script is sound (or at least, it works), I'm just being anal about cleaning up my warnings and trying to figure out this little nuance.

#!perl
use warnings;
use strict;
use Data::Dumper;
my $str = "ABC,123,,,,,,XYZ";
my #elems = split ',', $str;
print Dumper \#elems;
This gives:
$VAR1 = [
'ABC',
'123',
'',
'',
'',
'',
'',
'XYZ'
];
It puts in an empty string.
Edit: Note that the documentation for split() states that "by default, empty leading fields are preserved, and empty trailing ones are deleted." Thus, if your string is ABC,123,,,,,,XYZ,,,, then your returned list will be the same as the above example, but if your string is ,,,,ABC,123, then you will have a list with three empty strings in elements 0, 1, and 2 (in addition to 'ABC' and '123').
Edit 2: Try dumping out the #matrixDetail and #oldDetail arrays. It's likely that one of those isn't the length that you think it is. You might also consider checking the number of elements in those two lists before trying to use them to make sure you have as many elements as you're expecting.

I suggest to use Text::CSV from CPAN. It is a ready made solution which already covers all the weird edge cases of parsing CSV formatted files.

delims with nothing between them give empty strings when split. Empty strings evaluate as false in boolean context.
If you know that your "details" input will never contain "0" (or other scalar that evaluates to false), this should work:
my #matrixDetail = split(',', $matrixLine[1]);
die if #matrixDetail > #oldDetail;
my $newDetailString = "";
for my $i (0..$#oldDetail) {
$newDetailString .= $matrixDetail[$i] || $oldDetail[$i]; # thanks canSpice
}
say $newDetailString;
(there are probably other scalars besides empty string and zero that evaluate to false but I couldn't name them off the top of my head.)
TMTOWTDI:
$matrixDetail[$_] ||= $oldDetail[$_] for 0..$#oldDetail;
my $newDetailString = join("", #matrixDetail);
edit: for loops now go from 0 to $#oldDetail instead of $#matrixDetail since trailing ",,," are not returned by split.
edit2: if you can't be sure that real input won't evaluate as false, you could always just test the length of your split elements. This is safer, definitely, though perhaps less elegant ^_^

Empty fields in the middle will be ''. Empty fields on the end will be omitted, unless you specify a third parameter to split large enough (or -1 for all).

Related

why perl doesn't recognize i want to create an arrayRef with "split"?

I want to split a scalar by whitespaces and save the result in an ArrayReference.
use strict;
use warnings;
use Data::Dumper;
my $name = 'hans georg mustermann';
my $array = split ' ', $name;
print Dumper($array); #$VAR1 = 3;
So it seems $array is now a scalar with the size resulted by the split operation.
When i change the code to my $array = [split ' ', $name]; the variable $array is now a ArrayReference and contains all 3 strings.
I just don't understand this behavior. Would be really great if someone could explain it to me or post a good documentation about these things, as i don't know how to search for this topic.
Thank you in advance
What you see here is called "context". The documentation about this is rather scattered. You also want to take a look at this tutorial about "scalar vs list context" https://perlmaven.com/scalar-and-list-context-in-perl
If you assign the result of split (or any subroutine calls) to an array, it's list context:
my #arr = split ' ', $name;
#=> #arr = ('hans', 'georg', 'mustermann');
What your example code shows is assigning them to a scalar -- and therefore it's under "scalar context".
Since, naturally, multiple things cannot fit into one position, some sort of summarization needs to be done. In the case of split function, perl5 has defined that the number of elements in the result of split shall be the best.
Check the documentation of the split function: https://perldoc.pl/functions/split -- which actually defines the behaviour under scalar context as well as list context.
Also take a glance at the documentation of all builtin functions at https://perldoc.pl/functions -- you'll find the behaviour definition under "list context" and "scalar context" for most of them -- although many of them are not returning "the size of lists" but rather something else.
That's called context.
The partial expression split ' ', $name evaluates to a list. The partial expression $array = LIST coerces the list to a scalar value, namely counting the number of elements in the list. That's the default behaviour of lists in scalar context.
You should write #array = LIST instead, using an array variable, not a scalar variable, in order to preserve the list values.
If you read the documentation for split(), you'll find the bit that explains what the function returns.
Splits the string EXPR into a list of strings and returns the list in list context, or the size of the list in scalar context.
You're calling the function in scalar context (because you're assigning the result of the call to a scalar variable) so you're getting the size of the list.
If you want to get the list, then you need to store it either in a list of variables:
my ($forename, $middlename, $surname) = split ' ', $name;
Or (more usually) in an array:
my #name_parts = split ' ', $name;
But actually, you say that you want an array reference. You can do that by calling split() inside an anonymous array constructor ([ ... ]) and assigning the result of that call to a scalar variable.
my $name_parts = [ split ' ', $name ];

Perl: If two elements match print elements, else iterate until match and then print

I'm new to Perl and I'm trying to iterate over two elements of an array with multiple indices in each element and look for a match. If element2 matches element1, I want to print both and move to the next position in element1 and continue the loop looking for the next match. If I don't have a match, loop until I get a match. Here is what I have:
#array = split(',',$row);
foreach $element1(#array[1])
{
foreach $element2(#array[2])
{
if($element1 == $element2)
{
print "1 = $element1 : 2 = $element2 \n";
}
}
}
I'm not getting the the matched output. I've tried multiple iterations with different syntactical changes.
I can get both elements when I do this:
foreach $element1(#array[1])
{
foreach $element2(#array[2])
{
print "1 = $element1 : 2 = $element2 \n";
}
}
I thought I might not be dereferencing correctly. Any guidance or suggestions would be appreciated. Thanks.
There are a number of issues with your script. Briefly:
You should always use strict and warnings.
Array indices start at 0, not 1.
You get an element of an array with $array[0], not #array[0]. This is a common frustration for new Perl programmers. The thing to remember is that the sigil (the symbol preceding a variable name) indicates the type of value being passed (e.g. $scalar, #array, or %hash) to the left-hand side of the expression, not the type of datastructure being accessed on the right-hand side.
As #sp-asic pointed out in the comments on the OP, string comparisons are performed with eq, not ==.
References to datastructures are stored in scalars, and you dereference by prepending the sigil of the original datastructure. If $foo is a reference to an array, #$foo gets you the original array.
You apparently want to break out of your inner loops when you find a match, but you'll want to make it clear (for people who look at this code in the future, which may include yourself) which loop you're breaking out of.
Most critically, #array will be an array of strings after you split another string (the row) on commas, so it's not clear why you expect to be able to treat the strings in the first and second position as arrays that you can loop through. I have a few guesses about what you're actually trying to do, and what your inputs and expected outputs actually look like, but I'll wait for you to provide some additional information and leave the information above as general guidance in the meantime, along with a lightly-reworked version of your code below.
use strict;
use warnings;
my #array = split(',', $row);
foreach my $element1 (#$array[0]) {
foreach my $element2 (#$array[1]) {
if ($element1 eq $element2) {
print "1 = $element1 : 2 = $element2\n";
last;
}
}
}

Retain quotes on CSV fields that were quoted in the input

I have a CSV file such that a few of the fields are quoted regardless of whether they need to be. What I wish to do is load this file, modify a few of the values, and produce the modified CSV with the quoted fields intact.
I'm currently using Perl's Text::CSV package to attempt to solve this problem, but have ran into a bit of a roadblock. The following is a small test script to demonstrate the problem:
use Text::CSV;
my $csv = Text::CSV->new ({'binary' => 1, 'allow_loose_quotes' => 1, 'keep_meta_info' => 1});
my $line = q^hello,"world"^;
print qq^input: $line\n^;
$csv->parse($line);
my #flds = $csv->fields();
$csv->combine(#flds);
print 'output: ', $csv->string(), "\n";
produces:
input: hello,"world"
output: hello,world
According to Text::CSV's documentation, an is_quoted() function exists to test if a field had been quoted in the input, but if I use this to add surrounding quotes to a field, I get unexpected results:
my $csv = Text::CSV->new ({'binary' => 1, 'allow_loose_quotes' => 1, 'keep_meta_info' => 1});
my $line = q^hello,"world"^;
print qq^input: $line\n^;
$csv->parse($line);
my #flds = $csv->fields();
for my $idx (0..$#flds) {
if ($csv->is_quoted($idx)) {
$flds[$idx] = qq^"$flds[$idx]"^;
}
}
$csv->combine(#flds);
print 'output: ', $csv->string(), "\n";
Producing:
input: hello,"world"
output: hello,"""world"""
where I believe the quotes I've added before the combine() are being seen as part of the field, and so are being escaped with a second double quote as combine() is processing.
What would be the best way to ensure quoted fields are left intact from input to output? I'm not certain the application will accept always_quote'ed fields... Is there some combination of Text::CSV object attributes that will allow for keeping quotes intact? Or perhaps am I left with adjusting the record post-combine?
It's a shame but it appears that while keep_meta_info gives you access to the metadata there's no option to tell Text::CSV to reapply the is_quoted state on output.
Depending on how complex your record is you could just reassemble it yourself. But then you'd have to cope with changes to string fields that were previously safely unquoted but after your processing now require quotes. That will depend on the types of changes you introduce, i.e. whether or not you ever expect that a previously "safe" string value will become unsafe. If the answer is "never" (i.e. 0.00000% chance), then you should just do the reassembly yourself and document what you've done.
Post-processing would require that you CSV-parse the string to handle the possibility of commas and other unsafe characters inside strings, so that may not be an option.
Or, you could dive into the code for Text::CSV and implement the desired functionality. I.e. allow the user to force quoting of a specific field on output. I played around with it, and it looks like part of the required mechanism might be in place but unfortunately all I have access to is the XS version, which delegates to native code, so I can't delve deeper at this time. This is as far as I got:
Original combine method. Note the setting of _FFLAGS to undef.
sub combine
{
my $self = shift;
my $str = "";
$self->{_FIELDS} = \#_;
$self->{_FFLAGS} = undef;
$self->{_STATUS} = (#_ > 0) && $self->Combine (\$str, \#_, 0);
$self->{_STRING} = \$str;
$self->{_STATUS};
} # combine
My attempt. I guessed that the second argument to Combine might be the flags, but since the (lowercase) combine API is based on receiving an array and not an arrayref, there's no way to pass two arrays in. I changed it to expect two arrayrefs and tried passing the second to Combine but that failed with "Can't call method "print" on unblessed reference".
sub combine2
{
my $self = shift;
my $str = "";
my $f = shift;
my $g = shift;
$self->{_FIELDS} = $f;
$self->{_FFLAGS} = $g;
$self->{_STATUS} = (#$f > 0) && $self->Combine (\$str, $f, $g);
$self->{_STRING} = \$str;
$self->{_STATUS};
} # combine

PERL -- Regex incl all hash keys (sorted) + deleting empty fields from $_ in file read

I'm working on a program and I have a couple of questions, hope you can help:
First I need to access a file and retrieve specific information according to an index that is obtained from a previous step, in which the indexes to retrieve are found and store in a hash.
I've been looking for a way to include all array elements in a regex that I can use in the file search, but I havenĀ“t been able to make it work. Eventually i've found a way that works:
my #atoms = ();
my $natoms=0;
foreach my $atomi (keys %{$atome}){
push (#atoms,$atomi);
$natoms++;
}
#atoms = sort {$b cmp $a} #atoms;
and then I use it as a regex this way:
while (<IN_LIG>){
if (!$natoms) {last;}
......
if ($_ =~ m/^\s*$atoms[$natoms-1]\s+/){
$natoms--;
.....
}
Is there any way to create a regex expression that would include all hash keys? They are numeric and must be sorted. The keys refer to the line index in IN_LIG, whose content is something like this:
8 C5 9.9153 2.3814 -8.6988 C.ar 1 MLK -0.1500
The key is to be found in column 0 (8). I have added ^ and \s+ to make sure it refers only to the first column.
My second problem is that sometimes input files are not always identical and they make contain white spaces before the index, so when I create an array from $_ I get column0 = " " instead of column0=8
I don't understand why this "empty column" is not eliminated on the split command and I'm having some trouble to remove it. This is what I have done:
#info = split (/[\s]+/,$_);
if ($info[0] eq " ") {splice (#info, 0,1);} # also tried $info[0] =~ m/\s+/
and when I print the array #info I get this:
Array:
Array: 8
Array: C5
Array: 9.9153
Array: 2.3814
.....
How can I get rid of the empty column?
Many thanks for your help
Merche
There is a special form of split where it will remove both leading and trailing spaces. It looks like this, try it:
my $line = ' begins with spaces and ends with spaces ';
my #tokens = split ' ', $line;
# This prints |begins:with:spaces:and:ends:with:spaces|
print "|", join(':', #tokens), "|\n";
See the documentation for split at http://p3rl.org/split (or with perldoc split)
Also, the first part of your program might be simpler as:
my #atoms = sort {$b cmp $a} keys %$atome;
my $natoms = #atoms;
But, what is your ultimate goal with the atoms? If you simply want to verify that the atoms you're given are indeed in the file, then you don't need to sort them, nor to count them:
my #atoms = keys %$atome;
while (<IN_LIG>){
# The atom ID on this line
my ($atom_id) = split ' ';
# Is this atom ID in the array of atom IDs that we are looking for
if (grep { /$atom_id/ } #atoms) {
# This line of the file has an atom that was in the array: $atom_id
}
}
Lets warm up by refining and correcting some of your code:
# If these are all numbers, do a numerical sort: <=> not cmp
my #atoms = ( sort { $b <=> $a } keys %{$atome} );
my $natoms = scalar #atoms;
No need to loop through the keys, you can insert them into the array right away. You can also sort them right away, and if they are numbers, the sort must be numerical, otherwise you will get a sort like: 1, 11, 111, 2, 22, 222, ...
$natoms can be assigned directly by the count of values in #atoms.
while(<IN_LIG>) {
last unless $natoms;
my $key = (split)[0]; # split splits on whitespace and $_ by default
$natoms-- if ($key == $atoms[$natoms - 1]);
}
I'm not quite sure what you are doing here, and if it is the best way, but this code should work, whereas your regex would not. Inside a regex, [] are meta characters. Split by default splits $_ on whitespace, so you need not be explicit about that. This split will also definitely remove all whitespace. Your empty field is most likely an empty string, '', and not a space ' '.
The best way to compare two numbers is not by a regex, but with the equality operator ==.
Your empty field should be gone by splitting on whitespace. The default for split is split ' '.
Also, if you are not already doing it, you should use:
use strict;
use warnings;
It will save you a lot of headaches.
for your second question you could use this line:
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
in order to capture 9 items from each line (assuming they do not contain whitespace).
The first question I do not understand.
Update: I would read alle the lines of the file and use them in a hash with $info[0] as the key and [#info[1..8]] as the value. Then you can lookup the entries by your index.
my %details;
while (<IN_LIG>) {
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
$details{ $info[0] } = [ #info[1..$#info] ];
}
Later you can lookup details for the indices you are interested in and process as needed. This assumes the index is unique (has the property of keys).
thanks for all your replies. I tried the split form with ' ' and it saved me several lines of code. thanks!
As for the regex, I found something that could make all keys as part of the string expression with join and quotemeta, but I couldn't make it work. Nevertheless I found an alternative that works, but I liked the join/quotemeta solution better
The atom indexes are obtained from a text file according to some energy threshold. Later, in the IN_LIG loop, I need to access the molecule file to obtain more information about the atoms selected, thus I use the atom "index" in the molecule to identify which lines of the file I have to read and process. This is a subroutine to which I send a hash with the atom index and some other information.
I tried this for the regex:
my $strings = join "|" map quotemeta,
sort { $hash->{$b} <=> $hash->{$a}} keys %($hash);
but I did something wrong cos it wouldn't take all keys

Indexing directly into returned array in Perl

This question has been asked about PHP both here and here, and I have the same question for Perl. Given a function that returns a list, is there any way (or what is the best way) to immediately index into it without using a temporary variable?
For example:
my $comma_separated = "a,b,c";
my $a = split (/,/, $comma_separated)[0]; #not valid syntax
I see why the syntax in the second line is invalid, so I'm wondering if there's a way to get the same effect without first assigning the return value to a list and indexing from that.
Just use parentheses to define your list and then index it to pull your desired element(s):
my $a = (split /,/, $comma_separated)[0];
Just like you can do this:
($a, $b, $c) = #array;
You can do this:
my($a) = split /,/, $comma_separated;
my $a on the LHS (left hand side) is treated as scalar context. my($a) is list context. Its a single element list so it gets just the first element returned from split.
It has the added benefit of auto-limiting the split, so there's no wasted work if $comma_separated is large.