Reading CSV with Perl produces distorted lines - perl

I am reading a CSV file using Perl 5.26.1 with lines that look like this:
B1_10,202337840166,R08C02,202337840166_R08C02.gtc
I'm reading this data into a hash that has the last element as a key, and the first as a value.
I read the file line by line (snippet only):
while (<$csv>) {
if (/^Sample/) { next }
say "-----start----\noriginal = $_";
chomp;
my #line = split /,/;
my $name = $line[0];
my $vcf = $line[3];
say "1st element = $name";
say "4th element = $vcf";
$vcf2dir{$vcf} = $name;
say "\$vcf2dir{$vcf} = '$name'";
say '-----end------';
}
which produces the following output:
-----start----
original = B1_10,202337840166,R08C02,202337840166_R08C02.gtc
1st element = B1_10
4th element = 202337840166_R08C02.gtc
} = 'B1_10'2337840166_R08C02.gtc
-----end-------
but it should look like
-----start----
original = B1_10,202337840166,R08C02,202337840166_R08C02.gtc
1st element = B1_10
4th element = 202337840166_R08C02.gtc
$vcf2dir{202337840166_R08C02.gtc} = 'B1_10'
-----end-------
and it shows strangely with the data printer package:
use DDP;
p %vcf2dir;
produces
{
' "B1_10"840166_R08C02.gtc
}
in other words, the last string is being cut up for some reason.
I have tried removing non-ascii characters with $_ =~ s/[[:^ascii:]]//g; but this still produces the same error.
I have no idea why Perl is ripping these strings apart :(

while (<$csv>) {
...
chomp;
My guess is that the input file has as line end \r\n (windows style) while you are executing the code in a UNIX like environment (Linux, Mac...) where the line end is \n. This means that $INPUT_RECORD_SEPARATOR is also \n and that chomp only removes the \n and leaves the \r. This left \r causes such strange output.
To fix this either fix the line endings in your input file, set $INPUT_RECORD_SEPARATOR to the expected separator or just do s{\r?\n\z}{} instead of chomp to handle both \r\n and \n line endings.

I ran your snippet against your line and it worked as expected
But I have had behavior like what you show because a spurious Control-M's in my data.
Try filtering for control-M's
after your chomp replace all control-M's with the command below
s/\cM//g;

Related

Perl: Find a match, remove the same lines, and to get the last field

Being a Perl newbie, please pardon me for asking this basic question.
I have a text file #server1 that shows a bunch of sentences (white space is the field separator) on many lines in the file.
I needed to match lines with my keyword, remove the same lines, and extract only the last field, so I have tried with:
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my #uniqmatchedline = split(/ /, #allmatchedlines);
my $lastfield = $uniqmatchedline[-1]\n";
print "$lastfield\n";
and it gives me the output showing:
1
I don't know why it's giving me just "1".
Could someone please explain why I'm getting "1" and how I can get the last field of the matched line correctly?
Thank you!
my #uniqmatchedline = split(/ /, #allmatchedlines);
You're getting "1" because split takes a scalar, not an array. An array in scalar context returns the number of elements.
You need to split on each individual line. Something like this:
my #uniqmatchedline = map { split(/ /, $_) } #allmatchedlines;
There are two issues with your code:
split is expecting a scalar value (string) to split on; if you are passing an array, it will convert the array to scalar (which is just the array length)
You did not have a way to remove same lines
To address these, the following code should work (not tested as no data):
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my %existing;
my #uniqmatchedline = grep !$existing{$_}++, #allmatchedlines; #this will return the unique lines
my #lastfields = map { ((split / /, $_)[-1]) . "\n" } #uniqmatchedline ; #this maps the last field in each line into an array
print for #lastfields;
Apart from two errors in the code, I find the statement "remove the same lines and extract only the last field" unclear. Once duplicate matching lines are removed, there may still be multiple distinct sentences with the pattern.
Until a clarification comes, here is code that picks the last field from the last such sentence.
use warnings 'all';
use strict;
use List::MoreUtils qw(uniq)
my $file = '/tmp/myfile.txt';
my $cmd = "ssh user1\#server1 cat $file";
open my $fh, '-|', $cmd // die "Error opening $cmd: $!"; # /
while (<$fh>) {
chomp;
push #allmatchedlines, $_ if /mysearch/;
}
close(output1);
my #unique_matched_lines = uniq #allmatchedlines;
my $lastfield = ( split ' ', $unique_matched_lines[-1] )[-1];
print $lastfield, "\n";
I changed to the three-argument open, with error checking. Recall that open for a process involves a fork and returns pid, so an "error" doesn't at all relate to what happened with the command itself. See open. (The # / merely turns off wrong syntax highlighting.) Also note that # under "..." indicates an array and thus need be escaped.
The (default) pattern ' ' used in split splits on any amount of whitespace. The regex / / turns off this behavior and splits on a single space. You most likely want to use ' '.
For more comments please see the original post below.
The statement #allmatchedlines = $_ if /mysearch/; on every iteration assigns to the array, overwriting whatever has been in it. So you end up with only the last line that matched mysearch. You want push #allmatchedlines, $_ ... to get all those lines.
Also, as shown in the answer by Justin Schell, split needs a scalar so it is taking the length of #allmatchedlines – which is 1 as explained above. You should have
my #words_in_matched_lines = map { split } #allmatchedlines;
When all this is straightened out, you'll have words in the array #uniqmatchedline and if that is the intention then its name is misleading.
To get unique elements of the array you can use the module List::MoreUtils
use List::MoreUtils qw(uniq);
my #unique_elems = uniq #whole_array;

Regular expression to print a string from a command outpout

I have written a function that uses regex and prints the required string from a command output.
The script works as expected. But it's does not support a dynamic output. currently, I use regex for "icmp" and "ok" and print the values. Now, type , destination and return code could change. There is a high chance that command doesn't return an output at all. How do I handle such scenarios ?
sub check_summary{
my ($self) = #_;
my $type = 0;
my $return_type = 0;
my $ipsla = $self->{'ssh_obj'}->exec('show ip sla');
foreach my $line( $ipsla) {
if ( $line =~ m/(icmp)/ ) {
$type = $1;
}
if ( $line =~ m/(OK)/ ) {
$return_type = $1;
}
}
INFO ($type,$return_type);
}
command Ouptut :
PSLAs Latest Operation Summary
Codes: * active, ^ inactive, ~ pending
ID Type Destination Stats Return Last
(ms) Code Run
-----------------------------------------------------------------------
*1 icmp 192.168.25.14 RTT=1 OK 1 second ago
Updated to some clarifications -- we need only the last line
As if often the case, you don't need a regex to parse the output as shown. You have space-separated fields and can just split the line and pick the elements you need.
We are told that the line of interest is the last line of the command output. Then we don't need the loop but can take the last element of the array with lines. It is still unclear how $ipsla contains the output -- as a multi-line string or perhaps as an arrayref. Since it is output of a command I'll treat it as a multi-line string, akin to what qx returns. Then, instead of the foreach loop
my #lines = split '\n', $ipsla; # if $ipsla is a multi-line string
# my #lines = #$ipsla; # if $ipsla is an arrayref
pop #lines while $line[-1] !~ /\S/; # remove possible empty lines at end
my ($type, $return_type) = (split ' ', $lines[-1])[1,4];
Here are some comments on the code. Let me know if more is needed.
We can see in the shown output that the fields up to what we need have no spaces. So we can split the last line on white space, by split ' ', $lines[-1], and take the 2nd and 5th element (indices 1 and 4), by ( ... )[1,4]. These are our two needed values and we assign them.
Just in case the output ends with empty lines we first remove them, by doing pop #lines as long as the last line has no non-space characters, while $lines[-1] !~ /\S/. That is the same as
while ( $lines[-1] !~ /\S/ ) { pop #lines }
Original version, edited for clarifications. It is also a valid way to do what is needed.
I assume that data starts after the line with only dashes. Set a flag once that line is reached, process the line(s) if the flag is set. Given the rest of your code, the loop
my $data_start;
foreach (#lines)
{
if (not $data_start) {
$data_start = 1 if /^\s* -+ \s*$/x; # only dashes and optional spaces
}
else {
my ($type, $return_type) = (split)[1,4];
print "type: $type, return code: $return_type\n";
}
}
This is a sketch until clarifications come. It also assumes that there are more lines than one.
I'm not sure of all possibilities of output from that command so my regular expression may need tweaking.
I assume the goal is to get the values of all columns in variables. I opted to store values in a hash using the column names as the hash keys. I printed the results for debugging / demonstration purposes.
use strict;
use warnings;
sub check_summary {
my ($self) = #_;
my %results = map { ($_,undef) } qw(Code ID Type Destination Stats Return_Code Last_Run); # Put results in hash, use column names for keys, set values to undef.
my $ipsla = $self->{ssh_obj}->exec('show ip sla');
foreach my $line (#$ipsla) {
chomp $line; # Remove newlines from last field
if($line =~ /^([*^~])([0-9]+)\s+([a-z]+)\s+([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)\s+([[:alnum:]=]+)\s+([A-Z]+)\s+([^\s].*)$/) {
$results{Code} = $1; # Code prefixing ID
$results{ID} = $2;
$results{Type} = $3;
$results{Destination} = $4;
$results{Stats} = $5;
$results{Return_Code} = $6;
$results{Last_Run} = $7;
}
}
# Testing
use Data::Dumper;
print Dumper(\%results);
}
# Demonstrate
check_summary();
# Commented for testing
#INFO ($type,$return_type);
Worked on the submitted test line.
EDIT:
Regular expressions allow you to specify patterns instead of the exact text you are attempting to match. This is powerful but complicated at times. You need to read the Perl Regular Expression documentation to really learn them.
Perl regular expressions also allow you to capture the matched text. This can be done multiple times in a single pattern which is how we were able to capture all the columns with one expression. The matches go into numbered variables...
$1
$2

Perl chomp doesn't remove the newline

I want to read a string from a the first line in a file, then repeat it n repetitions in the console, where n is specified as the second line in the file.
Simple I think?
#!/usr/bin/perl
open(INPUT, "input.txt");
chomp($text = <INPUT>);
chomp($repetitions = <INPUT>);
print $text x $repetitions;
Where input.txt is as follows
Hello
3
I expected the output to be
HelloHelloHello
But words are new line separated despite that chomp is used.
Hello
Hello
Hello
You may try it on the following Perl fiddle CompileOnline
The strange thing is that if the code is as follows:
#!/usr/bin/perl
open(INPUT, "input.txt");
chomp($text = <INPUT>);
print $text x 3;
It will work fine and displays
HelloHelloHello
Am I misunderstanding something, or is it a problem with the online compiler?
You have issues with line endings; chomp removes trailing char/string of $/ from $text and that can vary depending on platform. You can however choose to remove from string any trailing white space using regex,
open(my $INPUT, "<", "input.txt");
my $text = <$INPUT>;
my $repetitions = <$INPUT>;
s/\s+\z// for $text, $repetitions;
print $text x $repetitions;
I'm using an online Perl editor/compiler as mentioned in the initial post http://compileonline.com/execute_perl_online.php
The reason for your output is that string Hello\rHello\rHello\r is differently interpreted in html (\r like line break), while in console \r returns cursor to the beginning of the current line.

Removing lines containing a string from a file w/ perl

I'm trying to take a file INPUT and, if a line in that file contains a string, replace the line with something else (the entire line, including line breaks), or nothing at all (remove the line like it wasn't there). Writing all this to a new file .
Here's that section of code...
while(<INPUT>){
if ($_ =~ / <openTag>/){
chomp;
print OUTPUT "Some_Replacement_String";
} elsif ($_ =~ / <\/closeTag>/) {
chomp;
print OUTPUT ""; #remove the line
} else {
chomp;
print OUTPUT "$_\r\n"; #print the original line
}
}
while(<INPUT>) should read one line at a time (if my understanding is correct) and store each line in the special variable $_
However, when I run the above code I get only the very first if statement condition returned Some_Replacement_String, and only once. (1 line, out of a file with 1.3m, and expecting 600,000 replacements). This obviously isn't the behavior I expect. If I do something like while(<INPUT>){print OUTPUT $_;) I get a copy of the entire file, every line, so I know the entire file is being read (expected behavior).
What I'm trying to do is get a line, test it, do something with it, and move on to the next one.
If it helps with troubleshooting at all, if I use print $.; anywhere in that while statement (or after it), I get 1 returned. I expected this to be the "Current line number for the last filehandle accessed.". So by the time my while statement loops through the entire file, it should be equal to the number of lines in the file, not 1.
I've tried a few other variations of this code, but I think this is the closest I've come. I assume there's a good reason I'm not getting the behavior I expect, can anyone tell me what it is?
The problem you are describing indicates that your input file only contains one line. This may be because of a great many different things, such as:
You have changed the input record separator $/
Your input file does not contain the correct line endings
You are running your script with -0777 switch
Some notes on your code:
if ($_ =~ / <openTag>/){
chomp;
print OUTPUT "Some_Replacement_String";
No need to chomp a line you are not using.
} elsif ($_ =~ / <\/closeTag>/) {
chomp;
print OUTPUT "";
This is quite redundant. You don't need to print an empty string (ever, really), and chomp a value you're not using.
} else {
chomp;
print OUTPUT "$_\r\n"; #print the original line
No need to remove newlines and then put them back. Also, normally you would use \n as your line ending, even on windows.
And, since you are chomping in every if-else clause, you might as well move that outside the entire if-block.
chomp;
if (....) {
But since you are never relying on line endings not being there, why bother using chomp at all?
When using the $_ variable, you can abbreviate some commands, such as you are doing with chomp. For example, a lone regex will be applied to $_:
} elsif (/ <\/closeTag>/) { # works splendidly
When, like above, you have a regex that contains slashes, you can choose another delimiter for your regex, so that you do not need to escape the slashes:
} elsif (m# </closeTag>#) {
But then you need to use the full notation of the m// operator, with the m in front.
So, in short
while(<INPUT>){
if (/ <openTag>/){
print OUTPUT "Some_Replacement_String";
} elsif (m# </closeTag>#) {
# do nothing
} else {
print OUTPUT $_; # print the original line
}
}
And of course, the last two can be combined into one, with some negation logic:
} elsif (not m# </closeTag>#) {
print OUTPUT $_;
}

Trying to understand Perl split() output

I have a few lines of text that I'm trying to use Perl's split function to convert into an array. The problem is that I'm getting some unusual extra characters in the output, specifically the following string "\cM" (without the quotes). This string appears where there were line breaks in the original text; however, (I believe) those line breaks were removed in the text that I'm trying to split. Does anybody know what's going on with this phenomenon? I posted an example below. Thanks.
Here's the original plain text that I'm trying to split. I'm loading it from a file, in case that matters:
10b2obo12b2o2b$6b3obob3o8bob3o2b$2bobo10bo3b2obo4bo2b$2o4b2o5bo3b4obo
3b2o2b$2bob2o2bo4b3obo5b4obob$8bo4bo13b3o$2bob2o2bo4b3obo5b4obob$2o4b
2o5bo3b4obo3b2o2b$2bobo10bo3b2obo4bo2b$6b3obob3o8bob3o2b$10b2obo12b2o!
Here is my Perl code that is supposed to do the splitting:
while(<$FH>) {
chomp;
$string .= $_;
last if m/!$/;
}
#rows = split(qr/\$/, $string);
print; # a dummy line to provide a breakpoint for the debugger
This what the debugger outputs when it gets to the "print" line. The issue I'm trying to deal with appears in lines 3, 7, and 10:
DB<10> p $string
2o5bo3b4obo3b2o2b$2bobo10bo3b2obo4bo2b$6b3obob3o8bob3o2b$10b2obo12b2o!
DB<11> x #rows
0 '10b2obo12b2o2b'
1 '6b3obob3o8bob3o2b'
2 '2bobo10bo3b2obo4bo2b'
3 "2o4b2o5bo3b4obo\cM3b2o2b"
4 '2bob2o2bo4b3obo5b4obob'
5 '8bo4bo13b3o'
6 '2bob2o2bo4b3obo5b4obob'
7 "2o4b\cM2o5bo3b4obo3b2o2b"
8 '2bobo10bo3b2obo4bo2b'
9 '6b3obob3o8bob3o2b'
10 "10b2obo12b2o!\cM"
You know, changing the file input separator would make this code a lot simpler.
$/ = '$';
my #rows = <$FH>;
chomp #rows;
print "#rows";
The debugger is probably using \cM to represent Ctrl-M which is also known as a carriage return (and sometimes \r or ^M). Text files from Windows use a CR-LF (carriage return, line feed) pair to represent the end of a line. If you read such a file on a Unix system, your chomp will strip off the Unix EOL (a single line feed) but leave the CR as is and you end up with stray CRs in your file.
For a file like you have you can just strip out all the trailing whitespace instead of using chomp:
while(defined(my $line = <$FH>)) {
$line =~ s/\s+$//;
$string .= $line;
last if($line =~ /!$/);
}
You don't say which OS you're on.
Check out binmode and what it has to say about \cM, and that their position coincides with the line endings of your input file:
http://perldoc.perl.org/functions/binmode.html