Perl - How to remove new lines stored in perl hash? - perl

Currently I am stuck with this problem when i use datadumper to print my perl hash which I imported from a spreadsheet, some of the hash print into multiple lines with carriage return symbol(^M) as source cell had new line in it. I have managed to remove the carriage symbol, but those new lines still persists. I have tried doing this to my string, it only manages to remove carriage return symbol ^M, but new lines still exist.
$title = $sheet->{Cells}[0][$col]{Val};
$num = $sheet->{Cells}[$row+1][$col]{Val};
$title =~ s/\r//g;
$num =~ s/\r//g;
$title2 = chomp($title);
$num2 = chomp($num);
The output i get when i print is this
$VAR1 = {
'' => {
'JOHN
has a car' => {
'SUV' => A red coloured
2022 suv from
japan
I expect to get
$VAR1 = {
'' => {
'JOHN' => {
'SUV' => A red coloured 2022 suv from japan

You could use \R instead of \r, that should handle any likely combination of carriage return and newline. It was released in Perl 5.10:
\R matches a generic newline; that is, anything considered a linebreak sequence by Unicode. This includes all characters matched by \v (vertical whitespace), and the multi character sequence "\x0D\x0A" (carriage return followed by a line feed, sometimes called the network newline; it's the end of line sequence used in Microsoft text files opened in binary mode). \R is equivalent to (?>\x0D\x0A|\v).

Related

Can somebody explain this obfuscated perl regexp script?

This code is taken from the HackBack DIY guide to rob banks by Phineas Fisher. It outputs a long text (The Sixth Declaration of the Lacandon Jungle). Where does it fetch it? I don't see any alphanumeric characters at all. What is going on here? And what does the -r switch do? It seems undocumented.
perl -Mre=eval <<\EOF
''
=~(
'(?'
.'{'.(
'`'|'%'
).("\["^
'-').('`'|
'!').("\`"|
',').'"(\\$'
.':=`'.(('`')|
'#').('['^'.').
('['^')').("\`"|
',').('{'^'[').'-'.('['^'(').('{'^'[').('`'|'(').('['^'/').('['^'/').(
'['^'+').('['^'(').'://'.('`'|'%').('`'|'.').('`'|',').('`'|'!').("\`"|
'#').('`'|'%').('['^'!').('`'|'!').('['^'+').('`'|'!').('['^"\/").(
'`'|')').('['^'(').('['^'/').('`'|'!').'.'.('`'|'%').('['^'!')
.('`'|',').('`'|'.').'.'.('`'|'/').('['^')').('`'|"\'").
'.'.('`'|'-').('['^'#').'/'.('['^'(').('`'|('$')).(
'['^'(').('`'|',').'-'.('`'|'%').('['^('(')).
'/`)=~'.('['^'(').'|</'.('['^'+').'>|\\'
.'\\'.('`'|'.').'|'.('`'|"'").';'.
'\\$:=~'.('['^'(').'/<.*?>//'
.('`'|"'").';'.('['^'+').('['^
')').('`'|')').('`'|'.').(('[')^
'/').('{'^'[').'\\$:=~/('.(('{')^
'(').('`'^'%').('{'^'#').('{'^'/')
.('`'^'!').'.*?'.('`'^'-').('`'|'%')
.('['^'#').("\`"| ')').('`'|'#').(
'`'|'!').('`'| '.').('`'|'/')
.'..)/'.('[' ^'(').'"})')
;$:="\."^ '~';$~='#'
|'(';$^= ')'^'[';
$/='`' |'.';
$,= '('
EOF
The basic idea of the code you posted is that each alphanumeric character has been replaced by a bitwise operation between two non-alphanumeric characters. For instance,
'`'|'%'
(5th line of the "star" in your code)
Is a bitwise or between backquote and modulo, whose codepoints are respectively 96 and 37, whose "or" is 101, which is the codepoint of the letter "e". The following few lines all print the same thing:
say '`' | '%' ;
say chr( ord('`' | '%') );
say chr( ord('`') | ord('%') );
say chr( 96 | 37 );
say chr( 101 );
say "e"
Your code starts with (ignore whitespaces which don't matter):
'' =~ (
The corresponding closing bracket is 28 lines later:
^'(').'"})')
(C-f this pattern to see it on the web-page; I used my editor's matching parenthesis highlighting to find it)
We can assign everything in between the opening and closing parenthesis to a variable that we can then print:
$x = '(?'
.'{'.(
'`'|'%'
).("\["^
'-').('`'|
'!').("\`"|
',').'"(\\$'
.':=`'.(('`')|
'#').('['^'.').
('['^')').("\`"|
',').('{'^'[').'-'.('['^'(').('{'^'[').('`'|'(').('['^'/').('['^'/').(
'['^'+').('['^'(').'://'.('`'|'%').('`'|'.').('`'|',').('`'|'!').("\`"|
'#').('`'|'%').('['^'!').('`'|'!').('['^'+').('`'|'!').('['^"\/").(
'`'|')').('['^'(').('['^'/').('`'|'!').'.'.('`'|'%').('['^'!')
.('`'|',').('`'|'.').'.'.('`'|'/').('['^')').('`'|"\'").
'.'.('`'|'-').('['^'#').'/'.('['^'(').('`'|('$')).(
'['^'(').('`'|',').'-'.('`'|'%').('['^('(')).
'/`)=~'.('['^'(').'|</'.('['^'+').'>|\\'
.'\\'.('`'|'.').'|'.('`'|"'").';'.
'\\$:=~'.('['^'(').'/<.*?>//'
.('`'|"'").';'.('['^'+').('['^
')').('`'|')').('`'|'.').(('[')^
'/').('{'^'[').'\\$:=~/('.(('{')^
'(').('`'^'%').('{'^'#').('{'^'/')
.('`'^'!').'.*?'.('`'^'-').('`'|'%')
.('['^'#').("\`"| ')').('`'|'#').(
'`'|'!').('`'| '.').('`'|'/')
.'..)/'.('[' ^'(').'"})';
print $x;
This will print:
(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})
The remaining of the code is a bunch of assignments into some variables; probably here only to complete the pattern: the end of the star is:
$:="\."^'~';
$~='#'|'(';
$^=')'^'[';
$/='`'|'.';
$,='(';
Which just assigns simple one-character strings to some variables.
Back to the main code:
(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})
This code is inside a regext which is matched against an empty string (don't forget that we had first '' =~ (...)). (?{...}) inside a regex runs the code in the .... With some whitespaces, and removing the string within the eval, this gives us:
# fetch an url from the web using curl _quitely_ (-s)
($: = `curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)
# replace end of paragraphs with newlines in the HTML fetched
=~ s|</p>|\n|g;
# Remove all HTML tags
$: =~ s/<.*?>//g;
# Print everything between SEXTA and Mexicano (+2 chars)
print $: =~ /(SEXTA.*?Mexicano..)/s
You can automate this unobfuscation process by using B::Deparse: running
perl -MO=Deparse yourcode.pl
Will produce something like:
'' =~ m[(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})];
$: = 'P';
$~ = 'h';
$^ = 'r';
$/ = 'n';
$, = '(';

Reading CSV with Perl produces distorted lines

I am reading a CSV file using Perl 5.26.1 with lines that look like this:
B1_10,202337840166,R08C02,202337840166_R08C02.gtc
I'm reading this data into a hash that has the last element as a key, and the first as a value.
I read the file line by line (snippet only):
while (<$csv>) {
if (/^Sample/) { next }
say "-----start----\noriginal = $_";
chomp;
my #line = split /,/;
my $name = $line[0];
my $vcf = $line[3];
say "1st element = $name";
say "4th element = $vcf";
$vcf2dir{$vcf} = $name;
say "\$vcf2dir{$vcf} = '$name'";
say '-----end------';
}
which produces the following output:
-----start----
original = B1_10,202337840166,R08C02,202337840166_R08C02.gtc
1st element = B1_10
4th element = 202337840166_R08C02.gtc
} = 'B1_10'2337840166_R08C02.gtc
-----end-------
but it should look like
-----start----
original = B1_10,202337840166,R08C02,202337840166_R08C02.gtc
1st element = B1_10
4th element = 202337840166_R08C02.gtc
$vcf2dir{202337840166_R08C02.gtc} = 'B1_10'
-----end-------
and it shows strangely with the data printer package:
use DDP;
p %vcf2dir;
produces
{
' "B1_10"840166_R08C02.gtc
}
in other words, the last string is being cut up for some reason.
I have tried removing non-ascii characters with $_ =~ s/[[:^ascii:]]//g; but this still produces the same error.
I have no idea why Perl is ripping these strings apart :(
while (<$csv>) {
...
chomp;
My guess is that the input file has as line end \r\n (windows style) while you are executing the code in a UNIX like environment (Linux, Mac...) where the line end is \n. This means that $INPUT_RECORD_SEPARATOR is also \n and that chomp only removes the \n and leaves the \r. This left \r causes such strange output.
To fix this either fix the line endings in your input file, set $INPUT_RECORD_SEPARATOR to the expected separator or just do s{\r?\n\z}{} instead of chomp to handle both \r\n and \n line endings.
I ran your snippet against your line and it worked as expected
But I have had behavior like what you show because a spurious Control-M's in my data.
Try filtering for control-M's
after your chomp replace all control-M's with the command below
s/\cM//g;

Issues parsing a CSV file in perl using Text::CSV

I'm trying to use Text::CSV to parse this CSV file. Here is how I am doing it:
open my $fh, '<', 'test.csv' or die "can't open csv";
my $csv = Text::CSV_XS->new ({ sep_char => "\t", binary => 1 , eol=> "\n"});
$csv->column_names($csv->getline($fh));
while(my $row = $csv->getline_hr($fh)) {
# use row
}
Because the file has 169,252 rows (not counting the headers line), I expect the loop to run that many times. However, it only runs 8 times and gives me 8 rows. I'm not sure what's happening, because the CSV just seems like a normal CSV file with \n as the line separator and \t as the field separator. If I loop through the file like this:
while(my $line = <$fh>) {
my $fields = $csv->parse($line);
}
Then the loop goes through all rows.
Text::CSV_XS is silently failing with an error. If you put the following after your while loop:
my ($cde, $str, $pos) = $csv->error_diag ();
print "$cde, $str, $pos\n";
You can see if there were errors parsing the file and you get the output:
2034, EIF - Loose unescaped quote, 336
Which means the column:
GT New Coupe 5.0L CD Wheels: 18" x 8" Magnetic Painted/Machined 6 Speakers
has an unquoted escape string (there is no backslash before the ").
The Text::CSV perldoc states:
allow_loose_quotes
By default, parsing fields that have quote_char characters inside an unquoted field, like
1,foo "bar" baz,42
would result in a parse error. Though it is still bad practice to allow this format, we cannot help there are some vendors that make their applications spit out lines styled like this.
If you change your arguments to the creation of Text::CSV_XS to:
my $csv = Text::CSV_XS->new ({ sep_char => "\t", binary => 1,
eol=> "\n", allow_loose_quotes => 1 });
The problem goes away, well until row 105265, when Error 2023 rears its head:
2023, EIQ - QUO character not allowed, 406
Details of this error in the perldoc:
2023 "EIQ - QUO character not allowed"
Sequences like "foo "bar" baz",qu and 2023,",2008-04-05,"Foo, Bar",\n will cause this error.
Setting your quote character empty (setting quote_char => '' on your call to Text::CSV_XS->new()) does seem to work around this and allow processing of the whole file. However I would take time to check if this is a sane option with the CSV data.
TL;DR The long and short is that your CSV is not in the greatest format, and you will have to work around it.

Perl Text processing on a variable before its usage

I wrote a perl script whihc will output a list containing similar entries like below:
$var = ' whatever'
$var contains: a single quote, a space, the word whatever, single quote
actually, this is key of a hash and i want to pull the value for the same. but due to the single quotes and a space in betweene, i am not able to pull the hash key value.
So, i want to strip $var as below:
$var = whatever
meaning remove the single quote, the space and the trailing single quote.
so that I can use $var as hash key to pull the respective value.
could you guide me on a perl oneliner for the same.
thnaks.
Here is several ways to do it, but beware - modifying the keys in a hash can end with unwanted results, like:
use strict;
use warnings;
use Data::Dumper;
my $src = {
"a a" => 1,
" a a " => 2,
"' a a '" => 3,
};
print "src: ", Dumper($src);
my $trg;
#$trg{ map { s/^[\s']*(.*?)[\s']*$/$1/; $_ } keys %$src } = values %$src;
print "copy: ", Dumper($trg);
will produce:
src: $VAR1 = {
' a a ' => 2,
'\' a a \'' => 3,
'a a' => 1
};
copy: $VAR1 = {
'a a' => 1
};
Any regex is possible do explain with YAPE::Regex::Explain module. (from CPAN). For the above regex:
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new( qr(^[\s']*(.*?)[\s']*$) )->explain;
will produce:
The regular expression:
(?-imsx:^[\s']*(.*?)[\s']*$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
[\s']* any character of: whitespace (\n, \r, \t,
\f, and " "), ''' (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
[\s']* any character of: whitespace (\n, \r, \t,
\f, and " "), ''' (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
In short the: s/^[\s']*(.*?)[\s']*$/$1/; mean:
at the beginning of the string match whitespaces or apostrophe as much times is possible,
then match anything
match at the end of string whitespaces or apostrophes as much times as possible
and keep the only the "anything" part
#!/usr/bin/perl
$string = "' my string'";
print $string . "\n";
$string =~ s/'//g;
$string =~ s/^ //g;
print $string;
Output
' my string'
my string
$var =~ tr/ '//d;
see: tr operator
or, by regex
$var =~ s/(?:^['\s]+)|'//g;
The latter will keep the spaces in the middle of the word, the former removes all spaces and single quotes.
A short test:
...
$var = q{' what ever'};
$var =~ s/
(?: # find the following group
^ # at string begin, followed by
['\s]+ # space or single quote, one or more
) # close group
| # OR
' # single quotes in the while string
//gx ; # replace by nothing, use formatted regex (x)
print "|$var|\n";
...
prints:
|what ever|
as expected.

Trying to understand Perl split() output

I have a few lines of text that I'm trying to use Perl's split function to convert into an array. The problem is that I'm getting some unusual extra characters in the output, specifically the following string "\cM" (without the quotes). This string appears where there were line breaks in the original text; however, (I believe) those line breaks were removed in the text that I'm trying to split. Does anybody know what's going on with this phenomenon? I posted an example below. Thanks.
Here's the original plain text that I'm trying to split. I'm loading it from a file, in case that matters:
10b2obo12b2o2b$6b3obob3o8bob3o2b$2bobo10bo3b2obo4bo2b$2o4b2o5bo3b4obo
3b2o2b$2bob2o2bo4b3obo5b4obob$8bo4bo13b3o$2bob2o2bo4b3obo5b4obob$2o4b
2o5bo3b4obo3b2o2b$2bobo10bo3b2obo4bo2b$6b3obob3o8bob3o2b$10b2obo12b2o!
Here is my Perl code that is supposed to do the splitting:
while(<$FH>) {
chomp;
$string .= $_;
last if m/!$/;
}
#rows = split(qr/\$/, $string);
print; # a dummy line to provide a breakpoint for the debugger
This what the debugger outputs when it gets to the "print" line. The issue I'm trying to deal with appears in lines 3, 7, and 10:
DB<10> p $string
2o5bo3b4obo3b2o2b$2bobo10bo3b2obo4bo2b$6b3obob3o8bob3o2b$10b2obo12b2o!
DB<11> x #rows
0 '10b2obo12b2o2b'
1 '6b3obob3o8bob3o2b'
2 '2bobo10bo3b2obo4bo2b'
3 "2o4b2o5bo3b4obo\cM3b2o2b"
4 '2bob2o2bo4b3obo5b4obob'
5 '8bo4bo13b3o'
6 '2bob2o2bo4b3obo5b4obob'
7 "2o4b\cM2o5bo3b4obo3b2o2b"
8 '2bobo10bo3b2obo4bo2b'
9 '6b3obob3o8bob3o2b'
10 "10b2obo12b2o!\cM"
You know, changing the file input separator would make this code a lot simpler.
$/ = '$';
my #rows = <$FH>;
chomp #rows;
print "#rows";
The debugger is probably using \cM to represent Ctrl-M which is also known as a carriage return (and sometimes \r or ^M). Text files from Windows use a CR-LF (carriage return, line feed) pair to represent the end of a line. If you read such a file on a Unix system, your chomp will strip off the Unix EOL (a single line feed) but leave the CR as is and you end up with stray CRs in your file.
For a file like you have you can just strip out all the trailing whitespace instead of using chomp:
while(defined(my $line = <$FH>)) {
$line =~ s/\s+$//;
$string .= $line;
last if($line =~ /!$/);
}
You don't say which OS you're on.
Check out binmode and what it has to say about \cM, and that their position coincides with the line endings of your input file:
http://perldoc.perl.org/functions/binmode.html