Split a string based on ASCII value - perl

I need to parse a delimited file.(generated by mainframe job and ftped over to windows).But got few Queries while using the split on delimiter.
As per the documentation, the file is separated by '1D'. But when I open the file in notepad++(when I check the encoding tab, it is set to 'Encode in ANSI'), it seems to me like a 'vertical broken bar'. Q. Not sure what is '1D'?
open my $handle, '<', 'sample.txt';
chomp(my #lines = <$handle>);
close $handle;
my #a = unpack("C*", $lines[0]);
print Dumper \#a;
# $VAR1 = [65,166,66,166,67,166];
From dumper output, we see perl considers the ASCII for vertical broken bar to be 166.
As per link1, 166 is indeed vertical broken bar whereas as per link2, 166 is feminine ordinal indicator.Q. Any suggestion as to why the difference ?
my $str = $lines[0];
print Dumper $str;
# $VAR1 = 'AªBªCª';
We can see that the output contains 'feminine ordinal indicator' not 'vertical broken bar'.Q. Not sure why perl reads a 'bar' but then starts treating it as something else.
# I copied the vertical broken bar from notepad++ for use below
my #b = split(/¦/, $lines[0]);
print Dumper \#b;
# $VAR1 = [ 'AªBªCª' ];
Since perl has started treating bar to be something else, as expected, no split here.I thought to split by giving the ascii code of 166 directly. Seems split() doesn't support ASCII as an argument. Q. Any workaround to pass ASCII code to split() ?
# I copied the vertical broken bar from notepad++ and created A¦B¦C
my #c = split(/¦/, 'A¦B¦C');
print Dumper \#c;
#$VAR1 = [ 'A','B','C']; # works as expected, added here just for completion
Any pointers will be a great help!
Update:
my #a = map {ord $_} split //, $lines[0]; print Dumper \#a;
# $VAR1 = [ 65,166,66,166,67,166];

When you receive an input file from an unknown source, the most important thing to need to know about it is "what character encoding does it use?" Without that information, any processing that you do on the file is based on guesswork.
The problem isn't helped by people who talk about "extended ASCII" as though it's a meaningful term. ASCII only contains 128 characters. There are many definitions of what the next 128 character codes represent, and many of them are contradictory.
It seems that you have a solution to your problem. Splitting on '¦' (copied from Notepad++) does what you want. So I suggest you do that. If you want to use the actual character code, then you can convert 116 to hexadecimal (0xA6) and use that:
split /\xA6/, ... ;

You should always decode your inputs and encode your outputs.
my $acp;
BEGIN {
require Win32;
$acp = "cp".Win32::GetACP();
}
use open ':std', ":encoding($acp)";
Now, #lines will contain strings of Unicode Code Points. As such, you can now use the following:
use utf8; # Source code is encoded using UTF-8.
my #b = split(/¦/, $lines[0]);
Alternatively, every one of the following will also work now:
my #b = split(/\N{BROKEN BAR}/, $lines[0]);
my #b = split(/\N{U+00A6}/, $lines[0]);
my #b = split(/\x{A6}/, $lines[0]);
my #b = split(/\xA6/, $lines[0]);

Related

Data::Dumper wraps second word's output

I'm experiencing a rather odd problem while using Data::Dumper to try and check on my importing of a large list of data into a hash.
My Data looks like this in another file.
##Product ID => Market for product
ABC => Euro
XYZ => USA
PQR => India
Then in my script, I'm trying to read in my list of data into a hash like so:
open(CONFIG_DAT_H, "<", $config_data);
while(my $line = <CONFIG_DAT_H>) {
if($line !~ /^\#/) {
chomp($line);
my #words = split(/\s*\=\>\s/, $line);
%product_names->{$words[0]} = $words[1];
}
}
close(CONFIG_DAT_H);
print Dumper (%product_names);
My parsing is working for the most part that I can find all of my data in the hash, but when I print it using the Data::Dumper it doesn't print it properly. This is my output.
$VAR1 = 'ABC';
';AR2 = 'Euro
$VAR3 = 'XYZ';
';AR4 = 'USA
$VAR5 = 'PQR';
';AR6 = 'India
Does anybody know why the Dumper is printing the '; characters over the first two letters on my second column of data?
There is one unclear thing in the code: is *product_names a hash or a hashref?
If it is a hash, you should use %product_names{key} syntax, not %product_names->{key}, and need to pass a reference to Data::Dumper, so Dumper(\%product_names).
If it is a hashref then it should be labelled with a correct sigil, so $product_names->{key} and Dumper($product_names}.
As noted by mob if your input has anything other than \n it need be cleaned up more explicitly, say with s/\s*$// per comment. See the answer by ikegami.
I'd also like to add, the loop can be simplified by loosing the if branch
open my $config_dat_h, "<", $config_data or die "Can't open $config_data: $!";
while (my $line = <$config_dat_h>)
{
next if $line =~ /^\#/; # or /^\s*\#/ to account for possible spaces
# ...
}
I have changed to the lexical filehandle, the recommended practice with many advantages. I have also added a check for open, which should always be in place.
Humm... this appears wrong to me, even you're using Perl6:
%product_names->{$words[0]} = $words[1];
I don't know Perl6 very well, but in Perl5 the reference should be like bellow considering that %product_names exists and is declared:
$product_names{...} = ... ;
If you could expose the full code, I can help to solve this problem.
The file uses CR LF as line endings. This would become evident by adding the following to your code:
local $Data::Dumper::Useqq = 1;
You could convert the file to use unix line endings (seeing as you are on a unix system). This can be achieved using the dos2unix utility.
dos2unix config.dat
Alternatively, replace
chomp($line);
with the more flexible
$line =~ s/\s+\z//;
Note: %product_names->{$words[0]} makes no sense. It happens to do what you want in old versions of Perl, but it rightfully throws an error in newer versions. $product_names{$words[0]} is the proper syntax for accessing the value of an element of a hash.
Tip: You should be using print Dumper(\%product_names); instead of print Dumper(%product_names);.
Tip: You might also find local $Data::Dumper::Sortkeys = 1; useful. Data::Dumper has such bad defaults :(
Tip: Using split(/\s*=>\s*/, $line, 2) instead of split(/\s*=>\s*/, $line) would permit the value to contain =>.
Tip: You shouldn't use global variable without reason. Use open(my $CONFIG_DAT_H, ...) instead of open(CONFIG_DAT_H, ...), and replace other instances of CONFIG_DAT_H with $CONFIG_DAT_H.
Tip: Using next if $line =~ /^#/; would avoid a lot of indenting.

Perl: Replace consecutive spaces in this given scenario?

an excerpt of a big binary file ($data) looks like this:
\n1ax943021C xxx\t2447\t5
\n1ax951605B yyy\t10400\t6
\n1ax919275 G2L zzz\t6845\t6
The first 25 characters contain an article number, filled with spaces. How can I convert all spaces between the article numbers and the next column into a \x09 ? Note the one or more spaces between different parts of the article number.
I tried a workaround, but that overwrites the article number with ".{25}xxx»"
$data =~ s/\n.{25}/\n.{25}xxx/g
Anyone able to help?
Thanks so much!
Gary
You can use unpack for fixed width data:
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper $_ for map join("\t", unpack("A25A*")), <DATA>;
__DATA__
1ax943021C xxx 2447 5
1ax951605B yyy 10400 6
1ax919275 G2L zzz 6845 6
Output:
$VAR1 = "1ax943021C\txxx\t2447\t5";
$VAR1 = "1ax951605B\tyyy\t10400\t6";
$VAR1 = "1ax919275 G2L\tzzz\t6845\t6";
Note that Data::Dumper's Useqq option prints whitecharacters in their escaped form.
Basically what I do here is take each line, unpack it, using 2 strings of space padded text (which removes all excess space), join those strings back together with tab and print them. Note also that this preserves the space inside the last string.
I interpret the question as there being a 25 character wide field that should have its trailing spaces stripped and then delimited by a tab character before the next field. Spaces within the article number should otherwise be preserved (like "1ax919275 G2L").
The following construct should do the trick:
$data =~ s/^(.{25})/{$t=$1;$t=~s! *$!\t!;$t}/emg;
That matches 25 characters from the beginning of each line in the data, then evaluates an expression for each article number by stripping its trailing spaces and appending a tab character.
Have a try with:
$data =~ s/ +/\t/g;
Not sure exactly what you what - this will match the two columns and print them out - with all the original spaces. Let me know the desired output and I will fix it for you...
#!/usr/bin/perl -w
use strict;
my #file = ('\n1ax943021C xxx\t2447\t5', '\n1ax951605B yyy\t10400\t6',
'\n1ax919275 G2L zzz\t6845\t6');
foreach (#file) {
my ($match1, $match2) = ($_ =~ /(\\n.{25})(.*)/);
print "$match1'[insertsomethinghere]'$match2\n";
}
Output:
\n1ax943021C '[insertsomethinghere]'xxx\t2447\t5
\n1ax951605B '[insertsomethinghere]'yyy\t10400\t6
\n1ax919275 G2L '[insertsomethinghere]'zzz\t6845\t6

Comparing two non-ascii strings in perl

I am unable to compare two non-ascii strings, although both the strings appear the same on the console. Below is what I tried. Please let me know what code is missing here, so that the two variables shall be equal.
if($lineContent[7] ne $name) {
/*Control coming to here*/
print "###### Values MIS-MATCHED\n";
} else {
print "###### Values MATCHED\n";
}
$lineContent[7] is from a CSV file
$name is from an XML file
When Putty's console is in the default Characterset
CSV Val: ENB69-åºå°å±
XML Val: ENB69-åºå°å±
When Putty's Console is set to UTF-8
CSV Val: ENB69-基地局
XML Val: ENB69-基地局
#!/usr/bin/perl
use warnings;
use strict;
use Encode;
binmode STDOUT, ":encoding(utf8)";
open F1, "<:utf8", "$ARGV[0]" or die "$!";
open F2, "<", "$ARGV[0]" or die "$!";
my $a1 = <F1>;
chomp $a1;
my $a2 = <F2>;
chomp $a2;
if ($a1 eq $a2) {
print "$a1=$a2 is true\n";
} else {
print "$a1=$a2 is false\n";
}
my $b = decode("utf-8", $a2);
if ($a1 eq $b) {
print "$a1=$b is true\n";
} else {
print "$a1=$b is false\n";
}
I wrote a test program listed above. And create a text file with one line: 基地局.
When you run the program with this text file, you can get a false and a true.
I don't know what's in your program, but I guess the csv file is read as a plain text without any parsers or encode/decode procedures, whereas the xml file must be parsed by some library, so that the internal encoding mechanism is different for the two string variables, including some leading bytes of encoding notation.
Simply put, you can try to encode or decode one of the two string variables, and see if they match.
By the way, this is my first answer here, hope it can be a little bit helpful to you ;-)
From your dump results, it's obvious. The first variable stores 9 characters which constrcut 基地局 in utf-8 encoding in its internal structure. The second variable represents 3 characters in its internal structure. They have same byte stream, and are equal in a byte-stream view but not equal in a character-based comparison.
Use decode/encode can solve your problem.
Your inputs:
"ENB13-\345\237\272\345\234\260\345\261\200"
"ENB13-\x{57fa}\x{5730}\x{5c40}"
As you can see, these are clearly not the same. Specifically, the first is the UTF-8 encoding of the other. Always decode inputs. Always encode outputs.
use strict;
use warnings;
use utf8; # Source code is saved as UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal expects UTF-8
my $name = "ENB69-基地局";
while ($line = <STDIN>) {
chomp;
my #lineContent = split /\t/, $line;
print($lineContent[7] eq $name ?1:0, "\n"); # 1
}
Personally I would be a little more careful if you know that you are comparing unicode strings. Unicode::Collate is the module for the job.
Of course you should also read tchrist's now-famous SO post on the topic of enabling unicode in Perl, https://stackoverflow.com/a/6163129/468327, but utf8::all does an admirable job of turning on proper unicode support. Note that better unicode handling was added to the Perl core in version 5.14 so I require that here as well.
Finally here is a quick script that does the comparison, of course you would populate the variables by reading the files as needed:
#!/usr/bin/env perl
use v5.14;
use strict;
use warnings;
use utf8::all;
use Unicode::Collate;
my $collator = Unicode::Collate->new;
my $csv = "ENB69-基地局";
my $xml = "ENB69-基地局";
say $collator->eq($csv, $xml) ? "equal" : "unequal";

how to put a file into an array and save it in perl

Hello everyone I'm a beginner in perl and I'm facing some problems as I want to put my strings starting from AA to \ in to an array and want to save it. There are about 2000-3000 strings in a txt file starting from same initials i.e., AA to / I'm doing it by this way plz correct me if I'm wrong.
Input File
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\
Source code
$flag = 0
while ($line = <ifh>)
{
if ( $line = m//\/g)
{
$flag = 1;
}
while ( $flag != 0)
{
for ($i = 0; $i <= 10000; $i++)
{ # Missing brace added by editor
$array[$i] = $line;
} # Missing brace added by editor
}
} # Missing close brace added by editor; position guessed!
print $ofh, $line;
close $ofh;
Welcome to StackOverflow.
There are multiple issues with your code. First, please post compilable Perl; I had to add three braces to give it the remotest chance of compiling, and I had to guess where one of them went (and there's a moderate chance it should be on the other side of the print statement from where I put it).
Next, experts have:
use warnings;
use strict;
at the top of their scripts because they know they will miss things if they don't. As a learner, it is crucial for you to do the same; it will prevent you making errors.
With those in place, you have to declare your variables as you use them.
Next, remember to indent your code. Doing so makes it easier to comprehend. Perl can be incomprehensible enough at the best of times; don't make it any harder than it has to be. (You can decide where you like braces - that is open to discussion, though it is simpler to choose a style you like and stick with it, ignoring any discussion because the discussion will probably be fruitless.)
Is the EB vs VB in the data significant? It is hard to guess.
It is also not clear exactly what you are after. It might be that you're after an array of entries, one for each block in the file (where the blocks end at the line containing just a backslash), and where each entry in the array is a hash keyed by the first two letters (or first word) on the line, with the remainder of the line being the value. This is a modestly complex structure, and probably beyond what you're expected to use at this stage in your learning of Perl.
You have the line while ($line = <ifh>). This is not invalid in Perl if you opened the file the old fashioned way, but it is not the way you should be learning. You don't show how the output file handle is opened, but you do use the modern notation when trying to print to it. However, there's a bug there, too:
print $ofh, $line; # Print two values to standard output
print $ofh $line; # Print one value to $ofh
You need to look hard at your code, and think about the looping logic. I'm sure what you have is not what you need. However, I'm not sure what it is that you do need.
Simpler solution
From the comments:
I want to flag each record starting from AA to \ as record 0 till record n and want to save it in a new file with all the record numbers.
Then you probably just need:
#!/usr/bin/env perl
use strict;
use warnings;
my $recnum = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
print "$_\n";
$recnum++;
}
else
{
print "$recnum $_\n";
}
}
This reads from the files specified on the command line (or standard input if there are none), and writes the tagged output to standard output. It prefixes each line except the 'end of record' marker lines with the record number and a space. Choose your output format and file handling to suit your needs. You might argue that the chomp is counter-productive; you can certainly code the program without it.
Overly complex solution
Developed in the absence of clear direction from the questioner.
Here is one possible way to read the data, but it uses moderately advanced Perl (hash references, etc). The Data::Dumper module is also useful for printing out Perl data structures (see: perldoc Data::Dumper).
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #data;
my $hashref = { };
my $nrecs = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
# End of group - save to data array and start new hash
$data[$nrecs++] = $hashref;
$hashref = { };
}
else
{
m/^([A-Z]+)\s+(.*)$/;
$hashref->{$1} = $2;
}
}
foreach my $i (0..$nrecs-1)
{
print "Record $i:\n";
foreach my $key (sort keys $data[$i])
{
print " $key = $data[$i]->{$key}\n";
}
}
print Data::Dumper->Dump([ \#data ], [ '#data' ]);
Sample output for example input:
Record 0:
AA = c0001
BB = afsfjgfjgjgjflffbg
CC = table
DD = hhhfsegsksgk
EB = jksgksjs
Record 1:
AA = e0002
BB = rejwkghewhgsejkhrj
CC = chair
DD = egrhjrhojohkhkhrkfs
VB = rkgjehkrkhkh;r
$#data = [
{
'EB' => 'jksgksjs',
'CC' => 'table',
'AA' => 'c0001',
'BB' => 'afsfjgfjgjgjflffbg',
'DD' => 'hhhfsegsksgk'
},
{
'CC' => 'chair',
'AA' => 'e0002',
'VB' => 'rkgjehkrkhkh;r',
'BB' => 'rejwkghewhgsejkhrj',
'DD' => 'egrhjrhojohkhkhrkfs'
}
];
Note that this data structure is not optimized for searching except by record number. If you need to search the data in some other way, then you need to organize it differently. (And don't hand this code in as your answer without understanding it all - it is subtle. It also does no error checking; beware faulty data.)
It can't be right. I can see two main issues with your while-loop.
Once you enter the following loop
while ( $flag != 0)
{
...
}
you'll never break out because you do not reset the flag whenever you find an break-line. You'll have to parse you input and exit the loop if necessary.
And second you never read any input within this loop and thus process the same $line over and over again.
You should not put the loop inside your code but instead you can use the following pattern (pseudo-code)
if flag != 0
append item to array
else
save array to file
start with new array
end
I believe what you want is to split the files content at \ though it's not too clear.
To achieve this you can slurp the file into a variable by setting the input record separator, then split the content.
To find out about Perl's special variables related to filehandlers read perlvar
#!perl
use strict;
use warnings;
my $content;
{
open my $fh, '<', 'test.txt';
local $/; # slurp mode
$content = <$fh>;
close $fh;
}
my #blocks = split /\\/, $content;
Make sure to localize modifications of Perl's special variables to not interfere with different parts of your program.
If you want to keep the separator you could set $/ to \ directly and skip split.
#!perl
use strict;
use warnings;
my #blocks;
{
open my $fh, '<', 'test.txt';
local $/ = '\\'; # seperate at \
#blocks = <$fh>;
close $fh;
}
Here's a way to read your data into an array. As I said in a comment, "saving" this data to a file is pointless, unless you change it. Because if I were to print the #data array below to a file, it would look exactly like the input file.
So, you need to tell us what it is you want to accomplish before we can give you an answer about how to do it.
This script follows these rules (exactly):
Find a line that begins with "AA",
and save that into $line
Concatenate every new line from the
file into $line
When you find a line that begins with
a backslash \, stop concatenating
lines and save $line into #data.
Then, find the next line that begins
with "AA" and start the loop over.
These matching regexes are pretty loose, as they will match AAARGH and \bonkers as well. If you need them stricter, you can try /^\\$/ and /^AA$/, but then you need to watch out for whitespace at the beginning and end of line. So perhaps /^\s*\\\s*$/ and /^\s*AA\s*$/ instead.
The code:
use warnings;
use strict;
my $line="";
my #data;
while (<DATA>) {
if (/^AA/) {
$line = $_;
while (<DATA>) {
$line .= $_;
last if /^\\/;
}
}
push #data, $line;
}
use Data::Dumper;
print Dumper \#data;
__DATA__
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\

Perl Text::CSV_XS Encoding Issues

I'm having issues with Unicode characters in Perl. When I receive data in from the web, I often get characters like “ or €. The first one is a quotation mark and the second is the Euro symbol.
Now I can easily substitute in the correct values in Perl and print to the screen the corrected words, but when I try to output to a .CSV file all the substitutions I have done are for nothing and I get garbage in my .CSV file. (The quotes work, guessing since it's such a general character). Also Numéro will give Numéro. The examples are endless.
I wrote a small program to try and figure this issue out, but am not sure what the problem is. I read on another stack overflow thread that you can import the .CSV in Excel and choose UTF8 encoding, this option does not pop up for me though. I'm wondering if I can just encode it into whatever Excel's native character set is (UTF16BE???), or if there is another solution. I have tried many variations on this short program, and let me say again that its just for testing out Unicode problems, not a part of a legit program. Thanks.
use strict;
use warnings;
require Text::CSV_XS;
use Encode qw/encode decode/;
my $text = 'Numéro Numéro Numéro Orkos Capital SAS (√¢¬Ä¬úOrkos√¢¬Ä¬ù) 325M√¢¬Ç¬¨ in 40 companies headquartered';
print("$text\n\n\n");
$text =~ s/“|”/"/sig;
$text =~ s/’s/'s/sig;
$text =~ s/√¢¬Ç¬¨/€/sig;
$text =~ s/√¢¬Ñ¬¢/®/sig;
$text =~ s/ / /sig;
print("$text\n\n\n");
my $CSV = Text::CSV_XS->new ({ binary => 1, eol => "\n" }) or die "Cannot use CSV: ".Text::CSV->error_diag();
open my $OUTPUT, ">:encoding(utf8)", "unicode.csv" or die "unicode.csv: $!";
my #row = ($text);
$CSV->print($OUTPUT, \#row);
$OUTPUT->autoflush(1);
I've also tried these two lines to no avail:
$text = decode("Guess", $text);
$text = encode("UTF-16BE", $text);
First, your strings are encoded in MacRoman. When you interpret them as byte sequences the second results in C3 A2 C2 82 C2 AC. This looks like UTF-8, and the decoded form is E2 82 AC. This again looks like UTF-8, and when you decode it you get €. So what you need to do is:
$step1 = decode("MacRoman", $text);
$step2 = decode("UTF-8", $step1);
$step3 = decode("UTF-8", $step2);
Don't ask me on which mysterious ways this encoding has been created in the first place. Your first character decodes as U+201C, which is indeed the LEFT DOUBLE QUOTATION MARK.
Note: If you are on a Mac, the first decoding step may be unnecessary since the encoding is only in the "presentation layer" (when you copied the Perl source into the HTML form and your browser did the encoding-translation for you) and not in the data itself.
So I figured out the answer, the comment from Roland Illig helped me get there (thanks again!). Decoding more than once causes the wide characters error, and therefore should not be done.
The key here is decoding the UTF-8 Text and then encoding it in MacRoman. To send the .CSV files to my Windows friends I have to save it as .XLSX first so that the coding doesn't get all screwy again.
$text =~ s/“|”/"/sig;
$text =~ s/’s/'s/sig;
$text =~ s/√¢¬Ç¬¨/€/sig;
$text =~ s/√¢¬Ñ¬¢/®/sig;
$text =~ s/ / /sig;
$text = decode("UTF-8", $text);
print("$text\n\n\n");
my $CSV = Text::CSV_XS->new ({ binary => 1, eol => "\n" }) or die "Cannot use CSV: ".Text::CSV->error_diag();
open my $OUTPUT, ">:encoding(MacRoman)", "unicode.csv" or die "unicode.csv: $!";