Perl Text Parsing - fixed delimeted structure is changing - perl

Perl Experts - My attempt to solve my problem is turning into a lot of code, which in PERL seems like I'm approaching this in-correctly. Here is my problem:
I have a block of text (example below) which can have variable amount of whitespace between the column data. I was using a simple split, but the problem now is that the column "code" now contains spaces in the data (I only accounted for that in the last column). What seems to be constant (although I don't have access to, or control of the source structure) is that there is a minimum of 3 spaces between columns (maybe more, but never less).
So, I'd like to say my column delimiter token is "3 spaces" and then trim the data within each to have my actual columnar data.
COL0 COL1 COL2 COL3 COL4 COL5
- 4 0.2 1 416489 463455 554
1 0.9 1 E1
0 3 1.4 14 E97-TEST 1
- 1 97.5 396 PASS Good
I'm just trying to get the values into 6 variables.
NOTE: COL0 may not have a value. COL4 may contain space in data. COL5 may contain no value, or data with space. All fixed formatting is done with spaces (no tabs or other special characters). To clarify -- the columns are NOT consistently sized. One file might have COL4 as 13 characters, another have COL4 with 21 characters wide. Or not strict as another SO member stated.

You'll need to figure out where the columns are. As a really quite disgusting hack, you can read the whole file in and then string-or the lines together:
my #file = <file>;
chomp #file;
my $t = "";
$t |= $_ foreach(#file);
$t will then contain space characters in columns only where there were always space characters in that column; other columns will contain binary junk. Now split it with a zero-width match that matches the non-space:
my #cols = split /(?=[^ ]+)/, $t;
We actually want the widths of the columns to generate an unpack() format:
#cols = map length, #cols;
my $format = join '', map "A$_", #cols;
Now process the file! :
foreach my $line (#file) {
my($field, $field2, ...) = unpack $format, $line;
your code here...
}
(This code has only been lightly tested.)

If you're dealing with strict columnar data like this, unpack is probably what you want:
#!perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my $data = <<EOD;
COL0 COL1 COL2 COL3 COL4 COL5
- 4 0.2 1 416489 463455 554
1 0.9 1 E1
0 3 1.4 14 E97-TEST 1
- 1 97.5 396 PASS Good
EOD
my #lines = split '\n', $data;
for my $line ( #lines ) {
my #values = unpack("a5 A7 A7 A7 A13 A*", $line);
print Dumper \#values;
}
This appears to dump out your values into the #values array as you wish, but they'll have leading spaces that you'll have to trim off.

I would use two passes: in the first, find those character columns that have a space in each line; then, split or unpack with those indices. Whitespace trimming is done afterwards.
Your example:
COL0 COL1 COL2 COL3 COL4 COL5
- 4 0.2 1 416489 463455 554
1 0.9 1 E1
0 3 1.4 14 E97-TEST 1
- 1 97.5 396 PASS Good
000011100001110000111000011100000000001110000000000
The 1s in the last line show which columns are all spaces.

I know CanSpice already answered (possibly a much better solution), but you can set the input delimiter using "$/". This must be done in a local scope (probably a sub) as it is a global variable, or you may see side effects. Ex:
local $/ = " ";
$input = <DATAIN>; # assuming DATAIN is the file-handler
You can trim whitespace using a nice little regex. See Wikipedia for an example.

Related

Using perl to split file in flat text

I have a flat file that are created with offsets e.g. row 1: char 1 - 3 = ID, 4-19 = user name, 20 - 40 = last name, etc...
What's the best way to go about creating a perl script to read this? and is there any way to make it flexible based on different offset groups? Thank you!
If the positions/lengths are in terms of Unicode Code Points:
# Use :encoding(UTF-8) on the file handle.
my #fields = unpack('A3 A16 A21', $decoded_line);
If the positions/lengths are in terms of bytes:
use Encode qw( decode );
sub trim_end(_) { $_[0] =~ s/\s+\z//r }
# Use :raw on the file handle.
my #fields =
map trim_end(decode("UTF-8", $_)),
unpack('a3 a16 a21', $encoded_line);
In both cases, trailing whitespace is trimmed.

Shortest Perl solution for outputing 4 random words

I have this one-line Unix shell script
for i in 1 2 3 4; do sed "$(tr -dc '0-9' < /dev/urandom | fold -w 5 |
awk '$0>=35&&$0<=65570' | head -1)q;d" "$0"; done | perl -p00e
's/\n(?!\Z)/ /g'
The script has 65K words in it, one per line, from line 35 to 65570. The code and the data are in the same file.
This script outputs 4 space-separated random words from this list with a newline at the end. For example
first fourth third second
How can I make this one-liner much shorter with Perl, keeping the
tr -dc '0-9' < /dev/urandom
part?
Keeping it is important since it provides Cryptographically Secure Pseudo-Random Numbers (CSPRNs) for all Unix OSs. Of course, if Perl can get numbers from /dev/urandom then the tr can be replaced with Perl too, but the numbers from urandom need to stay.
For convenience, I shared the base script with 65K words
65kwords.txt
or
65kwords.txt
Please use only core modules. It would be used for generating "human memorable passwords".
Later, the (hashing) iteration count, where we would use this to store the passwords would be extremely high, so brute-force would be very slow, even with many many GPUs/FPGAs.
You mention needing a CSPRN, which makes this a non trivial exercise - if you need cryptographic randomness, then using built in stuff (like rand) is not a good choice, as the implementation is highly variable across platforms.
But you've got Rand::Urandom which looks like it does the trick:
By default it uses the getentropy() (only available in > Linux 3.17) and falls back to /dev/arandom then /dev/urandom.
#!/usr/bin/env perl
use strict;
use warnings;
use Rand::Urandom;
chomp ( my #words = <DATA> );
print $words[rand #words], " " for 1..4;
print "\n";
__DATA__
yarn
yard
wound
worst
worry
work
word
wool
wolf
wish
wise
wipe
winter
wing
wind
wife
whole
wheat
water
watch
walk
wake
voice
Failing that though - you can just read bytes from /dev/urandom directly:
#!/usr/bin/env perl
use strict;
use warnings;
my #number_of_words = 4;
chomp ( my #words = <DATA> );
open ( my $urandom, '<:raw', '/dev/urandom' ) or die $!;
my $bytes;
read ( $urandom, $bytes, 2 * $number_of_words ); #2 bytes 0 - 65535
#for testing
#unpack 'n' is n An unsigned short (16-bit)
# unpack 'n*' in a list context returns a list of these.
foreach my $value ( unpack ( "n*", $bytes ) ) {
print $value,"\n";
}
#actually print the words.
#note - this assumes that you have the right number in your list.
# you could add a % #words to the map, e.g. $words[$_ % #words]
#but that will mean wrapping occurs, and will alter the frequency distribution.
#a more robust solution would be to fetch additional bytes if the 'slot' is
#empty.
print join " ", ( map { $words[$_] } unpack ( "n*", $bytes )),"\n";
__DATA__
yarn
yard
wound
worst
#etc.
Note - the above relies on the fact that your wordlist is the same size as two bytes (16 bits) - if this assumption isn't true, you'll need to deal with 'missed' words. A crude approach would be to take a modulo, but that would mean some wrapping and therefore not quite truly even distribution of word picks. Otherwise you can bit-mask and reroll, as indicated below:
On a related point though - have you considered not using a wordlist, and instead using consonant-vowel-consonant groupings?
E.g.:
#!/usr/bin/env perl
use strict;
use warnings;
#uses /dev/urandom to fetch bytes.
#generates consonant-vowel-consonant groupings.
#each are 11.22 bits of entropy, meaning a 4-group is 45 bits.
#( 20 * 6 * 20 = 2400, which is 11.22 bits of entropy log2 2400
#log2(2400 ^ 4) = 44.91
#but because it's generated 'true random' it's a know entropy string.
my $num = 4;
my $format = "CVC";
my %letters = (
V => [qw ( a e i o u y )],
C => [ grep { not /[aeiouy]/ } "a" .. "z" ], );
my %bitmask_for;
foreach my $type ( keys %letters ) {
#find the next power of 2 for the number of 'letters' in the set.
#So - for the '20' letter group, that's 31. (0x1F)
#And for the 6 letter group that's 7. (0x07)
$bitmask_for{$type} = ( 2 << log ( #{$letters{$type}} ) / log 2 ) - 1 ;
}
open( my $urandom, '<:raw', '/dev/urandom' ) or die $!;
for ( 1 .. $num ) {
for my $type ( split //, $format ) {
my $value;
while ( not defined $value or $value >= #{ $letters{$type} } ) {
my $byte;
read( $urandom, $byte, 1 );
#byte is 0-255. Our key space is 20 or 6.
#So rather than modulo, which would lead to an uneven distribution,
#we just bitmask and discard and 'too high'.
$value = (unpack "C", $byte ) & $bitmask_for{$type};
}
print $letters{$type}[$value];
}
print " ";
}
print "\n";
close($urandom);
This generates 3 character CVC symbols, with a known entropy level (11.22 per 'group') for making reasonably robust passwords. (45 bits as opposed to the 64 bits of your original, although obviously you can add extra 'groups' to gain 11.22 bits per time).
This answer is not cryptographically safe!
I would do this completely in Perl. No need for a one-liner. Just grab your word-list and put it into a Perl program.
use strict;
use warnings;
my #words = qw(
first
second
third
fourth
);
print join( q{ }, map { $words[int rand #words] } 1 .. 4 ), "\n";
This grabs four random words from the list and outputs them.
rand #words evaluates #words in scalar context, which gives the number of elements, and creates a random floating point value between 0 and smaller than that number. int cuts off the decimals. This is used as the index to grab an element out of #words. We repeat this four times with the map statement, where the 1 .. 4 is the same as passing a list of (1, 2, 3, 4) into map as an argument. This argument is ignored, but instead our random word is picked. map returns a list, which we join on one space. Finally we print the resulting string, and a newline.
The word list is created with the quoted words qw() operator, which returns a list of quoted words. It's shorthand so you don't need to type all the quotes ' and commas ,.
If you'd want to have the word list at the bottom you could either put the qw() in a sub and call it at the top, or use a __DATA__ section and read from it like a filehandle.
The particular method using tr and fold on /dev/urandom is a lot less efficient than it could be, so let's fix it up a little bit, while keeping the /dev/urandom part.
Assuming that available memory is enough to contain your script (including wordlist):
chomp(#words = <DATA>);
open urandom, "/dev/urandom" or die;
read urandom, $randbytes, 4 * 2 or die;
print join(" ", map $words[$_], unpack "S*", $randbytes), "\n";
__DATA__
word
list
goes
here
This goes for brevity and simplicity without outright obfuscation — of course you could make it shorter by removing whitespace and such, but there's no reason to. It's self-contained and will work with several decades of perls (yes, those bareword filehandles are deliberate :-P)
It still expects exactly 65536 entries in the wordlist, because that way we don't have to worry about introducing bias to the random number choice using a modulus operator. A slightly more ambitious approach might be to read 48 bytes from urandom for each word, turning it into a floating-point value between 0 and 1 (portable to most systems) and multiplying it by the size of the word list, allowing for a word list of any reasonable size.
A lot of nonsense is talked about password strength, and I think you're overestimating the worth of several of your requirements here
I don't understand your preoccupation with making your code "much shorter with perl". (Why did you pick Perl?) Savings here can only really be useful to make the script quicker to read and compile, but they will be dwarfed by the half megabyte of data following the code which must also be read
In this context, the usefulness to a hacker of a poor random number generator depends on prior knowledge of the construction of the password together with the passwords that have been most recently generated. With a sample of only 65,000 words, even the worst random number generator will show insignificant correlation between successive passwords
In general, a password is more secure if it is longer, regardless of its contents. Forming a long password out of a sequence of English words is purely a way of making the sequence more memorable
"Of course later, the (hashing) iteration count ... would be extreme high, so brute-force [hacking?] would be very slow"
This doesn't follow at all. Cracking algorithms won't try to guess the four words you've chosen: they will see only a thirty-character (or so) string consisting only of lower-case letters and spaces, and whose origin is insignificant. It will be no more or less crackable than any other password of the same length with the same character set
I suggest that you should rethink your requirements and so make things easier for yourself. I don't find it hard to think of four English words, and don't need a program to do it for me. Hint: pilchard is a good one: they never guess that!
If you still insist, then I would write something like this in Perl. I've used only the first 18 lines of your data for
use strict;
use warnings 'all';
use List::Util 'shuffle';
my #s = map /\S+/g, ( shuffle( <DATA> ) )[ 0 .. 3 ];
print "#s\n";
__DATA__
yarn
yard
wound
worst
worry
work
word
wool
wolf
wish
wise
wipe
winter
wing
wind
wife
whole
wheat
output
wind wise winter yarn
You could use Data::Random::rand_words()
perl -MData::Random -E 'say join $/, Data::Random::rand_words(size => 4)'

Perl: Replace consecutive spaces in this given scenario?

an excerpt of a big binary file ($data) looks like this:
\n1ax943021C xxx\t2447\t5
\n1ax951605B yyy\t10400\t6
\n1ax919275 G2L zzz\t6845\t6
The first 25 characters contain an article number, filled with spaces. How can I convert all spaces between the article numbers and the next column into a \x09 ? Note the one or more spaces between different parts of the article number.
I tried a workaround, but that overwrites the article number with ".{25}xxx»"
$data =~ s/\n.{25}/\n.{25}xxx/g
Anyone able to help?
Thanks so much!
Gary
You can use unpack for fixed width data:
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper $_ for map join("\t", unpack("A25A*")), <DATA>;
__DATA__
1ax943021C xxx 2447 5
1ax951605B yyy 10400 6
1ax919275 G2L zzz 6845 6
Output:
$VAR1 = "1ax943021C\txxx\t2447\t5";
$VAR1 = "1ax951605B\tyyy\t10400\t6";
$VAR1 = "1ax919275 G2L\tzzz\t6845\t6";
Note that Data::Dumper's Useqq option prints whitecharacters in their escaped form.
Basically what I do here is take each line, unpack it, using 2 strings of space padded text (which removes all excess space), join those strings back together with tab and print them. Note also that this preserves the space inside the last string.
I interpret the question as there being a 25 character wide field that should have its trailing spaces stripped and then delimited by a tab character before the next field. Spaces within the article number should otherwise be preserved (like "1ax919275 G2L").
The following construct should do the trick:
$data =~ s/^(.{25})/{$t=$1;$t=~s! *$!\t!;$t}/emg;
That matches 25 characters from the beginning of each line in the data, then evaluates an expression for each article number by stripping its trailing spaces and appending a tab character.
Have a try with:
$data =~ s/ +/\t/g;
Not sure exactly what you what - this will match the two columns and print them out - with all the original spaces. Let me know the desired output and I will fix it for you...
#!/usr/bin/perl -w
use strict;
my #file = ('\n1ax943021C xxx\t2447\t5', '\n1ax951605B yyy\t10400\t6',
'\n1ax919275 G2L zzz\t6845\t6');
foreach (#file) {
my ($match1, $match2) = ($_ =~ /(\\n.{25})(.*)/);
print "$match1'[insertsomethinghere]'$match2\n";
}
Output:
\n1ax943021C '[insertsomethinghere]'xxx\t2447\t5
\n1ax951605B '[insertsomethinghere]'yyy\t10400\t6
\n1ax919275 G2L '[insertsomethinghere]'zzz\t6845\t6

perl add contents of a column of a file

Column A | Column B | Column C | Column D
35627799100 8 8 2
35627788000 60 34 45
35627799200 10 21 21
35627780000 60 5 8
Basically I have a file as shown above and would like to add the contents of Column B i.e 8+60+10+60. To be frank I'm not sure if need to remove the first line being text and if I can use the split function and put it in a hash something along the lines:
my %hash = map {split/\s+/,$_,4} <$file>;
Thanks in advance for the help.
If you just want to sum up the second column, a hash is overkill. You can do something like this and calculate the sum directly in the map.
my $sum;
$sum += (split /\s+/, $_)[1] while <$file>;
Edit: If you have header rows or other rows with non-numeric values in column 2, then as the comments below indicate, you will run into problems. You can avoid this by trading split for a regular expression, like so:
my $sum = 0;
while (<STDIN>)
{
$sum += $1 if $_ =~ /^\S+\s+(\d+)/;
}
If it's possible that column 1 has no text (ie. the line starts with a single blank and the first non-blank represents the second column), then change the first part of the pattern from ^\S+ to ^\S*.
This is an example based on your data:
use strict;
use warnings;
my $sum_column_b = 0;
<DATA>; #drop header
while( my $line = <DATA>) {
$line =~ m/\s+(\d+)/; #regexpr to catch second column values
$sum_column_b += $1;
}
print $sum_column_b, "\n"; #<-- prints: 138
__DATA__
Column A | Column B | Column C | Column D
35627799100 8 8 2
35627788000 60 34 45
35627799200 10 21 21
35627780000 60 5 8

Displaying formatted table

I'm trying to display output in table format but having some trouble.
I have a few variables that contain strings separated by spaces for example:
$var1 = "1 1003 33 40 9948";
$var2 = "2";
I want the table to look like this:
I want to display the table such that it looks like this:
Column1 Column2
======= =======
1 2
1003
33
40
9948
I want the contents of var1 to wrap down for each value.
I'm able to display the header no problem. I've been trying to use perl's format:
^|||||~~^|||||~~^|||||~~^|||||~~^|||||~~
$var1 $var2 $var3 $var4 $var4
.
...but it's not working well.
The number's don't line up correct and I've tried to padding so that I force them to wrap but they line up unevenly. The contents of the table are being displayed through a foreach loop:
Column1 Column2
======= =======
1 2
1003
33
40
9948
I hope I'm clear on what I'm trying to do!
In general, I haven't found Perl’s patterns to be terribly reliable. They’re really useful for a few things, but generally require more work than it seems like they should.
The previous answer shows a more standard right-justification (which makes sense for numbers), but you seem to want a centered justification. What I don't understand, however, is why you’re using ^||||| instead of #|||||. The former will attempt to wrap text on multiple lines if necessary; that’s not really what you want here. You should probably be using #||||| instead.
Furthermore, there’s no need to use
^|||||~~^|||||~~^|||||~~^|||||~~^|||||~~
because the first ~~ will allow the pattern to be repeated indefinitely.
That said, I think you would get a lot more control over the process by generating the text yourself rather than relying on Perl’s formats.
sub centerText {
my ($text, $width) = #_;
my $pad = ( length($text) < $width )
? ' ' x int(($width - length($text)) / 2)
: '';
return "$pad$text";
}
Then you’d just use:
printf "%-7s %-7s\n", 'Column1', 'Column2';
printf "%-7s %-7s\n", '=======', '=======';
printf "%-7s %-7s\n", centerText($val1[1], 7), centerText($val2[1], 7);
or somesuch.
That make any sense?