Cannot split line and save to variable using whitespace in perl - perl

I'm having some trouble with parsing a file.
Two lines in the file contain the word ' Mapped', and I would like to extract the number that is in those two lines.
And this is my code:
my %cellHash = ();
my $mapped = 0;
my $alnPairs = 0;
my #mappedReads = ();
while (<ALIGN_SUMMARY>) {
chomp($_);
if (/Mapped/) {
print "\n$_\n";
$mapped = (split / /, $_)[2];
push(#mappedReads, $mapped);
}
if (/Aligned pairs/) {
print "\n$_\n";
$alnPairs = (split / /, $_)[4];
}
}
{ $cellHash{$cellDir} } = (
'MappedR1' => $mappedReads[0] ,
'MappedR2' => $mappedReads[1] ,
'AlnPairs' => $alnPairs ,
);
foreach my $cellName ( keys %cellHash){
print OUTPUT $cellName,
"\t", ${ $cellHash{$cellName} }{"LibSize"},
"\t", ${ $cellHash{$cellName} }{"MappedR1"},
"\t", ${ $cellHash{$cellName} }{"MappedR2"},
"\t", ${ $cellHash{$cellName} }{"AlnPairs"},
"\n";
}
But the OUTPUT file only has the 'AlignedPairs' column and never anything in MappedR1 or MappedR2.
What am I doing wrong? Thanks!

When I look at the file, it looks like there is more than a single space. Here is an example of what I mean and what I did to extract the number.
my $test = "blah : 123455";
my #test_ary = split(/ /, $test);
print scalar #test_ary . "\n"; # Prints the size of the array
$number = $1 if $test =~ m/([0-9]+)/;
print "$number\n"; # Prints the extracted number
Output of run:
Size of array: 8
The extracted number: 123455
Hope this helps.

First off, paste in your actual input and output if you want anyone to actually test somethnig for you, not an image.
Second, you're not splitting on whitespace, you're splitting on a single literal space. Use the special case of
split ' ', $_;
to split on arbitrary length whitespace, discarding leading and trailing whitespace.

Related

Creating a hashmap using perl split function

I am attempting to create a hashmap from a text file. The way the text file is set up is as follows.
(integer)<-- varying white space --> (string value)
. . .
. . .
. . .
(integer)<-- varying white space --> (string value)
eg:
5 this is a test
23 this is another test
123 this is the final test
What I want to do is assign the key to the integer, and then the entire string following to the value. I was trying something along the lines of
%myHashMap;
while(my $info = <$fh>){
chomp($info);
my ($int, $string) = split/ /,$info;
$myHashMap{$int} = $string;
}
This doesn't work though because I have spaces in the string. Is there a way to clear the initial white space, grab the integer, assign it to $int, then clear white space till you get to the string, then take the remainder of the text on that line and place it in my $string value?
You could replace
split / /, $info # Fields are separated by a space.
with
split / +/, $info # Fields are separated by spaces.
or the more general
split /\s+/, $info # Fields are separated by whitespace.
but you'd still face with the problem of the leading spaces. To ignore those, use
split ' ', $info
This special case splits on whitespace, ignoring leading whitespace.
Don't forget to tell Perl that you expect at most two fields!
$ perl -E'say "[$_]" for split(" ", " 1 abc def ghi", 2)'
[1]
[abc def ghi]
The other option would be to use the following:
$info =~ /^\s*(\S+)\s+(\S.*)/
You just need to split each line of text on whitespace into two fields
This example program assumes that the input file is passed as a parameter on the command line. I have used Data::Dump only to show the resulting hash structure
use strict;
use warnings 'all';
my %data;
while ( <DATA> ) {
s/\s*\z//;
my ($key, $val) = split ' ', $_, 2;
next unless defined $val; # Ensure that there were two fields
$data{$key} = $val;
}
use Data::Dump;
dd \%data;
output
{
5 => "this is a test",
23 => "this is another test",
123 => "this is the final test",
}
First you clear initial white space use this
$info =~ s/^\s+//g;
second you have more than 2 spaces in between integer and string so use split like this to give 2 space with plus
split/ +/,$info;
The code is
use strict;
use warnings;
my %myHashMap;
while(my $info = <$fh>){
chomp($info);
$info =~ s/^\s+//g;
my ($int, $string) = split/ +/,$info;
$myHashMap{$int} = $string;
}

Efficient way to read columns in a file using Perl

I have an input file like so, separated by newline characters.
AAA
BBB
BBA
What would be the most efficient way to count the columns (vertically), first with first, second with second etc etc.
Sample OUTPUT:
ABB
ABB
ABA
I have been using the following, but am unable to figure out how to remove the scalar context from it. Any hints are appreciated:
while (<#seq_prot>){
chomp;
my #sequence = map substr (#seq_prot, 1, 1), $start .. $end;
#sequence = split;
}
My idea was to use the substring to get the first letter of the input (A in this case), and it would cycle for all the other letters (The second A and B). Then I would increment the cycle number + 1 so as to get the next line, until I reached the end. Of course I can't seem to get the first part going, so any help is greatly appreciated, am stumped on this one.
Basically, you're trying to transpose an array.
This can be done easily using Array::Transpose
use warnings;
use strict;
use Array::Transpose;
die "Usage: $0 filename\n" if #ARGV != 1;
for (transpose([map {chomp; [split //]} <>])) {
print join("", map {$_ // " "} #$_), "\n"
}
For an input file:
ABCDEFGHIJKLMNOPQRS
12345678901234
abcdefghijklmnopq
ZYX
Will output:
A1aZ
B2bY
C3cX
D4d
E5e
F6f
G7g
H8h
I9i
J0j
K1k
L2l
M3m
N4n
O o
P p
Q q
R
S
You'll have to read in the file once for each column, or store the information and go through the data structure later.
I was originally thinking in terms of arrays of arrays, but I don't want to get into References.
I'm going to make the assumption that each line is the same length. Makes it simpler that way. We can use split to split your line into individual letters:
my = $line = "ABC"
my #split_line = split //, $line;
This will give us:
$split_line[0] = "A";
$split_line[1] = "B";
$split_line[2] = "C";
What if we now took each letter, and placed it into a #vertical_array.
my #vertical_array;
for my $index ( 0..##split_line ) {
$vertical_array[$index] .= "$split_line[$index];
}
Now let's do this with the next line:
$line = "123";
#split_line = split //, $line;
for my $index ( 0..##split_line ) {
$vertical_array[$index] .= "$split_line[$index];
}
This will give us:
$vertical_array[0] = "A1";
$vertical_array[1] = "B2";
$vertical_array[2] = "C3";
As you can see, I'm building the $vertical_array with each interation:
use strict;
use warnings;
use autodie;
use feature qw(say);
my #vertical_array;
while ( my $line = <DATA> ) {
chomp $line;
my #split_line = split //, $line;
for my $index ( 0..$#split_line ) {
$vertical_array[$index] .= $split_line[$index];
}
}
#
# Print out your vertical lines
#
for my $line ( #vertical_array ) {
say $line;
}
__DATA__
ABC
123
XYZ
BOY
FOO
BAR
This prints out:
A1XBFB
B2YOOA
C3ZYOR
If I had used references, I could probably have built an array of arrays and then flipped it. That's probably more efficient, but more complex. However, that may be better at handling lines of different lengths.

How to skip splitting for some part of the line

Say I have a line lead=george wife=jane "his boy"=elroy. I want to split with space but that does not include the "his boy" part. I should be considered as one.
With normal split it is also splitting "his boy" like taking "his" as one and "boy" as second part. How to escape this
Following this i tried
split " ", $_
Just came to know that this will work
use strict; use warnings;
my $string = q(hi my name is 'john doe');
my #parts = $string =~ /'.*?'|\S+/g;
print map { "$_\n" } #parts;
But it does not looks good. Any other simple thing with split itself?
You could use Text::ParseWords for this
use Text::ParseWords;
$list = "lead=george wife=jane \"his boy\"=elroy";
#words = quotewords('\s+', 0, $list);
$i = 0;
foreach (#words) {
print "$i: <$_>\n";
$i++;
}
ouput:
0: <lead=george>
1: <wife=jane>
2: <his boy=elroy>
sub split_space {
my ( $text ) = #_;
while (
$text =~ m/
( # group ($1)
\"([^\"]+)\" # first try find something in quotes ($2)
|
(\S+?) # else minimal non-whitespace run ($3)
)
=
(\S+) # then maximum non-whitespace run ($4)
/xg
) {
my $key = defined($2) ? $2 : $3;
my $value = $4;
print( "key=$key; value=$value\n" );
}
}
split_space( 'lead=george wife=jane "his boy"=elroy' );
Outputs:
key=lead; value=george
key=wife; value=jane
key=his boy; value=elroy
PP posted a good solution. But just to make it sure, that there is a cool other way to do it, comes my solution:
my $string = q~lead=george wife=jane "his boy"=elroy~;
my #split = split / (?=")/,$string;
my #split2;
foreach my $sp (#split) {
if ($sp !~ /"/) {
push #split2, $_ foreach split / /, $sp;
} else {
push #split2,$sp;
}
}
use Data::Dumper;
print Dumper #split2;
Output:
$VAR1 = 'lead=george';
$VAR2 = 'wife=jane';
$VAR3 = '"his boy"=elroy';
I use a Lookahead here for splitting at first the parts which keys are inside quotes " ". After that, i loop through the complete array and split all other parts, which are normal key=values.
You can get the required result using a single regexp, which extract the keys and the values and put the result inside a hash table.
(\w+|"[\w ]+") will match both a single and multiple word in the key side.
The regexp captures only the key and the value, so the result of the match operation will be a list with the following content: key #1, value #1, key #2, value#2, etc.
The hash is automatically initiated with the appropriate keys and values, when the match result is assigned to it.
here is the code
my $str = 'lead=george wife=jane "hello boy"=bye hello=world';
my %hash = ($str =~ m/(?:(\w+|"[\w ]+")=(\w+)(?:\s|$))/g);
## outputs the hash content
foreach $key (keys %hash) {
print "$key => $hash{$key}\n";
}
and here is the output of this script
lead => george
wife => jane
hello => world
"hello boy" => bye

perl sort question

I have some huge log files I need to sort. All entries have a 32 bit hex number which is the sort key I want to use.
some entries are one liners like
bla bla bla 0x97860afa bla bla
others are a bit more complex, start with the same type of line above and expand to a block of lines marked by curly brackets like the example below. In this case the entire block has to move to the position defined by the hex nbr. Block example-
bla bla bla 0x97860afc bla bla
bla bla {
blabla
bla bla {
bla
}
}
I can probably figure it out but maybe there is a simple perl or awk solution that will save me 1/2 day.
Transferring comments from OP:
Indentation can be space or tab, I can enhance that on any proposed solution, I think that Brian summarizes well: Specifically, do you want to sort "items" which are defined as a chunk of text that starts with a line containing a "0xNNNNNNNN", and contains everything up to (but not including) the next line which contains a "0xNNNNNNNN" (where the N's change, of course). No lines interspersed.
Something like this might work (Not tested):
my $line;
my $lastkey;
my %data;
while($line = <>) {
chomp $line;
if ($line =~ /\b(0x\p{AHex}{8})\b/) {
# Begin a new entry
my $unique_key = $1 . $.; # cred to [Brian Gerard][1] for uniqueness
$data{$1} = $line;
$lastkey = $unique_key;
} else {
# Continue an old entry
$data{$lastkey} .= $line;
}
}
print $data{$_}, "\n" for (sort { $a <=> $b } keys %data);
The problem is that you said "huge" log files, so storing the file in memory will probably be inefficient. However, if you want to sort it, I suspect you're going to need to do that.
If storing in memory is not an option, you can always just print the data to a file instead, with a format that will allow you to sort it by some other means.
For Huge data files, I'd recommend Sort::External.
It doesn't look like you need to parse the brackets, if the indentation does the job. Then you have to do it on "breaks", or when the indentation level 0, then you process the last record gathered, so you always look ahead one line.
So:
sub to_sort_form {
my $buffer = $_[0];
my ( $id ) = $buffer =~ m/(0x\p{AHex}{8})/; # grab the first candidate
return "$id-:-$buffer";
$_[0] = '';
}
sub to_source {
my $val = shift;
my ( $record ) = $val =~ m/-:-(.*)/;
$record =~ s/\$--\^/\n/g;
return $record;
}
my $sortex = Sort::External->new(
mem_threshold => 1024**2 * 16 # default: 1024**2 * 8 (8 MiB)
, cache_size => 100_000 # default: undef (disabled)
, sortsub => sub { $Sort::External::a cmp $Sort::External::b }
, working_dir => $temp_directory # default: see below
);
my $id;
my $buffer = <>;
chomp $buffer;
while ( <> ) {
my ( $indent ) = m/^(\s*)\S/;
unless ( length $indent ) {
$sortex->feed( to_sort_form( $buffer ));
}
chomp;
$buffer .= $_ . '$--^';
}
$sortex->feed( to_sort_form( $buffer ));
$sortex->finish;
while ( defined( $_ = $sortex->fetch ) ) {
print to_source( $_ );
}
Assumptions:
The string '$--^' does not appear in the data on its own.
That you're not alarmed about two 8-hex-digit strings in one record.
If the files are not too big for memory, I would go with TLP's solution. If they are, you can modify it just a bit and print to a file as he suggests. Add this before the while (all untested, ymmv, caveat programmer, etc):
my $currentInFile = "";
my $currentOutFileHandle = "";
And change the body of the while from the current if-else to
if ($currentInFile ne $ARG) {
if (fileno($currentOutFileHandle)) {
if (!close($currentOutFileHandle)) {
# whatever you want to do if you can't close the previous output file
}
}
my $newOutFile = $ARG . ".tagged";
if (!open($currentOutFileHandle, ">", $newOutFile)) {
# whatever you want to do if you can't open a new output file for writing
}
}
if (...conditional from TLP...) {
# add more zeroes if the files really are that large :)
$lastkey = $1 . " " . sprintf("%0.10d", $.);
}
if (fileno($currentOutFileHandle)) {
print $currentOutFileHandle $lastkey . "\t" . $line;
}
else {
# whatever you want to do if $currentOutFileHandle's gone screwy
}
Now you'll have a foo.log.tagged for each foo.log you fed it; the .tagged file contains exactly the contents of the original, but with "0xNNNNNNNN LLLLLLLLLL\t" (LLLLLLLLLL -> zero-padded line number) prepended to each line. sort(1) actually does a pretty good job of handling large data, though you'll want to look at the --temporary-directory argument if you think it will overflow /tmp with its temp files while chewing through the stuff you feed it. Something like this should get you started:
sort --output=/my/new/really.big.file --temporary-directory=/scratch/dir/on/roomy/partition *.tagged
Then trim away the tags if desired:
perl -pi -e 's/^[^\t]+\t//' /my/new/really.big.file
FWIW, I padded the line numbers to keep from having to worry about such things as line 10 sorting before line 2 if their hex keys were identical - since the hex numbers are the primary sort criterion, we can't just sort numerically.
One way (untested)
perl -wne'BEGIN{ $key = " " x 10 }' \
-e '$key = $1 if /(0x[0-9a-f]{8})/;' \
-e 'printf "%s%.10d%s", $key, $., $_' \
inputfile \
| LC_ALL=C sort \
| perl -wpe'substr($_,0,20,"")'
The solution from TLP worked nice with some minor tweaks. Adding all in one line before sorting was a good idea, next I have to add a pos parsing to restore the code blocks that got collapsed but that is easy. below is the final tested version. Thank you all, stackoverflow is awesome.
#!/usr/bin/perl -w
my $line;
my $lastkey;
my %data;
while($line = <>) {
chomp $line;
if ($line =~ /\b(0x\p{AHex}{8})\b/) {
# Begin a new entry
#my $unique_key = $1 . $.; # cred to [Brian Gerard][1] for uniqueness
my $unique_key = hex($1);
$data{$unique_key} = $line;
$lastkey = $unique_key;
} else {
# Continue an old entry
$data{$lastkey} .= $line;
}
}
print $data{$_}, "\n" for (sort { $a <=> $b } keys %data);

How can I combine Perl's split command with white space trimming?

Repost from Perlmonks for a coworker:
I wrote a perl script to separate long lists of email separated by a semi colon. What I would like to do with the code is combine the split with the trimming of white space so I don't need two arrays. Is there away to trim while loading the first array. Output is a sorted list of names.
Thanks.
#!/pw/prod/svr4/bin/perl
use warnings;
use strict;
my $file_data =
'Builder, Bob ;Stein, Franklin MSW; Boop, Elizabeth PHD Cc: Bear,
+ Izzy';
my #email_list;
$file_data =~ s/CC:/;/ig;
$file_data =~ s/PHD//ig;
$file_data =~ s/MSW//ig;
my #tmp_data = split( /;/, $file_data );
foreach my $entry (#tmp_data) {
$entry =~ s/^[ \t]+|[ \t]+$//g;
push( #email_list, $entry );
}
foreach my $name ( sort(#email_list) ) {
print "$name \n";
}
You don't have to do both operations in one go using the same function. Sometimes performing the actions separately can be more clear. That is, split first, then strip the whitespace off of each element (and then sort the result):
#email_list =
sort(
map {
s/\s*(\S+)\s*/\1/; $_
}
split ';', $file_data
);
EDIT: Stripping more than one part of a string at the same time can lead to pitfalls, e.g. Sinan's point below about leaving trailing spaces in the "Elizabeth" portion. I coded that snippet with the assumption that the name would not have internal whitespace, which is actually quite wrong and would have stood out as incorrect if I had consciously noticed it. The code is much improved (and also more readable) below:
#email_list =
sort(
map {
s/^\s+//; # strip leading spaces
s/\s+$//; # strip trailing spaces
$_ # return the modified string
}
split ';', $file_data
);
If you don't need to trim the first and final element, this will do the trick:
#email_list = split /\s*;\s*/, $file_data;
If you do need to trim the first and final element, trim $file_data first, then repeat as above. :-P
Well, you can do what Chris suggested, but it doesn't handle leading and trailing spaces in $file_data.
You can add handling of these like this:
$file_data =~ s/\A\s+|\s+\z//g;
Also, please note that using 2nd array was not necessary. Check this:
my $file_data = 'Builder, Bob ;Stein, Franklin MSW; Boop, Elizabeth PHD Cc: Bear, Izzy';
my #email_list;
$file_data =~ s/CC:/;/ig;
$file_data =~ s/PHD//ig;
$file_data =~ s/MSW//ig;
my #tmp_data = split( /;/, $file_data );
foreach my $entry (#tmp_data) {
$entry =~ s/^[ \t]+|[ \t]+$//g;
}
foreach my $name ( sort(#tmp_data) ) {
print "$name \n";
}
my #email_list = map { s/^[ \t]+|[ \t]+$//g; $_ } split /;/, $file_data;
or the more elegant:
use Algorithm::Loops "Filter";
my #email_list = Filter { s/^[ \t]+|[ \t]+$//g } split /;/, $file_data;
See How do I strip blank space from the beginning/end of a string? in the FAQ.
#email_list = sort map {
s/^\s+//; s/\s+$//; $_
} split ';', $file_data;
Now, note also that a for loop aliases each element of an array, so
#email_list = sort split ';', $file_data;
for (#email_list) {
s/^\s+//;
s/\s+$//;
}
would also work.
My turn:
my #fields = grep { $_ } split m/\s*(?:;|^|$)\s*/, $record;
It also strips the first and last elements as well. If grep is overkill for getting rid of the first element:
my ( undef, #fields ) = split m/\s*(?:;|^|$)\s*/, $record;
works if you know that there is a space, but that's not likely, so
my #fields = split m/\s*(?:;|^|$)\s*/, $record;
shift #fields unless $fields[0];
is the most sure way to do it.
Barring some minor sintax error, this should do the whole work for you. Oh, list operations, how beautiful you are!
print join (" \n", sort { $a <=> $b } map { s/^[ \t]+|[ \t]+$//g } split (/;/, $file_data));