Read chunks of data in Perl - perl

What is a good way in Perl to split a line into pieces of varying length, when there is no delimiter I can use. My data is organized by column length, so the first variable is in positions 1-4, the second variable is positions 5-15, etc. There are many variables each with different lengths.
Put another way, is there some way to use the split function based on the position in the string, not a matched expression?
Thanks.

Yes there is. The unpack function is well-suited to dealing with fixed-width records.
Example
my $record = "1234ABCDEFGHIJK";
my #fields = unpack 'A4A11', $record; # 1st field is 4 chars long, 2nd is 11
print "#fields"; # Prints '1234 ABCDEFGHIJK'
The first argument is the template, which tells unpack where the fields begin and end. The second argument tells it which string to unpack.
unpack can also be told to ignore character positions in a string by specifying null bytes, x. The template 'A4x2A9' could be used to ignore the "AB" in the example above.
See perldoc -f pack and perldoc perlpacktut for in-depth details and examples.

Instead of using split, try the old-school substr method:
my $first = substr($input, 0, 4);
my $second = substr($input, 5, 10);
# etc...
(I like the unpack method too, but substr is easier to write without consulting the documentation, if you're only parsing out a few fields.)

You could use the substr() function to extract data by offset:
$first = substr($line, 0, 4);
$second = substr($line, 4, 11);
Another option is to use a regular expression:
($first, $second) = ($line =~ /(.{4})(.{11})/);

Related

Replace binary form 0->1 and 1->0 value - perl

In my script i am dealing with binary value and i need to replace 0->1 and 1->0 at one place.
example :
input digit = 10101001
output digit = 01010110
I tried $string =~ s/1/0/; and reverse function but that is getting fail to give me correct out put.
can some one help me out.
Use tr:
my $str = '10101001';
$s =~ tr/01/10/;
print "$s\n";
Outputs:
01010110
If your input string has only those two possibilities 0 and 1, then you can use substitution in a multi-stage approach:
$str =~ s/1/x/g; # all 1's to x
$str =~ s/0/1/g; # all 0's to 1
$str =~ s/x/0/g; # all x's to 0
This is not a bad option for languages that only provide substitutions, but Perl also has an atomic translation feature:
$str =~ tr/01/10/;
which will work just as well (better, really, since it's less code and probably less passes over the data).
You could also go mathy on this and use the bitwise XOR operator ^...
my $input = '10101001';
my $binval = oct( '0b'.$input );
my $result = $binval ^ 0b11111111;
printf "%08b\n", $result;
...which will also give you 01010110.
This of course has the downside of being dependent on the length of the bit input string. The given solution only works for 8-bit values. It wouldn't be hard to generalize for any number of bits, though.
To incorporate Lưu Vĩnh Phúc's comment - you can also use the bitwise NOT operator ~. Again, the implementation is dependent on the number of bits as you need to truncate the result:
my $input = '10101001';
my $binval = oct( '0b'.$input );
print substr( sprintf ( '%b', ~$binval ), -8 )."\n";

Perl use variable as a file name

How to create a file with name containing variables with underscore between them. I need to create a file with name like this $variable1_$vraiable2_$variable3.txt
#values=split(/\./, $line)
my $fpga_name=$values[0];
my $block_name=$values[1];
my $mem_name=$values[2];
my $memfilename="mem_init/$fpga_name_$block_name_$mem_name.txt";
open(WRITE_MEM_FILE, ">>memfilename");
print WRITE_MEM_FILE "$line \n";
You can simply wrap all of the variables in curly braces:
my $memfilename="mem_init/${fpga_name}_${block_name}_${mem_name}.txt";
Keep in mind you need a $ before memfilename in your open statement, otherwise you will just get the literal string:
open(WRITE_MEM_FILE, ">>$memfilename");
The question is whether you need the intermediate array, and the three extra variables. If not, you can write the whole thing as:
my $memfilename = sprintf(
'%s_%s_%s.txt',
split(/[.]/, $line, 3), # whether you want 3 here depends on your input
);
If you do need the three intermediate variables, you can still skip the creation of the #value array and write something more legible than interpolating three variables into a string:
my ($fpga_name, $block_name, $mem_name) = split /[.]/, $line, 3;
my $memfilename = sprintf '%s_%s_%s.txt', $fpga_name, $block_name, $mem_name;
Using sprintf yields code that is much more readable than interpolating three variables, the braces, the underscores, the sigils etc.
Alternatively, you could also use:
my $memfilename = sprintf '%s.txt', join('_', split /[.]/, $line, 3);
Again, whether you want the third argument to split depends on your input.
Finally, if you find yourself doing this in more than one place, it would help to put it in a function
sub line_to_memfilename {
my $line = shift;
# ...
return $memfilename;
}
so if the format ever changes, you only need to make the change in one place.
Indicate where the variable names begin & end by writing ${varname}:
my $memfilename="mem_init/${fpga_name}_${block_name}_${mem_name}.txt";

convert ASCII code to number

I have one file, which contain 12 columns (bam file), the eleventh column contain ASCII code. In one file I have more than one ASCII code. I would like to convert it to number.
I think it this code:
perl -e '$a=ord("ALL_ASCII_CODES_FROM-FILE"); print "$a\t"'
And I would like to create for cycle to read all ASCII codes, which are in eleventh column, convert it to number and count the results to one number.
You need to split the string into individual characters, loop over every character, and call ord in the loop.
my #codes = map ord, split //, $str;
say join '.', map { sprintf("%02X", $_) } #codes;
Conveniently, unpack 'C*' does all of that.
my #codes = unpack 'C*', $str;
say join '.', map { sprintf("%02X", $_) } #codes;
If you do intend to print it out in hex, you can use the v modifier in a printf.
say sprintf("%v02X", $str);
The natural tool to convert a string of characters into a list of the corresponding ASCII codes would be unpack:
my #codes = unpack "C*", $string;
In particular, assuming that you're parsing the QUAL column of an SAM file (or, more generally, any FASTQ-style quality string, I believe the correct conversion would be:
my #qual = map {$_ - 33} unpack "C*", $string;
Ps. From your mention of "columns", I'm assuming you're actually parsing a SAM file, not a BAM file. If I'm reading the spec correctly, the BAM format doesn't seem to use the +33 offset for quality values, so if you are parsing BAM files, you'd simply use the first example above for that.

Parsing a log file using perl

I have a log file where some of the entries look like this:
YY/MM/DD HH:MM:SS:MMM <Some constant text> v1=XXX v2=YYY v3=ZZZ v4=AAA AND BBB v5=CCC
and I'm trying to get it into a CSV format:
Date,Time,v1,v2,v3,v4,v5
YY/MM/DD,HH:MM:SS:MMM,XXX,YYY,ZZZ,AAA AND BBB,CCC
I'd like to do this in Perl - speaking personally, I could probably do it far quicker in other languages but I'd really like to expand my horizons a bit.
So far I can get as far as reading the file in and picking out only lines which meet my criteria but I can't seem to get the next stage done. I'll need to splice up the input line but so far I just can't work out how to do this. I've looked at s//and m// but they don't really give me what I want. If anyone can advise me how this can be done or give me pointers I'd much appreciate it.
Important points:
The values in the second part of the line are always in the same order so mapping / re-organising is not necesarily a problem.
Some of the fields have free text which is not quoted :( but as the labels all start v<number>= I'm hoping parsing this should still be a possibility.
Since there is no one delimiter, you'll need to try this a few different ways:
First, split on ' ', then take the first three values:
my #array = split / /, $line;
my ($date, $time, $constant) = splice #array, 0, 3;
Join the rest of the fields together again, and re-split on v\d+= to get the values:
my $rest = join ' ', #array;
# $rest should now be "v1=XXX v2=YYY ..."
my #values = split /\s*v\d+=/, $rest;
shift #values; # since the first element in #values will be empty
print join ',', $date, $time, #values;
Edit: Here's another approach that may be easier to follow, and is slightly more efficient. This takes advantage of the fact that your constant text occurs between the date/time and the value list.
# assume that CONSTANT is your constant text
my ($datetime, $valuelist) = split /\s*CONSTANT\s*/, $line;
my ($date, $time) = split / /, $datetime;
my #values = split /\s*v\d+=/, $valuelist;
shift #values;
print join ',', $date, $time, #values, "\n";
What have you tried with regular expressions and how has it failed? A regex with m// works fine for me:
#!/usr/bin/env perl
use strict;
use warnings;
print "Date,Time,v1,v2,v3,v4,v5\n";
while (my $line = <DATA>) {
my #matched = $line =~ m{^([^ ]+) ([^ ]+).*v1=(.*) v2=(.*) v3=(.*) v4=(.*) v5=(.*)};
print join(',', #matched), "\n";
}
__DATA__
YY/MM/DD HH:MM:SS:MMM <Some constant text> v1=XXX v2=YYY v3=ZZZ v4=AAA AND BBB v5=CCC
Two caveats:
1) v1 cannot contain the substring " v2=", v2 cannot contain " v3=", etc., but, with such a loose format, that's something that would likely cause problems for a human attempting to parse it, too.
2) This code assumes that there will always be v1 through v5. If there are fewer than five v*n* fields, the line will fail to match. If there are more, all additional fields will be appended to v5 (including their v*n* tags).
In case the log is fixed-width, you better off using unpack, you will see its benefits if the log grows very large (performance wise).

How do I get the length of a string in Perl?

What is the Perl equivalent of strlen()?
length($string)
perldoc -f length
length EXPR
length Returns the length in characters of the value of EXPR. If EXPR is
omitted, returns length of $_. Note that this cannot be used on an
entire array or hash to find out how many elements these have. For
that, use "scalar #array" and "scalar keys %hash" respectively.
Note the characters: if the EXPR is in Unicode, you will get the num-
ber of characters, not the number of bytes. To get the length in
bytes, use "do { use bytes; length(EXPR) }", see bytes.
Although 'length()' is the correct answer that should be used in any sane code, Abigail's length horror should be mentioned, if only for the sake of Perl lore.
Basically, the trick consists of using the return value of the catch-all transliteration operator:
print "foo" =~ y===c; # prints 3
y///c replaces all characters with themselves (thanks to the complement option 'c'), and returns the number of character replaced (so, effectively, the length of the string).
length($string)
The length() function:
$string ='String Name';
$size=length($string);
You shouldn't use this, since length($string) is simpler and more readable, but I came across some of these while looking through code and was confused, so in case anyone else does, these also get the length of a string:
my $length = map $_, $str =~ /(.)/gs;
my $length = () = $str =~ /(.)/gs;
my $length = split '', $str;
The first two work by using the global flag to match each character in the string, then using the returned list of matches in a scalar context to get the number of characters. The third works similarly by splitting on each character instead of regex-matching and using the resulting list in scalar context