Cluster the pattern by positions? - perl

I have input file as follow.
ggaaaa (973026 to 973032) ctggag (1849680 to 1849686) = 6
ggaaaa (973056 to 973062) ctggag (1849706 to 1849712) = 6
ggaaaa (97322 to 97328) ctggag (184962 to 184968) = 6
cctgtggataacctgtgga (1849554 to 1849572) tccacaggttatccacagg (1849615 to 1849633) = 19
ggcccccccggagtt (470079 to 470093) aactccgggggggcc (1849574 to 1849588) = 15
ctggag (18497062 to 18497068) ggaaaa (9730562 to 9730568) = 6
First string is pattern with in bracket pattern position. Second string is repeat with in bracket repeat position. First 3 lines pattern and 6th line repeat are same but positions are different. So i want to make it as a cluster my output like this
ggaaaa==>(973026 to 973032)(973056 to 973062)(97322 to 97328)(9730562 to 9730568) ctggag==>(1849680 to 1849686)(1849706 to 1849712)(184962 to 184968)(18497062 to 18497068) 6 8
cctgtggataacctgtgga==>(1849554 to 1849572) tccacaggttatccacagg==>(1849615 to 1849633) 19 2
ggcccccccggagtt==>(470079 to 470093) aactccgggggggcc==>(1849574 to 1849588) 15 2
So first string are pattern and followed by their positions Second string are repeat and followed by their positions and TAB seperator is length of string again TAB separator is total of the pattern and repeat positions
I tried, small file i am getting output but large size file not getting output. I pasted my code below.
my $file = $_[0];
my %hashA;
my %hashB;
my #sorted;
my $i = 1;
my $j=$k=0;
my $tmplen = $len = 0;
my #sorted = `sort -nk10 $file`;
push(#sorted,"***");
open (FLWR,">$file") or die "File can't open $!";
$lengt = $sorted[0];
print FLWR $lengt;
my $linelen = #sorted;
while($i < $linelen)
{
($seqs,$len) = split(/\=/,$sorted[$i]);
$len =~s/\s+//g;
my($first,$second,$third,$fourth) = split(/\s+(?!to|\d+)/,$seqs);
if($len != $tmplen || $sorted[$i] eq "***")
{
if($tmplen != 0)
{
foreach $Alev2 (sort keys %{$hashA{$tmplen}})
{
foreach $Alev3 (sort keys %{$hashA{$tmplen}{$Alev2}})
{
foreach $Blev2 (sort keys %{$hashB{$tmplen}})
{
foreach $Blev3 (sort keys %{$hashB{$tmplen}{$Blev2}})
{
if($Alev3 eq $Blev3 && $Alev2 != $Blev2)
{
($Akey) = keys (%{$hashA{$tmplen}{$Blev2}});
($Akey1) = keys (%{$hashA{$tmplen}{$Blev2}{$Akey}});
foreach $Blev4 (sort keys %{$hashB{$tmplen}{$Blev2}{$Blev3}})
{
$hashA{$tmplen}{$Alev2}{$Alev3}{$Blev4}++;
$hashB{$tmplen}{$Alev2}{$Akey}{$Akey1}++ ;
}
delete($hashB{$tmplen}{$Blev2});
delete($hashA{$tmplen}{$Blev2});
}
}
}
}
}
}
$tmplen = $len;
}
if($first ne $dump_first)
{
$dump_first = $first;
$j++;
}
$hashA{$tmplen}{$j}{$dump_first}{$second}++;
$hashB{$tmplen}{$j}{$third}{$fourth}++;
$i++;
}
foreach $s1(sort keys %hashA)
{
foreach $s2 (sort keys %{$hashA{$s1}})
{
my $seq_concat = "";
my $a_concat = "";
my $b_concat = "";
my $a_inc = 0;
my $b_inc = 0;
foreach $s3 (sort keys %{$hashA{$s1}{$s2}})
{
next if($s3 eq "");
$Aseq_concat = "$s3==>";
foreach $s4 (sort keys %{$hashA{$s1}{$s2}{$s3}})
{
$a_inc++;
$a_concat .= $s4;
}
}
$s3 = "";
foreach $s3 (sort keys %{$hashB{$s1}{$s2}})
{
next if($s3 eq "");
$Bseq_concat = "$s3==>";
foreach $s4 (sort keys %{$hashB{$s1}{$s2}{$s3}})
{
$b_inc++;
$b_concat .= $s4;
}
}
next if($b_concat eq "");
$Bseq_concat = uc($Bseq_concat);
$b_concat = uc($b_concat);
$Aseq_concat = uc($Aseq_concat);
$a_concat = uc($a_concat);
if($a_inc > $b_inc)
{
print FLWR $Bseq_concat.$b_concat,"\t",$Aseq_concat,$a_concat;
}
else
{
print FLWR $Aseq_concat.$a_concat,"\t",$Bseq_concat,$b_concat;
}
print FLWR "\t$s1\t";
print FLWR $a_inc+$b_inc;
print FLWR "\n";
}
}

I'd prefer to suggest some updates to your code rather than hand out a completely re-worked solution - but I simply got lost in the level of nesting you have above.
Just a few key points;
The best attack with this is identifying that the ranges are the same format - therefore use a regex with 'global', /g option in a while loop.
Having decided to center the solution on a regex, its best to use 'extended mode' and lots of whitespace and comments.
The core data structure is a hash of hashes. The first level is keyed on the pattern string, the inner one keyed on the range specification string.
Your specification implies that all patterns on a given line of the data will be of the same length - the code I give below relies on this fact.
I added some checks to the given pattern length and the start and end positions - if the data is rock solid, maybe you don't need them?
Your requested output format added complexity to the code - if there was one report line for each pattern the code would be much simpler. As it is, a pattern appears at a certain line number in the report if it first appears on the corresponding line in the data. Recording the line that a pattern first appears added significantly to the code.
The code is written unix filter style - data supplied via STDIN and printed on STDOUT. Errors and warnings on STDERR
So, given those points;
use v5.12;
use warnings;
my $pattern_regex = qr/
(\w+) # capture actual pattern to $1
\s* # optional whitespace after pattern
( # capture the following to $2
\( # literal open bracket
(\d+) # capture start pos to $3
\s* to \s*
(\d+) # capture end pos to $4
\) # literal close bracket
) # close capture whole range spec to $1
\s* # gobble up any whitespae between patterns
/x ; # close regex - extended mode
my %ranges ;
my #first_seen_on_line ;
while (<>) {
chomp ;
my ($patterns_str, $given_length) = split /\s*=\s*/ ;
die "No length on line $." unless defined $given_length ;
# Repeatedly look for patterns and range-specifications
while ( $patterns_str =~ /$pattern_regex/g ) {
my ($pattern, $range_spec, $start_pos, $end_pos) = ($1,$2,$3,$4);
warn "Incorrect length for '$pattern' on line $.\n"
unless length($pattern) == $end_pos - $start_pos + 1
&& length($pattern) == $given_length ;
# Is this the first time we've seen this pattern?
if ( defined $ranges{ $pattern } ) {
# No its not - add this range to the hash of ranges for this pattern
$ranges{ $pattern }{ $range_spec }++ ;
}
else {
# Yes it is - record the fact that it was on this line and
# initialize the hash of ranges for this pattern
push #{ $first_seen_on_line[ $. ] }, $pattern ;
$ranges{ $pattern } = { $range_spec => 1 } ;
}
}
}
for my $line (1 .. $#first_seen_on_line) {
# Might not be anything to do for this line
next unless defined $first_seen_on_line[$line] ;
# Get the patterns that first appeared on this line
my #patterns = #{ $first_seen_on_line[$line] } ;
my ($pat_length , $range_count) ;
for my $pat (#patterns) {
# Get all the ranges for this pattern and print them
my #ranges = keys %{ $ranges{ $pat } };
print $pat, '==>', #ranges, "\t" ;
$range_count += #ranges ;
$pat_length = length($pat) ;
}
print $pat_length, "\t";
print $range_count, "\n" ;
# print length $pat, "\t", scalar #ranges, "\n" ;
}
Ran on the data above, it produces;
gaaaa==>(973026 to 973032)(973056 to 973062)(97322 to 97328)(9730562 to 9730568) ctggag==>(1849680 to 1849686)(1849706 to 1849712)(184962 to 184968)(18497062 to 18497068) 6 8
cctgtggataacctgtgga==>(1849554 to 1849572) tccacaggttatccacagg==>(1849615 to 1849633) 19 2
ggcccccccggagtt==>(470079 to 470093) aactccgggggggcc==>(1849574 to 1849588) 15 2

Related

Finding longest match between 2 files from pattern

I am having trouble implementing two files within this program. I am trying to to access the contents of file $Q and $s.
print "Input the K value \n";
$k = <>;
chomp $k;
print "Input T\n";
$t = <>;
chomp $t;
%Qkmer = ();
$i = 1;
$query=' ';
while ($line=<IN>) {
chomp($line);
if ($line=~ m/^>/ ) {
next;
}
$query=$query.$line;
$line=~ s/(^|\n)[\n\s]*/$1/g;
while (length($line) >= $k) {
$line =~ m/(.{$k})/;
if (! defined $Qkmer{$1}) {#every key not deined as the first match
$Qkmer{$1} = $i;
}
$i++;
$line = substr($line, 1, length($line) -1);
}
}
open(MYDATA, '<', "data.txt");
while ($line=<MYDATA>) { \
chomp($line);
%Skmer = (); # This initializes the hash called Skmer.
$j = 1;
if ($line=~ m/^>/ ) { #if the line starts with >
next; #start on next line #separated characters
}
$line=~ s/^\s+|\s+$//g ; #remove all spaces from file
while (length($line) >= $k) {
$line =~ m/(.{$k})/;#match any k characters and only k characters in dna
$Skmer{$1} = $j; #set the key position to $j and increase for each new key
$j++;
$line = substr($line, 1, length($line) -1); #this removes the first character in the current string
}
###(56)###for($Skmerkey(keys %Skmer)){
$i=$Skmer{$Skmerkey};
if(defined $Qkmer($Skmerkey)){
$j=$Qkmer($Skmerkey);
}
$S1=$line;
$S2=$query;
#arrayS1= split(//, $S1);
#array2= split(//, $S2);
$l=0;
while($arrayS1[$i-$l] eq $arrayS2[$j-$l]){
$l++;
}
$start=$i-$l;
$m=0;
while ($arrayS1[$i+$k+$m] eq $arrayS2[$j+$k+$m]) {
$m++;
}
$length=$l+$k+$m;
$match= substr($S1, $start, $length);
if($length>$t){
$longest=length($match);
print "Longest: $match of length $longest \n";
}
}
}###(83)###
The input files contain only strings of letters. For example:
File 1:
ahhtsagnchjgstffhjyfcsghnvzfhg
File2:
ggujvfbgfgkjfcijjjffcvvafcsghnvzfhgvugxckugcbhfcgh
ghnvzfhgvugxckHhfgjgcfujvftjbvdtkhvddgjcdgjxdjkfrh
ajdbvciyqdanvkjghnvzfhgvugxc
From a match of a word of length$k in file 1 in file 2, I check from that match in file 2 to left and to right of word for further matches. The final output is the longest match between File 1 and File 2 based on $k. Now I ge
With this code, I get a syntax error and I am not suer why because it looks correct to me:
syntax error at testk.pl line 56, near "$Skmerkey("
syntax error at testk.pl line 83, near "}"
Thank you.
use strict; # <--- Allways use this
use warnings; # <--- and this
use Data::Dumper;
my $k=3;
open(my $IN, '<', "File2"); # use $IN instead of depricated IN
my $line=0; # line number
my %kmer; # hash of arrays of all $k-letter "words" line/position
my #Q; # rows of Q-file
while(<$IN>) {
chomp;
next if /^>/;
s/^\s+|\s+$//g;
next if !$_;
my $pos=0;
push #Q, $_; # store source row
for(/(?=(.{$k}))/g) { # Capture $k letters. floating window with step 1 symbol
push #{$kmer{$_}}, [$line,$pos]; # store row number and position of "word"
$pos++;
}
$line++;
}
open($IN, '<', "File1");
$line=0;
while(<$IN>) { # Read S-file
chomp;
next if /^>/;
s/^\s+|\s+$//g;
next if !$_;
my $pos=0;
my $len=length($_); # length of row of S-file
my $s=$_; # Current row of S-file
my #ignore=(); # array for store information about match tails
for(/(?=(.{$k}))/g) {
next if ! $kmer{$_}; # "word" not found try to next
for(#{$kmer{$_}}) { # $kmer{word} contains array of lines/positions in Q
my($qline, $qpos)=#{$_};
# print "test $qline:$qpos ";
if( grep {$_->[0]==$qline && $_->[1]==$qpos } #ignore ) {
# this line/position already tested and included in found matching
# print "Ignore match tail $qline:$qpos\n";
next;
}
my $j=$k; # $k letters same, test after this point
my $qlen=length($Q[$qline]);
$j++ while( $pos+$j<$len && $qpos+$j<$qlen &&
substr($s,$pos+$j,1) eq substr($Q[$qline],$qpos+$j,1) );
print "MATCH FOUND: S-file line $line pos $pos, Q-file line $qline pos $qpos: ",
substr($s,$pos,$j),"\n";
push #ignore, [$qline, $qpos, $j]; # store positions and length of match
}
} continue { # Continue block works on all loops, include after "next"
$pos++;
#ignore=grep { # recalculate/filter position and length of all match tails
++$_->[1]; # increment position
(--$_->[2]) # decrement length
>= $k # and filter out lengths < $k
} #ignore;
# print Dumper(\#ignore);
}
$line++;
}

Perl - longest common prefix of 2 or more strings?

How can i create a Perl subroutine which would take in an array and find the longest common prefix for 2 or more of its elements? (strings)
I have this code:
sub longest_common_prefix {
$prefix = shift;
for (#_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
But it only works if you are looking for the longest common prefix of all strings.
For example, if i pass an array with the following strings:
aaaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
I want it to return aaa as the answer.
Thanks!
I'd use a modified trie.
Normally, one could use the following to add to a trie:
sub add {
my $p = \shift;
my $s = shift;
$p = \( $$p->{$_} ) for split(//, $s);
$$p->{''} = 1;
}
But we need two modifications:
All prefixes of a string must be added when adding a string. For example, adding abc should also add a and ab to the trie.
When adding to the trie, we want to return the length of previously-existing part of the path taken.
So we need:
sub add {
my $p = \shift;
my $s = shift;
my $cp_len = 0;
for (split(//, $s)) {
$p = \( $$p->{$_} );
++$cp_len if $$p->{$_}{''};
$$p->{''} = 1;
}
return $cp_len;
}
Combine (an optimized version of) this with an algorithm to find the longest strings in a list and with an algorithm to remove duplicate strings from a list to get the following solution:
use strict;
use warnings;
use feature qw( say );
sub add {
my $p = \shift;
my $s = shift;
my $cp_len = 0;
for (split(//, $s)) {
++$cp_len if exists($$p->{$_});
$p = \( $$p->{$_} );
}
return $cp_len;
}
my $t;
my $lcp_len = 0; # lcp = longest common prefix
my %lcps;
while (<>) {
chomp;
my $cp_len = add($t, $_)
or next;
if ($cp_len >= $lcp_len) {
if ($cp_len > $lcp_len) {
$lcp_len = $cp_len;
%lcps = ();
}
$lcps{ substr($_, 0, $cp_len) } = 1;
}
}
my #lcps = sort keys %lcps;
if (#lcps) {
say "Longest common prefix(es): #lcps";
} else {
say "No common prefix";
}
Data:
abc
abc
abcd
abcde
hijklx
hijkly
mnopqx
mnopqy
Output:
Longest common prefix(es): hijkl mnopq
The time taken by the above is proportional to the number of input characters.
One way would be to store the information in a hash. In this example, I set the hash key to the length of each prefix, and the value being the actual prefix found.
Note that this method overwrites a key and value if a same-length prefix exists, so you'll always get the last prefix found of the longest length (sort() takes care of finding the longest one).
The regex says "find the first character in the string and capture it, and use that char found in a second capture, and capture as many as there are". This string is then join()ed into a scalar and put into the hash.
use warnings;
use strict;
my %prefixes;
while (<DATA>){
my $prefix = join '', /^(.)(\1+)/;
$prefixes{length $prefix} = $prefix;
}
my $longest = (sort {$b <=> $a} keys %prefixes)[0];
print "$prefixes{$longest}\n";
__DATA__
aaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
Output:
aaa
You can keep a hash of an array of words keyed by the first character. By definition, if you have words starting with the same letter, those words share at least a one character common prefix of that one letter. Then reduce to the single longest prefix by stepping through the hash by character:
use strict; use warnings;
sub lcp {
(join("\0", #_) =~ /^ ([^\0]*) [^\0]* (?:\0 \1 [^\0]*)* $/sx)[0];
}
my %HoA;
my $longest='';
while (my $line=<DATA>){
$line =~ s/^\s+|\s+$//g ;
push #{ $HoA{substr $line, 0, 1} }, $line if $line=~/^[a-zA-Z]/;
}
for my $key ( sort (keys %HoA )) {
if (scalar #{ $HoA{$key} } > 1){
my $lon=lcp(#{ $HoA{$key} });
my $s = join ', ', map { qq/"$_"/ } #{ $HoA{$key} };
print "lcp: \"$lon\" for ($s)\n";
if (length($lon) > length($longest)) {
$longest=$lon;
}
}
else{
print "$key: no common prefix\n";
}
}
print "\nlongest common prefix is \"$longest\"\n";
__DATA__
aardvark
aaaBGFB
aaaJJJJ
jjfkBBB
aaaHGHG
interspecies
interstellar
interstate
Prints:
lcp: "aa" for ("aardvark", "aaaBGFB", "aaaJJJJ", "aaaHGHG")
lcp: "inters" for ("interspecies", "interstellar", "interstate")
j: no common prefix
longest common prefix is "inters"

Handling Nested Delimiters in perl

use strict;
use warnings;
my %result_hash = ();
my %final_hash = ();
Compare_results();
foreach my $key (sort keys %result_hash ){
print "$key \n";
print "$result_hash{$key} \n";
}
sub Compare_results
{
while ( <DATA> )
{
my($instance,$values) = split /\:/, $_;
$result_hash{$instance} = $values;
}
}
__DATA__
1:7802315095\d\d,7802315098\d\d;7802025001\d\d,7802025002\d\d,7802025003\d\ d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
2:7802315095\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
Output
1
7802315095\d\d,7802315098\d\d;7802025001\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
2
7802315095\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
Iam trying to fetch value of each key and again trying to split the comma seperated value from result hash , if i find a semicolon in any value i would want to store the left and right values in separate hash keys.
Something like below
1.#split the value of result_hash{$key} again by , and see whether any chunk is seperated by ;
2. #every chunk without ; and value on left with ; should be stored in
#{$final_hash{"eto"}} = ['7802315095\d\d','7802315098\d\d','7802025002\d\d','7802025003\d\d','7802025004\d\d','7802025005\d\d','7802025006\d\d','7802025007\d\d'] ;
3.#Anything found on the right side of ; has to be stored in
#{$final_hash{"pro"}} = ['7802025001\d\d'] ;
Is there a way that i can handle everything in the subroutine? Can i make the code more simpler
Update :
I tried splitting the string in a single shot, but its just picking the values with semicolon and ignoring everything
foreach my $key (sort keys %result_hash ){
# print "$key \n";
# print "$result_hash{$key} \n";
my ($o,$t) = split(/,|;/, $result_hash{$key});
print "Left : $o \n";
print "Left : $t \n";
#push #{$final_hash{"eto"}}, $o;
#push #{$final_hash{"pro"}} ,$t;
}
}
My updated code after help
sub Compare_results
{
open my $fh, '<', 'Data_File.txt' or die $!;
# split by colon and further split by , and ; if any (done in insert_array)
my %result_hash = map { chomp; split ':', $_ } <$fh> ;
foreach ( sort { $a <=> $b } (keys %result_hash) )
{
($_ < 21)
? insert_array($result_hash{$_}, "west")
: insert_array($result_hash{$_}, "east");
}
}
sub insert_array()
{
my ($val,$key) = #_;
foreach my $field (split ',', $val)
{
$field =~ s/^\s+|\s+$//g; # / turn off editor coloring
if ($field !~ /;/) {
push #{ $file_data{"pto"}{$key} }, $field ;
}
else {
my ($left, $right) = split ';', $field;
push #{$file_data{"pto"}{$key}}, $left if($left ne '') ;
push #{$file_data{"ero"}{$key}}, $right if($right ne '') ;
}
}
}
Thanks
Update Added a two-pass regex, at the end
Just proceed systematically, analyze the string step by step. The fact that you need consecutive splits and a particular separation rule makes it unwieldy to do in one shot. Better have a clear method than a monster statement.
use warnings 'all';
use strict;
use feature 'say';
my (%result_hash, %final_hash);
Compare_results();
say "$_ => $result_hash{$_}" for sort keys %result_hash;
say '---';
say "$_ => [ #{$final_hash{$_}} ]" for sort keys %final_hash;
sub Compare_results
{
%result_hash = map { chomp; split ':', $_ } <DATA>;
my (#eto, #pro);
foreach my $val (values %result_hash)
{
foreach my $field (split ',', $val)
{
if ($field !~ /;/) { push #eto, $field }
else {
my ($left, $right) = split ';', $field;
push #eto, $left;
push #pro, $right;
}
}
}
$final_hash{eto} = \#eto;
$final_hash{pro} = \#pro;
return 1; # but add checks above
}
There are some inefficiencies here, and no error checking, but the method is straightforward. If your input is anything but smallish please change the above to process line by line, what you clearly know how to do. It prints
1 => ... (what you have in the question)
---
eto => [ 7802315095\d\d 7802315098\d\d 7802025002\d\d 7802025003\d\ d ...
pro => [ 7802025001\d\d ]
Note that your data does have one loose \d\ d.
We don't need to build the whole hash %result_hash for this but only need to pick the part of the line after :. I left the hash in since it is declared global so you may want to have it around. If it in fact isn't needed on its own this simplifies
sub Compare_results {
my (#eto, #pro);
while (<DATA>) {
my ($val) = /:(.*)/;
foreach my $field (split ',', $val)
# ... same
}
# assign to %final_hash, return from sub
}
Thanks to ikegami for comments.
Just for the curiosity's sake, here it is in two passes with regex
sub compare_rx {
my #data = map { (split ':', $_)[1] } <DATA>;
$final_hash{eto} = [ map { /([^,;]+)/g } #data ];
$final_hash{pro} = [ map { /;([^,;]+)/g } #data ];
return 1;
}
This picks all characters which are not , or ;, using the negated character class, [^,;]. So that is up to the first either of them, left to right. It does this globally, /g, so it keeps going through the string, collecting all fields that are "left of" , or ;. Then it cheats a bit, picking all [^,;] that are right of ;. The map is used to do this for all lines of data.
If %result_hash is needed build it instead of #data and then pull the values from it with my #values = values %hash_result and feed the map with #values.
Or, broken line by line (again, you can build %result_hash instead of taking $data directly)
my (#eto, #pro);
while (<DATA>) {
my ($data) = /:(.*)/;
push #eto, $data =~ /([^,;]+)/g;
push #pro, $data =~ /;([^,;]+)/g;
}

How can I extract/parse tabular data from a text file in Perl?

I am looking for something like HTML::TableExtract, just not for HTML input, but for plain text input that contains "tables" formatted with indentation and spacing.
Data could look like this:
Here is some header text.
Column One Column Two Column Three
a b
a b c
Some more text
Another Table Another Column
abdbdbdb aaaa
Not aware of any packaged solution, but something not very flexible is fairly simple to do assuming you can do two passes over the file: (the following is partially Perlish pseudocode example)
Assumption: data may contain spaces and is NOT quoted ala CSV if there's a space - if this is not the case, just use Text::CSV(_XS).
Assumption: no tabs used for formatting.
The logic defines a "column separator" to be any consecutive set of vertical rows populated 100% with spaces.
If by accident every row has a space which is part of the data at offset M characters, the logic will consider offset M to be a column separator, since it can't know any better. The ONLY way it can know better is if you require column separation to be at least X spaces where X>1 - see the second code fragment for that.
Sample code:
my $INFER_FROM_N_LINES = 10; # Infer columns from this # of lines
# 0 means from entire file
my $lines_scanned = 0;
my #non_spaces=[];
# First pass - find which character columns in the file have all spaces and which don't
my $fh = open(...) or die;
while (<$fh>) {
last if $INFER_FROM_N_LINES && $lines_scanned++ == $INFER_FROM_N_LINES;
chomp;
my $line = $_;
my #chars = split(//, $line);
for (my $i = 0; $i < #chars; $i++) { # Probably can be done prettier via map?
$non_spaces[$i] = 1 if $chars[$i] ne " ";
}
}
close $fh or die;
# Find columns, defined as consecutive "non-spaces" slices.
my #starts, #ends; # Index at which columns start and end
my $state = " "; # Not inside a column
for (my $i = 0; $i < #non_spaces; $i++) {
next if $state eq " " && !$non_spaces[$i];
next if $state eq "c" && $non_spaces[$i];
if ($state eq " ") { # && $non_spaces[$i] of course => start column
$state = "c";
push #starts, $i;
} else { # meaning $state eq "c" && !$non_spaces[$i] => end column
$state = " ";
push #ends, $i-1;
}
}
if ($state eq "c") { # Last char is NOT a space - produce the last column end
push #ends, $#non_spaces;
}
# Now split lines
my $fh = open(...) or die;
my #rows = ();
while (<$fh>) {
my #columns = ();
push #rows, \#columns;
chomp;
my $line = $_;
for (my $col_num = 0; $col_num < #starts; $col_num++) {
$columns[$col_num] = substr($_, $starts[$col_num], $ends[$col_num]-$starts[$col_num]+1);
}
}
close $fh or die;
Now, if you require column separation to be at least X spaces where X>1, it's also doable but the parser of column locations needs to be a bit more complex :
# Find columns, defined as consecutive "non-spaces" slices separated by at least 3 spaces.
my $min_col_separator_is_X_spaces = 3;
my #starts, #ends; # Index at which columns start and end
my $state = "S"; # inside a separator
NEXT_CHAR: for (my $i = 0; $i < #non_spaces; $i++) {
if ($state eq "S") { # done with last column, inside a separator
if ($non_spaces[$i]) { # start a new column
$state = "c";
push #starts, $i;
}
next;
}
if ($state eq "c") { # Processing a column
if (!$non_spaces[$i]) { # First space after non-space
# Could be beginning of separator? check next X chars!
for (my $j = $i+1; $j < #non_spaces
|| $j < $i+$min_col_separator_is_X_spaces; $j++) {
if ($non_spaces[$j]) {
$i = $j++; # No need to re-scan again
next NEXT_CHAR; # OUTER loop
}
# If we reach here, next X chars are spaces! Column ended!
push #ends, $i-1;
$state = "S";
$i = $i + $min_col_separator_is_X_spaces;
}
}
next;
}
}
Here's a very quick solution, commented with an overview. (My apologies for the length.) Basically, if a "word" appears after the start of column header n, then it ends up in column n, unless most of its body trails into column n + 1, in which case it ends up there instead. Tidying this up, extending it to support multiple different tables, etc. are left as an exercise. You could also use something other than the left offset of the column header as the boundary mark, such as the centre, or some value determined by the column number.
#!/usr/bin/perl
use warnings;
use strict;
# Just plug your headers in here...
my #headers = ('Column One', 'Column Two', 'Column Three');
# ...and get your results as an array of arrays of strings.
my #result = ();
my $all_headers = '(' . (join ').*(', #headers) . ')';
my $found = 0;
my #header_positions;
my $line = '';
my $row = 0;
push #result, [] for (1 .. #headers);
# Get lines from file until a line matching the headers is found.
while (defined($line = <DATA>)) {
# Get the positions of each header within that line.
if ($line =~ /$all_headers/) {
#header_positions = #-[1 .. #headers];
$found = 1;
last;
}
}
$found or die "Table not found! :<\n";
# For each subsequent nonblank line:
while (defined($line = <DATA>)) {
last if $line =~ /^$/;
push #{$_}, "" for (#result);
++$row;
# For each word in line:
while ($line =~ /(\S+)/g) {
my $word = $1;
my $position = $-[1];
my $length = $+[1] - $position;
my $column = -1;
# Get column in which word starts.
while ($column < $#headers &&
$position >= $header_positions[$column + 1]) {
++$column;
}
# If word is not fully within that column,
# and more of it is in the next one, put it in the next one.
if (!($column == $#headers ||
$position + $length < $header_positions[$column + 1]) &&
$header_positions[$column + 1] - $position <
$position + $length - $header_positions[$column + 1]) {
my $element = \$result[$column + 1]->[$row];
$$element .= " $word";
# Otherwise, put it in the one it started in.
} else {
my $element = \$result[$column]->[$row];
$$element .= " $word";
}
}
}
# Output! Eight-column tabs work best for this demonstration. :P
foreach my $i (0 .. $#headers) {
print $headers[$i] . ": ";
foreach my $c (#{$result[$i]}) {
print "$c\t";
}
print "\n";
}
__DATA__
This line ought to be ignored.
Column One Column Two Column Three
These lines are part of the tabular data to be processed.
The data are split based on how much words overlap columns.
This line ought to be ignored also.
Sample output:
Column One: These lines are The data are split
Column Two: part of the tabular based on how
Column Three: data to be processed. much words overlap columns.

How can I skip some block content while reading in Perl

I plan to skip the block content which include the start line of "MaterializeU4()" with the subroutin() read_block below. But failed.
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines;
my $block_started = 0;
while( my $line = <$fh> ) {
# how to correct my code below? I don't need the 2nd block content.
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 0) ) ;
if( $block_started ) {
last if $line =~ /^\s*$/;
push #lines, $line;
}
}
return \#lines if #lines;
return;
}
Data as below:
__DATA__
status DynTest = <dynamic 100>
vid = 10002
name = "DynTest"
units = ""
status VIDNAME9000 = <U4 MaterializeU4()>
vid = 9000
name = "VIDNAME9000"
units = "degC"
status DynTest = <U1 100>
vid = 100
name = "Hello"
units = ""
Output:
<StatusVariables>
<SVID logicalName="DynTest" type="L" value="100" vid="10002" name="DynTest" units=""></SVID>
<SVID logicalName="DynTest" type="L" value="100" vid="100" name="Hello" units=""></SVID>
</StatusVariables>
[Updated]
I print the value of index($line, "MaterializeU4"), it output 25.
Then I updated the code as below
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 25)
Now it works.
Any comments are welcome about my practice.
Perl already has an operator to keep track of blocks. It's called the "flip-flop" operator:
Try this out:
while ( <DATA> ) {
next if /\Q<U4 MaterializeU4()>\E/../^\s*$/;
push #lines, $_;
}
The value of /\Q<U4 MaterializeU4()>\E/../^\s*$/ will be true when it sees a line that matches the starting regex and it will stop being true after it sees a line matching the second expression.
First, using a regex instead of index is probably better since you can tune it to the exact format of status string if you may decide to be stricter than just "substring exists"
I would suggest as one solution adding a second flag to skip the block contents if it's a MaterializeU4 block, as follows:
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Empty return for skippable (Materialize4U) block!!!
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = <$fh> ) {
if ($line =~ /^status.*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Here's a slightly modified sample I tested using codepad.org:
Code:
use Data::Dumper;
my #all_lines = (
"s 1" ,"b 1" ,""
, "s MaterializeU4" ,"b 2" ,""
, "s 3" ,"b 3" ,""
);
while (#all_lines) {
my $block = read_block();
print Data::Dumper->Dump([$block]);
}
exit 0;
sub read_block {
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = shift #all_lines) {
if ($line =~ /^s .*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Output:
$VAR1 = [
's 1',
'b 1'
];
$VAR1 = [];
$VAR1 = [
's 3',
'b 3'
];
On successful match of a substring, index returns the position of the substring, which could be any value >= 0. On "failure", index returns -1.
The way you are using index
index($line, "MaterializeU4") != 0
will be true for all lines except for a line that begins with the string "MaterializeU4".
It looks like you already know a little bit about Perl regular expressions. Why not use one in this case, too?
++$block_started if $line =~ /status/ && $line =~ /MaterializeU4/;
Another issue I see is that you set $block_started to begin capturing lines, but you never set it to zero at the end of the "block", say, when $line is empty. I'm not sure if that's what you wanted to do.