Next line array if a condition is achieved

Next line array if a condition is achieved - perl

I have a text in a file F1 each sentence in line, and another file contain the part of speech(POS) of each word in the text for example:
F1 contains:
he lives in paris\n
he jokes
F2 contains:
he pro\n
lives verb\n
in prep\n
paris adv_pl\n
he pro\n
jokes verb\n
I would like to parse each sentence of F1 and extract the POS of each word. I arrived to extract the POS of the first sentence, but the program can't parse the second line. This is the code:
open( FILE, $filename ) || die "Problème d'ouverture du ficher en entrée";
open( FILEOUT, ">$filenameout" ) || die "Problème d'ouverture";
open( F, "/home/ahmed/Bureau/test/corpus.txt" ) || die " Pb pour ouvrir";
open( F2, "/home/ahmed/Bureau/test/corp.txt" ) || die " Pb pour ouvrir";
my $z;
my $y = 0;
my $l;
my $li;
my $pos;
while ( $ligne = <F> ) {
while ( $li = <F2> ) { # F2 POS
chomp($li);
# prem contain the first word of F2 in each line,
# deux contain the POS of this word
( $prem, $deux ) = ( $li =~ m/^\W*(\w+)\W+(\w+)/ );
print "premier: $prem\n";
chomp($ligne);
#val = split( / /, $ligne ); # corpus de texte
$l = #val;
while ( $y < $l ) { # $l length of sentence
$z = $val[$y];
print "z : $z\n";
if ( $z eq $prem ) {
print "true\n";
$pos .= "POSw" . $y . "=" . $deux . " ";
++$y;
} else {
last;
}
}
}
print FILEOUT "$pos\n";
$pos = "";
}
The result I had in the terminal:
premier: he
z : he
true
premier : lives
z : lives
true
premier : in
z : in
true
premier : paris
z : paris
true
premier : he
premier : jokes
The first sentence has 4 words, when it pass 4, we must go to the next line in the text, I can't arrive to solve it.

There are some issues in your script.
You must always use strict; use warnings; to show the most common syntax and/or typing errors, unused variables, etc.
You should always use the three-parameter open and no global FILEHANDLEs (see opentut).
You should use some sensible names for your filehandles, not FH, FH1, etc. but $fh_sentences and $fh_grammar (or other meaningful names).
So far for the general part. Now let's get more specific:
Your outer loop (F) reads the sentences one by one. The next loop (F2) reads the grammatical types but it does so only once for the first sentence. When the F2 file is read, subsequent calls to <F2> will always return undef because the file was already read. You have to reset the filepointer to the beginning of the file after each sentence or -- even better -- read the file F2 in advance and store its contents in a hash.
Iterating over an array of words in a sentence is easier with foreach my $word (#words). No need to do the housekeeping of index variables (like $y) yourself.
chomping and splitting the sentences should be moved outside the F2 loop because $ligne doesn't change in the loop and only burns CPU cycles.
Putting this together I end up with this:
use strict;
use warnings;
# Read the grammar file, F2, into a hash:
my %grammar;
open( my $fh_grammar, '<', 'F2' ) or die "Pb pour ouvrir F2: $!\n";
while( my $ligne = <$fh_grammar> ) {
my ($prem, $deux) = ( $ligne =~ m/^\W*(\w+)\W+(\w+)/ );
$grammar{$prem} = $deux;
}
close($fh_grammar);
# The hash is now:
# %grammar = (
# 'he' => 'pro',
# 'lives => 'verb',
# 'in' => 'prep',
# 'paris' => 'adv_pl'
# 'jokes' => 'verb'
# );
# Read the sentences from F1 and check the grammar:
open( my $fh_sentences, '<', 'F1' ) or die "Pb pour ouvrir F1: $!\n";
while( my $ligne = <$fh_sentences> ) {
my #words = split(/\s+/, $ligne );
foreach my $word (#words) {
print "z: $word\n";
if ( exists $grammar{$word} ) {
print "true; $grammar{$word}\n";
}
}
print "\n";
}
close($fh_sentences);
Output:
z: he
true; pro
z: lives
true; verb
z: in
true; prep
z: paris
true; adv_pl
z: he
true; pro
z: jokes
true; verb

You can solve the above problem in different way like :
First read the POS file and put it in hash
Code :
my $filename = "F2";
open FH2, $filename or die "Error";
my %POS_hash;
while(<FH2>)
{
chomp($_);
my #arr = split(/ /, $_); # you can change the split function
$POS_hash{$arr[0]} = $arr[1];
}
Now read your file and replace it with the POS
my $filename1 = "F1";
open FH1, $filename1 or die "Error";
while(<FH1>)
{
chomp($_);
my #arr = split(/ /, $_); # you can change the split function
foreach my $val (#arr)
{
if(exists $POS_hash{$val})
{
print "$POS_hash{$val}\t";
}
}
print "\n";
}
I believe this is a better way for your problem. Hope this will solve your problem.

Related

how to compare two arrays string

I have two files as input, a file containing a list of words StopWordsList.txt, I want to remove from StopWordsList.txt the words that are in StopWordsList.txt, here is my code:
my $FichierResulat = '/home/lenovo/Bureau/MesTravaux/LeskAlgo/OriginalLeskResult';
open( my $FhResultat, '>:utf8', $FichierResulat );
open( my $fh1, "<:utf8", '/home/lenovo/Bureau/MesTravaux/LeskAlgo/DemoLesk/StopWordsList.txt' )
or die "Failed to open file: $!\n"; #file contains stop words
open( my $fh2, "<:utf8", '/home/lenovo/Bureau/MesTravaux/LeskAlgo/text1.txt' ) #file contains text
or die "Failed to open file: $!\n";
my #tabStopWords = <$fh1>;
my #tab_contexte;
my #words;
while ( <$fh2> ) {
chomp;
next if m/^$/;
my $context = $_;
#words = split( / /, $_ );
}
#compare: remove from #words the words existing in #tabStopWords
my %temp;
#temp{#tabStopWords} = 0 .. $#tabStopWords;
for my $val ( #words ) {
if ( exists $temp{$val} ) {
print "$val est présent dans tab1 à la position $temp{$val}.\n";
}
else {
print "$val n'est pas dans tab1.\n";
push #tab_sans_SW, $val;
}
}
foreach my $value ( #tab_sans_SW ) {
print $FhResultat "$value\n";
}
but in the result file i have all the words existing in #words without removing the word that exist in #tabStopWords..
I hope tha can you help me.
my sotpwords file :
ال
الآن
التي
الذي
الذين
اللاتي
اللائي
اللتان
اللتين
my texte file :
ومواصلات بما فيه من بريد ونور ومياه وصناعات وعلوم ومعارف وحينما يركب احدنا قطارا فإنه يركب في نفس الوقت على حرية جاهزة اعدها له آلاف العمال والمخترعين والمهندسين في

There are a couple of problems
You don't chomp the contents of #tabStopWords, so each entry has a newline at the end
You overwrite the contents of #words each time around the while loop with #words = split(/ /, $_) instead of adding to it
This program will do what you want. I have added use autodie to avoid having to check the result of every open, and I have removed a couple of unused variables. Local variable names are better written using just lower-case letters and underscores, especially for readers whose first language isn't English
I've used split on both files to reduce them both to individual words. Because split also removes newline characters there is no need for chomp
use strict;
use warnings 'all';
use autodie;
use constant FICHIER_STOP_WORD => '/home/lenovo/Bureau/MesTravaux/LeskAlgo/DemoLesk/StopWordsList.txt';
use constant FICHIER_TEXTE => '/home/lenovo/Bureau/MesTravaux/LeskAlgo/text1.txt';
use constant FICHIER_RESULAT => '/home/lenovo/Bureau/MesTravaux/LeskAlgo/OriginalLeskResult';
my #tab_stop_words = do {
open my $fh1, "<:utf8", FICHIER_STOP_WORD;
map { split } <$fh1>;
};
my #words = do {
open my $fh1, "<:utf8", FICHIER_TEXTE;
map { split } <$fh1>;
};
my %words = map { $words[$_] => $_ } 0 .. $#words;
open my $fh_resultat, '>:utf8', FICHIER_RESULAT;
for my $word ( #words ) {
my $position = $words{$word};
if ( defined $position ) {
print "$word est présent dans tab1 à la position $position.\n";
}
else {
print "$word n'est pas dans tab1.\n";
print $fh_resultat "$word\n";
}
}

This problem would be easier to solve if you showed us the format of your two input files. But as you don't, this will be guesswork.
I guess that your file of stopwords contains a single word on each line. In that case, each element in #tabStopWords and, therefore, each key in %temp will have newline at the end of them. This makes it extremely unlikely that any of the words in your source file will match these keys.
You probably want to add:
chomp #tabStopWords;
to your code.

We can get the difference using smart match operator (~~),
my(#words_arr) = ("is","a");
my(#input_arr) = ("This","is","a","example","code");
my (#diff) = grep { not $_ ~~ #words_arr} #input_arr;

Nested Loop running very slowly

I'm trying to run a program to check each line of one file against each line of a second file to see if some of the elements match. Each file is around 200k lines.
What I've got so far looks like this;
#!/usr/bin/perl
#gffgenefind.pl
use strict;
use warnings;
die "SNP gff\n" unless #ARGV == 4;
open( my $snp, "<", $ARGV[0] ) or die "Can't open $:";
open( my $gff, "<", $ARGV[1] ) or die "can't open $:";
open( my $outg, ">", $ARGV[2] );
open( my $outs, ">", $ARGV[3] );
my $scaffold;
my $site;
my #snplines = <$snp>;
my #gfflines = <$gff>;
foreach my $snpline (#snplines) {
my #arr = split( /\t/, $snpline );
$scaffold = $arr[0];
$site = $arr[1];
foreach my $line (#gfflines) {
my #arr1 = split( /\t/, $line );
if ( $arr1[3] <= $site and $site <= $arr1[4] and $arr1[0] eq $scaffold ) {
print $outg "$line";
print $outs "$snpline";
}
}
}
File 1 (snp) looks like this scaffold_100 10689 A C A 0 0 0 0 0 0
File 2 (gff) looks like this scaffold_1 phytozomev10 gene 750912 765975 . - . ID=Carubv10008059m.g.v1.0;Name=Carubv10008059m.g
Essentially, I'm looking to see if the first values match and if the second value from snp is within the range defined on the second file (in this case 750912 to 765975)
I've seen that nested loops are to be avoided, and was wondering if there's an alternative way for me to look through this data.
Thanks!

Firstly - lose the foreach loop. That reads your whole file into memory, when you probably don't need to.
Try instead:
while ( my $snpline = <$snp> ) {
because it reads line by line.
Generally - mixing array indicies and named variables is also bad style.
The core problem will most likely be though because each line of your first file, you're cycling all of the second file.
Edit: Note - because 'scaffold' isn't unique, amended accordingly
This seems like a good place to use a hash. E.g.
my %sites;
while ( <$snp> ) {
my ( $scaffold, $site ) = split ( /\t/ );
$sites{$site}{$scaffold}++
}
while ( <$gff> ) {
my ( $name, $tmp1, $tmp2, $range_start, $range_end ) = split ( /\t/ );
if ( $sites{$name} ) {
foreach my $scaffold ( keys %{ $sites{$name} ) {
if ( $scaffold > $range_start
and $scaffold < $range_end ) {
#do stuff with it;
print;
}
}
}
}
Hopefully you get the gist, even if it isn't specifically what you're after?

Try this Python snippet:
#!/usr/bin/env python
import sys
import contextlib
if len(sys.argv) !=5:
raise Exception('SNP gff')
snp, gff, outg, outs = sys.argv[1:]
gff_dict = {}
with open(gff) as gff_handler:
for line in gff_handler:
fields=line.split()
try:
gff_dict[fields[0]].append(fields[1:])
except KeyError:
gff_dict[fields[0]] = [fields[1:]]
with contextlib.nested(open(snp),
open(outs, 'w'),
open(outg, 'w')) as (snp_handler,
outs_handler,
outg_handler):
for line_snp in snp_handler:
fields=line_snp.split()
key = fields[0]
if key in gff_dict:
for ele in gff_dict[key]:
if ele[2] <= fields[1] <= ele[3]:
outs_handler.write(line_snp)
outg_handler.write("{0}\t{1}\n".format(key,"\t".join(ele)))

match columns on different lines and sum

I have a csv with about 160,000 lines, it looks like this:
chr1,160,161,3,0.333333333333333,+
chr1,161,162,4,0.5,-
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,18,0.5,+
chr2,511,512,6,0.333333333333333,-
I would like to pair lines where column 1 is the same, column 3 matches column 2 and where column 6 is a '+' while on the other line it is a '-'. If this is true I would like to sum column 4 and column 5.
My desired out put would be
chr1,160,161,7,0.833333333333333,+
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,24,0.833333333333333,-
the best solution I can think of is to duplicate the file and then match columns between the file and it's duplicate with perl:
#!/usr/bin/perl
use strict;
use warnings;
open my $firstfile, '<', $ARGV[0] or die "$!";
open my $secondfile, '<', $ARGV[1] or die "$!";
my ($chr_a, $chr_b,$start,$end,$begin,$finish, $sum_a, $sum_b, $total_a,
$total_b,$sign_a,$sign_b);
while (<$firstfile>) {
my #col = split /,/;
$chr_a = $col[0];
$start = $col[1];
$end = $col[2];
$sum_a = $col[3];
$total_a = $col[4];
$sign_a = $col[5];
seek($secondfile,0,0);
while (<$secondfile>) {
my #seccol = split /,/;
$chr_b = $seccol[0];
$begin = $seccol[1];
$finish = $seccol[2];
$sum_b = $seccol[3];
$total_b = $seccol[4];
$sign_b = $seccol[5];
print join ("\t", $col[0], $col[1], $col[2], $col[3]+=$seccol[3],
$col[4]+=$seccol[4], $col[5]),
"\n" if ($chr_a eq $chr_b and $end==$begin and $sign_a ne $sign_b);
}
}
And that works fine, but ideally I'd like to be able to do this within the file itself without having to duplicate it, because I have many files and so I would like to run a script over all of them that is less time-consuming.
Thanks.

In the absence of a response to my comment, this program will do as you ask with the data you provide.
use strict;
use warnings;
my #last;
while (<DATA>) {
s/\s+\z//;
my #line = split /,/;
if (#last
and $last[0] eq $line[0]
and $last[2] eq $line[1]
and $last[5] eq '+' and $line[5] eq '-') {
$last[3] += $line[3];
$last[4] += $line[4];
print join(',', #last), "\n";
#last = ()
}
else {
print join(',', #last), "\n" if #last;
#last = #line;
}
}
print join(',', #last), "\n" if #last;
__DATA__
chr1,160,161,3,0.333333333333333,+
chr1,161,162,4,0.5,-
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,18,0.5,+
chr2,511,512,6,0.333333333333333,-
output
chr1,160,161,7,0.833333333333333,+
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,24,0.833333333333333,+

Find mismatch on 2nd column between 2 text files

I have these 2 text files and I would like to find any mismatch on 2nd column between files. The mismatch to be identified is based on type of F ,P and N regardless which lines they occur. I have 1F, 3P in first file while 2P,1N and 1F in second file. When do comparison, both files should have equal occurrence of type 1F, 3P and 1N.
Text1:
f0x11 F
f0x34 P
drx99
dex67 P
edx43 P
sdx33
Text2:
1 P
2 N
4
5 F
6
7 P
Expected Output:
Text 1 has missing type of N
Text 2 has missing type of P
What I have tried so far does not produce desired output.
code:
use strict;
my %ref_data;
my %ref_data2;
open my $fh, '<', 'Text1' or die "Could not open file to read:$!";
while (<$fh>) {
chomp;
my ($res, $type) = split;
if (defined $type){
$ref_data{$type} = "$type";
}
}
our ($data,$data2);
open $fh, '<', 'Text2' or die "Could not open file to read:$!";
while (<$fh>) {
chomp;
my ($res, $type) = split;
if (defined $type){
$ref_data2{$type}= "$type";
$data2= $ref_data2{$type};
$data = $ref_data{$type};
print "File 2 has missing type of $type\n" unless $data;
}
}
foreach ($data){
print "File 1 has missing type of $_\n" if $data ne $data2;
}

You appear to want to keep track of how many times the values in Column 2 occur within each file -- for example, in a comment you write, "I have 1F, 3P in first file while 2P,1N and 1Fin second file". If that's the case, you need a better data structure.
Specifically, one that counts occurrences of the values in Column 2, and you need those counts to be tracked separately for each file. That suggests a hash-of-hashes.
use strict;
use warnings;
# Example usage:
# perl YOUR_SCRIPT.pl a.txt b.txt
my #files = #ARGV;
# Count the values in Column 2, organizing the tallies like this:
# $tallies{COL_2}{FILE_NAME} = N
my %tallies;
while (<>) {
my #cols = split;
$tallies{$cols[1]}{$ARGV} ++ if #cols > 1;
}
# Print discrepancies.
for my $c2 (keys %tallies) {
my #t = map { $tallies{$c2}{$_} || 0 } #files;
next if $t[0] == $t[1];
print "$c2: $files[0] has $t[0]; $files[1] has $t[1]\n";
}
Example output:
N: a.txt has 0; b.txt has 1
P: a.txt has 3; b.txt has 2
Also worth noting: this code does not have to open any files explicitly, and file names are not hard-coded in the program. Instead, we pass input file names as command-line arguments, get those arguments via #ARGV, process lines in those files via <>, and know which file we're currently processing via $ARGV.

I've refactored your code where you seem to be duplicating the same behavior.
The output isn't to spec, but should be clear enough for you to understand and finish up yourself.
I added a close $fh; and use warnings; as well
#!/usr/bin/perl
use strict;
use warnings;
#run
my %max; # hash of combined data
my $file_data_1 = parse_file_into_hash("text1", \%max);
my $file_data_2 = parse_file_into_hash("text2", \%max);
diff_hashes(\%max, $file_data_1, $file_data_2);
# diff_hashes($max, $h1, $h2)
#
# diffs 2 hash refs against a combined $max hash and prints results
sub diff_hashes {
my ($max, $h1, $h2) = #_;
# TODO - do all the comparisios and some error checking (if keys exist etc...) here
for my $key (keys %$max) {
print "max/combined: $key = $max->{$key}\n";
my $h1_print = exists $h1->{$key} ? $h1->{$key} : "0";
my $h2_print = exists $h2->{$key} ? $h2->{$key} : "0";
print "h1: $key = $h1_print\n";
print "h2: $key = $h2_print\n";
}
}
# parse_file_into_hash($file, $max)
#
# $max is a hash reference (passed by reference) so you can count occurences over
# multiple files...
# returns reference of hash ( $line_number => $data_value )
sub parse_file_into_hash {
my ($file, $max) = #_;
my %ref_data;
open my $fh, '<', $file or die "Could not open file to read:$!";
while (my $line = <$fh>) {
chomp $line;
my ($res, $type) = split /\s+/, $line;
if ($type) {
$ref_data{$type}++;
if (!exists $max->{$type} || $ref_data{$type} > $max->{$type}) {
$max->{$type} = $ref_data{$type};
}
}
}
close $fh;
return \%ref_data;
}
Output ran against your example files:
$ ./example.pl
max/combined: F = 1
h1: F = 1
h2: F = 1
max/combined: N = 1
h1: N = 0
h2: N = 1
max/combined: P = 3
h1: P = 3
h2: P = 2

Parsing unsorted data from large fixed width text

I am mostly a Matlab user and a Perl n00b. This is my first Perl script.
I have a large fixed width data file that I would like to process into a binary file with a table of contents. My issue is that the data files are pretty large and the data parameters are sorted by time. Which makes it difficult (at least for me) to parse into Matlab. So seeing how Matlab is not that good at parsing text I thought I would try Perl. I wrote the following code which works ... at least on my small test file. However it is painfully slow when I tried it on an actual large data file. It was pieced together which lots of examples for various tasks from the web / Perl documentation.
Here is a small sample of the data file. Note: Real file has about 2000 parameter and is 1-2GB. Parameters can be text, doubles, or unsigned integers.
Param 1 filter = ALL_VALUES
Param 2 filter = ALL_VALUES
Param 3 filter = ALL_VALUES
Time Name Ty Value
---------- ---------------------- --- ------------
1.1 Param 1 UI 5
2.23 Param 3 TXT Some Text 1
3.2 Param 1 UI 10
4.5 Param 2 D 2.1234
5.3 Param 1 UI 15
6.121 Param 2 D 3.1234
7.56 Param 3 TXT Some Text 2
The basic logic of my script is to:
Read until the ---- line to build list of parameters to extract (always has "filter =").
Use the --- line to determine field widths. It is broken by spaces.
For each parameter build time and data array (while nested inside of foreach param)
In continue block write time and data to binary file. Then record name, type, and offsets in text table of contents file (used to read the file later into Matlab).
Here is my script:
#!/usr/bin/perl
$lineArg1 = #ARGV[0];
open(INFILE, $lineArg1);
open BINOUT, '>:raw', $lineArg1.".bin";
open TOCOUT, '>', $lineArg1.".toc";
my $line;
my $data_start_pos;
my #param_name;
my #template;
while ($line = <INFILE>) {
chomp $line;
if ($line =~ s/\s+filter = ALL_VALUES//) {
$line = =~ s/^\s+//;
$line =~ s/\s+$//;
push #param_name, $line;
}
elsif ($line =~ /^------/) {
#template = map {'A'.length} $line =~ /(\S+\s*)/g;
$template[-1] = 'A*';
$data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $template = "#template";
my #lineData;
my #param_data;
my #param_time;
my $data_type;
foreach $current_param (#param_name) {
#param_time = ();
#param_data = ();
seek(INFILE,$data_start_pos,0); #Jump to data start
while ($line = <INFILE>) {
if($line =~ /$current_param/) {
chomp($line);
#lineData = unpack $template, $line;
push #param_time, #lineData[0];
push #param_data, #lineData[3];
}
} # END WHILE <INFILE>
} #END FOR EACH NAME
continue {
$data_type = #lineData[2];
print TOCOUT $current_param.",".$data_type.",".tell(BINOUT).","; #Write name,type,offset to start time
print BINOUT pack('d*', #param_time); #Write TimeStamps
print TOCOUT tell(BINOUT).","; #offset to end of time/data start
if ($data_type eq "TXT") {
print BINOUT pack 'A*', join("\n",#param_data);
}
elsif ($data_type eq "D") {
print BINOUT pack('d*', #param_data);
}
elsif ($data_type eq "UI") {
print BINOUT pack('L*', #param_data);
}
print TOCOUT tell(BINOUT).","."\n"; #Write memory loc to end data
}
close(INFILE);
close(BINOUT);
close(TOCOUT);
So my questions to you good people of the web are as follows:
What am I obviously screwing up? Syntax, declaring variables when I don't need to, etc.
This is probably slow (guessing) because of the nested loops and searching the line by line over and over again. Is there a better way to restructure the loops to extract multiple lines at once?
Any other speed improvement tips you can give?
Edit: I modified the example text file to illustrate non-integer time stamps and Param Names may contain spaces.

First, you should always have 'use strict;' and 'use warnings;' pragmas in your script.
It seems like you need a simple array (#param_name) for reference, so loading those values would be straight forward as you have it. (again, adding the above pragmas would start showing you errors, including the $line = =~ s/^\s+//; line!)
I suggest you read this, to understand how you can load your data file into a
Hash of Hashes. Once you've designed the hash, you simply read and load the file data contents, and then iterate through the contents of the hash.
For example, using time as the key for the hash
%HoH = (
1 => {
name => "Param1",
ty => "UI",
value => "5",
},
2 => {
name => "Param3",
ty => "TXT",
value => "Some Text 1",
},
3 => {
name => "Param1",
ty => "UI",
value => "10",
},
);
Make sure you close the INFILE after reading in the contents, before you start processing.
So in the end, you iterate over the hash, and reference the array (instead of the file contents) for your output writes - I would imagine it would be much faster to do this.
Let me know if you need more info.
Note: if you go this route, include Data:Dumper - a significant help to printing and understanding the data in your hash!

It seems to me that embedded spaces can only occur in the last field. That makes using split ' ' feasible for this problem.
I am assuming you are not interested in the header. In addition, I am assuming you want a vector for each parameter and are not interested in timestamps.
To use data file names specified on the command line or piped through standard input, replace <DATA> with <>.
#!/usr/bin/env perl
use strict; use warnings;
my %data;
$_ = <DATA> until /^-+/; # skip header
while (my $line = <DATA>) {
$line =~ s/\s+\z//;
last unless $line =~ /\S/;
my (undef, $param, undef, $value) = split ' ', $line, 4;
push #{ $data{ $param } }, $value;
}
use Data::Dumper;
print Dumper \%data;
__DATA__
Param1 filter = ALL_VALUES
Param2 filter = ALL_VALUES
Param3 filter = ALL_VALUES
Time Name Ty Value
---------- ---------------------- --- ------------
1 Param1 UI 5
2 Param3 TXT Some Text 1
3 Param1 UI 10
4 Param2 D 2.1234
5 Param1 UI 15
6 Param2 D 3.1234
7 Param3 TXT Some Text 2
Output:
$VAR1 = {
'Param2' => [
'2.1234',
'3.1234'
],
'Param1' => [
'5',
'10',
'15'
],
'Param3' => [
'Some Text 1',
'Some Text 2'
]
};

First off, this piece of code causes the input file to be read once for every param. Which is quite in-efficient.
foreach $current_param (#param_name) {
...
seek(INFILE,$data_start_pos,0); #Jump to data start
while ($line = <INFILE>) { ... }
...
}
Also there is very rarely a reason to use a continue block. This is more style / readability, then a real problem.
Now on to make it more performant.
I packed the sections individually, so that I could process a line exactly once. To prevent it from using up tons of RAM, I used File::Temp to store the data until I was ready for it. Then I used File::Copy to append those sections into the binary file.
This is a quick implementation. If I were to add much more to it, I would split it up more than it is now.
#!/usr/bin/perl
use strict;
use warnings;
use File::Temp 'tempfile';
use File::Copy 'copy';
use autodie qw':default copy';
use 5.10.1;
my $input_filename = shift #ARGV;
open my $input, '<', $input_filename;
my #param_names;
my $template = ''; # stop uninitialized warning
my #field_names;
my $field_name_line;
while( <$input> ){
chomp;
next if /^\s*$/;
if( my ($param) = /^\s*(.+?)\s+filter = ALL_VALUES\s*$/ ){
push #param_names, $param;
}elsif( /^[\s-]+$/ ){
my #fields = split /(\s+)/;
my $pos = 0;
for my $field (#fields){
my $length = length $field;
if( substr($field, 0, 1) eq '-' ){
$template .= "\#${pos}A$length ";
}
$pos += $length;
}
last;
}else{
$field_name_line = $_;
}
}
#field_names = unpack $template, $field_name_line;
for( #field_names ){
s(^\s+){};
$_ = lc $_;
$_ = 'type' if substr('type', 0, length $_) eq $_;
}
my %temp_files;
for my $param ( #param_names ){
for(qw'time data'){
my $fh = tempfile 'temp_XXXX', UNLINK => 1;
binmode $fh, ':raw';
$temp_files{$param}{$_} = $fh;
}
}
my %convert = (
TXT => sub{ pack 'A*', join "\n", #_ },
D => sub{ pack 'd*', #_ },
UI => sub{ pack 'L*', #_ },
);
sub print_time{
my($param,$time) = #_;
my $fh = $temp_files{$param}{time};
print {$fh} $convert{D}->($time);
}
sub print_data{
my($param,$format,$data) = #_;
my $fh = $temp_files{$param}{data};
print {$fh} $convert{$format}->($data);
}
my %data_type;
while( my $line = <$input> ){
next if $line =~ /^\s*$/;
my %fields;
#fields{#field_names} = unpack $template, $line;
print_time( #fields{(qw'name time')} );
print_data( #fields{(qw'name type value')} );
$data_type{$fields{name}} //= $fields{type};
}
close $input;
open my $bin, '>:raw', $input_filename.".bin";
open my $toc, '>', $input_filename.".toc";
for my $param( #param_names ){
my $data_fh = $temp_files{$param}{data};
my $time_fh = $temp_files{$param}{time};
seek $data_fh, 0, 0;
seek $time_fh, 0, 0;
my #toc_line = ( $param, $data_type{$param}, 0+sysseek($bin, 0, 1) );
copy( $time_fh, $bin, 8*1024 );
close $time_fh;
push #toc_line, sysseek($bin, 0, 1);
copy( $data_fh, $bin, 8*1024 );
close $data_fh;
push #toc_line, sysseek($bin, 0, 1);
say {$toc} join ',', #toc_line, '';
}
close $bin;
close $toc;

I modified my code to build a Hash as suggested. I have not incorporate the output to binary yet due to time limitations. Plus I need to figure out how to reference the hash to get the data out and pack it into binary. I don't think that part should be to difficult ... hopefully
On an actual data file (~350MB & 2.0 Million lines) the following code takes approximately 3 minutes to build the hash. CPU usage was 100% on 1 of my cores (nill on the other 3) and Perl memory usage topped out at around 325MB ... until it dumped millions of lines to the prompt. However the print Dump will be replaced with a binary pack.
Please let me know if I am making any rookie mistakes.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $lineArg1 = $ARGV[0];
open(INFILE, $lineArg1);
my $line;
my #param_names;
my #template;
while ($line = <INFILE>) {
chomp $line; #Remove New Line
if ($line =~ s/\s+filter = ALL_VALUES//) { #Find parameters and build a list
push #param_names, trim($line);
}
elsif ($line =~ /^----/) {
#template = map {'A'.length} $line =~ /(\S+\s*)/g; #Make template for unpack
$template[-1] = 'A*';
my $data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $size = $#param_names+1;
my #getType = ((1) x $size);
my $template = "#template";
my #lineData;
my %dataHash;
my $lineCount = 0;
while ($line = <INFILE>) {
if ($lineCount % 100000 == 0){
print "On Line: ".$lineCount."\n";
}
if ($line =~ /^\d/) {
chomp($line);
#lineData = unpack $template, $line;
my ($inHeader, $headerIndex) = findStr($lineData[1], #param_names);
if ($inHeader) {
push #{$dataHash{$lineData[1]}{time} }, $lineData[0];
push #{$dataHash{$lineData[1]}{data} }, $lineData[3];
if ($getType[$headerIndex]){ # Things that only need written once
$dataHash{$lineData[1]}{type} = $lineData[2];
$getType[$headerIndex] = 0;
}
}
}
$lineCount ++;
} # END WHILE <INFILE>
close(INFILE);
print Dumper \%dataHash;
#WRITE BINARY FILE and TOC FILE
my %convert = (TXT=>sub{pack 'A*', join "\n", #_}, D=>sub{pack 'd*', #_}, UI=>sub{pack 'L*', #_});
open my $binfile, '>:raw', $lineArg1.'.bin';
open my $tocfile, '>', $lineArg1.'.toc';
for my $param (#param_names){
my $data = $dataHash{$param};
my #toc_line = ($param, $data->{type}, tell $binfile );
print {$binfile} $convert{D}->(#{$data->{time}});
push #toc_line, tell $binfile;
print {$binfile} $convert{$data->{type}}->(#{$data->{data}});
push #toc_line, tell $binfile;
print {$tocfile} join(',',#toc_line,''),"\n";
}
sub trim { #Trim leading and trailing white space
my (#strings) = #_;
foreach my $string (#strings) {
$string =~ s/^\s+//;
$string =~ s/\s+$//;
chomp ($string);
}
return wantarray ? #strings : $strings[0];
} # END SUB
sub findStr { #Return TRUE if string is contained in array.
my $searchStr = shift;
my $i = 0;
foreach ( #_ ) {
if ($_ eq $searchStr){
return (1,$i);
}
$i ++;
}
return (0,-1);
} # END SUB
The output is as follows:
$VAR1 = {
'Param 1' => {
'time' => [
'1.1',
'3.2',
'5.3'
],
'type' => 'UI',
'data' => [
'5',
'10',
'15'
]
},
'Param 2' => {
'time' => [
'4.5',
'6.121'
],
'type' => 'D',
'data' => [
'2.1234',
'3.1234'
]
},
'Param 3' => {
'time' => [
'2.23',
'7.56'
],
'type' => 'TXT',
'data' => [
'Some Text 1',
'Some Text 2'
]
}
};
Here is the output TOC File:
Param 1,UI,0,24,36,
Param 2,D,36,52,68,
Param 3,TXT,68,84,107,
Thanks everyone for their help so far! This is an excellent resource!
EDIT: Added Binary & TOC file writing code.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Next line array if a condition is achieved - perl

Related

how to compare two arrays string

Nested Loop running very slowly

match columns on different lines and sum

Find mismatch on 2nd column between 2 text files

Parsing unsorted data from large fixed width text

Categories

Resources