I've got the follow function inside a perl script:
sub fileSize {
my $file = shift;
my $opt = shift;
open (FILE, $file) or die "Could not open file $file: $!";
$/ = ">";
my $junk = <FILE>;
my $g_size = 0;
while ( my $rec = <FILE> ) {
chomp $rec;
my ($name, #seqLines) = split /\n/, $rec;
my $sec = join('',#seqLines);
$g_size+=length($sec);
if ( $opt == 1 ) {
open TMP, ">>", "tmp" or die "Could not open chr_sizes.log: $!\n";
print TMP "$name\t", length($sec), "\n";
}
}
if ( $opt == 0 ) {
PrintLog( "file_size: $g_size", 0 );
}
else {
print TMP "file_size: $g_size\n";
close TMP;
}
$/ = "\n";
close FILE;
}
Input file format:
>one
AAAAA
>two
BBB
>three
C
I have several input files with that format. The line beginning with ">" is the same but the other lines can be of different length. The output of the function with only one file is:
one 5
two 3
three 1
I want to execute the function in a loop with this for each file:
foreach my $file ( #refs ) {
fileSize( $file, 1 );
}
When running the next iteration, let's say with this file:
>one
AAAAABB
>two
BBBVFVF
>three
CS
I'd like to obtain this output:
one 5 7
two 3 7
three 1 2
How can I modify the function or modify the script to get this? As can be seen, my function append the text to the file
Thanks!
I've left out your options and the file IO operations and have concentrated on showing a way to do this with an array of arrays from the command line. I hope it helps. I'll leave wiring it up to your own script and subroutines mostly up to to you :-)
Running this one liner against your first data file:
perl -lne ' $name = s/>//r if /^>/ ;
push #strings , [$name, length $_] if !/^>/ ;
END { print "#{$_ } " for #strings }' datafile1.txt
gives this output:
one 5
two 3
three 1
Substituting the second version or instance of the data file (i.e where record one contains AAAAABB) gives the expected results as well.
one 7
two 7
three 2
In your script above, you save to an output file in this format. So, to append columns to each row in your output file, we can just munge each of your data files in the same way (with any luck this might mean things can be converted into a function that will work in a foreach loop). If we save the transformed data to be output into an array of arrays (AoA), then we can just push the length values we get for each data file string onto the corresponding anonymous array element and then print out the array. Voilà! Now let's hope it works ;-)
You might want to install Data::Printer which can be used from the command line as -MDDP to visualize data structures.
First - run the above script and redirect the output to a file with > /tmp/output.txt
Next - try this longish one-liner that uses DDP and p to show the structure of the array we create:
perl -MDDP -lne 'BEGIN{ local #ARGV=shift;
#tmp = map { [split] } <>; p #tmp }
$name = s/>//r if /^>/ ;
push #out , [ $name, length $_ ] if !/^>/ ;
END{ p #out ; }' /tmp/output.txt datafile2.txt `
In the BEGIN block we local-ize #ARGV ; shift off the first file (our version of your TMP file) - {local #ARGV=shift} is almost a perl idiom for handling multiple input files; we then split it inside an anonymous array constructor ([]) and map { } that into the #tmp array which we display with DDP's p() function. Once we are out of the BEGIN block, the implicit while (<>){ ... } that we get with perl's -n command line switch takes over and reads in the remaining file from #ARGV ; we process lines starting with > - stripping the leading character and assigning the string that follows to the $name variable; the while continues and we push $name and the length of any line that does not start with > (if !/^>/) wrapped as elements of an anonymous array [] into the #out array which we display with p() as well (in the END{} block so it doesn't print inside our implicit while() loop). Phew!!
See the AoA that results as a gist #Github.
Finally - building on that, and now we have munged things nicely - we can change a few things in our END{...} block (add a nested for loop to push things around) and put this all together to produce the output we want.
This one liner:
perl -MDDP -lne 'BEGIN{ local #ARGV=shift; #tmp = map {[split]} <>; }
$name = s/>//r if /^>/ ; push #out, [ $name, length $_ ] if !/^>/ ;
END{ foreach $row (0..$#tmp) { push $tmp[$row] , $out[$row][-1]} ;
print "#$_" for #tmp }' output.txt datafile2.txt
produces:
one 5 7
two 3 7
three 1 2
We'll have to convert that into a script :-)
The script consists of three rather wordy subroutines that reads the log file; parses the datafile ; merges them. We run them in order. The first one checks to see if there is an existing log and creates one and then does an exit to skip any further parsing/merging steps.
You should be able to wrap them in a loop of some kind that feeds files to the subroutines from an array instead of fetching them from STDIN. One caution - I'm using IO::All because it's fun and easy!
use 5.14.0 ;
use IO::All;
my #file = io(shift)->slurp ;
my $log = "output.txt" ;
&readlog;
&parsedatafile;
&mergetolog;
####### subs #######
sub readlog {
if (! -R $log) {
print "creating first log entry\n";
my #newlog = &parsedatafile ;
open(my $fh, '>', $log) or die "I CAN HAZ WHA????" ;
print $fh "#$_ \n" for #newlog ;
exit;
}
else {
map { [split] } io($log)->slurp ;
}
}
sub parsedatafile {
my (#out, $name) ;
while (<#file>) {
chomp ;
$name = s/>//r if /^>/;
push #out, [$name, length $_] if !/^>/ ;
}
#out;
}
sub mergetolog {
my #tmp = readlog ;
my #data = parsedatafile ;
foreach my $row (0 .. $#tmp) {
push $tmp[$row], $data[$row][-1]
}
open(my $fh, '>', $log) or die "Foobar!!!" ;
print $fh "#$_ \n" for #tmp ;
}
The subroutines do all the work here - you can likely find ways to shorten; combine; improve them. Is this a useful approach for you?
I hope this explanation is clear and useful to someone - corrections and comments welcome. Probably the same thing could be done with place editing (i.e with perl -pie '...') which is left as an exercise to those that follow ...
You need to open the output file itself. First in read mode, then in write mode.
I have written a script that does what you are asking. What really matters is the part that appends new data to old data. Adapt that to your fileSize function.
So you have the output file, output.txt
Of the form,
one 5
two 3
three 1
And an array of input files, input1.txt, input2.txt, etc, saved in the #inputfiles variable.
Of the form,
>one
AAAAA
>two
BBB
>three
C
>four
DAS
and
>one
AAAAABB
>two
BBBVFVF
>three
CS
Respectively.
After running the following perl script,
# First read previous output file.
open OUT, '<', "output.txt" or die $!;
my #outlines;
while (my $line = <OUT> ) {
chomp $line;
push #outlines, $line;
}
close OUT;
my $outsize = scalar #outlines;
# Suppose you have your array of input file names already prepared
my #inputfiles = ("input1.txt", "input2.txt");
foreach my $file (#inputfiles) {
open IN, '<', $file or die $!;
my $counter = 1; # Used to compare against output size
while (my $line = <IN>) {
chomp $line;
$line =~ m/^>(.*)$/;
my $name = $1;
my $sequence = <IN>;
chomp $sequence;
my $seqsize = length($sequence);
# Here is where I append a column to output data.
if($counter <= $outsize) {
$outlines[$counter - 1] .= " $seqsize";
} else {
$outlines[$counter - 1] = "$name $seqsize";
}
$counter++;
}
close IN;
}
# Now rewrite the results to output.txt
open OUT, '>', "output.txt" or die $!;
foreach (#outlines) {
print OUT "$_\n";
}
close OUT;
You generate the output,
one 5 5 7
two 3 3 7
three 1 1 2
four 3
Related
I am writing a Perl script that takes two files as input: one input is a tab-separated table with an identifier of interested in the second column, the second input is a list of identifiers that match the second column of the first file.
THE GOAL is print only those lines of the table which contain an identifier in the second column and to print each line only once. I have written three versions of this program and have been finding different numbers of lines printed in each.
Version 1:
# TAB-SEPARTED TABLE FILE
open (FILE, $file);
while (<FILE>) {
my $line = $_;
chomp $line;
# ARRAY CONTAINING EACH IDENTIFIER AS A SEPARATE ELEMENT
foreach(#refs) {
my $ref = $_;
chomp $ref;
if ( $line =~ $ref) { print "$line\n"; next; }
}
}
Version 2:
# ARRAY CONTAINING EVERY LINE OF THE TAB-SEPARATED TABLE AS A SEPARATE LINE
foreach(#doc) {
my $full = $_;
# IF LOOP FOR PRINTING THE HEADER BUT NOT COMPARING IT TO ARRAY BELOW
if ( $counter == 0 ) {
print "$full\n";
$counter++;
next; }
# EXTRACT IDENTIFIER FROM LINE
my #cells = split('\t', $full);
my $gene = $cells[1];
foreach(#refs) {
my $text = $_;
if ( $gene =~ $text && $counter == 1 ) { # COMPARE IDENTIFIER
print "$full\n";
next;
}
}
$counter--;
}
Version 3:
# LIST OF IDENTIFIERS
foreach(#refs) {
my $ref = $_;
# LIST OF EACH ROW OF THE TABLE
foreach(#doc) {
my $line = $_;
my #cells = split('\t', $line);
my $gene = $cells[1];
if ( $gene =~ $ref ) { print "$line\n"; next; }
}
}
Each of these approaches gives me different output and I do not understand why. I also do not understand if I can trust any of them to give me the right output. The right output should not contain any duplicate lines but more than one row might match any identifier from the list.
Sample Input File:
Position Symbol Name REF ALT
chr1:887801 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) A G
chr1:888639 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C
chr1:888659 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C
chr1:897325 KLHL17 kelch-like 17 (Drosophila) G C
chr1:909238 PLEKHN1 pleckstrin homology domain containing, family N member 1 G C
chr1:982994 AGRN agrin T C
chr1:1254841 CPSF3L cleavage and polyadenylation specific factor 3-like C G
chr1:3301721 PRDM16 PR domain containing 16 C T
chr1:3328358 PRDM16 PR domain containing 16 T C
List is pulled from a file that looks like this:
A1BG
A2M
A2ML1
AAK1
ABCA12
ABCA13
ABCA2
ABCA4
ABCC2
Its put into an array using this code:
open (REF, $ref_file);
while (<REF>) {
my $line = $_;
chomp $line;
push(#refs, $line);
}
close REF;
Whenever you hear "I need to look up something", think hashes.
What you can do is create a hash that contains the elements you want to pull out of file #1. Then, use a second hash to track whether or not you printed it before:
#!/usr/bin/env perl
use warnings;
use strict;
use feature qw(say);
use autodie; # This way, I don't have to check my open for failures
use constant {
TABLE_FILE => "file1.txt",
LOOKUP_FILE => "file2.txt",
};
open my $lookup_fh, "<", LOOKUP_FILE;
my %lookup_table;
while ( my $symbol = <$lookup_fh> ) {
chomp $symbol,
$lookup_table{$symbol} = 1;
}
close $lookup_fh;
open my $table_file, "<", TABLE_FILE;
my %is_printed;
while ( my $line = <$table_file> ) {
chomp $line;
my #line_array = split /\s+/, $line;
my $symbol = $line_array[1];
if ( exists $lookup_table{$symbol} and not exists $is_printed{$symbol} ) {
say $line;
$is_printed{$symbol} = 1;
}
}
Two loops, but much more efficient. In yours, if you had 100 items in the first file, and 1000 items in the second file, you would have to loop 100 * 1000 times or 1,000,000. In this, you only loop the total number of lines in both files.
I use the three-parameter method of the open command which allows you to handle files with names that start with | or <, etc. Also, I use variables for my file handles which make it easier to pass the file handle to a subroutine if so desired.
I use use autodie; which handles issues such as what if my file doesn't open. In your program, the program would continue on its merry way. If you don't want to use autodie, you need to do this:
open $fh, "<", $my_file or die qq(Couldn't open "$my_file" for reading);
I use two hashes. The first is %lookup_table which stores the Symbols you want to print. When I go through the first file, I can simply check if `$lookup_table{$symbol} exists. If it doesn't, I don't print it, if it does, I print it.
The second hash %is_printed keeps track of Symbols I've already printed. If $is_printed{$symbol} exists, I know I've already printed that line.
Even though you said the second table is tab separated, I use /\s+/ as the split regular expression. This will catch a tab, but it will also catch if someone used two tabs (to keep things looking nice) or accidentally typed a space before that tab.
I'm pretty sure this should work:
$ awk '
NR == FNR {Identifiers[$1]; next}
$2 in Identifiers {
$1 = ""; $0 = $0; if(!Printed[$0]++) {print}
}' identifiers_file.txt data_file.txt
Given identifiers_file.txt such as this (to which I added NOC2L since there were no matching identifiers in your sample):
A1BG
A2M
A2ML1
AAK1
ABCA12
ABCA13
ABCA2
ABCA4
ABCC2
NOC2L
then your output will be:
$ awk '
NR == FNR {Identifiers[$1]; next}
$2 in Identifiers {
$1 = ""; $0 = $0; if(!Printed[$0]++) {print}
}' idents.txt data.txt
NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) A G
NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C
If that's correct and you want a Perl version, you can just:
$ echo 'NR == FNR {Identifiers[$1]; next} $2 in Identifiers { $1 = ""; $0 = $0; if(!Printed[$0]++) {print} }' \
| a2p
I suggest you to mix first version and second and add hashes to them.
First version because it's good(clear way) parse your data file line by line.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
open (REF, $ARGV[0]);
my %refs;
while (<REF>) {
my $line = $_;
chomp $line;
$refs{$line} = 0;
}
close REF;
#for head printing
$refs{'Symbol'} = 0;
open (FILE, $ARGV[1]);
while (<FILE>) {
my $line = $_;
my #cells = split('\t', $line);
my $gene = $cells[1];
#print $line, "\n" if exists $refs{$gene};
if(exists $refs{$gene} and $refs{$gene} == 0)
{
$refs{$gene}++;
print $line;
}
}
close FILE;
Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.
Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}
Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.
am fairly new to the perl scripting and need some help. below is my query:
I have a file which has contents like below:
AA ABC 0 0
line1
line2
...
AA XYZ 1 1
line..
line..
AA GHI 2 2
line..
line...
Now I would like get all the lines between those lines which have the starting string/pattern "AA" and write them to files ABC.txt, XYZ.txt, GHI.txt, repsectively including the line AA*, for examples ABC.txt should look like
AA ABC 0 0
line1
line2...
and XYZ.txt should look like
AA XYZ 1 1
line..
line..
Hope am clear in this question and any help regarding this is much appreciated.
Thanks,
Sandy
I presume you're asking for an algorithm since you didn't specify what you needed help with.
Declare a file handle for use for output.
While you haven't reached the end of the input file,
Read a line.
If it's a header line,
Parse it.
Determine file name.
(Re)open the output file.
Print the line to the output file handle.
Lest you be tempted to use one of the poor solutions that have been posted since I posted the above, here's the code:
my $fh;
while (<>) {
if (my ($fn) = /^AA\s+(\S+)/) {
$fn .= '.txt';
open($fh, '>', $fn)
or die("Can't create file \"$fn\": $!\n");
}
print $fh $_;
}
Possible improvements, all of which are easy to add:
Check for duplicate headers. (if -e $fn is one way)
Check for data before the first header. (if !$fh is one way)
You just need to keep one file open at a time... When a line matches XYZ, then you open your XYZ.txt file and output the line. You keep that file open (let's just say it's the handle CURRENT_FILE) and output each successive line to it until you match a new header line. Then you close the current file and open another one.
My Perl is a extremely rusty, so I don't think I can provide code that compiles, but essentially it's something close to this.
my $current_name = "";
foreach my $line (<INPUT>)
{
my($name) = $line =~ /^AA (\w+)/;
if( $name ne $current_name ) {
close(CURRENT_FILE) if $current_name ne "";
open(CURRENT_FILE, ">>", "$name.txt") || die "Argh\n";
$current_name = $name;
}
next if $current_name eq "";
print CURRENT_FILE $line;
}
close(CURRENT_FILE) if $current_name ne "";
What do you think about this one?
1: Get contents from the file (maybe using File::Slurp's read_file) and save to a scalar.
use File::Slurp qw(read_file write_file);
my $contents = read_file($filename);
2: Have a regex pattern matching similar to this:
my #file_rows = ($contents ~= /(AA\s[A-Z]{3}\s+\d+\s+\w*)/);
3: If column 2 values are always unique throughout the file:
foreach my $file_row (#file_rows) {
my #values = split(' ', $file_row, 3);
write_file($values[1] . ".txt", $file_row);
}
3: Otherwise: Split the row values. Store them to a hash using the second column as the key. Write data to output files using the hash.
my %hash;
foreach my $file_row (#file_rows) {
my #values = split(' ', $file_row, 3);
if (defined $hash{$value[1]}) {
$hash{$values[1]} .= $file_row;
} else {
$hash{$values[1]} = $file_row;
}
}
foreach my $key (keys %hash) {
write_file($key .'txt', $hash{$key});
}
Here's an option that looks for the pattern matching the start of each record. When found, it loops through the data file's lines and builds a record until it finds the same pattern again or eof, then that record is written to a file. It does not check to see if the file already exists before writing to it, so it will replace ABC.txt if it already exists:
use strict;
use warnings;
my $dataFile = 'data.txt';
my $nextLine = '';
my $recordRegex = qr/^AA\s+(\S+)\s+\d+\s+\d+/;
open my $inFH, '<', $dataFile or die $!;
RECORD: while ( my $line = <$inFH> ) {
my $record = $nextLine . $line;
if ( $record =~ $recordRegex ) {
my $fileName = $1 . '.txt';
while ( $nextLine = <$inFH> ) {
if ( $nextLine =~ $recordRegex or eof $inFH ) {
$record .= $nextLine if eof $inFH;
open my $outFH, '>', $fileName or die $!;
print $outFH $record;
close $outFH;
next RECORD;
}
$record .= $nextLine;
}
}
}
close $inFH;
Hope this helps!
Edit: This code replaces the original that was problematic. Thank you, amon, for reviewing the original code.
File I want to parse:
input Pattern;
input SDF;
input ABC
input Pattern;
output Pattern;
output XYZ;
In perl, usual operation is scan line by line.
I want to check that if
current line has output Pattern; and previous line (or all previous lines)has input Pattern;
then change all the previous lines matches to "input Pattern 2;" and current line to "output Pattern2;".
It is complicated ,I hope I have explained properly.
Is it possible in Perl to scan and change previous lines after they have been read?
Thanks
If this is your data:
my $sfile =
'input Pattern;
input SDF;
input ABC
input Pattern;
output Pattern;
output XYZ;' ;
then, the following snippet will read the whole file and change text accordingly:
open my $fh, '<', \$sfile or die $!;
local $/ = undef; # set file input mode to 'slurp'
my $content = <$fh>;
close $fh;
$content =~ s{ ( # open capture group
input \s+ (Pattern); # find occurence of input pattern
.+? # skip some text
output \s+ \2 # find same for output
) # close capture group
}
{ # replace by evaluated expression
do{ # within a do block
local $_=$1; # get whole match to $_
s/($2)/$1 2/g; # substitute Pattern by Pattern 2
$_ # return substituted text
} # close do block
}esgx;
Then, you may close your file and check the string:
print $content;
=>
input Pattern 2;
input SDF;
input ABC
input Pattern 2;
output Pattern 2;
output XYZ;
You may even include a counter $n which will be incremented after each successful match (by code assertion (?{ ... }):
our $n = 1;
$content =~ s{ ( # open capture group
input \s+ (Pattern); # find occurence of input pattern
.+? # skip some text
output \s+ \2 # find same for output
) # close capture group
(?{ $n++ }) # ! update match count
}
{ # replace by evaluated expression
do{ # within a do block
local $_=$1; # get whole match to $_
s/($2)/$1 $n/g; # substitute Pattern by Pattern and count
$_ # return substituted text
} # close do block
}esgx;
The substitution will now start with input Pattern 2; und increment subsequently.
I think this will do what you need, but try it on a 'scratch' file first (a copy of the original) since it actually changes the file:
use Modern::Perl;
open my $fh_in, '<', 'parseThis.txt' or die $!;
my #fileLines = <$fh_in>;
close $fh_in;
for ( my $i = 1 ; $i < scalar #fileLines ; $i++ ) {
next
if $fileLines[$i] !~ /output Pattern;/
and $fileLines[ $i - 1 ] !~ /input Pattern;/;
$fileLines[$i] =~ s/output Pattern;/output Pattern2;/g;
$fileLines[$_] =~ s/input Pattern;/input Pattern 2;/g for 0 .. $i - 1;
}
open my $fh_out, '>', 'parseThis.txt' or die $!;
print $fh_out #fileLines;
close $fh_out;
Results:
input Pattern 2;
input SDF;
input ABC;
input Pattern 2;
output Pattern2;
output XYZ;
Hope this helps!
#!/usr/bin/env perl
$in1 = 'input Pattern';
$in2 = 'input Pattern2';
$out1 = 'output Pattern';
$out2 = 'output Pattern2';
undef $/;
$_ = <DATA>;
if (/^$in1\b.*?^$out1\b/gms) {
s/(^$in1\b)(?=.*?^$out1\b)/$in2/gms;
s/^$out1\b/$out2/gms;
}
print;
__DATA__
input Pattern;
input SDF;
input ABC;
input Pattern;
output Pattern;
output XYZ;
Will there be additional "Input pattern1: lines folloring an occurence of "Output Patttern1?"
Are there going to be multiple pattern to search for, or will it just be "If we find Output Pattern1 then perform the replacement?
Will the "output pattern occur multiple times, or just once?
Will there be additional "Input pattern1: lines folloring an occurence of "Output Patttern1?"
I would perform this task in two/mutiple passes:
Pass1 - read the file, looking for the matching output lines, store the line number in memory.
Pass 2 - read the file, and based on the line numbers in the set of matches, perform the replacement on the appropriate Input lines.
So in semi-perlish, untested psuedocode:
my #matches = ();
open $fh, $inputfile, '<';
while (<$fh>) {
if (/Pattern1/) {
push #matches, $.;
}
}
close $fh;
open $fh, $inputfile, '<';
while (<$fh>) {
if ($. <= $matches[-1]) {
s/Input Pattern1/Input Pattern2/;
print ;
}
else {
pop #matches);
last unless #matches;
}
}
close $fh;
You run this like:
$ replace_pattern.pl input_file > output_file
You'll need to adjust it a little to meet your exact needs, but that should get you close.
You cannot go back and change lines in Perl. What you can do is open the file for the first time in read mode, find out which line has the pattern (say the 5th line), close it before gulping the entire file into an array, open it again in write mode, modify the contents of the array upto the 5th line, dump that array into that file, and close it. Something like this (assuming each file will have at most one output pattern):
my #arr;
my #files = ();
while (<>) {
if ($. == 0) {
$curindex = undef;
#lines = ();
push #files, $ARGV;
}
push #lines, $_;
if (/output pattern/) { $curindex = $. }
if (eof) {
push #arr, [\#lines, $curindex];
close $ARGV;
}
}
for $file (#files) {
open file, "> $file";
#currentfiledetails = #{ $arr[$currentfilenumber++] };
#currentcontents = #{ $currentfiledetails[0] };
$currentoutputmarker = $currentfiledetails[1];
if ($currentoutputmarker) {
for (0 .. $currentoutputmarker - 2) {
$currentcontents[$_] =~ s/input pattern/input pattern2/g;
}
$currentcontents[$currentoutputmarker - 1] =~
s/output pattern/output pattern2/g;
}
print file for #currentcontents;
close file;
}
Edit: solution added.
Hi, I currently have some working albeit slow code.
It merges 2 CSV files line by line using a primary key.
For example, if file 1 has the line:
"one,two,,four,42"
and file 2 has this line;
"one,,three,,42"
where in 0 indexed $position = 4 has the primary key = 42;
then the sub: merge_file($file1,$file2,$outputfile,$position);
will output a file with the line:
"one,two,three,four,42";
Every primary key is unique in each file, and a key might exist in one file but not in the other (and vice versa)
There are about 1 million lines in each file.
Going through every line in the first file, I am using a hash to store the primary key, and storing the line number as the value. The line number corresponds to an array[line num] which stores every line in the first file.
Then I go through every line in the second file, and check if the primary key is in the hash, and if it is, get the line from the file1array and then add the columns I need from the first array to the second array, and then concat. to the end. Then delete the hash, and then at the very end, dump the entire thing to file. (I am using a SSD so I want to minimise file writes.)
It is probably best explained with a code:
sub merge_file2{
my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my %line_for;
my #file1array;
open FILE1, "<$file1";
print "$file1 opened\n";
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
$file1array[$.] = $_; #store line in file1array.
}
close FILE1;
print "$file2 opened - merging..\n";
open FILE2, "<", $file2;
my #from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
while (<FILE2>){
print "$.\n" if ($.%1000) == 0;
chomp;
my #array1 = ();
my #array2 = ();
my #array2 = split /,/, $_; #split 2nd csv line by commas
my #array1 = split /,/, $file1array[$line_for{$array2[$position]}];
# ^ ^ ^
# prev line lookup line in 1st file,lookup hash, pos of key
#my #output = &merge_string(\#array1,\#array2); #merge 2 csv strings (old fn.)
foreach(#from1to2){
$array2[$_] = $array1[$_];
}
my $outstring = join ",", #array2;
$OUTSTRING.=$outstring."\n";
delete $line_for{$array2[$position]};
}
close FILE2;
print "adding rest of lines\n";
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
print "writing file $out\n\n\n";
write_line($out,$OUTSTRING);
}
The first while is fine, takes less than 1 minute, however the second while loop takes about 1 hour to run, and I am wondering if I have taken the right approach. I think it is possible for a lot of speedup? :) Thanks in advance.
Solution:
sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;
my (#file1,#file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
if ($.==1){
$header = $_;
next;
}
print "$.\n" if ($.%100000) == 0;
chomp;
push #file1, [split ',', $_];
}
close FILE1;
open FILE2, "<$file2" or die;
while (<FILE2>){
next if $.==1;
print "$.\n" if ($.%100000) == 0;
chomp;
push #file2, [split ',', $_];
}
close FILE2;
print "sorting files\n";
my #sortedf1 = sort {$a->[$position] <=> $b->[$position]} #file1;
my #sortedf2 = sort {$a->[$position] <=> $b->[$position]} #file2;
print "sorted\n";
#file1 = undef;
#file2 = undef;
#foreach my $line (#file1){print "\t [ #$line ],\n"; }
my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
my $key1 = $sortedf1[$i][$position];
my $key2 = $sortedf2[$j][$position];
if ($key1 eq $key2){
foreach(0..$hsize){ #header size.
$sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
}
$i++;
$j++;
}
elsif ( $key1 < $key2){
push(#sortedf2,[#{$sortedf1[$i]}]);
$i++;
}
elsif ( $key1 > $key2){
$j++;
}
}
#foreach my $line (#sortedf2){print "\t [ #$line ],\n"; }
print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(#sortedf2){
print OUT (join ",", #{$_})."\n";
}
close OUT;
}
Thanks everyone, the solution is posted above. It now takes about 1 minute to merge the whole thing! :)
Two techniques come to mind.
Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to CSV. The database will use indexes to optimize the join.
First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.
What's killing the performance is this code, which is concatenating millions of times.
$OUTSTRING.=$outstring."\n";
....
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
If you want to write to the output file only once, accumulate your results in an array, and then print them at the very end, using join. Or, even better perhaps, include the newlines in the results and write the array directly.
To see how concatenation does not scale when crunching big data, experiment with this demo script. When you run it in concat mode, things start slowing down considerably after a couple hundred thousand concatenations -- I gave up and killed the script. By contrast, simply printing an array of a million lines took less than a than a minute on my machine.
# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;
my ($line_len, $n_lines, $method) = #ARGV;
my #data = map { '_' x $line_len . "\n" } 1 .. $n_lines;
open my $fh, '>', 'output.txt' or die $!;
if ($method eq 'concat'){ # Dog slow. Gets slower as #data gets big.
my $outstring;
for my $i (0 .. $#data){
print STDERR $i, "\n" if $i % 1000 == 0;
$outstring .= $data[$i];
}
print $fh $outstring;
}
elsif ($method eq 'join'){ # Fast
print $fh join('', #data);
}
else { # Fast
print $fh #data;
}
If you want merge you should really merge. First of all you have to sort your data by key and than merge! You will beat even MySQL in performance. I have a lot of experience with it.
You can write something along those lines:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
use autodie;
use constant KEYPOS => 4;
die "Insufficient number of parameters" if #ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
my $row;
if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) { # merge rows
$row = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
$row1 = $csv->getline($file1);
$row2 = $csv->getline($file2);
}
elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
$row = $row1;
$row1 = $csv->getline($file1);
}
else {
$row = $row2;
$row2 = $csv->getline($file2);
}
$csv->print( *STDOUT, $row );
}
# flush possible tail
while ( $row1 ) {
$csv->print( *STDOUT, $row1 );
$row1 = $csv->getline($file1);
}
while ( $row2 ) {
$csv->print( *STDOUT, $row2 );
$row2 = $csv->getline($file1);
}
close $file1;
close $file2;
Redirect output to file and measure.
If you like more sanity around sort arguments you can replace file opening part with
(open my $file1, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[0]);
(open my $file2, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[1]);
I can't see anything that strikes me as obviously slow, but I would make these changes:
First, I'd eliminate the #file1array variable. You don't need it; just store the line itself in the hash:
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$_;
}
Secondly, although this shouldn't really make much of a difference with perl, I wouldn't add to $OUTSTRING all the time. Instead, keep an array of output lines and push onto it each time. If for some reason you still need to call write_line with a massive string you can always use join('', #OUTLINES) at the end.
If write_line doesn't use syswrite or something low-level like that, but rather uses print or other stdio-based calls, then you aren't saving any disk writes by building up the output file in memory. Therefore, you might as well not build your output up in memory at all, and instead just write it out as you create it. Of course if you are using syswrite, forget this.
Since nothing is obviously slow, try throwing Devel::SmallProf at your code. I've found that to be the best perl profiler for producing those "Oh! That's the slow line!" insights.
Assuming around 20 bytes lines each of your file would amount to about 20 MB, which isn't too big.
Since you are using hash your time complexity doesn't seem to be a problem.
In your second loop, you are printing to the console for each line, this bit is slow. Try removing that should help a lot.
You can also avoid the delete in the second loop.
Reading multiple lines at a time should also help. But not too much I think, there is always going to be a read ahead behind the scenes.
I'd store each record in a hash whose keys are the primary keys. A given primary key's value is a reference to an array of CSV values, where undef represents an unknown value.
use 5.10.0; # for // ("defined-or")
use Carp;
use Text::CSV;
sub merge_csv {
my($path,$record) = #_;
open my $fh, "<", $path or croak "$0: open $path: $!";
my $csv = Text::CSV->new;
local $_;
while (<$fh>) {
if ($csv->parse($_)) {
my #f = map length($_) ? $_ : undef, $csv->fields;
next unless #f >= 1;
my $primary = pop #f;
if ($record->{$primary}) {
$record->{$primary}[$_] //= $f[$_]
for 0 .. $#{ $record->{$primary} };
}
else {
$record->{$primary} = \#f;
}
}
else {
warn "$0: $path:$.: parse failed; skipping...\n";
next;
}
}
}
Your main program will resemble
my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;
The Data::Dumper module shows that the resulting hash given the simple inputs from your question is
$VAR1 = {
'42' => [
'one',
'two',
'three',
'four'
]
};