I'm incredibly new to Perl, and never have been a phenomenal programmer. I have some successful BVA routines for controlling microprocessor functions, but never anything embedded, or multi-facted. Anyway, my question today is about a boggle I cannot get over when trying to figure out how to remove duplicate lines of text from a text file I created.
The file could have several of the same lines of txt in it, not sequentially placed, which is problematic as I'm practically comparing the file to itself, line by line. So, if the first and third lines are the same, I'll write the first line to a new file, not the third. But when I compare the third line, I'll write it again since the first line is "forgotten" by my current code. I'm sure there's a simple way to do this, but I have issue making things simple in code. Here's the code:
my $searchString = pseudo variable "ideally an iterative search through the source file";
my $file2 = "/tmp/cutdown.txt";
my $file3 = "/tmp/output.txt";
my $count = "0";
open (FILE, $file2) || die "Can't open cutdown.txt \n";
open (FILE2, ">$file3") || die "Can't open output.txt \n";
while (<FILE>) {
print "$_";
print "$searchString\n";
if (($_ =~ /$searchString/) and ($count == "0")) {
++ $count;
print FILE2 $_;
} else {
print "This isn't working\n";
}
}
close (FILE);
close (FILE2);
Excuse the way filehandles and scalars do not match. It is a work in progress... :)
The secret of checking for uniqueness, is to store the lines you have seen in a hash and only print lines that don't exist in the hash.
Updating your code slightly to use more modern practices (three-arg open(), lexical filehandles) we get this:
my $file2 = "/tmp/cutdown.txt";
my $file3 = "/tmp/output.txt";
open my $in_fh, '<', $file2 or die "Can't open cutdown.txt: $!\n";
open my $out_fh, '>', $file3 or die "Can't open output.txt: $!\n";
my %seen;
while (<$in_fh>) {
print $out_fh unless $seen{$_}++;
}
But I would write this as a Unix filter. Read from STDIN and write to STDOUT. That way, your program is more flexible. The whole code becomes:
#!/usr/bin/perl
use strict;
use warnings;
my %seen;
while (<>) {
print unless $seen{$_}++;
}
Assuming this is in a file called my_filter, you would call it as:
$ ./my_filter < /tmp/cutdown.txt > /tmp/output.txt
Update: But this doesn't use your $searchString variable. It's not clear to me what that's for.
If your file is not very large, you can store each line readed from the input file as a key in a hash variable. And then, print the hash keys (ordered). Something like that:
my %lines = ();
my $order = 1;
open my $fhi, "<", $file2 or die "Cannot open file: $!";
while( my $line = <$fhi> ) {
$lines {$line} = $order++;
}
close $fhi;
open my $fho, ">", $file3 or die "Cannot open file: $!";
#Sort the keys, only if needed
my #ordered_lines = sort { $lines{$a} <=> $lines{$b} } keys(%lines);
for my $key( #ordered_lines ) {
print $fho $key;
}
close $fho;
You need two things to do that:
a hash to keep track of all the lines you have seen
a loop reading the input file
This is a simple implementation, called with an input filename and an output filename.
use strict;
use warnings;
open my $fh_in, '<', $ARGV[0] or die "Could not open file '$ARGV[0]': $!";
open my $fh_out, '<', $ARGV[1] or die "Could not open file '$ARGV[1]': $!";
my %seen;
while (my $line = <$fh_in>) {
# check if we have already seen this line
if (not $seen{$line}) {
print $fh_out $line;
}
# remember this line
$seen{$line}++;
}
To test it, I've included it with the DATA handle as well.
use strict;
use warnings;
my %seen;
while (my $line = <DATA>) {
# check if we have already seen this line
if (not $seen{$line}) {
print $line;
}
# remember this line
$seen{$line}++;
}
__DATA__
foo
bar
asdf
foo
foo
asdfg
hello world
This will print
foo
bar
asdf
asdfg
hello world
Keep in mind that the memory consumption will grow with the file size. It should be fine as long as the text file is smaller than your RAM. Perl's hash memory consumption grows a faster than linear, but your data structure is very flat.
Related
Can't call method print on an undefined value in line 40 line 2.
Here is the code. I use FileHandle to settle files:
#!/usr/bin/perl
use strict;
use warnings;
use FileHandle;
die unless (#ARGV ==4|| #ARGV ==5);
my #input =();
$input[0]=$ARGV[3];
$input[1]=$ARGV[4] if ($#ARGV==4);
chomp #input;
$input[0] =~ /([^\/]+)$/;
my $out = "$1.insert";
my $lane= "$1";
my %fh=();
open (Info,">$ARGV[1]") || die "$!";
open (AA,"<$ARGV[0]") || die "$!";
while(<AA>){
chomp;
my #inf=split;
my $iden=$inf[0];
my $outputfile="$ARGV[2]/$iden";
$fh{$iden}=FileHandle->new(">$outputfile");
}
close AA;
foreach my $input (#input) {
open (IN, "<$input" ) or die "$!" ;
my #path=split (/\//,$input);
print Info "#$path[-1]\n";
while (<IN>) {
my $line1 = $_;
my ($id1,$iden1) = (split "\t", $line1)[6,7];
my $line2 = <IN> ;
my ($id2,$iden2) = (split "\t", $line2)[6,7];
if ($id1 eq '+' && $id2 eq '-') {
my #inf=split(/\t/,$line1);
$fh{$iden1}->print($line1);
$fh{$iden2}->print($line2);
}
}
close IN;
}
I’ve tried multiple variations of this, but none of them seem to work. Any ideas?
Please remember that the primary worth of a Stack Overflow post is not to fix your particular problem, but to help the thousands of others who may be stuck in the same way. With that in mind, "I fixed it, thanks, bye" is more than a little selfish
As I said in my comment, using open directly on a hash element is much preferable to involving FileHandle. Perl will autovivify the hash element and create a file handle for you, and most people at all familiar with Perl will thank you for not making them read up again on the FileHandle documentation
I rewrote your code like this, which is much more Perlish and relies less on "magic numbers" to access #ARGV. You should really assign #ARGV to a list of named scalars, or - better still - use Getopt::Long so that they are named anyway
You should open your file handles as late as possible, and close the output handles early. This is effected most easily by using lexical file handles and limiting their scope to a block. Perl will implicitly close lexical handles for you when they go out of scope
There is no need to chomp the contents of #ARGVunless you could be be called under strange and errant circumstances, in which case you need to do a hell of a lot more to verify the input
You never use the result of $input[0] =~ /([^\/]+)$/ or the variables $out and $lane, so I removed them
#!/usr/bin/perl
use strict;
use warnings 'all';
# $ARGV[0] -- input file
# $ARGV[1] -- output log file
# $ARGV[2] -- directory for outputs per ident
# $ARGV[3] -- 1, $input[0]
# $ARGV[4] -- 2, $input[1] or undef
die "Fix the parameters" unless #ARGV == 4 or #ARGV == 5;
my #input = #ARGV[3,4];
my %fh;
{
open my $fh, '<', $ARGV[0] or die $!;
while ( <$fh> ) {
my $id = ( split )[0];
my $outputfile = "$ARGV[2]/$id";
open $fh{$id}, '>', $outputfile or die qq{Unable to open "$outputfile" for output: $!};
}
}
open my $log_fh, '>', $ARGV[1] or die qq{Unable to open "$ARGV[1]" for output: $!};
for my $input ( #input ) {
next unless $input; # skip unspecified parameters
my #path = split qr|/|, $input; # Really should be done by File::Spec
print $log_fh "#$path[-1]\n"; # Or File::Basename
open my $fh, '<', $input or die qq{Unable to open "$input" for input: $!};
while ( my $line0 = <$fh> ) {
chomp $line0;
my $line1 = <$fh>;
chomp $line1;
my ($id0, $iden0) = (split /\t/, $line0)[6,7];
my ($id1, $iden1) = (split /\t/, $line1)[6,7];
if ( $id0 eq '+' and $id1 eq '-' ) {
$fh{$_} or die qq{No output file for "$_"} for $iden0, $iden1;
print { $fh{$iden0} } $line0;
print { $fh{$iden1} } $line1;
}
}
}
while ( my ($iden, $fh) = each %fh ) {
close $fh or die qq{Unable to close file handle for "$iden": $!};
}
You don't have any error handling on this line:
$fh{$iden}=FileHandle->new(">$outputfile");
It's possible that opening a filehandle is silently failing, and only producing an error when you try to print to it. For example, if you have specified an invalid filename.
Also, you never check if $iden1 and $iden2 are names of open filehandles that actually exist. It's possible one of them does not exist.
In particular, you aren't removing a newline from $line1, so if $iden1 and $iden2 happen to be the last values on the line, this will be included in the name you are trying to use, and it will fail.
In your first while loop, you set up a hash of filehandles that you will write to later. The keys in this hash are the "iden" strings from the first file passed to the program.
Later, you parse another file and use the "iden" values in that file to choose which filehandle to write data to. But one (or more) of the "iden" values in the second file is missing from the first file. So that filehandle can't be found in the %fh hash. Because you don't check for that, you get `undef back from the hash and you can't print to an undefined filehandle.
To fix it, put a check before trying to use one of the filehandles from the %fh hash.
die "Unknown fh identifier '$iden1'" unless exists $fh{$iden1};
die "Unknown fh identifier '$iden2'" unless exists $fh{$iden2};
$fh{$iden1}->print($line1);
$fh{$iden2}->print($line2);
I dont know what exactly is wrong but everytime I execute this script i keep getting "No such file or directory at ./reprioritize line 35, line 1".
here is my script that is having an issue:
my $newresult = "home/user/newresults_percengtage_vs_pn";
sub pushval
{
my #fields = #_;
open OUTFILE, ">$newresult/fixedhomdata_030716-031316.csv" or die $!; #line 35
while(<OUTFILE>)
{
if($fields[5] >= 13)
{
print OUTFILE "$fields[0]", "$fields[1]","$fields[2]","$fields[3]","$fields[4]","$fields[5]", "0";
}
elsif($fields[5] < 13 && $fields[5] > 1)
{
print OUTFILE "$fields[0]", "$fields[1]","$fields[2]","$fields[3]","$fields[4]","$fields[5]", "1";
}
elsif($fields[5] <= 1)
{
print OUTFILE "$fields[0]", "$fields[1]","$fields[2]","$fields[3]","$fields[4]","$fields[5]", "2";
}
}
close (OUTFILE);
You may want to have a look at Perl's tutorial on opening files.
I simplify it a bit. There are basically three modes: open for reading, open for writing, and open for appending.
Reading
Opening for reading is indicated by either a < preceeding the filename or on its own, as a separate parameter to the open() call (preferred), i.e.:
my $fh = undef;
my $filename = 'fixedhomdata_030716-031316.csv';
open($fh, "<$filename") or die $!; # bad
open($fh, '<', $filename) or die $!; # good
while( my $line = <$fh> ) { # read one line from filehandle $fh
...
}
close($fh);
When you open the file this way, it must exist, else you get your error (No such file or directory at ...).
Writing
Opening for writing is indicated by a >, i.e.:
open($fh, ">$filename") or die $!; # bad
open($fh, '>', $filename) or die $!; # good
print $fh "some text\n"; # write to filehandle $fh
print $fh "more text\n"; # write to filehandle $fh
...
close($fh);
When you open the file this way, it is truncated (cleared) and overwritten if it existed. If it did not exist, it will get created.
Appending
Opening for appending is indicated by a >>, i.e.:
open($fh, ">>$filename") or die $!; # bad
open($fh, '>>', $filename) or die $!; # good
print $fh "some text\n"; # append to filehandle $fh
print $fh "more text\n"; # append to filehandle $fh
...
close($fh);
When you open the file this way and it existed, then the new lines will be appended to the file, i.e. nothing is lost. If the file did not
exist, it will be created (as if only > had been given).
Your error message doesn't match your code. You opened the file for writing (>) but got doesn't exist, which indicates that you actually opened it for reading.
This might have happened because you use OUTPUT as a filehandle instead of a scoped variable, e.g. $fh. OUTPUT is a global filehandle, i.e. if you open a file this way, then all of your code (no matter which function in) can use OUTPUT. Don't do that. From the docs:
An older style is to use a bareword as the filehandle, as
open(FH, "<", "input.txt")
or die "cannot open < input.txt: $!";
Then you can use FH as the filehandle, in close FH and and so on.
Note that it's a global variable, so this form is not recommended
in new code.
To summarize:
use scoped variables as filehandles ($fh instead of OUTPUT)
open your file in the right mode (> vs. <)
always use three-argument open (open($fh, $mode, $filename) vs. open($fh, "$mode$filename")
The comments explain that your two issues with the snippet are
The missing leading '/' in the $newresult declaration
You are treating your filehandle as both a read and a write.
The first is easy to fix. The second is not as easy to fix properly with knowing the rest of the script. I am making an assumption that pushval is called once per record in a Array of Arrays(?). This snippet below should get the result you want, but there is likely a better way of doing it.
my $newresult = "/home/user/newresults_percengtage_vs_pn";
sub pushval{
my #fields = #_;
open OUTFILE, ">>$newresult/fixedhomdata_030716-031316.csv" or die $!; #line 35
print OUTFILE "$fields[0]", "$fields[1]","$fields[2]","$fields[3]","$fields[4]","$fields[5]"
if($fields[5] >= 13) {
print OUTFILE "0\n";
} elsif($fields[5] < 13 && $fields[5] > 1) {
print OUTFILE "1\n";
} elsif($fields[5] <= 1) {
print OUTFILE "2\n";
}
close (OUTFILE);
I am trying to both learn perl and use it in my research. I need to do a simple task which is counting the number of sequences and their lengths in a file such as follow:
>sequence1
ATCGATCGATCG
>sequence2
AAAATTTT
>sequence3
CCCCGGGG
The output should look like this:
sequence1 12
sequence2 8
sequence3 8
Total number of sequences = 3
This is the code I have written which is very crude and simple:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = #ARGV;
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
while (<INFILE>) {
chomp;
if (/^>/)
{
my $number_of_sequences++;
}else{
my length = length ($input);
}
}
print length, number_of_sequences;
close (INFILE);
I'd be grateful if you could give me some hints, for example, in the else block, when I use the length function, I am not sure what argument I should pass into it.
Thanks in advance
You're printing out just the last length, not each sequence length, and you want to catch the sequence names as you go:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = #ARGV;
my ($lastSeq, $number_of_sequences) = ('', 0);
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
# You never use OUTFILE
# open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
while (<INFILE>) {
chomp;
if (/^>(.+)/)
{
$lastSeq = $1;
$number_of_sequences++;
}
else
{
my $length = length($_);
print "$lastSeq $length\n";
}
}
print "Total number of sequences = $number_of_sequences\n";
close (INFILE);
Since you have indicated that you want feedback on your program, here goes:
my ($input, $output) = #ARGV;
open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.
Personally, I think when dealing with a simple input/output file relation, it is best to just use the diamond operator and standard output. That means that you read from the special file handle <>, commonly referred to as "the diamond operator", and you print to STDOUT, which is the default output. If you want to save the output in a file, just use shell redirection:
perl program.pl input.txt > output.txt
In this part:
my $number_of_sequences++;
you are creating a new variable. This variable will go out of scope as soon as you leave the block { .... }, in this case: the if-block.
In this part:
my length = length ($input);
you forgot the $ sigil. You are also using length on the file name, not the line you read. If you want to read a line from your input, you must use the file handle:
my $length = length(<INFILE>);
Although this will also include the newline in the length.
Here you have forgotten the sigils again:
print length, number_of_sequences;
And of course, this will not create the expected output. It will print something like sequence112.
Recommendations:
Use a while (<>) loop to read your input. This is the idiomatic method to use.
You do not need to keep a count of your input lines, there is a line count variable: $.. Though keep in mind that it will also count "bad" lines, like blank lines or headers. Using your own variable will allow you to account for such things.
Remember to chomp the line before finding out its length. Or use an alternative method that only counts the characters you want: my $length = ( <> =~ tr/ATCG// ) This will read a line, count the letters ATGC, return the count and discard the read line.
Summary:
use strict;
use warnings; # always use these two pragmas
my $count;
while (<>) {
next unless /^>/; # ignore non-header lines
$count++; # increment counter
chomp;
my $length = (<> =~ tr/ATCG//); # get length of next line
s/^>(\S+)/$1 $length\n/; # remove > and insert length
} continue {
print; # print to STDOUT
}
print "Total number is sequences = $count\n";
Note the use of continue here, which will allow us to skip a line that we do not want to process, but that will still get printed.
And as I said above, you can redirect this to a file if you want.
For starters, you need to change your inner loop to this:
...
chomp;
if (/^>/)
{
$number_of_sequences++;
$sequence_name = $_;
}else{
print "$sequence_name ", length($input), "\n";
}
...
Note the following:
The my declaration has been removed from $number_of_sequences
The sequence name is captured in the variable $sequence_name. It is used later when the next line is read.
To make the script run under strict mode, you can add my declarations for $number_of_sequences and $sequence_name outside of the loop:
my $sequence_name;
my $number_of_sequences = 0;
while (<INFILE>) {
...(as above)...
}
print "Total number of sequences: $number_of_sequences\n";
The my keyword declares a new lexically scoped variable - i.e. a variable which only exists within a certain block of code, and every time that block of code is entered, a new version of that variable is created. Since you want to have the value of $sequence_name carry over from one loop iteration to the next you need to place the my outside of the loop.
#!/usr/bin/perl
use strict;
use warnings;
my ($file, $line, $length, $tag, $count);
$file = $ARGV[0];
open (FILE, "$file") or print"can't open file $file\n";
while (<FILE>){
$line=$_;
chomp $line;
if ($line=~/^>/){
$tag = $line;
}
else{
$length = length ($line);
$count=1;
}
if ($count==1){
print "$tag\t$length\n";
$count=0
}
}
close FILE;
I am trying to read a newline-delimited file into an array in Perl. I do NOT want the newlines to be part of the array, because the elements are filenames to read later. That is, each element should be "foo" and not "foo\n". I have done this successfully in the past using the methods advocated in Stack Overflow question Read a file into an array using Perl and Newline Delimited Input.
My code is:
open(IN, "< test") or die ("Couldn't open");
#arr = <IN>;
print("$arr[0] $arr[1]")
And my file 'test' is:
a
b
c
d
e
My expected output would be:
a b
My actual output is:
a
b
I really don't see what I'm doing wrong. How do I read these files into arrays?
Here is how I generically read from files.
open (my $in, "<", "test") or die $!;
my #arr;
while (my $line = <$in>) {
chomp $line;
push #arr, $line;
}
close ($in);
chomp will remove newlines from the line read. You should also use the three-argument version of open.
Put the file path in its own variable so that it can be easily
changed.
Use the 3-argument open.
Test all opens, prints, and closes for success, and if not, print the error and the file name.
Try:
#!/usr/bin/env perl
use strict;
use warnings;
# --------------------------------------
use charnames qw( :full :short );
use English qw( -no_match_vars ); # Avoids regex performance penalty
# conditional compile DEBUGging statements
# See http://lookatperl.blogspot.ca/2013/07/a-look-at-conditional-compiling-of.html
use constant DEBUG => $ENV{DEBUG};
# --------------------------------------
# put file path in a variable so it can be easily changed
my $file = 'test';
open my $in_fh, '<', $file or die "could not open $file: $OS_ERROR\n";
chomp( my #arr = <$in_fh> );
close $in_fh or die "could not close $file: $OS_ERROR\n";
print "#arr[ 0 .. 1 ]\n";
A less verbose option is to use File::Slurp::read_file
my $array_ref = read_file 'test', chomp => 1, array_ref => 1;
if, and only if, you need to save the list of file names anyway.
Otherwise,
my $filename = 'test';
open (my $fh, "<", $filename) or die "Cannot open '$filename': $!";
while (my $next_file = <$fh>) {
chomp $next_file;
do_something($next_file);
}
close ($fh);
would save memory by not having to keep the list of files around.
Also, you might be better off using $next_file =~ s/\s+\z// rather than chomp unless your use case really requires allowing trailing whitespace in file names.
I'm not much of a programmer and pretty new to Perl. I'm currently writing a little script to read from a CSV, do some evaluation on different fields and then print to another file if certain criteria are met. I thought I was pretty much done but then I got this new message:
"Usage: Text::CSV_Xs::getline(self, io) at date_compare.pl line 51, line 3."
I've been trying to find something that tells me what this message means but I'm lost. I know this is something simple. My code is below. Please excuse my ignorance.
#! /usr/local/ActivePerl-5.12/bin/perl
#this script will check to files, throw them into arrays and compare them
#to find entries in one array which meet specified criteria
#$field_file is the name of the file that contains the ablation date first,
#then the list of compared dates, then the low and high end date criteria, each
#value should end with a \n.
#$unfiltered_file is the name of the raw CSV with all the data
#$output_file is the name of the file the program will write to
use strict;
use 5.012;
use Text::CSV_XS;
use IO::HANDLE qw/getline/;
use Date::Calc qw/Decode_Date_US2 Delta_Days/;
my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ }) or
die "Cannot use CSV: ".Text::CSV->error_diag ();
my ($field_file,
$unfiltered_file,
$output_file,
#field_list,
$hash_keys,
%compare,
$check,
$i);
#Decode_Date_US2 scans a string and tries to parse any date within.
#($year,$month,$day)=Decode_Date_US2($string[,$language])
#Delta_Days returns the difference in days between the two given dates.
#$Dd = Delta_Days($year1,$month1,$day1, $year2,$month2,$day2);
sub days{
Delta_Days(Decode_Days_US2(#compare{$field_list[0]}),
Decode_Days_US2(#compare{$field_list[$i]}));
}
sub printout{
$csv->print(<OUTPUTF>, $check) or die "$output_file:".$csv->error_diag();
}
print "\nEnter the check list file name: ";
chomp ($field_file = <STDIN>);
open FIELDF, "<", $field_file or die "$field_file: $!";
chomp (#field_list=<$field_file>);
close FIELDF or die "$field_file: $!";
print "\nEnter the raw CSV file name: ";
chomp ($unfiltered_file = <STDIN>);
print "\nEnter the output file name : ";
chomp ($output_file = <STDIN>);
open OUTPUTF, ">>", $output_file or die "$output_file: $!";
open RAWF, "<", $unfiltered_file or die "$unfiltered_file: $!";
if ($hash_keys = $csv->getline(<RAWF>)){
$check = $hash_keys;
&printout();
}else{die "\$hash_keys: ".$csv->error_diag();}
while ($check = $csv->getline (<RAWF>)){
#compare{#$hash_keys}=#$check;
TEST: for ($i=1, $i==(#field_list-3), $i++){
if (&days()>=$field_list[-2] && &days()<=$field_list[-1]){
last TEST if (&printout());
}
}
Usage: Text::CSV_Xs::getline(self, io) at date_compare.pl line 51, line 3.
getline apparently expects a filehandle/IO::Handle, and you're passing it a scalar (containing a line read from the filehandle).
This means that on your line:
if ($hash_keys = $csv->getline(<RAWF>)){
You should be using:
if ($hash_keys = $csv->getline( \*RAWF )){
instead.
(But you should really be using lexical filehandles, as in:)
open FIELDF, "<", $field_file or die "$field_file: $!";
chomp (#field_list=<$field_file>); # Not sure how you expect this to work
close FIELDF or die "$field_file: $!";
would become:
open my $fieldf, "<", $field_file or die "$field_file: $!";
chomp (#field_list=<$fieldf>);
close $fieldf or die "$field_file: $!";