perl combine specific columns from multiple files - perl

I'd like to create a perl script that combines columns from multiple files. I have to respect a series of criteria (folder/file structure). I'll try to represent what I have and what I have.I have two folders with a bunch of files. The files inside each folders have the same names.
Folder1: File1, File2, File3, ...
Folder2: File1, File2, File3, ...
Folder1:File1 content looks like this (tab delimited):
aaaaa 233
bbbbb 34
ccccc 853
...
All the other files look like this one, except the numerical values are different. I want to create a single file (a report) that will look like this:
aaaaa value_Folder1:File1 value_Folder2:File1 value_Folder1:File2 value_Folder2:File2 ...
...
It would be nice to have the file name on top of the columns from which the values are coming from (just the file name, the folder is not important).
I have some code evolving, but it's not doing what I want right now! I tried to make it work via loops, but I feel that it might not be the solution... One other problem is that I don't know how to add columns to my report file. In the following code, I just append the value a the end of the file. Even if it's not super nice, here's my code:
#!/usr/bin/perl -w
use strict;
use warnings;
my $outputfile = "/home/duceppemo/Desktop/count.all.txt";
my $queryDir = "/home/duceppemo/Desktop/query_count/";
my $hitDir = "/home/duceppemo/Desktop/hit_count/";
opendir (DIR, "$queryDir") or die "Error opening $queryDir: $!"; #Open the directory containing the files with sequences to look for
my #queryFileNames = readdir (DIR);
opendir (DIR, "$hitDir") or die "Error opening $hitDir: $!"; #Open the directory containing the files with sequences to look for
my #hitFileNames = readdir (DIR);
my $index = 0;
$index ++ until $queryFileNames[$index] eq ".";
splice(#queryFileNames, $index, 1);
$index = 0;
$index ++ until $queryFileNames[$index] eq "..";
splice(#queryFileNames, $index, 1);
$index = 0;
$index ++ until $hitFileNames[$index] eq ".";
splice(#hitFileNames, $index, 1);
$index = 0;
$index ++ until $hitFileNames[$index] eq "..";
splice(#hitFileNames, $index, 1);
#counter for query file number opened
my $i = 0;
foreach my $queryFile (#queryFileNames) #adjust the file name according to the subdirectory
{
$i += 1; #keep track of the file number opened
$queryFile = $queryDir . $queryFile;
open (QUERY, "$queryFile") or die "Error opening $queryFile: $!";
my #query = <QUERY>; #Put the query sequences from the count file into an array
close (QUERY);
my $line = 0;
open (RESULT, ">>$outputfile") or die "Error opening $outputfile: $!";
foreach my $lineQuery (#query) #look into the query file
{
my #columns = split(/\s+/, $lineQuery); #Split each line into a new array, when it meets a whitespace character (including tab)
if ($i == 1)
{
#open (RESULT, ">>$outputfile") or die "Error opening $outputfile: $!";
print RESULT "$columns[0]\t";
print RESULT "$columns[1]\n";
#close (RESULT);
$line += 1;
}
else
{
open (RESULT, ">>$outputfile") or die "Error opening $outputfile: $!";
print RESULT "$columns[1]\n";
close (RESULT);
$line += 1;
}
}
$line = 0;
}
close (RESULT);
closedir (DIR);
P.S. Any other advises on code optimisation be gratefully accepted!

The main problem is that you don't seem to understand what is a FILEHANDLE. You should research on this.
A Filehandle is a sort of reference to an open file, and since everything is a file, it can be a command or a directory.
When you make opendir(DIR, ...) "DIR" is not a keyword but a filehandle that can have any name. That means your 2 opendir() have the same filehandle, which does not make sense.
It should be more like:
opendir(QDIR, $queryDir) or die "Error opening $queryDir: $!";
my #queryFileNames = readdir(QDIR);
opendir(HDIR, $hitDir) or die "Error opening $hitDir: $!";
my #hitFileNames = readdir(HDIR);
Also, since you should always close every open filehandle, you must call close() at the same level and make sure close() will be called.
e.g. the opening of the filehandle RESULT and its close after the loop in which it was opened does not make sense... How many times will you open it without closing it?
You probably need to open it before the loop, and you don't have to open it twice with the same filehandle...
In general you want to avoid open/close in loops. You simply open before and close after.

That code is doing pretty much what I want:
#!/usr/bin/perl
use strict;
use warnings;
#my $queryDir = "ARGV[0]";
my $queryDir = "C:/Users/Marco/Desktop/query_count/";
opendir (DIR1, "$queryDir") or die "Error opening $queryDir: $!"; #Open the directory containing the files with sequences to look for
my #queryFileName = readdir (DIR1);
#my $hitDir = "ARGV[1]";
my $hitDir = "C:/Users/Marco/Desktop/hit_count/";
opendir (DIR2, "$hitDir") or die "Error opening $hitDir: $!"; #Open the directory containing the files with sequences to look for
my #hitFileName = readdir (DIR2);
my $index = 0;
$index ++ until $queryFileName[$index] eq ".";
splice(#queryFileName, $index, 1);
$index = 0;
$index ++ until $queryFileName[$index] eq "..";
splice(#queryFileName, $index, 1);
$index = 0;
$index ++ until $hitFileName[$index] eq ".";
splice(#hitFileName, $index, 1);
$index = 0;
$index ++ until $hitFileName[$index] eq "..";
splice(#hitFileName, $index, 1);
foreach my $queryFile (#queryFileName) #adjust the queryFileName according to the subdirectory
{
$queryFile = "$queryDir" . $queryFile;
}
foreach my $hitFile (#hitFileName) #adjust the queryFileName according to the subdirectory
{
$hitFile = "$hitDir" . $hitFile;
}
my $outputfile = "C:/Users/Marco/Desktop/out.txt";
my %hash;
foreach my $queryFile (#queryFileName)
{
my $i = 0;
open (QUERY, "$queryFile") or die "Error opening $queryFile: $!";
while (<QUERY>)
{
chomp;
my $val = (split /\t/)[1];
$i++;
$hash{$i}{$queryFile} = $val;
}
close (QUERY);
}
foreach my $hitFile (#hitFileName)
{
my $i = 0;
open (HIT, "$hitFile") or die "Error opening $hitFile: $!";
while (<HIT>)
{
chomp;
my $val = (split /\t/)[1];
$i++;
$hash{$i}{$hitFile} = $val;
}
close (HIT);
}
open (RESULT, ">>$outputfile") or die "Error opening $outputfile: $!";
foreach my $qfile (#queryFileName)
{
print RESULT "\t$qfile";
}
foreach my $hfile (#hitFileName)
{
print RESULT "\t$hfile";
}
print RESULT "\n";
foreach my $id (sort keys %hash)
{
print RESULT "$id\t";
print RESULT "$hash{$id}{$_}\t" foreach (#queryFileName, #hitFileName);
print RESULT "\n";
}
close (RESULT);

Related

Loop through file with similar names

How can I loop through files with similar names? This script works just for the first line of the first file and I don't understand the reason. Is there a simpler way to do it?
This script has been created in order to read files, and write in another file all lines without numbers inside.
use Data::Dumper;
use utf8;
#read OUT_AM3.txt, OUT_MOV3.txt, OUT_TA3.txt
opendir (DIR, '.') or die "Couldn't open directory, $!";
my #files = readdir(DIR);
closedir DIR;
$out = "Res.txt";
open (O, ">>", $out);
binmode(O, "utf8");
#eti = ("AM3","TA3","MOV3");
for ($i = 0; $i < #eti; $i++){
foreach $fh(#files){
open($fh, "<", "OUT_$eti[$i].txt");
binmode($fh, "utf8");
while(defined($l = <$fh>)){
if (!grep /\-?\d\.\d+/, $l){
print O $l;
}
}
}
}
You don't need
for ($i = 0; $i < #eti; $i++)
as it will loop three times over all files found in directory.
Also, when looping over #files it is expected to use array elements,
foreach my $file (#files) {
-f $file or next;
open(my $fh, "<", $file) or die $!;
# ..
}
"But it's not very important the result. I would like know how to open each file in the directory without writing always the name of the file. "
If I understand you correctly, you might want this?
my #files = grep /OUT_.*/, glob("*");
print $_ . "\n" foreach #files;
Or if you don't mind using module. There is File::Find::Rule
use File::Find::Rule;
my $rule = File::Find::Rule->new;
$rule->file;
$rule->name( 'OUT_*' );
my #files = $rule->in( "." );
Both will give a list of file name in #files.

Reading and comparing lines in Perl

I am having trouble with getting my perl script to work. The issue might be related to the reading of the Extract file line by line within the while loop, any help would be appreciated. There are two files
Bad file that contains a list of bad IDs (100s of IDs)
2
3
Extract that contains a delimited data with the ID in field 1 (millions of rows)
1|data|data|data
2|data|data|data
2|data|data|data
2|data|data|data
3|data|data|data
4|data|data|data
5|data|data|data
I am trying to remove all the rows from the large extract file where the IDs match. There can be multiple rows where the ID matches. The extract is sorted.
#use strict;
#use warnning;
$SourceFile = $ARGV[0];
$ToRemove = $ARGV[1];
$FieldNum = $ARGV[2];
$NewFile = $ARGV[3];
$LargeRecords = $ARGV[4];
open(INFILE, $SourceFile) or die "Can't open source file: $SourceFile \n";
open(REMOVE, $ToRemove) or die "Can't open toRemove file: $ToRemove \n";
open(OutGood, "> $NewFile") or die "Can't open good output file \n";
open(OutLarge, "> $LargeRecords") or die "Can't open Large Records output file \n";
#Read in the list of bad IDs into array
#array = <REMOVE>;
#Loop through each bad record
foreach (#array)
{
$badID = $_;
#read the extract line by line
while(<INFILE>)
{
#take the line and split it into
#fields = split /\|/, $_;
my $extractID = $fields[$FieldNum];
#print "Here's what we got: $badID and $extractID\n";
while($extractID == $badID)
{
#Write out bad large records
print OutLarge join '|', #fields;
#Get the next line in the extract file
#fields = split /\|/, <INFILE>;
my $extractID = $fields[$FieldNum];
$found = 1; #true
#print " We got a match!!";
#remove item after it has been found
my $input_remove = $badID;
#array = grep {!/$input_remove/} #array;
}
print OutGood join '|', #fields;
}
}
Try this:
$ perl -F'|' -nae 'BEGIN {while(<>){chomp; $bad{$_}++;last if eof;}} print unless $bad{$F[0]};' bad good
First, you are lucky: The number of bad IDs is small. That means, you can read the list of bad IDs once, stick them in a hash table without running into any difficulty with memory usage. Once you have them in a hash, you just read the big data file line by line, skipping output for bad IDs.
#!/usr/bin/env perl
use strict;
use warnings;
# hardwired for convenience
my $bad_id_file = 'bad.txt';
my $data_file = 'data.txt';
my $bad_ids = read_bad_ids($bad_id_file);
remove_data_with_bad_ids($data_file, $bad_ids);
sub remove_data_with_bad_ids {
my $file = shift;
my $bad = shift;
open my $in, '<', $file
or die "Cannot open '$file': $!";
while (my $line = <$in>) {
if (my ($id) = extract_id(\$line)) {
exists $bad->{ $id } or print $line;
}
}
close $in
or die "Cannot close '$file': $!";
return;
}
sub read_bad_ids {
my $file = shift;
open my $in, '<', $file
or die "Cannot open '$file': $!";
my %bad;
while (my $line = <$in>) {
if (my ($id) = extract_id(\$line)) {
$bad{ $id } = undef;
}
}
close $in
or die "Cannot close '$file': $!";
return \%bad;
}
sub extract_id {
my $string_ref = shift;
if (my ($id) = ($$string_ref =~ m{\A ([0-9]+) }x)) {
return $id;
}
return;
}
I'd use a hash as follows:
use warnings;
use strict;
my #bad = qw(2 3);
my %bad;
$bad{$_} = 1 foreach #bad;
my #file = qw (1|data|data|data 2|data|data|data 2|data|data|data 2|data|data|data 3|data|data|data 4|data|data|data 5|data|data|data);
my %hash;
foreach (#file){
my #split = split(/\|/);
$hash{$split[0]} = $_;
}
foreach (sort keys %hash){
print "$hash{$_}\n" unless exists $bad{$_};
}
Which gives:
   
1|data|data|data
4|data|data|data
5|data|data|data

how to count the number of specific characters through each line from file?

I'm trying to count the number of 'N's in a FASTA file which is:
>Header
AGGTTGGNNNTNNGNNTNGN
>Header2
AGNNNNNNNGNNGNNGNNGN
so in the end I want to get the count of number of 'N's and each header is a read so I want to make a histogram so I would at the end output something like this:
# of N's # of Reads
0 300
1 240
etc...
so there are 300 sequences or reads that have 0 number of 'N's
use strict;
use warnings;
my $file = shift;
my $output_file = shift;
my $line;
my $sequence;
my $length;
my $char_N_count = 0;
my #array;
my $count = 0;
if (!defined ($output_file)) {
die "USAGE: Input FASTA file\n";
}
open (IFH, "$file") or die "Cannot open input file$!\n";
open (OFH, ">$output_file") or die "Cannot open output file $!\n";
while($line = <IFH>) {
chomp $line;
next if $line =~ /^>/;
$sequence = $line;
#array = split ('', $sequence);
foreach my $element (#array) {
if ($element eq 'N') {
$char_N_count++;
}
}
print "$char_N_count\n";
}
Try this. I changed a few things like using scalar file handles. There are many ways to do this in Perl, so some people will have other ideas. In this case I used an array which may have gaps in it - another option is to store results in a hash and key by the count.
Edit: Just realised I'm not using $output_file, because I have no idea what you want to do with it :) Just change the 'print' at the end to 'print $out_fh' if your intent is to write to it.
use strict;
use warnings;
my $file = shift;
my $output_file = shift;
if (!defined ($output_file)) {
die "USAGE: $0 <input_file> <output_file>\n";
}
open (my $in_fh, '<', $file) or die "Cannot open input file '$file': $!\n";
open (my $out_fh, '>', $output_file) or die "Cannot open output file '$output_file': $!\n";
my #results = ();
while (my $line = <$in_fh>) {
next if $line =~ /^>/;
my $num_n = ($line =~ tr/N//);
$results[$num_n]++;
}
print "# of N's\t# of Reads\n";
for (my $i = 0; $i < scalar(#results) ; $i++) {
unless (defined($results[$i])) {
$results[$i] = 0;
# another option is to 'next' if you don't want to show the zero totals
}
print "$i\t\t$results[$i]\n";
}
close($in_fh);
close($out_fh);
exit;

I can't output properly

I'm trying to print a character from a file each time I get a char as input.
My problem is that it prints the whole line. I know it's a logic problem, I just can't figure out how to fix it.
use Term::ReadKey;
$inputFile = "input.txt";
open IN, $inputFile or die "I can't open the file :$ \n";
ReadMode("cbreak");
while (<IN>) {
$line = <IN>;
$char = ReadKey();
foreach $i (split //, $line) {
print "$i" if ($char == 0);
}
}
Move the ReadKey call into the foreach loop.
use strictures;
use autodie qw(:all);
use Term::ReadKey qw(ReadKey ReadMode);
my $inputFile = 'input.txt';
open my $in, '<', $inputFile;
ReadMode('cbreak');
while (my $line = <$in>) {
foreach my $i (split //, $line) {
my $char = ReadKey;
print $i;
}
}
END { ReadMode('restore') }
Your original code has 3 problems:
You only read the character once (outside the for loop)
You read 1 line from input file when testing while (<IN>) { (LOSING that line!) and then another in $line = <IN>; - therefore, only read even #d lines in your logic
print "$i" prints 1 line with no newline, therefore, you don't see characters separated
My scrip reads all the files in a directory, puts then in a list, chooses a random file from the given list.
After that, each time it gets an input char from the user, it prints a char from the file.
#!C:\perl\perl\bin\perl
use Term::ReadKey qw(ReadKey ReadMode);
use autodie qw(:all);
use IO::Handle qw();
use Fatal qw( open );
STDOUT->autoflush(1);
my $directory = "codes"; #directory's name
opendir (DIR, $directory) or die "I can't open the directory $directory :$ \n"; #open the dir
my #allFiles; #array of all the files
while (my $file = readdir(DIR)) { #read each file from the directory
next if ($file =~ m/^\./); #exclude it if it starts with '.'
push(#allFiles, $file); #add file to the array
}
closedir(DIR); #close the input directory
my $filesNr = scalar(grep {defined $_} #allFiles); #get the size of the files array
my $randomNr = int(rand($filesNr)); #generate a random number in the given range (size of array)
$file = #allFiles[$randomNr]; #get the file at given index
open IN, $file or die "I can't open the file :$ \n"; #read the given file
ReadMode('cbreak'); #don't print the user's input
while (my $line = <IN>) { #read each line from file
foreach my $i (split //, $line) { #split the line in characters (including \n & \t)
print "$i" if ReadKey(); #if keys are pressed, print the inexed char
}
}
END {
ReadMode('restore') #deactivate 'cbreak' read mode
}

perl increasing the counter number every time the script running

I have a script to compare 2 files and print out the matching lines on the file. what I want to add a logic to help me to identify for how long these devices are matched. currently I have add the starting point 1 so I want to increase that number every time the script run and matched.
Example.
inputfile:-########################
retiredDevice.txt
Alpha
Beta
Gamma
Delta
prodDevice.txt
first
second
third
forth
Gamma
Delta
output file :-#######################
final_result.txt
1 Delta
1 Gamma
my objective is to add a counter stamp on each matching line to identify for how long "Delta" and "Gamma" matched. the script running every week. so every time the script running adding 1 so when I audit the 'finalResult.txt. the result should looks like
Delta 4
Gamma 3
the result indicate me Delta matched for last 4 weeks and Gamma for last 3 weeks.
#! /usr/local/bin/perl
my $ndays = 1;
my $f1 = "/opt/retiredDevice.txt ";
my $f2 = "prodDevice.txt";
my $outfile = "/opt/final_result.txt";
my %results = ();
open FILE1, "$f1" or die "Could not open file: $! \n";
while(my $line = <FILE1>){ $results{$line}=1;
}
close(FILE1);
open FILE2, "$f2" or die "Could not open file: $! \n";
while(my $line =<FILE2>) {
$results{$line}++;
}
close(FILE2);
open (OUTFILE, ">$outfile") or die "Cannot open $outfile for writing \n";
foreach my $line (keys %results) {
my $x = $ndays;
$x++;
print OUTFILE "$x : ", $line if $results{$line} != 1;
}
close OUTFILE;
Thanks in advance for any help!
Based on your earlier question and comments, perhaps this might work.
use strict;
use warnings;
use autodie;
my $logfile = 'int.txt';
my $f1 = shift || "/opt/test.txt";
my $f2 = shift || "/opt/test1.txt";
my %results;
open my $file1, '<', $f1;
while (my $line = <$file1>) {
chomp $line;
$results{$line} = 1;
}
open my $file2, '<', $f2;
while (my $line = <$file2>) {
chomp $line;
$results{$line}++;
}
{ ############ added part
my %c;
for (keys %results) {
$c{$_} = $results{$_} if $results{$_} > 1;
}
%results = %c;
} ############ end added part
my (%log, $log);
if ( -e $logfile ) {
open $log, '<', $logfile;
while (<$log>) {
my ($num, $key) = split;
$log{$key} = $num;
}
}
open $log, '>', $logfile or die $!;
for my $key (keys %results) {
my $old = ( $log{$key} || 0 ); # keep old count, or 0 otherwise
my $new = ( $results{$key} ? 1 : 0 ); # 1 if it exists, 0 otherwise
print $log $old + $new, " $key\n";
}
Perform this computation in two steps.
Each time you run the comparison between retired and prod, produce an output file that you save with a unique file name, e.g. result-XXX where XXX denotes when you ran the comparison.
Then write a script which iterates over all of the result-XXX files and produces a summary.
I would name the files result-YYYY-MM-DD where YYYY-MM-DD is the date that the comparison was created. Then it will be relatively easy to iterate over a subset of the files (e.g. ones for a certain month).
Or store the data in a relational database.