I am new to perl require your help to build a logic.
I have some let say 10 files in a directory, each file has some data like below. Each file contain lines depends upon the number of users setting. For example if 4 users are there then 4 lines will get printed from the server.
1405075666889,4044,SOA_breade,200,OK,Thread Group 1-1,text,true,623,4044
1405075666889,4041,SOA_breade,200,OK,Thread Group 1-1,text,true,623,4041
1405075666889,4043,SOA_breade,200,OK,Thread Group 1-1,text,true,623,4043
1405075666889,4045,SOA_breade,200,OK,Thread Group 1-1,text,true,623,4044
I want to some piece of logic that should create single file in a output directory and that file should contain 10 lines
Min_Value, Max_Value, Avg_Value, User1, User2, User3......User4
and their corresponding values from second line in this case corresponding values are coming from second column.
Min_Value, Max_Value, Avg_Value, User1, User2, User3......User4
4.041,4.045,4.044,4.041,4.043,4.045
.
.
.
.
.
10th file data
Here is my code... It is working however I am not getting how to print user1, user2... in sequence and its corresponding values
my #soaTime;
my #soaminTime;
my #soamaxTime;
my #soaavgTime;
my $soadir = $Json_Time;
foreach my $inputfile (glob("$soadir/*Overview*.txt")) {
open(INFILE, $inputfile) or die("Could not open file.");
foreach my $line (<INFILE>) {
my #values = split(',', $line); # parse the file
my $time_ms = $values[1]/1000;
push (#soaTime, $time_ms);
}
my $min = min #soaTime;
push (#soaminTime, $min);
print $soaminTime[0];
my $max = max #soaTime;
push (#soamaxTime, $max);
sub mean { return #_ ? sum(#_) / #_ : 0 };
#print mean(#soaTime);
push (#soaavgTime, mean());
close(INFILE);
}
my $outputfile = $report_path."abc.txt";
open (OUTFILE, ">$outputfile");
print OUTFILE ("Min_Value,Max_Value,User1,User2,User3,User4"."\n"); # Prining the data
for (my $count = 0; $count <= $#soaTC; $count++) {
print OUTFILE("$soaminTime[0],$soamaxTime[0],$soaTime[0],$soaTime[1],$soaTime[2],$soaTime[3]"."\n" ); #Prining the data
}
close(OUTFILE);
Please help.
is it?
use strict;
use List::Util qw( min max );
my $Json_Time="./test";
my $report_path="./out/";
my #soaTime;
my #soaminTime;
my #soamaxTime;
my #soaavgTime;
my #users;
my $maxusers;
my $soadir = $Json_Time;
foreach my $inputfile (glob("$soadir/*Overview*.txt")) {
open(INFILE, $inputfile) or die("Could not open file.");
my $i=0;
my #m_users;
my #m_soaTime;
foreach my $line (<INFILE>) {
my #values = split(',', $line); # parse the file
my $time_ms = $values[1]/1000;
push (#m_soaTime, $time_ms);
$i++;
push(#m_users, "User".$i);
}
push(#soaTime,\#m_soaTime);
if ($maxusers<$#m_users) {
#users=#m_users;
$maxusers=$#m_users;
}
my $min = min(#m_soaTime);
push (#soaminTime, $min);
my $max =max(#m_soaTime);
push (#soamaxTime, $max);
sub mean { return #_ ? sum(#_) / #_ : 0 };
push (#soaavgTime, mean());
close(INFILE);
}
my $outputfile = $report_path."abc.txt";
open (OUTFILE, ">$outputfile");
print OUTFILE "Min_Value,Max_Value,Avg_Value,".join(',',#users)."\n"; # Prining the data
for (my $count = 0; $count <= $#soaavgTime; $count++) {
print OUTFILE $soaminTime[$count].","
.$soamaxTime[$count].","
.$soaavgTime[$count].","
.join(',',#{$soaTime[$count]})
."\n"; #Prining the data
}
close(OUTFILE);
Related
I have some problem with my code. I have 1 GB records, in which I have to sort according to date and time. Records are look like :
TYP_journal article|KEY_1926000001|AED_|TIT_A Late Eighteenth-Century Purist|TPA_|GLO_Pronouncements of George Campbell and his contemporaries which time has set aside.|AUT_Bryan, W. F.|AUS_|AFF_|RES_|IED_|TOC_|FJN_Studies in Philology|ISN_0039-3738|ESN_|PLA_Chapel Hill, NC|URL_|DAT_1926|VOL_23|ISS_|EXT_358-370|CPP_|FSN_|ISN_|PLA_|SNO_|PUB_|IBZ_|PLA_|PYR_|PAG_|DAN_|DGI_|DGY_|OFP_|OFU_|FSS_|PDF_|LIB_|INO_|FAU_|INH_|IUR_|INU_|CDT_9/15/2003 3:12:28 PM|MDT_5/16/2017 9:18:40 AM|
I sort these records using MDT_5/16/2017 9:18:40 AM.
I used below technique:
I filter file, which have MDT_ or not (create two file with MDT_ and without MDT_).
For MDT data code:
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my #Dt_ModifiedDate = grep { $_ =~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/i} <read_file>;
my $doc_MD = new IO::File(">$current_ou/output/$file_name_with_out_ext.ModifiedDate");
$doc_MD->binmode(':utf8');
print $doc_MD #Dt_ModifiedDate;
$doc_MD->close;
close (read_file);
For Un_MDT data code:
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my #un_ModifiedDate = grep { $_ !~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/} <read_file>;
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my $doc_UMD = new IO::File(">$current_ou/output/$file_name_with_out_ext.unModifiedDate");
$doc_UMD->binmode(':utf8');
print $doc_UMD #un_ModifiedDate;
$doc_UMD->close;
close (read_file);
From MDT_ contains file, I collect all date and time and sort them and then unique.
#modi_date = map $_->[0],
sort { uc($a->[1]) cmp uc($b->[1]) } map { [ $_, toISO8601($_) ] } #modi_date;
#modi_date = reverse (#modi_date);
#modi_date = uniq (#modi_date);
according to sorted date and time I grep all records from MDT_file. And finally create final file.
my $doc1 = new IO::File(">$current_ou/output/$file_name_with_out_ext.sorted_data");
$doc1->binmode(':utf8');
foreach my $changes (#modi_date)
{
chomp($changes);
$Count_pro++;
#ab = grep (/$changes/, #all_data_with_time);
print $doc1 ("#ab\n");
$progress_bar->update($Count_pro);
}
$doc1->close;
But this process take more time. Is there any way to do in short time?
As you pointed out doing everything in memory is not an option on your machine. However, I do not see why you are first sorting the dates,
to then grep all records with that date, instead of sorting all of those records on the date.
I also suspect that if you were to go through the original file line by line and not in one huge map sort split map, you might save some memory,
but I'll leave that up to you to try - it would save you creating the files and then re-parsing things.
I would suggest doing 2 + 3 in one go:
Skip building #modi_date ( somewhere not visible to us :/ ).
my $mdt_fn = 'with_mdt.txt'; # <- whatever name you gave that file?
open ( my $fh, '< :encoding(UTF-8)', $mdt_fn )
or die "could not open file '$mdt_fn' to read: $!";
my $dt_parser = DateTime::Format::Strptime->new(
pattern => '%m/%d/%Y %r',
);
# get all records from file. To ensure we only need to parse the line once,
# store the datetime in a hashref.
my #records;
while ( my $line = <$fh> ){
push #records, {
dt => _dt_from_record($line),
record => $line,
};
}
# If you wanted to CMP rather than doing datetime comparison,
# adapt _dt_from_record and use 'cmp' instead of '<=>'
#records = sort{ $a->{dt} <=> $b->{dt} }#records;
open ( my $out_fh, '> :encoding(UTF-8)', 'sorted.txt') or
die "could not open file to write to: $!";
# Or reverse first if you want latest to oldest
print $out_fh $_->{record}."\n" for #records;
close $out_fh;
# I prefer using DateTime for this.
# Using a parser will alert me if some date was set, but cannot be parsed.
# If you want to spare yourself some additional time,
# why not store the parsed date in the file. However, I doubt this takes long.
sub _dt_from_record {
my $record = shift;
$record =~ /MDT_([^\|]+)/;
return $dt_parser->parse_datetime($1);
}
Finally i done it.
Complete code is :-
use warnings;
use strict;
use 5.010;
use Cwd;
binmode STDOUT, ":utf8";
use Date::Simple ('date', 'today');
use Time::Simple;
use Encode;
use Time::Piece;
use Win32::Console::ANSI;
use Term::ANSIScreen qw/:color /;
use File::Copy;
BEGIN {our $start_run = time();
my $Start = localtime;
print colored ['bold green'], ("\nstart time :- $Start\n");
}
##vairable
my $current_dir = getcwd();
my $current_in = $ARGV[0];
my $current_ou = $ARGV[1];
my #un_ext_file;
my #un_ext_file1;
my $current_data =today();
my $time = Time::Simple->new();
my $hour = $time->hours;
my $minute = $time->minutes;
my $second = $time->seconds;
my $current_time = "$hour"."-"."$minute"."-"."$second";
my $ren_folder = "output_"."$current_data"."_"."$current_time";
##check for output name DIR
opendir(DIR1, $current_ou);
my #current_ou_folder = readdir(DIR1);
closedir(DIR1);
foreach my $entry (#current_ou_folder)
{
if ($entry eq "output")
{
move "$current_ou/output" , "$current_ou/$ren_folder";
mkdir "$current_ou/output";
}
else
{
mkdir "$current_ou/output";
}
}
opendir(DIR, $current_in);
my #files_and_folder = readdir(DIR);
closedir(DIR);
foreach my $entry (#files_and_folder)
{
next if $entry eq '.' or $entry eq '..';
next if -d $entry;
push(#un_ext_file1, $entry);
}
##### check duplicate file name
my %seen;
my #file_test;
foreach my $file_name (#un_ext_file1)
{
if ($file_name =~ /(.*)\.([a-z]+)$/)
{
push (#file_test, $1);
}
else
{
push (#file_test, $file_name);
}
}
foreach my $string (#file_test)
{
next unless $seen{$string}++;
print "'$string' is duplicated.\n";
}
##collect all file from array
foreach my $file_name (#un_ext_file1)
{
my $REC_counter=0;
if ($file_name =~ /(.*)\.([a-z]+)$/) #####work for all extension
{
my $file_name_with_out_ext = $1;
my #modi_date_not_found;
eval{
#####read source file
#####First short file date wise (old date appear first then new date apper in last)
##### To get modifiedDate from the file
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my #Dt_ModifiedDate = grep { $_ =~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/i} <read_file>;
my $doc_MD = new IO::File(">$current_ou/output/$file_name_with_out_ext.ModifiedDate");
$doc_MD->binmode(':utf8');
print $doc_MD #Dt_ModifiedDate;
$doc_MD->close;
close (read_file);
#Dt_ModifiedDate=undef; ##### free after use
print colored ['bold green'], ("\n\tAll ModifiedDate data Filtered\n\n");
##### To get un-modifiedDate from the file
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my #un_ModifiedDate = grep { $_ !~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/} <read_file>;
my $doc_UMD = new IO::File(">$current_ou/output/$file_name_with_out_ext.unModifiedDate");
$doc_UMD->binmode(':utf8');
print $doc_UMD #un_ModifiedDate;
$doc_UMD->close;
close (read_file);
#un_ModifiedDate=undef; ##### free after use
print colored ['bold green'], ("\n\tAll unModifiedDate data Filtered\n\n\n\n");
##### Read ModifiedDate
open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.ModifiedDate" || die "file found $!";
my #all_ModifiedDate = <read_file_ModifiedDate>;
close(read_file_ModifiedDate);
##### write in sotred_data file ModifiedDate after sorting all data.
my $doc1 = new IO::File(">$current_ou/output/$file_name_with_out_ext.sorted_data");
$doc1->binmode(':utf8');
print $doc1 sort { (toISO8601($a)) cmp (toISO8601($b)) } #all_ModifiedDate;
$doc1->close;
##### Read sorted_data and do in reverse order and then read unModifiedDate data and write in final file.
open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.sorted_data" || die "file found $!";
my #all_sorted_data = <read_file_ModifiedDate>;
close(read_file_ModifiedDate);
#all_sorted_data = reverse (#all_sorted_data);
open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.unModifiedDate" || die "file found $!";
my #all_unModifiedDate = <read_file_ModifiedDate>;
close(read_file_ModifiedDate);
my $doc_final = new IO::File(">$current_ou/output/$1.txt");
$doc_final->binmode(':utf8');
print $doc_final #all_sorted_data;
print $doc_final #all_unModifiedDate;
$doc_final->close;
unlink("$current_ou/output/$file_name_with_out_ext.ModifiedDate");
unlink("$current_ou/output/$file_name_with_out_ext.sorted_data");
unlink("$current_ou/output/$file_name_with_out_ext.unModifiedDate");
}
}
}
#####Process Complete.
say "\n\n---------------------------------------------";
print colored ['bold green'], ("\tProcess Completed\n");
say "---------------------------------------------\n";
get_time();
sub toISO8601
{
my $record = shift;
$record =~ /MDT_([^\|]+)/;
return(Time::Piece->strptime($1, '%m/%d/%Y %I:%M:%S %p')->datetime);
}
sub get_time
{
my $end_run = time();
my $run_time = $end_run - our $start_run;
#my $days = int($sec/(24*60*60));
my $hours = ($run_time/(60*60))%24;
my $mins =($run_time/60)%60;
my $secs = $run_time%60;
print "\nJob took";
print colored ['bold green'], (" $hours:$mins:$secs ");
print "to complete this process\n";
my $End = localtime;
print colored ['bold green'], ("\nEnd time :- $End\n");
}
All process is done with-in :-- 20 min.
specially i am V. very thank-full to #bytepusher.
I want to calculate overlap number(#) and percentage (%) from a series of ranges values distributed in four different files initiated with a specific identifier(id) (like NP_111111.4) . The initial list of ids are taken from file1.txt (starting file) and if the id matches with ids of other file, overlaps are calculated. Suppose my files are like this:
file1.txt
NP_111111.4: 1-9 12-20 30-41
YP_222222.2: 3-30 40-80
file2.txt
NP_111111.4: 1-6, 13-22, 31-35, 36-52
NP_414690.4: 360-367, 749-755
YP_222222.2: 19-24, 22-40
file3.txt
NP_418214.2: 1-133, 135-187, 195-272
YP_222222.2: 1-10
file4.txt
NP_418119.2
YP_222222.2 GO:0016878, GO:0051108
NP_111111.4 GO:0005887
From these input file, I want to create a .csv or excel output with separate columns with header as:
id overlap_file1_file2(#) overlap_file1_file2(%) overlap_file1_file3(#) overlap_file1_file3(%) overlap_file1_file2_file3(#) overlap_file1_file2_file3(%) Go_Terms(File4)
I am learning perl and found a perl module "strictures" for this type of range comparison. I am calculating overlapping number and percentage from two ranges as:
#!/usr/bin/perl
use strictures;
use Number::Range;
my $seq1 = Number::Range->new(8..356); #Start and stop for file1.txt
my $seq2 = Number::Range->new(156..267); #Start and stop for file2.txt
my $overlap = 0;
my $sseq1 = $seq1->size;
my $percent = (($seq2->size * 100) / $seq1->size);
foreach my $int ($seq2->range) {
if ( $seq1->inrange($int) ) {
$overlap++;
}
else {
next;
}
}
print "Total size= $sseq1 Number overlapped= $overlap Percentage overlap= $percent \n";
But I could not find a way to match ids of (file1.txt) with other files to extract specific information and to print them in a output csv file.
Please help. Thanks for your consideration.
This is a fragile solution in that it can only check 3 files for overlaps. If more files are involved, the code would need to be restructured. It uses Set::IntSpan to calculate the overlaps (and percent of overlaps.
#!/usr/bin/perl
use strict;
use warnings;
use Set::IntSpan;
use autodie;
my $file1 = 'file1';
my #files = qw/file2 file3/;
my %data;
my %ids;
open my $fh1, '<', $file1;
while (<$fh1>) {
chomp;
my ($id, $list) = split /:\s/;
$ids{$id}++;
$data{$file1}{$id} = Set::IntSpan->new(split ' ', $list);
}
close $fh1;
for my $file (#files) {
open my $fh, '<', $file;
while (<$fh>) {
chomp;
my ($id, $list) = split /:\s/;
next unless exists $ids{$id};
$data{$file}{$id} = Set::IntSpan->new(split /,\s/, $list);
}
close $fh;
}
my %go_terms;
open my $go, '<', 'file4';
while (<$go>) {
chomp;
my ($id, $terms) = split ' ', $_, 2;
$go_terms{$id} = $terms =~ tr/,//dr;
}
close $go;
my %output;
for my $file (#files) {
for my $id (keys %ids) {
my $count = ($data{$file1}{$id} * $data{$file}{$id})->size;
my $percent = sprintf "%.0f", 100 * $count / $data{$file1}{$id}->size;
$output{$id}{$file} = [$count, $percent];
}
}
for my $id (keys %ids) {
my $count = ($data{$file1}{$id} * $data{$files[0]}{$id} * $data{$files[1]}{$id})->size;
my $percent = sprintf "%.0f", 100 * $count / $data{$file1}{$id}->size;
$output{$id}{all_files} = [$count, $percent];
}
# output saved as f2.csv
print join(",", qw/ID f1f2_overlap f1f2_%overlap
f1f3_overlap f1f3_%overlap
f1f2f3_overlap f1f2f3_%overlap Go_terms/), "\n";
for my $id (keys %output) {
print "$id,";
for my $file (#files, 'all_files') {
my $aref = $output{$id}{$file};
print join(",", #$aref), ",";
}
print +($go_terms{$id} // ''), "\n";
}
The Excel sheet looks like this.
I have a CSV file with three columns in order called Mb_size, tax_id, and parent_id. There is a relationship between tax_id and parent_id, for example, in the csv file at the end where you have 22.2220658537 for the mb size, 5820 is the tax id and 5819 is the parent id. As move up the file 5819 the parent id will be seen in the tax id column. The parent id can be repeated but tax id is uniqie in its column.
Starting at the end which has values in Mb_size, I need to work up to the top calculating the average everytime the parent_id becomes the tax_id. Then move up by when this happens the parent Id that is next to that tax Id become new start point to move up.
Below is the sample input :
Mb_size,tax_id,parent_id
,1,1
,131567,1
,2759,131567
,5819,2759
,147429,2759
22.2220658537,5820,5819
184.801317,4557,147429
748.66869,4575,147429
555.55,1234,5819
Below is the sample output:
Mb_size,tax_id,parent_id
377.810518214,1,1
377.810518214,131567,1
377.810518214,2759,131567
288.886032927,5819,2759
466.7350035,147429,2759
22.2220658537,5820,5819
184.801317,4557,147429
748.66869,4575,147429
555.55,1234,5819,
The code so far
use strict;
use warnings;
no warnings 'numeric';
open taxa_fh, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open match_fh, ">$ARGV[0]_sized.csv" or die qq{Failed to open for output: $!\n};
my %data;
while ( my $line = <taxa_fh> ) {
chomp( $line );
my #fields = split( /,/, $line );
my $Mb_size = $fields[0];
my $tax_id = $fields[1];
my $parent_id = $fields[2];
$data{$parent_id}{sum} += $Mb_size;
$data{$parent_id}{count}++;
}
for my $parent_id ( sort keys %data ) {
my $avg = $data{$parent_id}{sum} / $data{$parent_id}{count};
print match_fh "$parent_id, $avg \n";
}
close taxa_fh;
close match_fh;
The code I have so far, is from a poster of help earlier. I edited the question to help make it better/clearer. I cant get it to continue the calculation up and include in the printing the original lines from below.
I tried a foreach(tax_id) but didn't work. Any suggestions to include to accomplish this . It does move up but doesn’t do calculation.
You need build a data-structure carefully from down to up first. I am using hashes for that.
Here for every parent_id as key I am building a hash in which I am saving averages,tax_id,sum and count associated with that.
As there could be multiple tax_id associated with single parent_id we need to store averages separately for them.
Now when It becomes a tree like structure then It becomes trivial to print it out according to our requirements.
As they are hashes, orders are not conserved. To maintain order you can use arrays instead of hashes.
One way to do it will be like below:
#!/usr/bin/perl
use strict;
use warnings;
open my $fh, '<', 'tax' or die "unable to open file:$!\n";
my %data;
my #lines;
chomp(my $header=<$fh>); #slurp header
while(<$fh>){
chomp;
my #fields=split(/,/);
if($fields[0]){
##actually field0 is avg so storing it as avg here
$data{$fields[2]}{$fields[1]}{avg}=$fields[0];
$data{$fields[2]}{sum}+=$fields[0];
$data{$fields[2]}{count}++;
}
else{
push(#lines,[split(/,/)]);
}
}
close($fh);
#lines=reverse #lines;
foreach my $lines(#lines){
if(exists $data{$lines->[1]}){
$data{$lines->[2]}{$lines->[1]}{avg}=($data{($lines->[1])}{sum})/($data{($lines->[1])}{count});
$data{$lines->[2]}{sum}+=$data{$lines->[2]}{$lines->[1]}{avg};
$data{$lines->[2]}{count}++;
}
else{
print "Sorry No Such Entry ",$lines->[2]," present\n";
}
}
print "$header\n";
foreach my $tax_id(keys %data){
foreach my $parent_id(keys $data{$tax_id} ){
if(ref ($data{$tax_id}{$parent_id}) eq 'HASH'){
print $data{$tax_id}{$parent_id}->{'avg'}.",".$tax_id.",".$parent_id."\n";
}
}
}
Here is another similar solution, based on your work:
use strict;
use warnings;
open taxa_fh, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open match_fh, ">$ARGV[0]_sized.csv" or die qq{Failed to open for output: $!\n};
my %node_data;
my %parent;
my #node_order;
my $header;
while ( my $line = <taxa_fh> ) {
chomp( $line );
if (1 == $.) {
$header = $line;
next; # Skip header
}
my #fields = split( /,/, $line );
my $Mb_size = $fields[0] || 0; # To avoid uninitialized warning
my $tax_id = $fields[1];
my $parent_id = $fields[2];
$parent{$tax_id} = $parent_id;
push #node_order, $tax_id;
$node_data{$tax_id} = $Mb_size;
}
# Add the node value for all parents in the tree
my %totals;
for my $tax_id ( sort keys %parent ) {
my $parent = $parent{$tax_id};
my $done = 0;
while( ! $done ) {
if ($node_data{$tax_id} > 0) {
$totals{$parent}->{sum} += $node_data{$tax_id};
$totals{$parent}->{count}++;
}
$done++ if ($parent{$parent} == $parent);
$parent = $parent{$parent};
}
}
print match_fh "$header\n";
for my $id ( #node_order ) {
my $avg;
if ( exists $totals{$id} ) {
# Parent Node
$avg = $totals{$id}->{sum} / $totals{$id}->{count};
} else {
# Leaf Node
$avg = $node_data{$id};
}
print match_fh "$avg, $id, " . $parent{$id} . "\n";
}
close taxa_fh;
close match_fh;
Output:
Mb_size,tax_id,parent_id
377.810518213425, 1, 1
377.810518213425, 131567, 1
377.810518213425, 2759, 131567
288.88603292685, 5819, 2759
466.7350035, 147429, 2759
22.2220658537, 5820, 5819
184.801317, 4557, 147429
748.66869, 4575, 147429
555.55, 1234, 5819
I have two files with two columns each:
FILE1
A B
1 #
2 #
3 !
4 %
5 %
FILE 2
A B
3 #
4 !
2 &
1 %
5 ^
The Perl script must compare column A in both both files, and only if they are equal, column B of FIlE 2 must be printed
So far I have the following code but all I get is an infinite loop with # from column B
use strict;
use warnings;
use 5.010;
print "enter site:"."\n";
chomp(my $s = <>);
print "enter protein:"."\n";
chomp(my $p = <>);
open( FILE, "< $s" ) or die;
open( OUT, "> PSP.txt" ) or die;
open( FILE2, "< $p" ) or die;
my #firstcol;
my #secondcol;
my #thirdcol;
while ( <FILE> )
{
next if $. <2;
chomp;
my #cols = split;
push #firstcol, $cols[0];
push #secondcol, $cols[1]."\t"."\t".$cols[3]."\t"."\t"."\t"."N\/A"."\n";
}
my #firstcol2;
my #secondcol2;
my #thirdcol2;
while ( <FILE2> )
{
next if $. <2;
my #cols2 = split(/\t/, $_);
push #firstcol2, $cols2[0];
push #secondcol2, $cols2[4]."\n";
}
my $size = #firstcol;
my $size2 = #firstcol2;
for (my $i = 0; $i <= #firstcol ; $i++) {
for (my $j = 0; $j <= #firstcol2; $j++) {
if ( $firstcol[$i] eq $firstcol2[$j] )
{
print $secondcol2[$i];
}
}
}
my (#first, #second);
while(<first>){
chomp;
my $foo = split / /, $_;
push #first , $foo;
}
while(<second>){
chomp;
my $bar = split / / , $_;
push #second, $bar;
}
my %first = #first;
my %second = #second;
Build a hash of the first file as %first and second file as %second with first column as key and second column as value.
for(keys %first)
{
print $second{$_} if exists $second{$_}
}
I couldn't check it as I am on mobile. hope that gives you an idea.
I assume that column A is ordered and that you actually want to compare the first entry in File 1 to the first entry in File 2, and so on.
If that's true, you have nested loop that you don't need. Simplify your last while as such:
for my $i (0..$#firstcol) {
if ( $firstcol[$i] eq $firstcol2[$i] )
{
print $secondcol2[$i];
}
}
Also, if you're at all concerned about the files being of different length, then you can adjust the loop:
use List::Util qw(min);
for my $i (0..min($#firstcol, $#firstcol2)) {
Additional Note: You aren't chomping your data in the second file loop while ( <FILE2> ). That might introduce a bug later.
If your files are called file1.txt and file2.txt the next:
use Modern::Perl;
use Path::Class;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } file("file$_.txt")->slurp for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
say $line2->[1] if ($line1->[0] eq $line2->[0]);
}
prints:
B
^
equals in column1 only the lines A and 5
without the CPAN modules - produces the same result
use strict;
use warnings;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } do { local(#ARGV)="file$_.txt";<> } for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
print $line2->[1],"\n" if ($line1->[0] eq $line2->[0]);
}
#!/usr/bin/perl
use strict;
use Data::Dumper;
use warnings;
my #mdsum;
open (IN1,"$ARGV[0]") || die "counldn't open";
open (MYFILE, '>>md5sum-problem.txt');
open (IN2, "mdsumfile.txt");
my %knomexl=();
my %knomemdsum = ();
my #arrfile ;
my $tempkey ;
my $tempval ;
my #values ;
my $val;
my $i;
my #newarra;
my $testxl ;
my $testmdsum;
while(<IN1>){
next if /barcode/;
#arrfile = split('\t', $_);
$knomexl{$arrfile[0]} = $arrfile[2];
}
while(<IN2>){
chomp $_;
#newarra = split(/ {1,}/, $_);
$tempval = $newarra[0];
$tempkey = $newarra[1];
$tempkey=~ s/\t*$//g;
$tempval=~ s/\s*$//g;
$tempkey=~s/.tar.gz//g;
$knomemdsum{$tempkey} = $tempval;
}
#values = keys %knomexl;
foreach $i(#values){
$testxl = $knomexl{$values[$i]};
print $testxl."\n";
$testmdsum = $knomemdsum{$values[$i]};
print $testmdsum."\n";
if ( $testxl ne $testmdsum ) {
if ($testxl ne ""){
print MYFILE "Files hasving md5sum issue $i\n";
}
}
}
close (MYFILE);
I have two files one both having File name and Mdsum values and I need to check that which all file's md5sum values are not matching so I understand that in some case where Value and corresponding values will not be their and I want those cases only. Any work around on this code ? Please. This code is pretty simple but don't know why it's not working!! :( :(
#values = keys %knomexl;
foreach $i(#values){
#print Dumper $knomexl{$values[$i]};
$testxl = $knomexl{$i};
print $testxl."\n";
$testmdsum = $knomemdsum{$i};
print $testmdsum."\n";
$i is an element of #values because of the foreach, not an index, so you shouldn't use $values[$i].