Appending values to Hash if key is same in Perl - perl

Problem is to read a file with value at every new line. Content of file looks like
3ssdwyeim3,3ssdwyeic9,2017-03-16,09:10:35.372,0.476,EndInbound
3ssdwyeim3,3ssdwyfyyn,2017-03-16,09:10:35.369,0.421,EndOutbound
3ssdwyfxc0,3ssdwyfxfi,2017-03-16,09:10:35.456,0.509,EndInbound
3ssdwyfxc0,3ssdwyhg0v,2017-03-16,09:10:35.453,0.436,EndOutbound
With the string before first comma being the Key and string in between last and second last comma the Value
i.e. for the first line 3ssdwyeim3 becomes the key and 0.476 Value.
Now as we are looping over each line if the key exists we have to concatenate the values separated by comma.
Hence for the next new line as key already exists key remains 3ssdwyeim3 but the value is updated to 0.476,0.421.
Finally we have to print the keys and values in a file.
I have written a code to achieve the same, which is as follows.
sub findbreakdown {
my ( $out ) = #_;
my %timeLogger;
open READ, "out.txt" or die "Cannot open out.txt for read :$!";
open OUTBD, ">$out\_breakdown.csv" or die "Cannot open $out\_breakdown.csv for write :$!";
while ( <READ> ) {
if ( /(.*),.*,.*,.*,(.*),.*/ ) {
$btxnId = $1;
$time = $2;
if ( !$timeLogger{$btxnId} ) {
$timeLogger{$btxnId} = $time;
}
else {
$previousValue = $timeLogger{$btxnId};
$newValue = join ",", $previousValue, $time;
$timeLogger{$btxnId} = $newValue;
}
}
foreach ( sort keys %timeLogger ) {
print OUTBD "$_ ,$timeLogger{$_}\n";
}
}
close OUTBD;
close READ;
}
However Something is going wrong and its printing like this
3ssdwyeim3,0.476
3ssdwyeim3,0.476,0.421
3ssdwyeim3,0.476,0.421
3ssdwyfxc0,0.509
3ssdwyeim3,0.476,0.421
3ssdwyfxc0,0.509,0.436
3ssdwyeim3,0.476,0.421
3ssdwyfxc0,0.509,0.436
Whereas expected is:
3ssdwyeim3,0.476,0.421
3ssdwyfxc0,0.509,0.436

Your program is behaving correctly, but you are printing the current state of the entire hash after you process each line.
Therefore you are printing hash keys before they have the complete set of values, and you have many duplicated lines.
If you move the foreach loop that prints to the end of your program (or simply use the debugger to inspect the variables) you will find that the final state of the hash is exactly what you expect.
Edit: I previously thought the problem was the below, but it's because I misread the sample data in your question.
This regular expression is not ideal:
if (/(.*),.*,.*,.*,(.*),.*/) {
The .* is greedy and will match as much as possible (including some content with commas). So if any line contains more than six comma-separated items, more than one item will be included in the first matching group. This may not be a problem in your actual data, but it's not an ideal way to write the code. The expression is more ambiguous than necessary.
It would be better written like this:
if (/^([^,]*),[^,]*,[^,]*,[^,]*,([^,]*),[^,]*$/) {
Which would only match lines with exactly six items.
Or consider using split on the input line, which would be a cleaner solution.

This is much simpler than you have made it. You can just split each line into fields and use push to add the value to the list corresponding to the key
I trust you can modify this to read from an external file instead of the DATA file handle?
use strict;
use warnings 'all';
my %data;
while ( <DATA> ) {
my #fields = split /,/;
push #{ $data{$fields[0]} }, $fields[-2];
}
for my $key ( sort keys %data ) {
print join(',', $key, #{ $data{$key} }), "\n";
}
__DATA__
3ssdwyeim3,3ssdwyeic9,2017-03-16,09:10:35.372,0.476,EndInbound
3ssdwyeim3,3ssdwyfyyn,2017-03-16,09:10:35.369,0.421,EndOutbound
3ssdwyfxc0,3ssdwyfxfi,2017-03-16,09:10:35.456,0.509,EndInbound
3ssdwyfxc0,3ssdwyhg0v,2017-03-16,09:10:35.453,0.436,EndOutbound
output
3ssdwyeim3,0.476,0.421
3ssdwyfxc0,0.509,0.436

Related

Regular expression to print a string from a command outpout

I have written a function that uses regex and prints the required string from a command output.
The script works as expected. But it's does not support a dynamic output. currently, I use regex for "icmp" and "ok" and print the values. Now, type , destination and return code could change. There is a high chance that command doesn't return an output at all. How do I handle such scenarios ?
sub check_summary{
my ($self) = #_;
my $type = 0;
my $return_type = 0;
my $ipsla = $self->{'ssh_obj'}->exec('show ip sla');
foreach my $line( $ipsla) {
if ( $line =~ m/(icmp)/ ) {
$type = $1;
}
if ( $line =~ m/(OK)/ ) {
$return_type = $1;
}
}
INFO ($type,$return_type);
}
command Ouptut :
PSLAs Latest Operation Summary
Codes: * active, ^ inactive, ~ pending
ID Type Destination Stats Return Last
(ms) Code Run
-----------------------------------------------------------------------
*1 icmp 192.168.25.14 RTT=1 OK 1 second ago
Updated to some clarifications -- we need only the last line
As if often the case, you don't need a regex to parse the output as shown. You have space-separated fields and can just split the line and pick the elements you need.
We are told that the line of interest is the last line of the command output. Then we don't need the loop but can take the last element of the array with lines. It is still unclear how $ipsla contains the output -- as a multi-line string or perhaps as an arrayref. Since it is output of a command I'll treat it as a multi-line string, akin to what qx returns. Then, instead of the foreach loop
my #lines = split '\n', $ipsla; # if $ipsla is a multi-line string
# my #lines = #$ipsla; # if $ipsla is an arrayref
pop #lines while $line[-1] !~ /\S/; # remove possible empty lines at end
my ($type, $return_type) = (split ' ', $lines[-1])[1,4];
Here are some comments on the code. Let me know if more is needed.
We can see in the shown output that the fields up to what we need have no spaces. So we can split the last line on white space, by split ' ', $lines[-1], and take the 2nd and 5th element (indices 1 and 4), by ( ... )[1,4]. These are our two needed values and we assign them.
Just in case the output ends with empty lines we first remove them, by doing pop #lines as long as the last line has no non-space characters, while $lines[-1] !~ /\S/. That is the same as
while ( $lines[-1] !~ /\S/ ) { pop #lines }
Original version, edited for clarifications. It is also a valid way to do what is needed.
I assume that data starts after the line with only dashes. Set a flag once that line is reached, process the line(s) if the flag is set. Given the rest of your code, the loop
my $data_start;
foreach (#lines)
{
if (not $data_start) {
$data_start = 1 if /^\s* -+ \s*$/x; # only dashes and optional spaces
}
else {
my ($type, $return_type) = (split)[1,4];
print "type: $type, return code: $return_type\n";
}
}
This is a sketch until clarifications come. It also assumes that there are more lines than one.
I'm not sure of all possibilities of output from that command so my regular expression may need tweaking.
I assume the goal is to get the values of all columns in variables. I opted to store values in a hash using the column names as the hash keys. I printed the results for debugging / demonstration purposes.
use strict;
use warnings;
sub check_summary {
my ($self) = #_;
my %results = map { ($_,undef) } qw(Code ID Type Destination Stats Return_Code Last_Run); # Put results in hash, use column names for keys, set values to undef.
my $ipsla = $self->{ssh_obj}->exec('show ip sla');
foreach my $line (#$ipsla) {
chomp $line; # Remove newlines from last field
if($line =~ /^([*^~])([0-9]+)\s+([a-z]+)\s+([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)\s+([[:alnum:]=]+)\s+([A-Z]+)\s+([^\s].*)$/) {
$results{Code} = $1; # Code prefixing ID
$results{ID} = $2;
$results{Type} = $3;
$results{Destination} = $4;
$results{Stats} = $5;
$results{Return_Code} = $6;
$results{Last_Run} = $7;
}
}
# Testing
use Data::Dumper;
print Dumper(\%results);
}
# Demonstrate
check_summary();
# Commented for testing
#INFO ($type,$return_type);
Worked on the submitted test line.
EDIT:
Regular expressions allow you to specify patterns instead of the exact text you are attempting to match. This is powerful but complicated at times. You need to read the Perl Regular Expression documentation to really learn them.
Perl regular expressions also allow you to capture the matched text. This can be done multiple times in a single pattern which is how we were able to capture all the columns with one expression. The matches go into numbered variables...
$1
$2

loop through sql result with duplicate values and show only line with some string with perl

I am selecting from sql three values, hostname, place, message, assigning them to hashref and then in loop printing into file. Before I print it I have to clean the message as it contains in DB plenty of mess like html tags etc
while (my $result = $sth_get_message->fetchrow_hashref) {
my $message = $result->{MESSAGE};
$message =~ s/<[^>]*>//g;
$message =~ s/\n/ /g;
$message =~ s/ |"/ /g;
$message =~ s/\r//g;
my $message_out;
if (index($message, 'removed') != -1) {$message_out = $message;} else {$message_out= "";}
my $pc = $result->{PC};
my $place = $result->{PLACE};
print OUT "$pc|$message_out|$place\n";
}
the result is like this
pc1||place
pc2|message with word removed|place
pc3||place
pc3|message with word removed|place
pc4||place
pc4||place
pc4|message with word removed|place
some of them have more messages, but only one is relevant, so rest is cleaned but I have duplicates of pc's. What I want to get is to have unique pc list and if it has no message ok, show blank, if it has message with removed, show only this line
pc1||place
pc2|message with word removed|place
pc3|message with word removed|place
pc4|message with word removed|place
as far as I understood perl hashes this should be possible if I put pc as a key and then work with values but I have no clue how
thanks for help
When you have a list of values that you wish to find the unique values for you can populate a hash along these lines:
my $hashref = {}; ## initialise an empty hashref .. nb I prefer to work with refs but you could use has hash type variable
foreach my $row ( $rows )
{
$hashref->{$row} = 1;
}
now you have a hasref that is indexed by unique keys so you can do the following
print join(",", keys %$hashref );
or
foreach my $unique_key ( sort keys %$hashref )
{
print "$unique key in sorted order\n";
}
In your case you could use the string with the fields concatenated or you could just use the values that you wish to make unique and then assign something containing the entire row detail you are interested in place of the 1 value used above as a placeholder. You are then able to retrieve the full row detail with something like $hashref->{$unique_key}

Nested Loops in Perl - Failing logic

I'm trying to work out a bigger problem, but have simplified the issue for readability, ultimately the logic below is the reason the extended program is failing.
I am using Perl to search for a short sequence of letters within a larger sequence (protein sequences), and if it is not found, then I'd like to do some calculations. I don't know whether I'm going crazy, but I can't work out why this logic is failing.
sub calculateEpitopeMutations {
my #mutationArray;
my #epitopeArray;
my $count;
my $localEpitope;
open( EPITOPESIN2, $ARGV[5] ) or die "Unable to open file $ARGV[5]\n";
while ( my $line = <EPITOPESIN2> ) {
chomp $line;
push #epitopeArray, $line;
}
while ( my ( $key, $value ) = each our %sequencesForCalculation ) {
foreach ( #epitopeArray ) {
$localEpitope = $_;
if ( $value =~ /($localEpitope)/g ) {
print "$key\n$localEpitope\nexactly the same\n\n";
next;
}
else {
#This is where I'd like to do the further calculations
print "$key\n$localEpitope\nthere is a difference\n\n";
next;
}
}
}
}
$ARGV[5] is the name of a text file containing a list of 9-character sequences, exactly like the following
RVSENIQRF
SFQVDCFLW
The idea is to put these into array #epitopeArray and iterate through these, and compare them with all (currently just one) $value sequences in the hash %sequencesForCalculation.
%sequencesForCalculation is a hash, where $value is a long sequence of characters, like this
MDSNTMSSFQVDCFLWHIRKRFADNGLGDAPFLDRLRRDQKSLKGRGNTLGLDIETATLVGKQIVEWILKEESSETLRMTIASVPTSRYLSDMTLEEMSRDWFMLMPRQKKIGPLCVRLDQAVMEKNIVLKANFSVIFNRLETLILLRAFTEEEAIVGEISPLPSLPGHTYEDVKNAVGVLIGGLEWNGNTVRVSENIQRFAWRNCDENGRPSLPPEQK
Currently, the small 9-character long sequence $localEpitope is contained in the longer sequence $value so when I iterate through the program, I should get this printed every time.
($key contains a header of information about the protein sequences, but is irrelevant so I have shortened it to just the variable name.)
$key
RVSENIQRF
Exactly the same
$key
SFQVDCFLW
Exactly the same
$key
But instead I'm getting this
$key
RVSENIQRF
exactly the same
$key
SFQVDCFLW
there is a difference
$key
Any ideas? Please let me know if anything further is required.
Update
TL;DR: You should change $value =~ /($localEpitope)/g to $value =~ /$localEpitope/
Okay now that we know the real circumstances, the problem (as melpomene points out in his comment) is that you have the /g modifier on your pattern match. There's no reason for that; you don't want check how many times the substring appears, you just want to know whether it's there at all
The problem is that variables subjected to a /g pattern search keep a state that says where the last search ended. So you're searching for $epitopeArray[0] in the longer string and finding it, and then searching for $epitopeArray[1] from where the previous search terminated. The first substring appears after the second one, so only the first is found
For more information on this behaviour, take a look at the pos function which returns the current value of this state. For instance pos($value) will return the character offset where the next m//g will start its search
This short program demonstrates the problem. With the /g modifier only BBB is found. Remove it and both are found
use strict;
use warnings;
use 5.010;
my $long_s = 'xxxAAAxxxBBBxxx';
for my $substr ( qw/ BBB AAA / ) {
if ( $long_s =~ /$substr/g ) {
say "$substr okay";
}
else {
say "$substr nope";
}
}
output
BBB okay
AAA nope
Original
You say
Currently, the small 9-character long sequence ($localEpitope) IS contained in the longer sequence ($value), and so when I iterate through the program, I should get the following printed everytime
So $localEpitope is a substring of $value and you're saying that
$value =~ /($localEpitope)/g
evaluates to true
That is correct behaviour. $value =~ /$localEpitope/ will check whether $localEpitope can be found anywhere in $value
Unfortunately it's not clear enough from what you've written to suggest a solution

i want to merge multiple csv files by specific condition using perl

i have multiple csv files, i want to merge all those files.....
i am showing some of my sample csv files below...
M1DL1_Interpro_sum.csv
IPR017690,Outer membrane, omp85 target,821
IPR014729,Rossmann,327
IPR013785,Aldolase,304
IPR015421,Pyridoxal,224
IPR003594,ATPase,179
IPR000531,TonB receptor,150
IPR018248,EF-hand,10
M1DL2_Interpro_sum.csv
IPR017690,Outer membrane, omp85 target,728
IPR013785,Aldolase,300
IPR014729,Rossmann,261
IPR015421,Pyridoxal,189
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase,111
M1DL3_Interpro_sum.csv
IPR017690,Outer membrane,905
IPR013785,Aldolase,367
IPR014729,Rossmann,338
IPR015421,Pyridoxal,271
IPR003594,ATPase,158
IPR018248,EF-hand,3
now to merge these files i have tried the following code
#ARGV = <merge_csvfiles/*.csv>;
print #ARGV[0],"\n";
open(PAGE,">outfile.csv") || die"Can't open outfile.csv\n";
while($i<scalar(#ARGV))
{
open(FILE,#ARGV[$i]) || die"Can't open ...#ARGV[$i]...\n";
$data.=join("",<FILE>);
close FILE;
print"file completed...",$i+1,"\n";
$i++;
}
#data=split("\n",$data);
#data2=#data;
print scalar(#data);
for($i=0;$i<scalar(#data);$i++)
{
#id1=split(",",#data[$i]);
$id_1=#id1[0];
#data[$j]=~s/\n//;
if(#data[$i] ne "")
{
print PAGE "\n#data[$i],";
for($j=$i+1;$j<scalar(#data2);$j++)
{
#id2=split(",",#data2[$j]);
$id_2=#id2[0];
if($id_1 eq $id_2)
{
#data[$j]=~s/\n//;
print PAGE "#data2[$j],";
#data2[$j]="";
#data[$j]="";
print "match found at ",$i+1," and ",$j+1,"\n";
}
}
}
print $i+1,"\n";
}
merge_csvfiles is a folder which contains all the files
output of above code is
IPR017690,Outer membrane,821,IPR017690,Outer membrane ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,IPR003594,ATPase,158
IPR000531,TonB receptor,150
IPR018248,EF-hand,10,IPR018248,EF-hand,3
IPR011991,Winged,113
IPR000873,AMP-dependent synthetase/ligase
but i want the output in following format....
IPR017690,Outer membrane,821,IPR017690,Outer membrane ,728,IPR017690,Outer membrane,905
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
IPR000531,TonB receptor,150,0,0,0,0,0,0
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
0,0,0,IPR011991,Winged,113,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
Has anybody got any idea how can i do this?
Thank you for the help
As mentioned in Miguel Prz's comment, you haven't explained how you want the merge to be performed, but, judging by the "desired output" sample, it appears that what you want is to concatenate lines with matching IDs from all three input files into a single line in the output file, with "0,0,0" taking the place of any lines which don't appear in a given file.
So, then:
#!/usr/bin/env perl
use strict;
use warnings;
my #input_files = glob 'merge_csvfiles/*.csv';
my %data;
for my $i (0 .. $#input_files) {
open my $infh, '<', $input_files[$i]
or die "Failed to open $input_files[$i]: $!";
while (<$infh>) {
chomp;
my $id = (split ',', $_, 2)[0];
$data{$id}[$i] = $_;
}
print "Input file read: $input_files[$i]\n";
}
open my $outfh, '>', 'outfile.csv' or die "Failed to open outfile.csv: $!";
for my $id (sort keys %data) {
my #merge_data;
for my $i (0 .. $#input_files) {
push #merge_data, $data{$id}[$i] || '0,0,0';
}
print $outfh join(',', #merge_data) . "\n";
}
The first loop collects all the lines from each file into a hash of arrays. The hash keys are the IDs, so the lines for that ID from all files are kept together, and the value for each key is (a reference to) an array of the line associated with that ID in each file; using an array for this allows us to keep track of values which are missing as well as those which are present.
The second loop then takes the keys of that hash (in alphabetical order) and, for each one, creates a temporary array of the values associated with that ID, substituting "0,0,0" for missing values, joins them into a single string, and prints that to the output file.
The results, in outfile.csv, are:
IPR000531,TonB receptor,150,0,0,0,0,0,0
0,0,0,IPR000873,AMP-dependent synthetase/ligase,111,0,0,0
IPR003594,ATPase,179,0,0,0,IPR003594,ATPase,158
0,0,0,IPR011991,Winged,113,0,0,0
IPR013785,Aldolase,304,IPR013785,Aldolase,300,IPR013785,Aldolase,367
IPR014729,Rossmann,327,IPR014729,Rossmann,261,IPR014729,Rossmann,338
IPR015421,Pyridoxal,224,IPR015421,Pyridoxal,189,IPR015421,Pyridoxal,271
IPR017690,Outer membrane, omp85 target,821,IPR017690,Outer membrane, omp85 target,728,IPR017690,Outer membrane,905
IPR018248,EF-hand,10,0,0,0,IPR018248,EF-hand,3
Edit: Added explanations requested by OP in comments
can u expalain me the working of my $id = (split ',', $_, 2)[0]; and $# in this program
my $id = (split ',', $_, 2)[0]; gets the text prior to the first comma in the last line of text that was read:
Because I didn't specify what variable to put the data in, while (<$infh>) reads it into the default variable $_.
split ',', $_, 2 splits up the value of $_ into a list of comma-separated fields. The 2 at the end tells it to only produce at most 2 fields; the code will work fine without the 2, but, since I only need the first field, splitting into more parts isn't necessary.
Putting (...)[0] around the split command turns the returned list of fields into an (anonymous) array and returns the first element of that array. It's the same as if I'd written my #fields = split ',', $_, 2; my $id = $fields[0];, but shorter and without the extra variable.
$#array returns the highest-numbered index in the array #array, so for my $i (0 .. $#array) just means "loop over the indexes for all elements in #array". (Note that, if I hadn't needed the value of the index counter, I would have instead looped over the array's data directly, by using for my $filename (#input_files), but it would have been less convenient to keep track of the missing values if I'd done it that way.)

Opening a text file as a hash and searching within that hash

I have an assignment to write a Perl file to open a text file of IP addresses and their hostnames, separated by a new line, and load it into a hash. I'm then supposed to ask for user input as to what the user would like to search for within the file. If a result is found, the program should print the value and key, and ask for input again until the user doesn't input anything. I'm not even close to the end, but need a bit of guidance. I've cobbed together some code from here and through using some Google-Fu.
Here's my work in progress:
#!/usr/bin/perl
print "Welcome to the text searcher! Please enter a filename: ";
$filename = <>;
my %texthash = ();
open DNSTEXT, "$filename"
or die! "Insert a valid name! ";
while (<DNSTEXT>) {
chomp;
my ($key, $value) = split("\n");
$texthash{$key} .= exists $texthash{$key}
? ",$value"
: $value;
}
print $texthash{$weather.com}
#print "What would you like to search for within this file? "
#$query = <>
#if(exists $text{$query}) {
As is probably glaringly obvious, I'm quite lost. I'm not sure if I'm inserting the file into the hash correctly, or how to even print a value to debug.
The problem here is we don't know what the input file looks like. Assuming that the input file somehow looks like:
key1,value1
key2,value2
key3,value3
(or other similar manner, in this case the key and value pair are separated by a comma), you could do this:
my %text_hash;
# the my $line in the while() means that for every line it reads,
# store it in $line
while( my $line = <DNSTEXT>) {
chomp $line;
# depending on what separates the key and value, you could replace the q{,}
# with q{<whatever is between the key and value>}
my ( $key, $value ) = split q{,},$line;
$text_hash{$key} = $value;
}
But yeah, please tell us what the content of the file looks like.