Counting using Hashes in Perl - perl

I am trying to determine if a certain ID is present in my hash and if it is store the count in hash: this is what I have:
#!/usr/bin/perl
open (INFILE, "parsedveronii.txt")
or die "cannot find infile";
while ($file=<INFILE>){
#array=split "\t", $file;
$hash{$array[1]}= ""; #the keys in my hash are subject IDS
}
open (INFILE1, "uniqueveroniiproteins.txt")
or die "cannot find infile";
while ($file1=<INFILE>){
#array = split "\n", $file1; #array[0] also contains subject IDs
if (exists ($hash{$array[0]})){ #if in my hash exists $array[0], keep count of it
$count++;
}
$hash{$array[1]{$count}}=$hash{$array[1]{$count}} +1;#put the count in hash
}
use Data::Dumper;
print Dumper (\%hash);
for some reason it's not executing the count, any ideas? Any help is appreciated.

Always include use strict; and use warnings; at the top of each and EVERY script.
Your machinations in the second file loop seem a little contrived. If you're just trying to count the subject ids, that is done a lot simpler.
The following is a cleaned up version of your code, doing what I interpret as your intention.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
open my $fh, '<', "parsedveronii.txt";
my %count;
while (my $line = <$fh>){
chomp $line;
my #array = split "\t", $line;
$count{$array[1]} = 0; #the keys in my hash are subject IDS
}
open $fh, '<', "uniqueveroniiproteins.txt";
while (my $line = <$fh>){
chomp $line;
my #array = split "\t", $line; #array[0] also contains subject IDs
if (exists $count{$array[0]}) { #if in my hash exists $array[0], keep count of it
$count{$array[0]}++;
}
}
use Data::Dumper;
print Dumper (\%count);

Related

how to assign data into hash from an input file

I am new to perl.
Inside my input file is :
james1
84012345
aaron5
2332111 42332
2345112 18238
wayne[2]
3505554
Question: I am not sure what is the correct way to get the input and set the name as key and number as values. example "james" is key and "84012345" is the value.
This is my code:
#!/usr/bin/perl -w
use strict;
use warnings;
use Data::Dumper;
my $input= $ARGV[0];
my %hash;
open my $data , '<', $input or die " cannot open file : $_\n";
my #names = split ' ', $data;
my #values = split ' ', $data;
#hash{#names} = #values;
print Dumper \%hash;
I'mma go over your code real quick:
#!/usr/bin/perl -w
-w is not recommended. You should use warnings; instead (which you're already doing, so just remove -w).
use strict;
use warnings;
Very good.
use Data::Dumper;
my $input= $ARGV[0];
OK.
my %hash;
Don't declare variables before you need them. Declare them in the smallest scope possible, usually right before their first use.
open my $data , '<', $input or die " cannot open file : $_\n";
You have a spurious space at the beginning of your error message and $_ is unset at this point. You should include $input (the name of the file that failed to open) and $! (the error reason) instead.
my #names = split ' ', $data;
my #values = split ' ', $data;
Well, this doesn't make sense. $data is a filehandle, not a string. Even if it were a string, this code would assign the same list to both #names and #values.
#hash{#names} = #values;
print Dumper \%hash;
My version (untested):
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
#ARGV == 1
or die "Usage: $0 FILE\n";
my $file = $ARGV[0];
my %hash;
{
open my $fh, '<', $file or die "$0: can't open $file: $!\n";
local $/ = '';
while (my $paragraph = readline $fh) {
my #words = split ' ', $paragraph;
my $key = shift #words;
$hash{$key} = \#words;
}
}
print Dumper \%hash;
The idea is to set $/ (the input record separator) to "" for the duration of the input loop, which makes readline return whole paragraphs, not lines.
The first (whitespace separated) word of each paragraph is taken to be the key; the remaining words are the values.
You have opened a file with open() and attached the file handle to $data. The regular way of reading data from a file is to loop over each line, like so:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $input = $ARGV[0];
my %hash;
open my $data , '<', $input or die " cannot open file : $_\n";
while (my $line = <$data>) {
chomp $line; # Removes extra newlines (\n)
if ($line) { # Checks if line is empty
my ($key, $value) = split ' ', $line;
$hash{$key} = $value;
}
}
print Dumper \%hash;
OK, +1 for using strict and warnings.
First Take a look at the $/ variable for controlling how a file is broken into records when it's read in.
$data is a file handle you need to extract the data from the file, if it's not to big you can load it all into an array, if it's a large file you can loop over each record at a time. See the <> operator in perlop
Looking at you code it appears that you want to end up with the following data structure from your input file
%hash(
james1 =>[
84012345
],
aaron5 => [
2332111,
42332,
2345112,
18238
]
'wayne[2]' => [
3505554,
]
)
See perldsc on how to do that.
All the documentation can be read using the perldoc command which comes with Perl. Running perldoc on its own will give you some tips on how to use it and running perldoc perldoc will give you possibly far more info than you need at the moment.

Get the header lines of protein sequences that start with specific amino acid in FASTA

Hi guys so I have been trying to use PERL to print only the headers (the entire >gi line) of protein sequences that start with "MAD" or "MAN" (the first 3 aa) from a FASTA file. But I couldn't figure out which part went wrong.
Thanks in advance!
#!usr/bin/perl
use strict;
my $in_file = $ARGV[0];
open( my $FH_IN, "<", $in_file ); ###open to fileholder
my #lines = <$FH_IN>;
chomp #lines;
my $index = 0;
foreach my $line (#lines) {
$index++;
if ( substr( $line, 0, 3 ) eq "MAD" or substr( $line, 0, 3 ) eq "MAN" ) {
print "#lines [$index-1]\n\n";
} else {
next;
}
}
This is a short part of the FASTA file, the header of the first seq is what I am looking for
>gi|16128078|ref|NP_414627.1| UDP-N-acetylmuramoyl-L-alanyl-D-glutamate:meso-diaminopimelate ligase [Escherichia coli str. K-12 substr. MG1655] MADRNLRDLLAPWVPDAPSRALREMTLDSRVAAAGDLFVAVVGHQADGRRYIPQAIAQGVAAIIAEAKDE ATDGEIREMHGVPVIYLSQLNERLSALAGRFYHEPSDNLRLVGVTGTNGKTTTTQLLAQWSQLLGEISAV MGTVGNGLLGKVIPTENTTGSAVDVQHELAGLVDQGATFCAMEVSSHGLVQHRVAALKFAASVFTNLSRD HLDYHGDMEHYEAAKWLLYSEHHCGQAIINADDEVGRRWLAKLPDAVAVSMEDHINPNCHGRWLKATEVN
Your print statement is buggy. Should probably be:
print "$lines[$index-1]\n\n";
However, it's typically better to just process a file line by line unless there is a specific reason you need to slurp the entire thing:
#!usr/bin/perl
use strict;
use warnings;
use autodie;
my $file = shift;
#open my $fh, "<", $in_file;
my $fh = \*DATA;
while (<$fh>) {
print if /^>/ && <$fh> =~ /^MA[DN]/;
}
__DATA__
>gi|16128078|ref|NP_414627.1| UDP-N-acetylmuramoyl-L-alanyl-D-glutamate:meso-diaminopimelate ligase [Escherichia coli str. K-12 substr. MG1655]
MADRNLRDLLAPWVPDAPSRALREMTLDSRVAAAGDLFVAVVGHQADGRRYIPQAIAQGVAAIIAEAKDE
ATDGEIREMHGVPVIYLSQLNERLSALAGRFYHEPSDNLRLVGVTGTNGKTTTTQLLAQWSQLLGEISAV
MGTVGNGLLGKVIPTENTTGSAVDVQHELAGLVDQGATFCAMEVSSHGLVQHRVAALKFAASVFTNLSRD
HLDYHGDMEHYEAAKWLLYSEHHCGQAIINADDEVGRRWLAKLPDAVAVSMEDHINPNCHGRWLKATEVN
–
Outputs:
>gi|16128078|ref|NP_414627.1| UDP-N-acetylmuramoyl-L-alanyl-D-glutamate:meso-diaminopimelate ligase [Escherichia coli str. K-12 substr. MG1655]
Since you want to know how to improve your code, here is a commented version of your program with some suggestions on how you could change it.
#!/usr/bin/perl
use strict;
You should also add the use warnings pragma, which enables warnings (as you might expect).
my $in_file = $ARGV[0];
It's a good idea to check that $ARGV[0] is defined, and to give an appropriate error message if it isn't, e.g.
my $in_file = $ARGV[0] or die "Please supply the name of the FASTA file to process";
If $ARGV[0] is not defined, Perl executes the die statement.
open( my $FH_IN, "<", $in_file ); # open to fileholder
You should check that the script is able to open the input file; you can use a similar structure to the previous statement, by adding a die statement:
open( my $FH_IN, "<", $in_file ) or die "Could not open $in_file: $!";
The special variable $! holds the error message as to why the file could not be opened (e.g. it doesn't exist, no permission to read it, etc.).
my #lines = <$FH_IN>;
chomp #lines;
my $index = 0;
foreach my $line (#lines) {
$index++;
if ( substr( $line, 0, 3 ) eq "MAD" or substr( $line, 0, 3 ) eq "MAN" ) {
print "#lines [$index-1]\n\n";
This is the problem point in the script. Firstly, the correct way to access an item in the array is using $lines[$index-1]. Secondly, the first item in an array is at index 0, so line 1 of the file will be at position 0 in #lines, line 4 at position 3, etc. Because you've already incremented the index, you're printing the line after the header line. The problem can easily be fixed by incrementing $index at the end of the loop.
}
else {
next;
}
Using next isn't really necessary here because there is no code following the else statement, so there's nothing to gain from telling Perl to skip the rest of the loop.
The fixed code would look like this:
#!/usr/bin/perl
use warnings;
use strict;
my $in_file = $ARGV[0] or die "Please supply the name of the FASTA file to be processed";
open( my $FH_IN, "<", $in_file ) or die "Could not open $in_file: $!";
my #lines = <$FH_IN>;
chomp #lines;
my $index = 0;
foreach my $line (#lines) {
if ( substr( $line, 0, 3 ) eq "MAD" or substr( $line, 0, 3 ) eq "MAN" ) {
print "$lines[$index-1]\n\n";
}
$index++;
}
I hope that is helpful and clear!

Extract data from file

I have data like
"scott
E -45 COLLEGE LANE
BENGALI MARKET
xyz -785698."
"Tomm
D.No: 4318/3,Ansari Road, Dariya Gunj,
xbc - 289235."
I wrote one Perl program to extract names i.e;
open(my$Fh, '<', 'printable address.txt') or die "!S";
open(my$F, '>', 'names.csv') or die "!S";
while (my#line =<$Fh> ) {
for(my$i =0;$i<=13655;$i++){
if ($line[$i]=~/^"/) {
print $F $line[$i];
}
}
}
It works fine and it extracts names exactly .Now my aim is to extract address that is like
BENGALI MARKET
xyz -785698."
D.No: 4318/3,Ansari Road, Dariya Gunj,
xbc - 289235."
In CSV file. How to do this please tell me
There are a lot of flaws with your original problem. Should address those before suggesting any enhancements:
Always have use strict; and use warnings; at the top of every script.
Your or die "!S" statements are broken. The error code is actually in $!. However, you can skip the need to do that by just having use autodie;
Give your filehandles more meaningful names. $Fh and $F say nothing about what those are for. At minimum label them as $infh and $outfh.
The while (my #line = <$Fh>) { is flawed as that can just be reduced to my #line = <$Fh>;. Because you're going readline in a list context it will slurp the entire file, and the next loop it will exit. Instead, assign it to a scalar, and you don't even need the next for loop.
If you wanted to slurp your entire file into #line, your use of for(my$i =0;$i<=13655;$i++){ is also flawed. You should iterate to the last index of #line, which is $#line.
if ($line[$i]=~/^"/) { is also flawed as you leave the quote character " at the beginning of your names that you're trying to match. Instead add a capture group to pull the name.
With the suggested changes, the code reduces to:
use strict;
use warnings;
use autodie;
open my $infh, '<', 'printable address.txt';
open my $outfh, '>', 'names.csv';
while (my $line = <$infh>) {
if ($line =~ /^"(.*)/) {
print $outfh "$1\n";
}
}
Now if you also want to isolate the address, you can use a similar method as you did with the name. I'm going to assume that you might want to build the whole address in a variable so you can do something more complicated with it than throwing them blindly at a file. However, mirroring the file setup for now:
use strict;
use warnings;
use autodie;
open my $infh, '<', 'printable address.txt';
open my $namefh, '>', 'names.csv';
open my $addressfh, '>', 'address.dat';
my $address = '';
while (my $line = <$infh>) {
if ($line =~ /^"(.*)/) {
print $namefh "$1\n";
} elsif ($line =~ /(.*)"$/) {
$address .= $1;
print $addressfh "$address\n";
$address = '';
} else {
$address .= $line;
}
}
Ultimately, no matter what you want to use your data for, your best solution is probably to output it to a real CSV file using Text::CSV. That way it can be imported into a spreadsheet or some other system very easily, and you won't have to parse it again.
use strict;
use warnings;
use autodie;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1, eol => "\n" } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $infh, '<', 'printable address.txt';
open my $outfh, '>', 'address.csv';
my #data;
while (my $line = <$infh>) {
# Name Field
if ($line =~ /^"(.*)/) {
#data = ($1, '');
# End of Address
} elsif ($line =~ /(.*)"$/) {
$data[1] .= $1;
$csv->print($outfh, \#data);
# Address lines
} else {
$data[1] .= $line;
}
}

in perl can I sort my output of a foreach loop according to keywords in the original string?

I am working on a problem, and iterating through an array. I am new to perl, so sorry if this is something very obvious I am not seeing.
I want to sort the output according to a keyword in the original string. As I have two foreach loops that give me something like this:
[blup]
[ich]
[du]
[er]
[sie]
[es]
something something something
somethingelse something else something else
I want to sort it like that though according to a keyword in the original string where the substrings have been extracted from:
[blup blup]
[ich]
something something something
[er]
[sie]
[es]
something else something else something else
Thank you for your help!
This is my code:
#!/usr/bin/perl
# perl -d ./perl_debugger.pl
use strict;
use warnings;
use Data::Dumper qw(Dumper);
use File::Slurp;
my #a_linesorig;
my #solution;
my $line;
my $str;
my $grab;
my $s;
my $rs;
my $capture;
open(my $fh, "<", "output.txt")
or die "cannot open < output.txt: $!";
$line = read_file('output.txt');
$line = read_file('output.txt');
#a_linesorig = split( /\*/, $line);
#solution = split( /\bsolution\b/, $line);
close $fh
or die "can't close file: $!";
my $filename = 'neu.txt';
open(my $fh1, '>', $filename)
or die "can't open file: $!";
foreach $str (#a_linesorig) {
if ($str =~ (/\[(.*?)\]/)) {
print ($fh1 "content bracket: $1\n\n");
}
}
foreach $str (#a_linesorig) {
if ($str =~ /\brewrites\b([^\|]+)((\bcpu\b))*/g) {
print ($fh1 "decision: $&\n\n");
}
}
close $fh1
or die "can't close file: $!";
As you are calculating your results you could store them in a hash at the end you can iterate through your hash (by sorted keys)
This is unchecked pseudo-perl but the concept is:
Define a hash:
%hash
When you store the entry you would do
$hash{$key-that-you-want-to-sort-by} = $Thing-that-you-want-to-print
Then when you are done you could loop through your keys
for (my $key (sort keys $hash)) {
print $key{$hash};
}
At a high level, I would say that you should employ a hash, who's keys are the words in the original string, and who's values are the ordering that you want to preserve.
Then, afte ryou're processed your input, you will look over the keys of the hash, sorted by your ordering, and print you results for each word inside the loop.

Dynamic array of hashes in Perl

I have a CSV file like this:
name,email,salary
a,b#b.com,1000
d,e#e.com,2000
Now, I need to transform this to an array of hash-maps in Perl, so when I do something like:
table[1]{"email"}
it returns e#e.com.
The code I wrote is :
open(DATA, "<$file") or die "Cannot open the file\n";
my #table;
#fetch header line
$line = <DATA>;
my #header = split(',',$line);
#fetch data tuples
while($line = <DATA>)
{
my %map;
my #row = split(',',$line);
for($index = 0; $index <= $#header; $index++)
{
$map{"$header[$index]"} = $row[$index];
}
push(#table, %map);
}
close(DATA);
But I am not getting desired results.. Can u help?? Thanks in advance...
This line
push(#table, %map)
should be
push(#table, \%map)
You want table to be a list of hash references; your code adds each key and value in %map to the list as a separate element.
There is no need to reinvent the wheel here. You can do this with the Text::CSV module.
#!/usr/bin/perl
use strict;
use warnings;
use v5.16;
use Text::CSV;
my $csv = Text::CSV->new;
open my $fh, "<:encoding(utf8)", "data.csv" or die "data.csv: $!";
$csv->column_names( $csv->getline ($fh) );
while (my $row = $csv->getline_hr ($fh)) {
say $row->{email};
}
Something like this perhaps:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my #table;
chomp(my $header = <DATA>);
my #cols = split /,/, $header; # Should really use a real CSV parser here
while (<DATA>) {
chomp;
my %rec;
#rec{#cols} = split /,/;
push #table, \%rec;
}
say $table[1]{email};
__END__
name,email,salary
a,b#b.com,1000
d,e#e.com,2000