Doubts group data - perl

I need a help with the following problem. I have a file with the following data.
21997|||70049,,20170428154818,20170527235959|||
21997|||70070,,20170428154739,20170527235959|||
21998|||70049,,20170428154818,20170527235959|||
21998|||70070,,20170428154739,20170527235959|||
21998|||70071,,20170428154739,20170527235959|||
I need to unify the file as follows.
21997|||70049,,20170502172844,20170531235959; 70070,,20170502172844,20170531235959|||
21998|||70049,,20170502172844,20170531235959; 70070,,20170502172844,20170531235959; 70071,,20170502172844|||
Can someone help me please?

my $unified_output;
my %out;
open(FILE, "./raw-file.txt") or die $!;
my #file = <FILE>;
close FILE;
for (#file) {
next if $_ =~ /$^/;
my #line = split(/\|\|\|/, $_) if $_;
$out{"$line[0]"} .= qq~$line[1]; ~ if $_ and $_ =~ /^$line[0]/;
}
for (keys %out) {
$out{$_} =~ s!\; $!!;
$unified_output .= qq~$_|||$out{$_}|||\n~ if $_ and $out{$_};
}

Related

Nested if statements: Swapping headers and sequences in fasta files

I am opening a directory and processing each file. A sample file looks like this when opened:
>AAAAA
TTTTTTTTTTTAAAAATTTTTTTTTT
>BBBBB
TTTTTTTTTTTTTTTTTTBBBBBTTT
>CCCCC
TTTTTTTTTTTTTTTTCCCCCTTTTT
For the above sample file, I am trying to make them look like this:
>TAAAAAT
AAAAA
>TBBBBBT
BBBBB
>TCCCCCT
CCCCC
I need to find the "header" in next line sequence, take flanks on either side of the match, and then flip them. I want to print each file's worth of contents to another separate file.
Here is my code so far. It runs without errors, but doesn't generate any output. My guess is this is probably related to the nested if statements. I have never worked with those before.
#!/usr/bin/perl
use strict;
use warnings;
my ($directory) = #ARGV;
my $dir = "$directory";
my #ArrayofFiles = glob "$dir/*";
my $count = 0;
open(OUT, ">", "/path/to/output_$count.txt") or die $!;
foreach my $file(#ArrayofFiles){
open(my $fastas, $file) or die $!;
while (my $line = <$fastas>){
$count++;
if ($line =~ m/(^>)([a-z]{5})/i){
my $header = $2;
if ($line !~ /^>/){
my $sequence .= $line;
if ($sequence =~ m/(([a-z]{1})($header)([a-z]{1}))/i){
my $matchplusflanks = $1;
print OUT ">", $matchplusflanks, "\n", $header, "\n";
}
}
}
}
}
How can I fix this code? Thanks.
Try this
foreach my $file(#ArrayofFiles)
{
open my $fh," <", $file or die"error opening $!\n";
while(my $head=<$fh>)
{
chomp $head;
$head=~s/>//;
my $next_line = <$fh>;
my($extract) = $next_line =~m/(.$head.)/;
print ">$extract\n$head\n";
}
}
There are several mistakes in your code but the main problem is:
if ($line =~ m/(^>)([a-z]{5})/i) {
my $header = $2;
if ($line !~ /^>/) {
# here you write to the output file
Because the same line can't start and not start with > at the same time, your output files are never written. The second if statement always fails and its block is never executed.
open(OUT, ">", "/path/to/output_$count.txt") or die $!; and $count++ are misplaced. Since you want to produce an output file (with a new name) for each input file, you need to put them in the foreach block, not outside or in the while loop.
Example:
#!/usr/bin/perl
use strict;
use warnings;
my ($dir) = #ARGV;
my #files = glob "$dir/*";
my $count;
my $format = ">%s\n%s\n";
foreach my $file (#files) {
open my $fhi, '<', $file
or die "Can't open file '$file': $!";
$count++;
my $output_path = "/path/to/output_$count.txt";
open my $fho, '>', $output_path
or die "Can't open file '$output_path': $!";
my ($header, $seq);
while(<$fhi>) {
chomp;
if (/^>([a-z]{5})/i) {
if ($seq) { printf $fho $format, $seq =~ /([a-z]$header[a-z])/i, $header; }
($header, $seq) = ($1, '');
} else { $seq .= $_; }
}
if ($seq) { printf $fho $format, $seq =~ /([a-z]$header[a-z])/i, $header; }
}
close $fhi;
close $fho;

Is there a better way of writing this code to avoid redundancy?

I have a segment of code in a program, which accepts GNU style input from a pipe (which is a list of file names). If STDIN does not contain data, I need to accept input from a predetermined text file containing file names.
I find myself needing to write redundant code. Is it possible to simplify this bit of code to avoid redundancy?
sub downloadlinkgen {
my $fh;
print "Printing links\n";
if ($getfilelist==1) {
open $fh, '<', "fuzzyfile" or die $!;
while (<$fh>) {
chomp ($_);
(my $fname,my $path, my $suffix) = fileparse($_);
my ($name, $ext) = $fname =~ /(.*)\.(.*)/;
my $newfile=$path.$name.".$ext";
$newfile =~ s/\s/%20/g;
$newfile =~ s/\/root/http:\/\/myip/;
print $newfile."\n";
}
} else {
while (<>) {
chomp ($_);
(my $fname,my $path, my $suffix) = fileparse($_);
my ($name, $ext) = $fname =~ /(.*)\.(.*)/;
my $newfile=$path.$name.".$ext";
$newfile =~ s/\s/%20/g;
$newfile =~ s/\/root/http:\/\/myip/;
print $newfile."\n";
}
}
}
Yes, just make the default ARGV filehandle open the file:
sub downloadlinkgen {
#ARGV = 'fuzzyfile' if $getfilelist == 1;
print "Printing links\n";
while (<>) {
chomp ($_);
(my $fname,my $path, my $suffix) = fileparse($_);
my ($name, $ext) = $fname =~ /(.*)\.(.*)/;
my $newfile=$path.$name.".$ext";
$newfile =~ s/\s/%20/g;
$newfile =~ s/\/root/http:\/\/myip/;
print $newfile."\n";
}
}
sub downloadlinkgen {
# default file handle
my $fh = \*ARGV;
print "Printing links\n";
if ($getfilelist==1) {
open $fh, '<', "fuzzyfile" or die $!;
}
while (<$fh>) {
chomp ($_);
(my $fname,my $path, my $suffix) = fileparse($_);
my ($name, $ext) = $fname =~ /(.*)\.(.*)/;
my $newfile=$path.$name.".$ext";
$newfile =~ s/\s/%20/g;
$newfile =~ s/\/root/http:\/\/myip/;
print $newfile."\n";
}
}
From perldoc -f readline
Reads from the filehandle whose typeglob is contained in EXPR (or from *ARGV if EXPR is not provided)
so \*ARGV is reference to file handle used when reading from <>, and you can use $fh in both cases.
Even if you don't know about ARGV, you could do something simple like this:
sub downloadlinkgen {
my $fh;
print "Printing links\n";
if ($getfilelist==1) {
open $fh, '<', "fuzzyfile" or die $!;
while (<$fh>) {
process_line($_);
}
} else {
while (<>) {
process_line($_);
}
}
}
sub process_line {
my $line = shift;
chomp ($line);
(my $fname,my $path, my $suffix) = fileparse($line);
my ($name, $ext) = $fname =~ /(.*)\.(.*)/;
my $newfile=$path.$name.".$ext";
$newfile =~ s/\s/%20/g;
$newfile =~ s/\/root/http:\/\/myip/;
print $newfile."\n";
}

Failed to open GLOB error

I've noticed that when you drag & drop a file into OS X Terminal and any part of the pathway contains a space (for example in a folder name) it substitutes this for a \
This then leads to an error in opening files in my script:
use strict;
use warnings;
use File::Basename;
my $in;
my $filename = $ARGV[0];
unless ($filename){
print "\n\nPlease drag and drop a FASTA/FA or plain-text file containing your sequence into the prompt window and hit ENTER. Alternatively, manually specify the file-pathway:\n";
$filename = <STDIN>;
chomp $filename;
}
open($in, $filename) or die "Failed to open $in: $!";
my $DNA = read_fasta($in);
my $len = length($DNA);
print "\nFASTA/Sequence Length is: $len bp \n";
print "\nPlease enter restriction sites (degeneracy characters permitted) seperated by a single space: ";
my $sites=<STDIN>;
chomp $sites;
my #pats = split ' ', $sites;
for (#pats) {
s/K/[GT]/g;
s/M/[AC]/g;
s/Y/[CT]/g;
s/S/[CG]/g;
s/W/[AT]/g;
s/B/[CGT]/g;
s/V/[ACG]/g;
s/H/[ACT]/g;
s/D/[AGT]/g;
s/X/[AGCT]/g;
s/R/[AG]/g;
s/N/[AGCT]/g;
}
for (#pats) {
my $m = () = $DNA =~ /$_/gi;
print "\nTotal DNA matches to $_ are: $m \n";
}
my $DIR = dirname($filename);
my $name = basename($filename);
(my $extrem = $name) =~ s/\.[^.]+$//;
open my $out, ">$DIR/$extrem $sites.txt";
my $pat=join("|",#pats);
my #cutarr = split(/$pat/, $DNA);
for (#cutarr) {
my $len = length($_);
print $out "$len \n";
}
print "\nYour results are located at: $DIR/$extrem $sites.txt\n\n";
close($out);
close($in);
#Subfunction - Reading formatted FASTA/FA files
sub read_fasta {
my ($fh) = #_;
my $sequence;
while (<$fh>) {
unless (/^>/) {
chomp;
$sequence .= $_;
}
}
return $sequence;
}
It will open files if the pathway contains no spaces. Is there a better way I can go about opening the file to avoid this occurring?
Try to remove backslashes from your file name,
$filename =~ tr|\\||d;
open(my $in, $filename) or die $!;

Perl <STDIN> not matching contents in an array

I have a file which consists of three names: daniel, elaine and victoria. If I search for daniel I get "you are not on the list". Could someone kindly point out where my mistake is? Thank you.
#!/usr/bin/perl
#open file
open(FILE, "names") or die("Unable to open file");
# read file into an array
#data = <FILE>;
# close file
close(FILE);
print "Enter name\n";
$entry = <STDIN>;
chomp $entry;
if (grep {$_ eq $entry} #data)
{
print "You are on the list $entry";
}
else
{
print "Your are not on the list";
}
You need to chomp (remove new line character from the end of each string) data from the file too:
chomp #data;
if (grep {$_ eq $entry} #data) {
print "You are on the list $entry";
} else {
print "Your are not on the list";
}
change this
if (grep {$_ eq $entry} #data)
to this
if (grep {$_ =~ m/^$entry\b/i} #data)
remove the i if you specifically want it to be case sensitive.

Why I am not getting "success" with this program?

I have written the following program with the hope of getting success. But I could never get it.
my $fileName = 'myfile.txt';
print $fileName,"\n";
if (open MYFILE, "<", $fileName) {
my $Data;
{
local $/ = undef;
$Data = <MYFILE>;
}
my #values = split('\n', $Data);
chomp(#values);
if($values[2] eq '9999999999') {
print "Success"."\n";
}
}
The content of myfile.txt is
160002
something
9999999999
700021
Try splitting by \s*[\r\n]+
my $fileName = 'myfile.txt';
print $fileName,"\n";
if (open MYFILE, "<", $fileName) {
my $Data;
{
local $/ = undef;
$Data = <MYFILE>;
}
my #values = split(/\s*[\r\n]+/, $Data);
if($values[2] eq '9999999999') {
print "Success";
}
}
If myfile.txt contain carriage return (CR, \r), it will not work as expected.
Another possible cause is trailing spaces before linefeed (LF, \n).
You don't need to read an entire file into an array to check one line. Open the file, skip the lines you don't care about, then play with the line you do care about. When you've done what you need to do, stop reading the file. This way, only one line is ever in memory:
my $fileName = 'myfile.txt';
open MYFILE, "<", $fileName or die "$filename: $!";
while( <MYFILE> ) {
next if $. < 3; # $. is the line number
last if $. > 3;
chomp;
print "Success\n" if $_ eq '9999999999';
}
close MYFILE;
my $fileName = 'myfile.txt';
open MYFILE, "<", $fileName || die "$fileName: $!";
while( $rec = <MYFILE> ) {
for ($rec) { chomp; s/\r//; s/^\s+//; s/\s+$//; } #Remove line-feed and space characters
$cnt++;
if ( $rec =~ /^9+$/ ) { print "Success\n"; last; } #if record matches "9"s only
#print "Success" and live the loop
}
close MYFILE;
#Or you can write: if ($cnt==3 and $rec =~ /^9{10}$/) { print "Success\n"; last; }
#If record 3 matches ten "9"s print "Success" and live the loop.