Perl script to subset multiple DNA sequences - perl

I have a FASTA file of ~500 DNA sequences, each of which has target position for a Single-Neucleotide Polymorphism (SNP) of interest that is known to me.
For each entry in the file, I have a separate tab-delimited text file that has on each line
The FASTA sequence name
The start position
The end position
The SNP position
The sequences and the positions in the text file are in the same order.
The dummy FASTA file is:
>AOS-94_S25_L002_R1_001_trimmed_contig_767
GACACACACTGATTGTTAGTGGTGTACAGACATTGCTTCAAACTGCA
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
TAGGTTTTCTTTCCCATGTCCCCTGAATAACATGGGATTCCCTGTGACTGTGGGGACCCCTGAGAGCCTGGT
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
GATAAGGAGCTCACAGCAACCCACATGAGTTGTCC
and the dummy position file is
AOS-94_S25_L002_R1_001_trimmed_contig_767 5 15 10
AOS-94_S25_L002_R1_001_trimmed_contig_2199 8 19 11
AOS-94_S25_L002_R1_001_trimmed_contig_2585 4 20 18
This is the script I have written and tried
use warnings;
use strict;
# Read in the complete FASTA file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;
# Read in file of contig name, start pos, stop pos, SNP pos in tab delimited
text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;
# Output file
print "What are we calling the output? \n";
my $out= <STDIN>;
chomp $out;
local $/ = "\n>"; #Read by fasta record
my $seq1 = ();
open(FASTA,$fasta) || die "\n Unable to open the file!\n";
open(POS,$text) || die "\n Unable to open the file! \n";
my #fields = <POS>;
while (my $seq = <FASTA>){
chomp $seq;
my #seq = split(/\n/,$seq);
if($seq[0] =~ /^>/){
$seq1 = $seq[0];
}elsif($seq[0] =~ /[^>]/){ #matches any character except the >
$seq1 = ">".$seq[0];
}
for my $pos (#fields){
chomp $pos;
my #field = split(/\t/,$pos);
open(OUTFILE,">>$out");
print OUTFILE "$seq1";
my $subseq = substr $seq[1], $field[1] -1, $field[2] - $field[1];
print OUTFILE "$subseq\n";
}
}
close FASTA;
close POS;
close OUTFILE;
This is what I get out, which is what I sort of want:
>AOS-94_S25_L002_R1_001_trimmed_contig_767
CACACTGATT
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
TTTTCTTTCC
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
AGGAGCTCAC
However, I need to also print out the SNP position (column 4) after the sequence name, e.g.,
>AOS-94_S25_L002_R1_001_trimmed_contig_767
pos=10
CACACTGATT
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
pos=11
TTTTCTTTCC
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
pos=18
AGGAGCTCAC
I tried inserting print OUTFILE "pos= $field[3]\n";after print OUTFILE "$seq1"; and I get the following:
>AOS-94_S25_L002_R1_001_trimmed_contig_767
10
AOS-94_S25_L002_R1_001_trimmed_contig_2199
CACACTGATT
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
10
AOS-94_S25_L002_R1_001_trimmed_contig_2199
TTTTCTTTCC
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
10
AOS-94_S25_L002_R1_001_trimmed_contig_2199
AGGAGCTCAC
Obviously I have messed up my loops, and probably some chomp commands.
For instance, when I print "$seq1" to a file, why does it not need a "\n" included in the printed string? There must already be a hard return in the string?
I know I am missing some basics of how this is structured, but I so far can't figure out how to fix my mistakes. Can anyone provide any suggestions?
Update
Perl code reformatted for legibility
use warnings;
use strict;
# Read in the complete FASTA file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;
# Read in file of contig name, start pos, stop pos, SNP pos in tab delimited
text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;
#Output file
print "What are we calling the output? \n";
my $out = <STDIN>;
chomp $out;
local $/ = "\n>"; # Read by FASTA record
my $seq1 = ();
open( FASTA, $fasta ) || die "\n Unable to open the file!\n";
open( POS, $text ) || die "\n Unable to open the file! \n";
my #fields = <POS>;
while ( my $seq = <FASTA> ) {
chomp $seq;
my #seq = split( /\n/, $seq );
if ( $seq[0] =~ /^>/ ) {
$seq1 = $seq[0];
}
elsif ( $seq[0] =~ /[^>]/ ) { # matches any character except the >
$seq1 = ">" . $seq[0];
}
for my $pos ( #fields ) {
chomp $pos;
my #field = split( /\t/, $pos );
open( OUTFILE, ">>$out" );
print OUTFILE "$seq1";
my $subseq = substr $seq[1], $field[1] - 1, $field[2] - $field[1];
print OUTFILE "$subseq\n";
}
}
close FASTA;
close POS;
close OUTFILE;

There are many problems with your code
Your comments don't correspond to the code. For instance, you have Read in the complete FASTA file when the code just accepts the file name from STDIN and trims it. It is usually best to write clean code with well-chosen identifiers; that way the program explains itself
You are using the two-parameter form of open and global file handles. You also don't have the reason for failure in the die string, and you have a newline at the end, which will prevent Perl from giving you the source file name and line number where the error occurred
Something like
open( FASTA, $fasta ) || die "\n Unable to open the file!\n"
should be
open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!}
and
open( OUTFILE, ">>$out" );
should be
open my $out_fh, '>>', $output_file or die qq{Unable to open "$output_file" for appending: $!}
You should avoid putting quotes around variable names.
print OUTFILE "$seq1"
should be
print OUTFILE $seq1
You set the input record separator to "\n>". That means that every time you call <FASTA> Perl will read up to the next occurrence of that string. It also means that chomp will remove exactly that string from the end of the line, if it is there
The biggest problem is that you never reset $/ before reading from POS. Remember that its setting affects every readline (or <>) and every chomp. And because your $text file probably contains no > characters at the start of a line, you will read the entire file in one go
That is why you are seeing newlines in your output without asking for them. You have read the whole file, together with all embedded newlines, and chomp is useless here because you have modified the string that it removes
local is named that way for a reason. It alters the value temporarily and locally to the current scope. But your "current scope" is the entirety of the rest of the file and you are reading both files with the modified terminator
Use some braces { ... } to limit the scope of the local modification. Alternatively, because file handles in more recent versions of Perl behave as IO::Handle objects, you can write
$fasta_fh->input_record_separator("\n>")
and the change will apply only to that file handle, and there is no need to localise $/ at all
Here's an amended version of your program which also addresses some poor choices of identifier as well as a few other things. Please note that this code is untested. I am working on the train at present and can only check what I'm writing mentally
Note that things like while ( <$fasta_fh> ) and for ( #pos_records ) use the default variable $_ when no loop variable is specified. Likewise, operators like chomp and split will apply to $_ when the corresponding parameter is missing. That way there is never any need to mention any variable explicitly, and it leads to more concise and readable code. $_ is equivalent to it in the English language
I encourage you to understand what the things you're writing actually do. It is becoming common practice to copy code from one part of the internet and offer it to some kind souls elsewhere to get it working for you. That isn't "learning to program", and you will not understand anything unless you study the language and put your mind to it
And please take more care with laying out your code. I hope you can see that the edit I made to your question, and the code in my solution, is more more comfortable to read than the program that you posted? While you're welcome to make your own job as awkward as you like, it's unfair and impolite to offer a mess like that to a world of total strangers whom you're asking for free programming help. A nice middle line is to alter your editor to use an indent of four spaces when the tab key is pressed. Never use tab characters in source code!
use strict;
use warnings 'all';
print "Name of the FASTA contig file: ";
chomp( my $fasta_file = <STDIN> );
print "Name file with SNP position info: ";
chomp( my $pos_file = <STDIN> );
print "Name of the output file: ";
chomp( my $out_file = <STDIN> );
open my $out_fh, '>', $out_file die qq{Unable to open "$out_file" for output: $!};
my #pos_records = do {
open $pos_, '<', $pos_file or die qq{Unable to open "$pos_file" for input: $!};
<$pos_fh>;
};
chomp #pos_records; # Remove all newlines
{
open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!};
local $/ = "\n>"; # Reading FASTA format now
while ( <$fasta_fh> ) {
chomp; # Remove "">\n" from the end
my ( $header, $seq ) = split /\n/; # Separate the two lines
$header =~ s/^>?/>/; # Replace any chomped >
for ( #pos_records ) {
my ( $name, $beg, $end, $pos ) = split /\t/;
my $subseq = substr $seq, $beg-1, $end-$beg;
print $out_fh "$header\n";
print $out_fh "pos=$pos\n";
print $out_fh "$subseq\n";
}
}
} # local $/ expires here
close $out_fh or die $!;

Okay, with a couple very minor edits, your code worked perfectly. This is the solution that worked for me:
#!/usr/bin/perl
use strict;
use warnings;
print "Name of the FASTA contig file: ";
chomp( my $fasta_file = <STDIN> );
print "Name file with SNP position info: ";
chomp( my $pos_file = <STDIN> );
print "Name of the output file: ";
chomp( my $out_file = <STDIN> );
open my $out_fh, '>', $out_file or die qq{Unable to open "out_file" for output: $!};
my #pos_records = do {
open my $pos_, '<' , $pos_file or die qq{Unable to open "$pos_file" for input: $!};
<$pos_>;
};
chomp #pos_records; #remove all newlines
{
open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!};
local $/ = "\n>"; #Reading FASTA format now
for ( <$fasta_fh> ) {
chomp; #Remove ">\n" from the end
my ( $header, $seq) = split /\n/; #separate the two lines
$header = ">$header" unless $header =~ /^>/; # Replace any chomped >
for ( #pos_records ) {
my ($name,$beg,$end,$pos) = split /\t/;
my $subseq = substr $seq, $beg-1, $end-$beg;
my $final_SNP = $end - $pos;
if($header =~ /$name/){
print $out_fh "$header\n";
print $out_fh "pos=$final_SNP\n";
print $out_fh "$subseq\n";
}
}
}
} #local expires here
close $out_fh or die $!;
The only substantive thing I changed was the addition of an if statement. Without that, each fasta sequence was being written three times, each one with one with one of the three SNP positions. I also slightly changed what I was doing to notate the SNP position, which after excising the sequence, was actually $end - $pos and not just $pos.
Again, I can't thank you enough, as it is obvious you spent a fair bit of time helping me. For what its worth, I sincerely appreciate it. Your solution will serve as a template for my future efforts (which will likely be similar manipulations of fasta files), and your explanations have helped me to better understand things like what local does in a way that my pea brain can comprehend.

Related

print hashes with values from different files

I want to create output file that has values from file 1 and file 2.
The line from file 1:
chr1 Cufflinks exon 708356 708487 1000 - .
gene_id "CUFF.3"; transcript_id "CUFF.3.1"; exon_number "5"; FPKM
"3.1300591420"; frac "1.000000"; conf_lo "2.502470"; conf_hi
"3.757648"; cov "7.589085"; chr1Cufflinks exon 708356
708487 . - . gene_id "XLOC_001284"; transcript_id
"TCONS_00007667"; exon_number "7"; gene_name "LOC100288069"; oId
"CUFF.15.2"; nearest_ref "NR_033908"; class_code "j"; tss_id
"TSS2981";
The line from file 2:
CUFF.48557
chr4:160253850-160259462:160259621-160260265:160260507-160262715
The second column from this file is unique id (uniq_id).
I want to get output file in the following format:
transcript_id(CUFF_id) uniq_id gene_id(XLOC_ID) FPKM
My script takes XLOC_ID and FPKM values from first file and print them together with two columns from the second file.
#!/usr/bin/perl -w
use strict;
my $v_merge_gtf = shift #ARGV or die $!;
my $unique_gtf = shift #ARGV or die $!;
my %fpkm_hash;
my %xloc_hash;
open (FILE, "$v_merge_gtf") or die $!;
while (<FILE>) {
my $line = $_;
chomp $line;
if ($line =~ /[a-z]/) {
my #array = split("\t", $line);
if ($array[2] eq 'exon') {
my $id = $array[8];
if ($id =~ /transcript_id \"(CUFF\S+)/) {
$id = $1;
$id =~ s/\"//g;
$id =~ s/;//;
}
my $fpkm = $array[8];
if ($fpkm =~ /FPKM \"(\S+)/) {
$fpkm = $1;
$fpkm =~ s/\"//g;
$fpkm =~ s/;//;
}
my $xloc = $array[17];
if ($xloc =~ /gene_id \"(XLOC\S+)/) {
$xloc = $1;
$xloc =~ s/\"//g;
$xloc =~ s/;//;
}
$fpkm_hash{$id} = $fpkm;
$xloc_hash{$id} = $xloc;
}
}
}
close FILE;
open (FILE, "$unique_gtf") or die $!;
while (<FILE>) {
my $line = $_;
chomp $line;
if ($line =~ /[a-z]/) {
my #array = split("\t", $line);
my $id = $array[0];
my $uniq = $array[1];
print $id . "\t" . $uniq . "\t" . $xloc_hash{$id} . "\t" . $fpkm_hash{$id} . "\n";
}
}
close FILE;
I initialized hashes outside of the files, but I get the following error for each CUFF values:
CUFF.24093
chr17:3533641-3539345:3527526-3533498:3526786-3527341:3524707-3526632
Use of uninitialized value in concatenation (.) or string at ex_1.pl
line 55, line 9343.
Use of uninitialized value in concatenation (.) or string at ex_1.pl
line 55, line 9343.
How can I fix this issue?
Thank you!
I think the warning message is because the $id key, (CUFF.24093), you get on line 9343 of the second file isn't contained in the hashes you created in the first file.
Is it possible that an ID in the second file isn't contained in the first file? That seems to be the case here.
If so, and you just want to skip over this unknown ID, you could add a line to your program like:
my $id = $array[0];
my $uniq = $array[1];
next unless exists $fpkm_hash{$id}; # add this line
print $id . "\t" . $uniq . "\t" . $xloc_hash{$id} . "\t" . $fpkm_hash{$id} . "\n";
This will bypass the following print statement and go back to the top of the while loop and read in the next line and continue processing.
It depends on what action you want to take if you encounter an unknown ID.
Update: I thought I might make some observations/improvements to your code.
my $v_merge_gtf = shift #ARGV or die $!;
my $unique_gtf = shift #ARGV or die $!;
The error variable $! serves no purpose here (this is a fact I only recently discovered even after 14 years using Perl). $! is only set for system calls, (where you are involving the operating system).The most common are open and close for files, and opendir and closedir for directories. If an error occurs in opening/closing a file or a directory, $! will contain the error message. (See in my included code how I handled this - I created a message, $usage to be printed if the shift didn't succeed.
Instead of using 2 hashes to store the information, I used 1 hash,%data. The advantage is that it will use less memory, (because its only storing 1 set of keys instead of 2), Though, you could use the 2 if you like.
I used the recommended 3 argument (filehandle, mode, filename) form for opening the files. The 2 argument approach you used is outdated and less safe (for reasons I won't go into detail here). Also, the lexical filehandles I used, my $mrg and my $unique are the newer ways to create filehandles (instead of usingFILEfor your 2 opens).
You can directly assign to $linein your while loop like while (my $line = <FILE>) instead of the way you did it. In my sample program, I didn't assign to $line, but instead relied on the default variable $_. (It simplifies the 2 following statements, next unless /\S/; my #array = split /\t/;). I didn't chomp for the first file because you're only parsing inside the string and aren't using anything from the end of the string.chomp is necessary for the second while loop because the second variable my $uniq = ... would have a newline at its end if it wasn't removed by chomp.
I didn't know what you meant by this statement, if ($line =~ /[a-z]/). I am assuming you wanted to check for empty lines and only process lines with non-space data. That's why I wrote next unless /\S/;instead. (says to skip the following statements and got to the top of the while loop and read the next record).
Your first while loop worked because you had no errors in your input file. If there had errors, the way you wrote the code could have been a problem.
The statementmy $id = $array[8]; gives $id a value that would have been wrongly used if the following if statement had been false. (The same thing for the 2 other variables you want to capture,$fpkm and $xloc). You can see in my code example how I handled this.
In my code, I died if the match didn't succeed, You might not want todie but say match or next to try the next line of data. It depends on how you would want to handle a failed match.
And in this line$array[8] =~ /gene_id "(CUFF\S+)";/, Note that I put the ";following the captured data, so there is no need to remove it from the captured data (as you did in your substitutions)
Well, I know this is a long comment on your code, but I hope you get some good ideas about why I recommended the changes given.
or die "Could not find ID in $v_merge_gtf (line# $.)";
$. is the line number of the file being read.
#!/usr/bin/perl
use warnings;
use strict;
my $usage = "USAGE: perl $0 merge_gtf_file unique_gtf_file\n";
my $v_merge_gtf = shift #ARGV or die $usage;
my $unique_gtf = shift #ARGV or die $usage;
my %data;
open my $mrg, '<', $v_merge_gtf or die $!;
while (<$mrg>) {
next unless /\S/;
my #array = split /\t/;
if ($array[2] eq 'exon') {
$array[8] =~ /gene_id "(CUFF\S+)";/
or die "Could not find ID in $v_merge_gtf (line# $.)";
my $id = $1;
$array[8] =~ /FPKM "(\S+)";/
or die "Could not find FPKM in $v_merge_gtf (line# $.)";
my $fpkm = $1;
$array[17] =~ /gene_id "(XLOC\S+)";/
or die "Could not find XLOC in $v_merge_gtf (line# $.)";
my $xloc = $1;
$data{$id}{fpkm} = $fpkm;
$data{$id}{xloc} = $xloc;
}
}
close $mrg or die $!;
open my $unique, '<', $unique_gtf or die $!;
while (<$unique>) {
next unless /\S/;
chomp;
my ($id, $uniq) = split /\t/;
print join("\t", $id, $uniq, $data{$id}{fpkm}, $data{$id}{xloc}), "\n";
}
close $unique or die $!;

How to store file content sentence by sentence in an array

I want to open a file, and store its content in an array and make changes to each sentence one at a time and then print the output of the file.
I have something like this:
open (FILE , $file);
my #lines = split('.' , <FILE>)
close FILE;
for (#lines) {
s/word/replace/g;
}
open (FILE, ">$file");
print FILE #lines;
close FILE;
For some reason, perl doesn't like this and won't output any content into the new file. It seems to not like me splitting up the array. Can someone give me an explanation why perl does this and a possible fix? Thanks!
split needs a regexp. Change split('.' , <FILE>) to split(/\./ , <FILE>)
Change my #lines = split('.' , <FILE>) to my #lines = split('\.' , <FILE>)
Only . is used in regex to match a single character. So you need to escape . to split on full stop.
#!/usr/local/bin/perl
use strict;
use warnings;
my $filename = "somefile.txt";
my $contents = do { local(#ARGV, $/) = $filename; <> };
my #lines = split '\.', $contents;
foreach(#lines){
#lines is an array which contains one sentence at each index.
}
what i found was in second line of your script missing semicolon(;) that is the error and also your script is not capable of handling content of entire file.It will process only one line. So please find the modification of your script below.If any clarification please let me know.
my $file='test.txt';#input file name
open (FILE , $file);
#my #lines = split('\.' ,<FILE>); this will not process the entire content of the file.
my #lines;
while(<FILE>) {
s/word/replace/g;
push(#lines,$_);
}
close FILE;
open (FILE, ">$file");
print FILE #lines;
close FILE;
You have lots of problems in your code.
my #lines = split('.' , <FILE>) will just read the first line and split it.
split('.' should be split(/\./
my #lines = split('.' , <FILE>) no semicolon terminator.
print FILE #lines; - you have lost all your full stops!
Finally I have to wonder why you are bothered about 'sentences' at all when you are just replacing one word. If you really want to read one sentence at a time (presumably to do some kind of sentence based processing) then you need to change the record separator variable $\. For example:
#!/usr/bin/perl
use strict;
use warnings;
my $file = "data.txt";
open (FILE , $file);
my #buffer;
$/ = '.'; # Change the Input Separator to read one sentence at a time.
# Be careful though, it won't work for questions ending in ?
while ( my $sentence = <FILE> ) {
$sentence =~ s/word/replace/g;
push #buffer, $sentence;
}
close FILE;
.. saving to file is left for you to solve.
However if you just want to change the strings you can read the whole file in one gulp by setting $/ to undef. Eg:
#!/usr/bin/perl
use strict;
use warnings;
my $file = "data.txt";
open (FILE , $file);
$/ = undef; # Slurp mode!
my $buffer = <FILE>;
close FILE;
$buffer =~ s/word/replace/g;
open (FILE, ">$file");
print FILE $buffer;
close FILE;
If you are really looking to process sentences and you want to get questions then you probably want to slurp the whole file and then split it, but use a capture in your regex so that you don't lose the punctuation. Eg:
!/usr/bin/perl
use strict;
use warnings;
my $file = "data.txt";
open (FILE , $file);
$/ = undef; # slurp!
my $buffer = <FILE>;
close FILE;
open (FILE, ">$file" . '.new'); # Don't want to overwrite my input.
foreach my $sentence (split(/([\.|\?]+)/, $buffer)) # split uses () to capture punctuation.
{
$sentence =~ s/word/replace/g;
print FILE $sentence;
}
close FILE;

Perl find and replace multiple(huge) strings in one shot

Based on a mapping file, i need to search for a string and if found append the replace string to the end of line.
I'm traversing through the mapping file line by line and using the below perl one-liner, appending the strings.
Issues:
1.Huge find & replace Entries: But the issues is the mapping file has huge number of entries (~7000 entries) and perl one-liners takes ~1 seconds for each entries which boils down to ~1 Hour to complete the entire replacement.
2.Not Simple Find and Replace: Its not a simple Find & Replace. It is - if found string, append the replace string to EOL.
If there is no efficient way to process this, i would even consider replacing rather than appending.
Mine is on Windows 7 64-Bit environment and im using active perl. No *unix support.
File Samples
Map.csv
findStr1,RplStr1
findStr2,RplStr2
findStr3,RplStr3
.....
findStr7000,RplStr7000
input.csv
col1,col2,col3,findStr1,....col-N
col1,col2,col3,findStr2,....col-N
col1,col2,col3,FIND-STR-NOT-EXIST,....col-N
output.csv (Expected Output)
col1,col2,col3,findStr1,....col-N,**RplStr1**
col1,col2,col3,findStr1,....col-N,**RplStr2**
col1,col2,col3,FIND-STR-NOT-EXIST,....col-N
Perl Code Snippet
One-Liner
perl -pe '/findStr/ && s/$/RplStr/' file.csv
open( INFILE, $MarketMapFile ) or die "Error occured: $!";
my #data = <INFILE>;
my $cnt=1;
foreach $line (#data) {
eval {
# Remove end of line character.
$line =~ s/\n//g;
my ( $eNodeBID, $MarketName ) = split( ',', $line );
my $exeCmd = 'perl -i.bak -p -e "/'.$eNodeBID.'\(M\)/ && s/$/,'.$MarketName.'/;" '.$CSVFile;
print "\n $cnt Repelacing $eNodeBID with $MarketName and cmd is $exeCmd";
system($exeCmd);
$cnt++;
}
}
close(INFILE);
To do this in a single pass through your input CSV, it's easiest to store your mapping in a hash. 7000 entries is not particularly huge, but if you're worried about storing all of that in memory you can use Tie::File::AsHash.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Tie::File::AsHash;
tie my %replace, 'Tie::File::AsHash', 'map.csv', split => ',' or die $!;
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, eol => $/ })
or die Text::CSV->error_diag;
open my $in_fh, '<', 'input.csv' or die $!;
open my $out_fh, '>', 'output.csv' or die $!;
while (my $row = $csv->getline($in_fh)) {
push #$row, $replace{$row->[3]};
$csv->print($out_fh, $row);
}
untie %replace;
close $in_fh;
close $out_fh;
map.csv
foo,bar
apple,orange
pony,unicorn
input.csv
field1,field2,field3,pony,field5,field6
field1,field2,field3,banana,field5,field6
field1,field2,field3,apple,field5,field6
output.csv
field1,field2,field3,pony,field5,field6,unicorn
field1,field2,field3,banana,field5,field6,
field1,field2,field3,apple,field5,field6,orange
I don't recommend screwing up your CSV format by only appending fields to matching lines, so I add an empty field if a match isn't found.
To use a regular hash instead of Tie::File::AsHash, simply replace the tie statement with
open my $map_fh, '<', 'map.csv' or die $!;
my %replace = map { chomp; split /,/ } <$map_fh>;
close $map_fh;
This is untested code / pseudo-Perl you'll need to polish it (strict, warnings, etc.):
# load the search and replace sreings into memeory
open($mapfh, "<", mapfile);
%maplines;
while ( $mapline = <fh> ) {
($findstr, $replstr) = split(/,/, $mapline);
%maplines{$findstr} = $replstr;
}
close $mapfh;
open($ifh, "<", inputfile);
while ($inputline = <$ifh>) { # read an input line
#input = split(/,/, $inputline); # split it into a list
if (exists $maplines{$input[3]}) { # does this line match
chomp $input[-1]; # remove the new line
push #input, $maplines{$input[3]}; # add the replace str to the end
last; # done processing this line
}
print join(',', #input); # or print or an output file
}
close($ihf)

how to extract substrings by knowing the coordinates

I am terribly sorry for bothering you with my problem in several questions, but I need to solve it...
I want to extract several substrings from a file whick contains string by using another file with the begin and the end of each substring that I want to extract.
The first file is like:
>scaffold30 24194
CTTAGCAGCAGCAGCAGCAGTGACTGAAGGAACTGAGAAAAAGAGCGAGCTGAAAGGAAGCATAGCCATTTGGGAGTGCCAGAGAGTTGGGAGG GAGGGAGGGCAGAGATGGAAGAAGAAAGGCAGAAATACAGGGAGATTGAGGATCACCAGGGAG.........
.................
(the string must be everything in the file except the first line), and the coordinates file is like:
44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
.....................
my script is this:
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $count = 0;
while(<INFILE1>) {
if ($_ !~ m/^>/) {
local $/ = undef;
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my #cline = <INFILE>;
foreach my $cline (#cline) {
print "$cline\n";
my#data = split('\t', $cline);
my $start = $data[0];
my $end = $data[1];
my $offset = $end - $start;
$count++;
my $sub = substr ($var, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE;
}
}
when I run it, it looks like it does only one iteration and it prints me the start of the first file.
It seems like the foreach loop doesn't work.
also substr seems that doesn't work.
when I put an exit to print the cline to check the loop, it prints all the lines of the file with the coordinates.
I am sorry if I become annoying, but I must finish it and I am a little bit desperate...
Thank you again.
This line
local $/ = undef;
changes $/ for the entire enclosing block, which includes the section where you read in your second file. $/ is the input record separator, which essentially defines what a "line" is (it is a newline by default, see perldoc perlvar for details). When you read from a filehandle using <>, $/ is used to determine where to stop reading. For example, the following program relies on the default line-splitting behavior, and so only reads until the first newline:
my $foo = <DATA>;
say $foo;
# Output:
# 1
__DATA__
1
2
3
Whereas this program reads all the way to EOF:
local $/;
my $foo = <DATA>;
say $foo;
# Output:
# 1
# 2
# 3
__DATA__
1
2
3
This means your #cline array gets only one element, which is a string containing the text of your entire coordinates file. You can see this using Data::Dumper:
use Data::Dumper;
print Dumper(\#cline);
Which in your case will output something like:
$VAR1 = [
'44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
'
];
Notice how your array (technically an arrayref in this case), delineated by [ and ], contains only a single element, which is a string (delineated by single quotes) that contains newlines.
Let's walk through the relevant sections of your code:
while(<INFILE1>) {
if ($_ !~ m/^>/) {
# Enable localized slurp mode. Stays in effect until we leave the 'if'
local $/ = undef;
# Read the rest of INFILE1 into $var (from current line to EOF)
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
# In list context, return each block until the $/ character as a
# separate list element. Since $/ is still undef, this will read
# everything until EOF into our first list element, resulting in
# a one-element array
my #cline = <INFILE>;
# Since #cline only has one element, the loop only has one iteration
foreach my $cline (#cline) {
As a side note, your code could be cleaned up a bit. The names you chose for your filehandles leave something to be desired, and you should probably use lexical filehandles anyway (and the three-argument form of open):
open my $chromosome_fh, "<", $ARGV[0] or die $!;
open my $coordinates_fh, "<", $ARGV[1] or die $!;
Also, you do not need to nest your loops in this case, it just makes your code more convoluted. First read the relevant parts of your chromosome file into a variable (named something more meaningful than var):
# Get rid of the `local $/` statement, we don't need it
my $chromosome;
while (<$chromosome_fh>) {
next if /^>/;
$chromosome .= $_;
}
Then read in your coordinates file:
my #cline = <$coordinates_fh>;
Or if you only need to use the contents of the coordinates file once, process each line as you go using a while loop:
while (<$coordinates_fh>) {
# Do something for each line here
}
As 'ThisSuitIsBlackNot' suggested, your code could be cleaned up a little. Here is a possible solution that may be what you want.
#!/usr/bin/perl
use strict;
use warnings;
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $fasta;
<INFILE1>; # get rid of the first line - '>scaffold30 24194'
while(<INFILE1>) {
chomp;
$fasta .= $_;
}
close INFILE1 or die "Could not close '$chrom'. $!";
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my $count = 0;
while(<INFILE>) {
my ($start, $end) = split;
# Or, should this be: my $offset = $end - ($start - 1);
# That would include the start fasta
my $offset = $end - $start;
$count++;
my $sub = substr ($fasta, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE or die "Could not close '$coords_file'. $!";

Perl - adding new line and tab characters after a fixed number of characters ina file?

I have a Perl question. I have a file each line of this file contains a different number of As Ts Gs and Cs
The file looks like below
ATCGCTGASTGATGCTG
GCCTAGCCCTTAGC
GTTCCATGCCCATAGCCAAATAAA
I would like to add line number for each line
Then insert a \n every 6 characters and then on each of the new rows created put an
Empty space every 3 characters
Example of the output should be
Line NO 1
ATC GCT
GAS TGA
TGC TG
Line NO 2
GCC TAG
CCC TTA
GC
I have come up with the code below:
my $count = 0;
my $line;
my $row;
my $split;
open(F, "Data.txt") or die "Can't read file: $!";
open (FH, " > UpDatedData.txt") or die "Can't write new file: $!";
while (my $line = <F>) {
$count ++ ;
$row = join ("\n", ( $line =~ /.{1,6}/gs));
$split = join ("\t", ( $row =~ /.{3}/gs ));
print FH "Line NO\t$count\n$split\n";
}
close F;
close FH;
However
It gives the following out put
Line NO 1
ATC GCT
GA STG A
T GCT G
Line NO 2
GCC TAG
CC CTT A
G C
This must have something with the \n being counted as a character in this line of code
$split = join ("\t", ( $row =~ /.{3}/gs ));
Any one got any idea how to get around this problem?
Any help would be greatly appreciated.
Thanks in advance
Sinead
This should solve your problem:
use strict;
use warnings;
while (<DATA>) {
s/(.{3})(.{0,3})?/$1 $2 /g;
s/(.{7}) /$1\n/g;
printf "Line NO %d\n%s\n", $., $_;
}
__DATA__
ATCGCTGASTGATGCTG
GCCTAGCCCTTAGC
GTTCCATGCCCATAGCCAAATAAA
This is a one-liner:
perl -plwe 's/(.{3})(.{0,3})/$1 $2\n/g' data.txt
The regex looks for 3 characters (does not match newline), followed by 0-3 characters and captures both of those, then inserts a space between them and newline after.
To keep track of the line numbers, you can add
s/^/Line NO $.\n/;
Which will enumerate based on input line number. If you prefer, you can keep a simple counter, such as ++$i.
-l option will handle newlines for you.
You can also do it in two stages, like so:
perl -plwe's/.{6}\K/\n/g; s/^.{3}\K/ /gm;'
Using the \K (keep) escape sequence here to keep the matched part of the string, and then simply inserting a newline after 6 characters, and then a space 3 characters after "line beginnings", which with the /m modifier also includes newlines.
So, in short:
perl -plwe 's/.{6}\K/\n/g; s/^.{3}\K/ /gm; s/^/Line NO $.\n/;' data.txt
perl -plwe 's/(.{3})(.{0,3})/$1 $2\n/g; s/^/Line NO $.\n/;' data.txt
Another solution. Note that it uses lexical filehandles and three argument form of open.
#!/usr/bin/perl
use warnings;
use strict;
open my $IN, '<', 'Data.txt' or die "Can't read file: $!";
open my $OUT, '>', 'UpDatedData.txt' or die "Can't write new file: $!";
my $count = 0;
while (my $line = <$IN>) {
chomp $line;
$line =~ s/(...)(...)/$1 $2\n/g; # Create pairs of triples
$line =~ s/(\S\S\S)(\S{1,2})$/$1 $2\n/; # A triple plus something at the end.
$line .= "\n" if $line !~ /\n$/; # A triple or less at the end.
$count++;
print $OUT "Line NO\t$count\n$line\n";
}
close $OUT;