Extracting information from a .gtf to a new text file using PERL - perl

I have the the below .gtf file, i need to extract only 4 variables (chromosome, start/stop codon and transcripst i.d.
1 Cufflinks transcript 11869 14412 1000 + . gene_id "CUFF.1"; transcript_id "CUFF.1.2"; FPKM "0.3750000000"; frac "0.000000"; conf_lo "0.375000"; conf_hi "0.375000"; cov "1.470346"; full_read_support "yes";
1 Cufflinks transcript 11869 14412 444 + . gene_id "CUFF.1"; transcript_id "CUFF.1.3"; FPKM "0.1666666667"; frac "0.000000"; conf_lo "0.166667"; conf_hi "0.166667"; cov "0.653487"; full_read_support "yes";
2 Cufflinks transcript 11869 14412 333 + . gene_id "CUFF.1"; transcript_id "CUFF.1.4"; FPKM "0.1250000000"; frac "0.000000"; conf_lo "0.125000"; conf_hi "0.125000"; cov "0.490115"; full_read_support "yes";**
My questions is how does a script know to work on a selected file?
Do you used:
(1) my $file = 'transcripts_selected.gtf'
(2) Also can this script be used to extract the sected data:
say $data->{"chromosome_number"}->{"start_codon"}->{"stop_codon"}->{"transcript_id"};
or should:
BioSeq->new(-chromosome_number, -start_codon...) method be used?
(3) Finally this scripts is taken from the BioperlHOWTO:
my $seq_in = Bio::SeqIO->new( -file => "<$infile", -format => $infileformat,);
my $seq_out = Bio::SeqIO->new( -file => ">$outfile", -format => $outfileformat,);
while (my $inseq = $seq_in->next_seq) {$seq_out->write_seq($inseq);
where is says the variables $infile/$outfile should the name of the .gtf file be placed here and the name of the new file with selected data replace $outfile?

the easiest way to specify the filenames is to write something like:
my $infile = shift;
my $outfile = shift;
above the code block from the HOWTO, then type:
perl ScriptName transcripts_selected.gtf OutFileName
at the command line

Related

Add new hash keys and then print in a new file

Previously, I post a question to search for an answer to using regex to match specifics sequence identification (ID).
Now I´m looking for some recommendations to print the data that I looking for.
If you want to see the complete file, here's a GitHub link.
This script takes two files to work. The first file is something like this (this is only a part of the file):
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 2 2 0.0804934 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 4 4 0.0925522 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 13 13 0.0250116 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 23 23 0.565981 . .
...
This file tells me when there is a value >= 0.5, this information is in the sixth column. When this happens my script takes the first column (this is an ID, to match in with the second file) and the fourth column (this is a position of a letter in the second file).
Here my second file (this is only a part):
>AGY29650.2|NA spike protein
MTYSVFPLMCLLTFIGANAKIVTLPGNDA...EEYDLEPHKIHVH*
Like I said previously, the script takes the ID in the first file to match with the ID in the second file when these are the same and then searches for the position (fourth column) in the contents of the data.
Here an example, in file one the fourth row is a positive value (>=0.5) and the position in the fourth column is 23.
Then the script searches for position 23 in the data contents of the second file, here position 23 is a letter T:
MTYSVFPLMCLLTFIGANAKIV T LP
When the script match with the letter, the looking for 2 letters right and 2 letters left to the position of interest:
IVTLP
In the previous post, thank the help of some people in Stack I could solve the problem because of a difference between ID in each file (difference like this: AGY29650_2_NA (file one) and AGY29650.2 (file two)).
Now I looking for help to obtain the output that I need to complete the script.
The script is incomplete because I couldn't found the way to print the output of interest, in this case, the 5 letters in the second file (one letter of the position that appears in file one) 2 letters right, and 2 left.
I have thousands of files like the one and two, now I need some help to complete the script with any idea that you recommend.
Here is the script:
use strict;
use warnings;
use Bio::SeqIO;
​
my $file = $ARGV[0];
my $in = $ARGV[1];
my %fastadata = ();
my #array_residues = ();
my $seqio_obj = Bio::SeqIO->new(-file => $in,
-format => "fasta" );
while (my $seq_obj = $seqio_obj->next_seq ) {
my $dd = $seq_obj->id;
my $ss = $seq_obj->seq;
###my $ee = $seq_obj->desc;
$fastadata{$dd} = "$ss";
}
​
my $thres = 0.5; ### Selection of values in column N°5 with the following condition: >=0.5
​
# Open file
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
$one =~ s/\n//g;
$one =~ s/\r//g;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar (#cols) == 7); ### the line must have 7 columns to add to the array
my $val = $cols[5];
​
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 6);
}
}
}
close F;
I´m thinking in add a push function to generate the new data and then print in a new file.
My expected output is to print the position of a positive value (>=0.5), in this case, T (position 23) and the 2 letters right and 2 letters left.
In this case, with the data example in GitHub (link above) the expected output is:
IVTLP
Any recommendation or help is welcome.
Thank!
Main problem seems to be that the line has 8 columns not 7 as assumed in the script. Another small issue is that the extracted substring should have 5 characters not 6 as assumed by the script. Here is a modified version of the loop that works for me:
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
chomp $one;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar #cols) == 8; ### the line must have 8 columns to add to the array
my $val = $cols[5];
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 5);
print $subresidues, "\n";
}
}
}

Add dipeptide frequency in this perl script based on sequence length

I have a perl script to get the di-peptide counts (there are 400 combinations, for example- AA, AC, AD, AE...) from sequences (fasta format). But I would like to add the frequency based on the sequence lengths. I have a input with multiple sequences (myfile.fasta).
I tried to do it, but I got the wrong results. Im am not very familiar with perl.
My script:
use strict;
use warnings;
use Bio::SeqIO;
my #amino=qw/A C D E F G H I K L M N P Q R S T V W Y/;
my #comb=();
foreach my $a (#amino){
foreach my $b (#amino){
push (#comb,$a.$b)
}
}
my $in = Bio::SeqIO->new(-file => "myfile.fasta" , '-format' => 'Fasta');
while ( my $seq= $in->next_seq ) {
my #dipeps=($seq->seq()=~/(?=(.{2}))/g);
my %di_count=();
$di_count{$_}++ for #dipeps;
print $seq->id();
map{exists $di_count{$_}?print " ",$di_count{$_}:print " ",0}sort #comb;
print "\n";
}
I tried:
map{exists $di_count{$_}?print " ",$di_count{$_}:print " ",0}sort #comb/length;
map{exists $di_count{$_}?print " ",$di_count{$_}:print " ",0/length}sort #comb;
I also tried to define the length, such as:
my $seq_len = length($seq);
Also, I do not want to define the input file in the script, I would like to define like "perl script.pl input.fasta > result.txt". For that I should use:
open (S, "$ARGV[0]") || die "cannot open FASTA file to read: $!";
This is pretty ugly code (should be rewritten entirely), but I think you want:
my $length = #dipeps;
map{exists $di_count{$_}?print " ",$di_count{$_}/$length:print " ",0}sort #comb;

How to find number of numerical data for each and every line in a file

Please help me to count the numerical data in each line of a file,
and also to find the line length. The code has to written in Perl.
For example if I have a line such as:
INPUT:I was born on 24th october,1994.
Output:2
You could do something like this:
perl -ne 'BEGIN{my $x} $x += () = /[0-9]+/g; END{print($x . "\n")}' file
-n: causes Perl to assume the following loop around your program, which makes it iterate over filename arguments somewhat like sed -n or awk:
LINE:
while (<>) {
... # your program goes here
}
-e: may be used to enter one line of program;
() will make /[0-9]+/g be evaluated in list context (i.e. () = /[0-9]+/g will return an array containing the sequences of one or more digits found in the default input), while $x += will make the result be evaluated again in scalar context (i.e. $x += () = /[0-9]+/g will add the number of sequences of one or more digits found in the default input to $x); END{print($x . "\n") will print $x after the whole file has been processed.
% cat file
string 123 string 1 string string string
456 string
% perl -ne 'BEGIN{my $x} $x += () = /[0-9]+/g; END{print($x . "\n")}' file
3
%
I'd do something like this
#!/usr/bin/perl
use warnings;
use strict;
my $file = 'num.txt';
open my $fh, '<', $file or die "Failed to open $file: $!\n";
while (my $line = <$fh>){
chomp $line;
my #num = $line =~ /([0-9.]+)/g;
print "On this line --- " .scalar(#num) . "\n";
}
close ($fh);
The input file I tested --
This should say 1
Line 2 should say 2
I want this line to say 5 so I have added 4 other numbers like 0.02 -1 and 5.23
The output as tested ----
On this line --- 1
On this line --- 2
On this line --- 5
Using the regex match ([0-9.]+) will match ANY number and include any decimals (I guess really you could use just ([0-9]+) since you are only counting them and not using the actually number represented.)
Hope it helps.

How to extract specific columns from different files and output in one file?

I have in a directory 12 files , each file has 4 columns. The first column is a gene name and the rest 3 are count columns. All the files are in the same directory. I want to extract 1,4 columns for each files (12 files in total) and paste them in one output file, since the first column is same for every files the output file should only have one once the 1st column and the rest will be followed by 4th column of each file. The first column of each file is same. I do not want to use R here. I am a big fan of awk. So I was trying something like below but it did not work
My input files look like
Input file 1
ZYG11B 8267 16.5021 2743.51
ZYG11A 4396 0.28755 25.4208
ZXDA 5329 2.08348 223.281
ZWINT 1976 41.7037 1523.34
ZSCAN5B 1751 0.0375582 1.32254
ZSCAN30 4471 4.71253 407.923
ZSCAN23 3286 0.347228 22.9457
ZSCAN20 4343 3.89701 340.361
ZSCAN2 3872 3.13983 159.604
ZSCAN16-AS1 2311 1.1994 50.9903
Input file 2
ZYG11B 8267 18.2739 2994.35
ZYG11A 4396 0.227859 19.854
ZXDA 5329 2.44019 257.746
ZWINT 1976 8.80185 312.072
ZSCAN5B 1751 0 0
ZSCAN30 4471 9.13324 768.278
ZSCAN23 3286 1.03543 67.4392
ZSCAN20 4343 3.70209 318.683
ZSCAN2 3872 5.46773 307.038
ZSCAN16-AS1 2311 3.18739 133.556
Input file 3
ZYG11B 8267 20.7202 3593.85
ZYG11A 4396 0.323899 29.8735
ZXDA 5329 1.26338 141.254
ZWINT 1976 56.6215 2156.05
ZSCAN5B 1751 0.0364084 1.33754
ZSCAN30 4471 6.61786 596.161
ZSCAN23 3286 0.79125 54.5507
ZSCAN20 4343 3.9199 357.177
ZSCAN2 3872 5.89459 267.58
ZSCAN16-AS1 2311 2.43055 107.803
Desired output from above
ZYG11B 2743.51 2994.35 3593.85
ZYG11A 25.4208 19.854 29.8735
ZXDA 223.281 257.746 141.254
ZWINT 1523.34 312.072 2156.05
ZSCAN5B 1.32254 0 1.33754
ZSCAN30 407.923 768.278 596.161
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN20 340.361 318.683 357.177
ZSCAN2 159.604 307.038 267.58
ZSCAN16-AS1 50.9903 133.556 107.803
here as you can see above first column from each file and 4 column , since the first column of each file is same so I just want to keep it once and rest the ouptut will have 4th column of each file. I have just shown for 3 files. It should work for all the files in the directory at once since all of the files have similar naming conventions like file1_quant.genes.sf file2_quant.genes.sf , file3_quant.genes.sf
Every files has same first column but different counts in rest column. My idea is to create one output file which should have 1st column and 4th column from all the files.
awk '{print $1,$2,$4}' *_quant.genes.sf > genes.estreads
Any heads up?
If I understand you correctly, what you're looking for is one line per key, collated from multiple files.
The tool you need for this job is an associative array. I think awk can, but I'm not 100% sure. I'd probably tackle it in perl though:
#!/usr/bin/perl
use strict;
use warnings;
# an associative array, or hash as perl calls it
my %data;
#iterate the input files (sort might be irrelevant here)
foreach my $file ( sort glob("*_quant.genes.sf") ) {
#open the file for reading.
open( my $input, '<', $file ) or die $!;
#iterate line by line.
while (<$input>) {
#extract the data - splitting on any whitespace.
my ( $key, #values ) = split;
#add'column 4' to the hash (of arrays)
push( #{$data{$key}}, $values[2] );
}
close($input);
}
#start output
open( my $output, '>', 'genes.estreads' ) or die;
#sort, because hashes are explicitly unordered.
foreach my $key ( sort keys %data ) {
#print they key and all the elements collected.
print {$output} join( "\t", $key, #{ $data{$key} } ), "\n";
}
close($output);
With data as specified as above, this produces:
ZSCAN16-AS1 50.9903 133.556 107.803
ZSCAN2 159.604 307.038 267.58
ZSCAN20 340.361 318.683 357.177
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN30 407.923 768.278 596.161
ZSCAN5B 1.32254 0 1.33754
ZWINT 1523.34 312.072 2156.05
ZXDA 223.281 257.746 141.254
ZYG11A 25.4208 19.854 29.8735
ZYG11B 2743.51 2994.35 3593.85
The following is how you do it in awk:
awk 'BEGIN{FS = " "};{print $1, $4}' *|awk 'BEGIN{FS = " "};{temp = x[$1];x[$1] = temp " " $2;};END {for(xx in x) print xx,x[xx]}'
As cryptic as it looks, I am just using associative arrays.
Here is the solution broken down:
Just print the key and the value, one per line.
print $1, $2
Store the data in an associative array, keep updating
temp = x[$1];x[$1] = temp " " $2;}
Display it:
for(xx in x) print xx,x[xx]
Sample run:
[cloudera#quickstart test]$ cat f1
A k1
B k2
[cloudera#quickstart test]$ cat f2
A k3
B k4
C k1
[cloudera#quickstart test]$ awk 'BEGIN{FS = " "};{print $1, $2}' *|awk 'BEGIN{FS = " "};{temp = x[$1];x[$1] = temp " " $2;};END {for(xx in x) print xx,x[xx]}'
A k1 k3
B k2 k4
C k1
As a side note, the approach should be reminiscent of the Map Reduce paradigm.
awk '{E[$1]=E[$1] "\t" $4}END{for(K in E)print K E[K]}' *_quant.genes.sf > genes.estreads
Order is order of appearence when reading files (so normaly based on 1 readed file)
If the first column is the same in all the files, you can use paste:
paste <(tabify f1 | cut -f1,4) \
<(tabify f2 | cut -f4) \
<(tabify f3 | cut -f4)
Where tabify changes consecutive spaces to tabs:
sed 's/ \+/\t/g' "$#"
and f1, f2, f3 are the input files' names.
Here's another way to do it in Perl:
perl -lane '$data{$F[0]} .= " $F[3]"; END { print "$_ $data{$_}" for keys %data }' input_file_1 input_file_2 input_file_3
Here's another way of doing it with awk. And it supports using multiple files.
awk 'FNR==1{f++}{a[f,FNR]=$1}{b[f,FNR]=$4}END { for(x=1;x<=FNR;x++){printf("%s ",a[1,x]);for(y=0;y<=ARGC;y++)printf("%s ",b[y,x]);print ""}}' input1.txt input2.txt input3.txt
That line of code, give the following output
ZYG11B 2743.51 2994.35 3593.85
ZYG11A 25.4208 19.854 29.8735
ZXDA 223.281 257.746 141.254
ZWINT 1523.34 312.072 2156.05
ZSCAN5B 1.32254 0 1.33754
ZSCAN30 407.923 768.278 596.161
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN20 340.361 318.683 357.177
ZSCAN2 159.604 307.038 267.58
ZSCAN16-AS1 50.9903 133.556 107.803

Perl: Prompt appears to cause perl code to exit without giving correct output

December 10, 2014
Can someone kindly help me to resolve this issue where character '>' causes the perl program to exit prematurely when run on a remote Windows server?
The actual output is:
K:\ Volume in drive K is DataDisk
Volume Serial Number is E8BD-C593
Directory of K:\
04/15/2011 05:25 AM <DIR
The expected output is:
K:\>dir
Volume in drive K is DataDisk
Volume Serial Number is E8BD-C593
Directory of K:\
12/08/2014 11:18 PM <DIR> ftpvol
04/15/2011 05:25 AM <DIR> Images
1 File(s) 0 bytes
16 Dir(s) 246,180,012,032 bytes free
Here is the script:
#!/usr/bin/perl
use Net::Telnet ();
my $node = $ARGV[0];
my $ipAddress = $ARGV[1];
my $username = $ARGV[2];
my $password = $ARGV[3];
my $mmlCommand0 = "hostname&prcstate -l";
my $filedate = `date +%Y%m%d`; #date in format YYYYMMDD
chomp($filedate); #deletes newline character at end
my $numArgs = $#ARGV + 1;
if($numArgs == 4){
my $telnet = new Net::Telnet( Host=>$ipAddress, Port=>23, Timeout=>20, Errmode=>'die', Prompt=>'/>/');
$telnet->open() or die "hai $telnet->errmsg ";
$telnet->waitfor('/login name:/');
$telnet->print($username);
$telnet->waitfor('/password:/');
$telnet->print($password);
$telnet->waitfor('/Windows NT Domain:/');
$telnet->print("");
$telnet->waitfor('/>/');
## get printouts
#print $telnet->cmd($mmlCommand0);
print $telnet->cmd("K:");
print $telnet->cmd("dir");
}
else{
print "\n!!! Correct syntax is: command <node> <IP address> \nExample: \n\n";
}
print "\n\n";
exit(0);
script does not execute if I remove prompt or try to set another prompt.
However I think the error that the character '>' is always interpreted as the prompt.
my $telnet = new Net::Telnet( Host=>$ipAddress, Port=>23, Timeout=>20, Errmode=>'die');
$telnet->prompt('/$/');
Thanks in advance!
December 11, 2014
A "reply" button would be nice to have instead of having to edit an original port...
I am not quite following what Mr Llama has suggested. Accordingly if I am using the functions print() and waitfor() the promt should NOT be used. In that case I removed prompt however the code still does not work. Could you be kind to post a working code sample that will retrieve characters '<' and '>' in the printout and not treat either as a DOS prompt?
The Net::Telnet documentation says that you only need to use the prompt attribute if you're not using print() and waitfor() for communication (it's meant to be used with login().
In your case, the prompt value is being removed from the response. Try setting the prompt value to something that will never occur and that should fix your issue. Do be careful in what value you select as the prompt value will be treated as a regular expression.