extracting regions from a range file in a formatted output perl - perl

I have a input and list file like this:
input.txt file:
>gi|NP_415931.4
MTEQQKLTFTALQQRLDSLMLRDRLRFSRRLHGVKKVKNPDAQQAIFQEMAKEIDQAAGKVLLREAARPEITYPD
>gi|NP_418770.2
MMNKSNFEFLKGVNDFTYAIACAAENNYPDDPNTTLIKMRMFGEATAKHLGLL
>gi|YP_026226.4
MRKFTLNIFTLSLGLAVMPMVEAAPTAQQQLLEQVRLGEATHREDLVQQSLYRLELIDPNNPDVVAARFRSLLRQGDIDGAQKQ
list.txt file:
NP_415931.4: 1-5, 6-8
YP_026226.4: 3-7, 9-9, 10, 12-15
Now, for this time, I want a csv formatted output.csv (with certain added header) as (for the above inputs):
ID,Regions,Length,Sequences
NP_415931.4,1-5,5,MTEQQ
,6-8,3,KLT
YP_026226.4,3-7,5,KFTLN
,9-9,1,F
,10,1,T
,12-15,4,SLGL
that is, it first match the list file headers with those of input files and the matched once's sequences are taken and then it gives the output arranging in the above format.
the excel view of the output.csv would be:
How can I generate the above output.csv file from those inputs?
Thanks

Here is an approach. To summarize: We have a master database file input.txt with all defined sequences. Our job is to extract certain information from this database and write it to a CSV file. The information about what to extract is given in file list.txt.
use feature qw(say);
use strict;
use warnings;
my $input_fn = 'input.txt';
open ( my $fh1, '<', $input_fn ) or die "Could not open file '$input_fn': $!";
my %seqs;
while( my $line = <$fh1> ) {
my ($id ) = $line =~ /gi\|(.*)$/;
chomp( my $seq = <$fh1> );
$seqs{$id} = $seq;
}
close $fh1;
say join ',', qw(ID Regions Length Sequences);
my $list_fn = 'list.txt';
open ( my $fh2, '<', $list_fn ) or die "Could not open file '$list_fn': $!";
while( my $line = <$fh2> ) {
chomp $line;
my ( $id, #regions ) = split /[:,]\s?/, $line;
for my $i (0..$#regions) {
my $region = $regions[$i];
my $start = my $end = $region;
if ( $region =~ /(\d+)-(\d+)/ ) {
$start = $1;
$end = $2;
}
my $name = ($i == 0) ? $id : "";
my $seq = substr( $seqs{$id}, $start - 1, $end - $start + 1);
say join ',', $name, $region, length( $seq ), $seq;
}
}
close $fh2;
Output:
ID,Regions,Length,Sequences
NP_415931.4,1-5,5,MTEQQ
,6-8,3,KLT
YP_026226.4,3-7,5,KFTLN
,9-9,1,F
,10,1,T
,12-15,4,SLGL

Related

How to check whether one file's value contains in another text file? (perl script)

I would like to check one of the file's values contains on another file. if one of the value contains it will show there is existing bin for that specific, if no, it will show there is no existing bin limit. the problem is I am not sure how to check all values at once.
first DID1 text file value contain :
L84A:D:O:M:
L84C:B:E:D:
second DID text file value contain :
L84A:B:E:Q:X:F:i:M:Y:
L84C:B:E:Q:X:F:i:M:Y:
L83A:B:E:Q:X:F:i:M:Y:
if first 4words value are match, need to check all value for that line.
for example L84A in first text file & second text file value has M . it should print out there is an existing M bin
below is my code :
use strict;
use warnings;
my $filename = 'DID.txt';
my $filename1 = 'DID1.txt';
my $count = 0;
open( FILE2, "<$filename1" )
or die("Could not open log file. $!\n");
while (<FILE2>) {
my ($number) = $_;
chomp($number);
my #values1 = split( ':', $number );
open( FILE, "<$filename" )
or die("Could not open log file. $!\n");
while (<FILE>) {
my ($line) = $_;
chomp($line);
my #values = split( ':', $line );
foreach my $val (#values) {
if ( $val =~ /$values1[0]/ ) {
$count++;
if ( $values[$count] =~ /$values1[$count]/ ) {
print
"Yes ,There is an existing bin & DID\n #values1\n";
}
else {
print "No, There is an existing bin & DID\n";
}
}
}
}
}
I cannot check all value. please help to give any advice on it since this is my first time learning for perl language. Thanks a lot :)
Based on my understanding I write this code:
use strict;
use warnings;
#use ReadWrite;
use Array::Utils qw(:all);
use vars qw($my1file $myfile1cnt $my2file $myfile2cnt #output);
$my1file = "did1.txt"; $my2file = "did2.txt";
We are going to read both first and second files (DID1 and DID2).
readFileinString($my1file, \$myfile1cnt); readFileinString($my2file, \$myfile2cnt);
In first file, as per the OP's request the first four characters should be matched with second file and then if they matched we need to check rest of the characters in the first file with the second one.
while($myfile1cnt=~m/^((\w){4})\:([^\n]+)$/mig)
{
print "<LineStart>";
my $lineChk = $1; my $full_Line = $3; #print ": $full_Line\n";
my #First_values = split /\:/, $full_Line; #print join "\n", #First_values;
If the first four digit matched then,
if($myfile2cnt=~m/^$lineChk\:([^\n]+)$/m)
{
Storing the rest of the content in the same and to be split with colon and getting the characters to be matched with first file contents.
my $FullLine = $1; my #second_values = split /:/, $FullLine;
Then search each letter first and second content which matched line...
foreach my $sngletter(#First_values)
{
If the letters are matched with first and second file its going to be printed.
if( grep {$_ eq "$sngletter"} #second_values)
{
print "Matched: $sngletter\t";
}
}
}
else { print "Not Matched..."; }
This is just information that the line end.
print "<LineEnd>\n"
}
#------------------>Reading a file
sub readFileinString
#------------------>
{
my $File = shift;
my $string = shift;
use File::Basename;
my $filenames = basename($File);
open(FILE1, "<$File") or die "\nFailed Reading File: [$File]\n\tReason: $!";
read(FILE1, $$string, -s $File, 0);
close(FILE1);
}
Read search pattern and data into hash (first field is a key), then go through data and select only field included into pattern for this key.
use strict;
use warnings;
use feature 'say';
my $input1 = 'DID1.txt'; # look for key,pattern(array)
my $input2 = 'DID.txt'; # data - key,elements(array)
my $pattern;
my $data;
my %result;
$pattern = file2hash($input1); # read pattern into hash
$data = file2hash($input2); # read data into hash
while( my($k,$v) = each %{$data} ) { # walk through data
next unless defined $pattern->{$k}; # skip those which is not in pattern hash
my $find = join '|', #{ $pattern->{$k} }; # form search pattern for grep
my #found = grep {/$find/} #{ $v }; # extract only those of interest
$result{$k} = \#found; # store in result hash
}
while( my($k,$v) = each %result ) { # walk through result hash
say "$k has " . join ':', #{ $v }; # output final result
}
sub file2hash {
my $filename = shift;
my %hash;
my $fh;
open $fh, '<', $filename
or die "Couldn't open $filename";
while(<$fh>) {
chomp;
next if /^\s*$/; # skip empty lines
my($key,#data) = split ':';
$hash{$key} = \#data;
}
close $fh;
return \%hash;
}
Output
L84C has B:E
L84A has M

Duplicate values in column

I have a original file which has following columns,
02-May-2018,AAPL,Sell,0.25,1000
02-May-2018,C,Sell,0.25,2000
02-May-2018,JPM,Sell,0.25,3000
02-May-2018,WFC,Sell,0.25,5000
02-May-2018,AAPL,Sell,0.25,7000
02-May-2018,GOOG,Sell,0.25,8000
02-May-2018,GOOG,Sell,0.25,9000
02-May-2018,C,Sell,0.25,2000
02-May-2018,AAPL,Sell,0.25,3000
I am trying to print this original line if I see value in the second column more then 2 times.. for example, if I see AAPL more then 2 times desired result should print
02-May-2018,AAPL,Sell,0.25,1000
02-May-2018,AAPL,Sell,0.25,7000
02-May-2018,AAPL,Sell,0.25,3000
So Far, I have written the following which prints results multiple times which is wrong.. can you please help on what I am doing wrong?
open (FILE, "<$TMPFILE") or die "Could not open $TMPFILE";
open (OUT, ">$TMPFILE1") or die "Could not open $TMPFILE1";
%count = ();
#symbol = ();
while ($line = <FILE>)
{
chomp $line;
(#data) = split(/,/,$line);
$count{$data[1]}++;
#keys = sort {$count{$a} cmp $count{$b}} keys %count;
for my $key (#keys)
{
if ( $count{$key} > 2 )
{
print "$line\n";
}
}
}
I'd do it something like this - store lines you've seen in a 'buffer' and print them out again if the condition is hit (before continuing to print as you go):
#!/usr/bin/env perl
use strict;
use warnings;
my %buffer;
my %count_of;
while ( my $line = <> ) {
my ( $date, $ticker, #values ) = split /,/, $line;
#increment the count
$count_of{$ticker}++;
if ( $count_of{$ticker} < 3 ) {
#count limit not hit, so stash the current line in the buffer.
$buffer{$ticker} .= $line;
next;
}
#print the buffer if the count has been hit
if ( $count_of{$ticker} == 3 ) {
print $buffer{$ticker};
}
#only gets to here once the limit is hit, so just print normally.
print $line;
}
With your input data, this outputs:
02-May-2018,AAPL,Sell,0.25,1000
02-May-2018,AAPL,Sell,0.25,7000
02-May-2018,AAPL,Sell,0.25,3000
Simple answer:
push #{ $lines{(split",")[1]} }, $_ while <>;
print #{ $lines{$_} } for grep #{ $lines{$_} } > 2, sort keys %lines;
perl program.pl inputfile > outputfile
You need to read the input file twice, because you don't know the final counts until you get to the end of the file
use strict;
use warnings 'all';
my ($TMPFILE, $TMPFILE1) = qw/ infile outfile /;
my %counts;
{
open my $fh, '<', $TMPFILE or die "Could not open $TMPFILE: $!";
while ( <$fh> ) {
my #fields = split /,/;
++$counts{$fields[1]};
}
}
open my $fh, '<', $TMPFILE or die "Could not open $TMPFILE: $!";
open my $out_fh, '>', $TMPFILE1 or die "Could not open $TMPFILE1: $!";
while ( <$fh> ) {
my #fields = split /,/;
print $out_fh $_ if $counts{$fields[1]} > 2;
}
output
02-May-2018,AAPL,Sell,0.25,1000
02-May-2018,AAPL,Sell,0.25,7000
02-May-2018,AAPL,Sell,0.25,3000
This should work:
use strict;
use warnings;
open (FILE, "<$TMPFILE") or die "Could not open $TMPFILE";
open (OUT, ">$TMPFILE1") or die "Could not open $TMPFILE1";
my %data;
while ( my $line = <FILE> ) {
chomp $line;
my #line = split /,/, $line;
push(#{$data{$line[1]}}, $line);
}
foreach my $key (keys %data) {
if(#{$data{$key}} > 2) {
print "$_\n" foreach #{$data{$key}};
}
}

Correct use of Perl "exists"

I have two files. The first two columns in both are chromosome loci and genotypes, for instance chr1:1736464585 and T/G.
I have put the first two columns into a hash. I want to check whether the hash key (the chromosome locus) exists in the second file.
I have written this Perl program and have tried many variations but I'm not sure if I'm using exists correctly: it gives the error exists is not an HASH or ARRAY element or a subroutine.
#!/usr/bin/perl
use strict;
use warnings;
my $output = "annotated.txt";
open( O, ">>$output" );
my $filename = "datatest.txt";
my $filename2 = "MP2.txt";
chomp $filename;
chomp $filename2;
my %hash1 = ();
open( FN1, $filename ) or die "Can't open $filename: $!";
my #lines = <FN1>;
foreach my $line (#lines) {
my #split = split /\t/, $line;
if ( $line =~ /^chr/ ) {
my ( $key, $value ) = ( $split[0], $split[1] );
$hash1{$key} = $value;
}
}
my $DATA;
open( $DATA, $filename2 ) or die $!;
my #lines2 = <$DATA>;
foreach my $line2 (#lines2) {
my #split2 = split /\t/, $line2;
if ( $line2 =~ /^chr/ ) {
if ( exists %hash1{$key} ) {
print "$line2\n";
}
}
}
The syntax of the following line is incorrect:
if (exists %hash1{$key}) { ... }
This should be:
if (exists $hash1{$key}) { ... }

How do I compare two texts in perl

I have 2 files:
1. a.txt
2. b.txt
a.txt:
UP_00292229 191
Xa_09833888 199
b.txt
UP_00292229 191
Xa_09833888 188
I want to compare this 2 files with the first column.
result:
UP_00292229 is same
Xa_09833888 is not same
How can I do it in perl?
How can I input 2 files at the same times?
How can I check the file format is xxxxx dddd (there is a space between xxxxx dddd)?
This code will compare first column of both files and if value matches it will print same otherwise not same:
use strict;
use warnings;
my %seen;
open ( my $file2,"<", "b.txt" ) or die $!;
while ( my $line = <$file2> )
{
chomp ( $line );
my ($column3, $column4) = split ' ', $line;
$seen{$column3}++;
}
close ( $file2 );
open ( my $file1, "<", "a.txt" ) or die $!;
while ( my $line1 = <$file1> )
{
chomp $line1;
my ($column1, $column2) = split ' ', $line1;
print $column1, " ", $seen{$column1} ? "is same" : "is not same", "\n";
}
close ( $file1 );
Try this:
use warnings;
use strict;
open my $handle_file1,"<","file1";
open my $handle_file2,"<","file2";
my #ar = <$handle_file1>;
my #br = <$handle_file2>;
for my $i(0..$#ar){
if($ar[$i] eq $br[$i]){
chomp $ar[$i];
print "$ar[$i] is same\n";
}
else{
chomp $ar[$i];
print "$ar[$i] is not same\n";
}
}

Perl regex to capture strings between anchor words

I am still working on cleaning up Oracle files, having to replace strings in files where the Oracle schema name is prepended to the function/procedure/package name within the file, as well as when the function/procedure/package name is double-quoted. Once the definition is corrected, I write the correction back to the file, along with the rest of the actual code.
I have code written to replace simple declarations (no input/output parameters) Now I am trying to get my regex to operate on (Note: This post is a continuation from this question) Some examples of what I'm trying to clean up:
Replace:
CREATE OR REPLACE FUNCTION "TRON2000"."DC_F_DUMP_CSV_MMA" (
p_trailing_separator IN BOOLEAN DEFAULT FALSE,
p_max_linesize IN NUMBER DEFAULT 32000,
p_mode IN VARCHAR2 DEFAULT 'w'
)
RETURN NUMBER
IS
to
CREATE OR REPLACE FUNCTION DC_F_DUMP_CSV_MMA (
p_trailing_separator IN BOOLEAN DEFAULT FALSE,
p_max_linesize IN NUMBER DEFAULT 32000,
p_mode IN VARCHAR2 DEFAULT 'w'
)
RETURN NUMBER
IS
I have been trying to use the following regex to separate the declaration, for later reconstruction after I've cleaned out the schema name / fixed the name of the function/procedure/package to not be double-quoted. I am struggling with getting each into a buffer - here's my latest attempt to grab all the middle input/output into it's own buffer:
\b(CREATE\sOR\sREPLACE\s(PACKAGE|PACKAGE\sBODY|PROCEDURE|FUNCTION))(?:\W+\w+){1,100}?\W+(RETURN)\s*(\W+\w+)\s(AS|IS)\b
Any / all help is GREATLY appreciated!
This is the script that I'm using right now to evaluate / write the corrected files:
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use Data::Dumper;
# utility to clean strings
sub trim($) {
my $string = shift;
$string = "" if !defined($string);
$string =~ s/^\s+//;
$string =~ s/\s+$//;
# aggressive removal of blank lines
$string =~ s/\n+/\n/g;
return $string;
}
sub cleanup_packages {
my $file = shift;
my $tmp = $file . ".tmp";
my $package_name;
open( OLD, "< $file" ) or die "open $file: $!";
open( NEW, "> $tmp" ) or die "open $tmp: $!";
while ( my $line = <OLD> ) {
# look for the first line of the file to contain a CREATE OR REPLACE STATEMENT
if ( $line =~
m/^(CREATE\sOR\sREPLACE)\s*(PACKAGE|PACKAGE\sBODY)?\s(.+)\s(AS|IS)?/i
)
{
# look ahead to next line, in case the AS/IS is next
my $nextline = <OLD>;
# from the above IF clause, the package name is in buffer 3
$package_name = $3;
# if the package name and the AS/IS is on the same line, and
# the package name is quoted/prepended by the TRON2000 schema name
if ( $package_name =~ m/"TRON2000"\."(\w+)"(\s*|\S*)(AS|IS)/i ) {
# grab just the name and the AS/IS parts
$package_name =~ s/"TRON2000"\."(\w+)"(\s*|\S*)(AS|IS)/$1 $2/i;
trim($package_name);
}
elsif ( ( $package_name =~ m/"TRON2000"\."(\w+)"/i )
&& ( $nextline =~ m/(AS|IS)/ ) )
{
# if the AS/IS was on the next line from the name, put them together on one line
$package_name =~ s/"TRON2000"\."(\w+)"(\s*|\S*)/$1/i;
$package_name = trim($package_name) . ' ' . trim($nextline);
trim($package_name); # remove trailing carriage return
}
# now put the line back together
$line =~
s/^(CREATE\sOR\sREPLACE)\s*(PACKAGE|PACKAGE\sBODY|FUNCTION|PROCEDURE)?\s(.+)\s(AS|IS)?/$1 $2 $package_name/ig;
# and print it to the file
print NEW "$line\n";
}
else {
# just a normal line - print it to the temp file
print NEW $line or die "print $tmp: $!";
}
}
# close up the files
close(OLD) or die "close $file: $!";
close(NEW) or die "close $tmp: $!";
# rename the temp file as the original file name
unlink($file) or die "unlink $file: $!";
rename( $tmp, $file ) or die "can't rename $tmp to $file: $!";
}
# find and clean up oracle files
sub eachFile {
my $ext;
my $filename = $_;
my $fullpath = $File::Find::name;
if ( -f $filename ) {
($ext) = $filename =~ /(\.[^.]+)$/;
}
else {
# ignore non files
return;
}
if ( $ext =~ /(\.spp|\.sps|\.spb|\.sf|\.sp)/i ) {
print "package: $filename\n";
cleanup_packages($fullpath);
}
else {
print "$filename not specified for processing!\n";
}
}
MAIN:
{
my ( #files, $file );
my $dir = 'C:/1_atest';
# grab all the files for cleanup
find( \&eachFile, "$dir/" );
#open and evaluate each
foreach $file (#files)
{
# skip . and ..
next if ( $file =~ /^\.$/ );
next if ( $file =~ /^\.\.$/ );
cleanup_file($file);
};
}
Assuming the entire content of a file is stored as scalar in a var, the following should do the trick.
$Str = '
CREATE OR REPLACE FUNCTION "TRON2000"."DC_F_DUMP_CSV_MMA" (
p_trailing_separator IN BOOLEAN DEFAULT FALSE,
p_max_linesize IN NUMBER DEFAULT 32000,
p_mode IN VARCHAR2 DEFAULT w
)
RETURN NUMBER
IS
CREATE OR REPLACE FUNCTION "TRON2000"."DC_F_DUMP_CSV_MMA" (
p_trailing_separator IN BOOLEAN DEFAULT FALSE,
p_max_linesize IN NUMBER DEFAULT 32000,
p_mode IN VARCHAR2 DEFAULT w
)
RETURN NUMBER
IS
';
$Str =~ s#^(create\s+(?:or\s+replace\s+)?\w+\s+)"[^"]+"."([^"]+)"#$1 $2#mig;
print $Str;