adding new line to an output file - perl

I'm writing a script for comparing 2 variable in 2 line then output the line with equal value to new file. However, the new file only contain last line only, the earlier line was delete. I do google my problem but still not find the way out. Sorry for my English.
Thank you very much in advance.
Here is my script somehow look like:
for (tmp1 = 1 ; tmp1 <= cnt1 ; tmp1++) {
$line1 = `head -tmp1 file1| tail -1`;
#str1 = split(/\s/, $line1);
for (tmp2 = 1 ; tmp2 <= cnt2 ; tmp2++) {
$line2 = `head -tmp2 file2| tail -1`;
#str2 = split(/\s/, $line2);
open(OUT, ">out");
if ($str1[3] eq $str2[3]) {
print OUT "$line1";
}
}
}

You should open the before the loop starts, or use open (OUT, ">>out"); to append mode.

Related

Add new hash keys and then print in a new file

Previously, I post a question to search for an answer to using regex to match specifics sequence identification (ID).
Now I´m looking for some recommendations to print the data that I looking for.
If you want to see the complete file, here's a GitHub link.
This script takes two files to work. The first file is something like this (this is only a part of the file):
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 2 2 0.0804934 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 4 4 0.0925522 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 13 13 0.0250116 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 23 23 0.565981 . .
...
This file tells me when there is a value >= 0.5, this information is in the sixth column. When this happens my script takes the first column (this is an ID, to match in with the second file) and the fourth column (this is a position of a letter in the second file).
Here my second file (this is only a part):
>AGY29650.2|NA spike protein
MTYSVFPLMCLLTFIGANAKIVTLPGNDA...EEYDLEPHKIHVH*
Like I said previously, the script takes the ID in the first file to match with the ID in the second file when these are the same and then searches for the position (fourth column) in the contents of the data.
Here an example, in file one the fourth row is a positive value (>=0.5) and the position in the fourth column is 23.
Then the script searches for position 23 in the data contents of the second file, here position 23 is a letter T:
MTYSVFPLMCLLTFIGANAKIV T LP
When the script match with the letter, the looking for 2 letters right and 2 letters left to the position of interest:
IVTLP
In the previous post, thank the help of some people in Stack I could solve the problem because of a difference between ID in each file (difference like this: AGY29650_2_NA (file one) and AGY29650.2 (file two)).
Now I looking for help to obtain the output that I need to complete the script.
The script is incomplete because I couldn't found the way to print the output of interest, in this case, the 5 letters in the second file (one letter of the position that appears in file one) 2 letters right, and 2 left.
I have thousands of files like the one and two, now I need some help to complete the script with any idea that you recommend.
Here is the script:
use strict;
use warnings;
use Bio::SeqIO;
​
my $file = $ARGV[0];
my $in = $ARGV[1];
my %fastadata = ();
my #array_residues = ();
my $seqio_obj = Bio::SeqIO->new(-file => $in,
-format => "fasta" );
while (my $seq_obj = $seqio_obj->next_seq ) {
my $dd = $seq_obj->id;
my $ss = $seq_obj->seq;
###my $ee = $seq_obj->desc;
$fastadata{$dd} = "$ss";
}
​
my $thres = 0.5; ### Selection of values in column N°5 with the following condition: >=0.5
​
# Open file
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
$one =~ s/\n//g;
$one =~ s/\r//g;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar (#cols) == 7); ### the line must have 7 columns to add to the array
my $val = $cols[5];
​
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 6);
}
}
}
close F;
I´m thinking in add a push function to generate the new data and then print in a new file.
My expected output is to print the position of a positive value (>=0.5), in this case, T (position 23) and the 2 letters right and 2 letters left.
In this case, with the data example in GitHub (link above) the expected output is:
IVTLP
Any recommendation or help is welcome.
Thank!
Main problem seems to be that the line has 8 columns not 7 as assumed in the script. Another small issue is that the extracted substring should have 5 characters not 6 as assumed by the script. Here is a modified version of the loop that works for me:
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
chomp $one;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar #cols) == 8; ### the line must have 8 columns to add to the array
my $val = $cols[5];
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 5);
print $subresidues, "\n";
}
}
}

Opening a file inside a subroutine for read/write in Perl

I am trying to open a file inside a subroutine to basically substitute some lines in the file. But since, it was not working, I tried a simpler way of printing a line instead of substitute, for debug purposes. Following is the subroutine code.
sub replace {
while (<INPUT_FILE>){
my $cell = $_[0];
our $rpl;
if ($_=~ /^TASK\|VALUE = (.*)/ ) {
my $task = $1;
chomp $task;
$rpl = $cell . '_' . $task . '_bunch_rpl';
print "000: $rpl\n";
}
elsif ($_=~ /^(.*)\|VALUE = (.*)/ ) {
my $line = $_;
chomp $line;
my $ip_var = $1;
my $ip_val = $2;
chomp $ip_var;
chomp $ip_val;
my $look= $ip_var."|VALUE";
open(REPLAY_FILE, "+<$rpl") || die "\ncannot open $rpl\n";
while (my $rpl_sub = <REPLAY_FILE>) {
if ($rpl_sub =~ /^$line/) {
print "\n 111: $ip_val";
}
}
close REPLAY_FILE;
}
elsif ($_=~ /^\s*$/) {
print "\n";
return ;
}
}
}
The code prints the following as of now.
000: lfr_task62_bunch_rpl
111: 2.0.9.0
111: INLINE
111: POWER
000: aaa_task14_bunch_rpl
Expected output is:
000: lfr_task62_bunch_rpl
111: 2.0.9.0
111: INLINE
111: POWER
000: aaa_task14_bunch_rpl
111: 0.45
111: NO
The input sample is:
TASK_CELL_NAME|VALUE = lfr
TASK|VALUE = task62
TASK_VERSION|VALUE = 2.0.9.0
CHIP_PKG_TYPE|VALUE = INLINE
JUNK_LINE = JUNK
JUNK_LINE = JUNK
FULL_ESD|VALUE = POWER
TASK_CELL_NAME|VALUE = aaa
TASK|VALUE = task14
CUSTOM_CELL_DENSITY|VALUE = 0.45
CUSTOM_CELL_SS|VALUE = NO
Can someone tell me the mistake I am doing here?
UPDATE: Main code below
my #cell_names;
open(INPUT_FILE, "<$ip_file") || die "\n!!!ERROR OPENING INPUT FILE. EXITING SCRIPT!!!\n";
while (<INPUT_FILE>) {
if ($_=~ /(.*) =\n/ ) {
$mw -> messageBox(-message=> "\nFormat not correct on line $. of input file. Exiting script\n");
exit;
}
elsif ($_=~ /(.*) =\s+\n/ ) {
$mw -> messageBox(-message=> "\nFormat not correct on line $. of input file. Exiting script\n");
exit;
}
elsif ($_=~ /(.*) = \s+(.*)/ ) {
$mw -> messageBox(-message=> "\nFormat not correct on line $. of input file. Exiting script\n");
exit;
}
elsif ($_=~ /^TASK_CELL_NAME\|VALUE = (.*)/ ) {
my $cell_name = $1;
chomp $cell_name;
unless(grep( /^$cell_name $/, #cell_names )) {
push #cell_names, "$cell_name ";
#$count++;
#print "\nCELL NAME: $cell_name\n";
replace($cell_name);
}
}
}
close INPUT_FILE;
Update: lfr_task62_bunch_rpl before running code:
# Select fund
FUND|VALUE = mmi
# Select bank
BANK|VALUE = citi
# Select cell name
TASK_CELL_NAME|VALUE = lfr
# Select task
TASK|VALUE = task62
# Select task version
TASK_VERSION|VALUE = 1.0.9.0
# Select fund type
FULL_ESD|VALUE = MUTUAL
# Select customer premium
CUSTOM_CELL_SS|VALUE = YES
# Select customer brand density
CUSTOM_CELL_DENSITY|VALUE = 0.76
# Select card chip
CHIP_PKG_TYPE|VALUE|VALUE = OUTLINE
Expected lfr_task62_bunch_rpl after running code:
# Select fund
FUND|VALUE = mmi
# Select bank
BANK|VALUE = citi
# Select cell name
TASK_CELL_NAME|VALUE = lfr
# Select task
TASK|VALUE = task62
# Select task version
TASK_VERSION|VALUE = 2.0.9.0
# Select fund type
FULL_ESD|VALUE = POWER
# Select customer premium
CUSTOM_CELL_SS|VALUE = YES
# Select customer brand density
CUSTOM_CELL_DENSITY|VALUE = 0.76
# Select card chip
CHIP_PKG_TYPE|VALUE|VALUE = INLINE
It's not really clear what this code is supposed to do. But I can immediately see a few problems with the logic. Let's step through a few iterations of the loop, using your sample data file.
The first time, the line of data read in is:
TASK_CELL_NAME|VALUE = lf
So that matches on your second regex match. You set a few variables and then (because $ip_var is equal to "TASK_CELL_NAME") you skip to the else clause and close a filehandle that isn't open.
Next time round, we read:
TASK|VALUE = task62
That matches your first regex match. The variable $rpl_file is set to "XXX_lfr_bunch_rpl" (where 'XXX' is the parameter passed to the subroutine - obviously, I don't know what that is). You print a "000" line with that value and open the file with that name in r/w mode.
Third time round, we get this data:
TASK_VERSION|VALUE = 2.0.9.0
This matches your second regex and because $ip_var isn't equal to "TASK_CELL_NAME" we go into the if clause. This reads from your open filehandle and prints a "111" line. But this generates a warning if you have use warnings switched on as the line includes the value of $rpl_file which is currently defined. It was set the last time around the loop, but because the variable is declared inside the loop, it has now lost its value. We then close the filehandle.
The fourth iteration will be the last one that's really interesting. We get this data:
CHIP_PKG_TYPE|VALUE = INLINE
This also matches the second regex, so we do a lot the same as the third iteration. But the difference here is that when we try to read from the filehandle, we get a warning because that filehandle is closed. Oh, and then we close it again for good measure :-)
As I said at the start, I can't really work out what we're trying to do here. But I can see that the logic is very strange. You really need to go back to the drawing board and think through your logic again.
Update:
With the updated version of your code, I'm still seeing problems.
On the first iteration, the data is:
TASK_CELL_NAME|VALUE = lf
So this matches your second regex. That goes into the piece of code that opens the other file and tries to read from it. But it expects to find the filename in $rpl and that variable hasn't been given a value yet. So the open() fails and the program dies.

Loop recursive through two files in perl

Hi guys I have an issue to solve,
I have 2 files.
File A
col1,col2, value_total_to_put
File A 201843,12345,30
File B
col1,col2,col3, value_inputted, missing_value, value_max
201843,12345,447,4,0,4
201843,12345,448,0,0,4
201843,12345,449,0,0,2
201843,12345,450,4,0,4
201843,12345,451,2,0,2
201843,12345,455,4,0,4
201843,12345,457,0,0,4
201843,12345,899,10,0,10
201843,12345,334,0,1,1
201843,12345,364,0,1,1
201843,12345,71,0,2,2
201843,12345,260,0,2,2
201843,12345,321,0,2,2
201843,12345,328,0,2,2
201843,12345,371,0,2,2
201843,12345,385,0,2,2
201843,12345,426,0,2,2
201843,12345,444,0,2,2
201843,12345,31,4,6,10
201843,12345,360,2,87,99
201843,12345,373,4,95,99
201843,12345,472,4,95,99
201843,12345,475,4,95,99
201843,12345,430,0,99,99
201843,12345,453,0,99,99
201843,12345,463,0,99,99
201843,12345,482,0,99,99
201843,12345,484,0,99,99
My keys are col1 and col2 from both files and I am doing this way below and my loop is wrong because when I reach the EOF from File B my loop is stopped.
What I want is match File A and B with $col1 and $col2 and while the value_total_to_put is > 0 withdraw 1 in each loop and in value_inputted from File B when value_inputted is less than value_max. For withdraw from File A missing_value might be > 0.
For the result I will print when value_inputted is equal to value_max in other words the last value until reach value_max or value_total_to_put is 0.
while ( <FA> ){
chomp;
my($col1,$col2, $value_total_to_put) = split ",";
push #A, [$col1,$col2, $value_total_to_put];
}
my #B;
while ( <FB> ){
chomp;
my($col1,$col2,$col3, $value_inputted, $missing_value, $value_max) = split ",";
push #B, [$col1,$col2,$col3, $value_inputted, $missing_value, $value_max];
}
foreach my $line (#A){
my $idxl = #$line[0].",".#$line[1];
my $value_total_to_put = #$line[2];
while ($value_total_to_put > 0){
foreach my $row ( #B ){
if ( $idxr eq $idxl ){
my $idxr = #$row[0].",".#$row[1];
my $value_inputted = #$row[3];
my $value_max = #$row[5];
my $missing_value = #$row[4];
if ( ($value_inputted eq 0) and ($missing_value eq 0)){
#do_nothing
} elsif($value_inputted == $value_max){
#do_nothing
print join(",", $idxr, #$row[2],"Value_inputted: ".$value_inputted, "Missing_value: ".$missing_value, "Value_max:".$value_max, "Total: ".$value_total_to_put)."\n";
}else{
$value_inputted++;
$missing_value--;
$value_total_to_put--;
}
}
}
last if $value_total_to_put > 0;
}
}
The third file will be this way:
201843,12345,447,4,0,4
201843,12345,450,4,0,4
201843,12345,451,2,0,2
201843,12345,455,4,0,4
201843,12345,899,10,0,10
201843,12345,334,1,0,1
201843,12345,364,1,0,1
201843,12345,71,2,0,2
201843,12345,260,2,0,2
201843,12345,321,2,0,2
201843,12345,328,2,0,2
201843,12345,371,2,0,2
201843,12345,385,2,0,2
201843,12345,426,2,0,2
201843,12345,444,2,0,2
201843,12345,31,10,0,10
201843,12345,360,3,86,99
201843,12345,373,5,94,99
201843,12345,472,5,94,99
201843,12345,475,5,94,99
201843,12345,430,1,98,99
201843,12345,453,1,98,99
As explained by #Dave Cross, your code is quite hard to read (and misses use strictand use warnings), and your explanation of what you are trying to achieve is quite unclear...
One thing that caught my eye however, is that you start a loop with this statement...
while ($value_total_to_put > 0){
... but at the very end of that same block you do :
last if $value_total_to_put > 0;
}
This will effectively cause Perl to exit the loop after the first iteration, no matter the value of the $value_total_to_put variable. This is probably not what you want. Hence, as far as I understand, you should start your investigation by removing that last statement.

Join broken lines with perl/awk

I have a huge file with sql broken statements like:
PP3697HB ####0
<<<<<<Record has been deleted as per PP3697HB>>>>>>
FROM sys.xtab_ref rc,sys.xtab_sys f,sys.domp ur WHE
RE rc.milf = ur.milf AND rc.molf = f.molf AND ur.dept = 'SWIT'AND ur
.department = 'IND' AND share = '2' AND ur.status = 'DONE' AND f.s
tatus = 'TRUE' AND rc.OPERATOR = '=' AND rc.VALUE = '261366'AND rc.r
unet IN (SELECT milf FROM sys.domp WHERE change = 'OVO' A
ND IND = 75);
I need all these broken lines to be recombined to a single line.
The line should look like:
PP3697HB ####0<<<<<<Record has been deleted as per PP3697HB>>>>>>FROM sys.xtab_ref rc,sys.xtab_sys f,sys.domp ur WHERE rc.milf = ur.milf AND rc.molf = f.molf AND ur.dept = 'SWIT'AND ur.department = 'IND' AND share = '2' AND ur.status = 'DONE' AND f.status = 'TRUE' AND rc.OPERATOR = '=' AND rc.VALUE = '261366'AND rc.runet IN (SELECT milf FROM sys.domp WHERE change = 'OVO' AND IND = 75);
How can I achieve this in perl/awk.
We can say that the start of the line must be ^PP(.*) and the end of sql statement must be (.*);$
Let me know if you have difficulty understand the problem and I will try to explain again.
try this one-liner:
awk '!/;$/{printf "%s",$0}/;$/{print}' file
Using tr to remove the newlines and sed to split each SQL statement:
tr '\n' ' ' < file | sed 's/;/;\n/g'
Try this solution in Perl:
#!/usr/bin/perl -w
use strict;
use warnings;
use Data::Dumper;
## The raw string
my $str = "
PP3697HB ####0
<<<<<<Record has been deleted as per PP3697HB>>>>>>
FROM sys.xtab_ref rc,sys.xtab_sys f,sys.domp ur WHE
RE rc.milf = ur.milf AND rc.molf = f.molf AND ur.dept = 'SWIT'AND ur
.department = 'IND' AND share = '2' AND ur.status = 'DONE' AND f.s
tatus = 'TRUE' AND rc.OPERATOR = '=' AND rc.VALUE = '261366'AND rc.r
unet IN (SELECT milf FROM sys.domp WHERE change = 'OVO' A
ND IND = 75);
";
## Split the given string as per new line.
my #lines = split(/\n/, $str);
## Join every element of the formed array using blank.
$str = join("", #lines);
print $str;
Perl solution:
perl -ne 'chomp $last unless /^PP/; print $last; $last = $_ }{ print $last' FILE.SQL
Assuming there's other lines that are not split up like this, and that only the specified lines require re-joining:
awk '
/^PP/ {insql=1}
/;$/ {insql=0}
insql {printf "%s", $0; next}
{print}
' file

Renaming names in a file using another file without using loops

I have two files:
(one.txt) looks Like this:
>ENST001
(((....)))
(((...)))
>ENST002
(((((((.......))))))
((((...)))
I have like 10000 more ENST
(two.txt) looks like this:
>ENST001 110
>ENST002 59
and so on for the rest of all ENSTs
I basically would like to replace the ENSTs in the (one.txt) by the combination of the two fields in the (two.txt) so the results will look like this:
>ENST001_110
(((....)))
(((...)))
>ENST002_59
(((((((.......))))))
((((...)))
I wrote a matlab script to do so but since it loops for all lines in (two.txt) it take like 6 hours to finish, so I think using awk, sed, grep, or even perl we can get the result in few minutes. This is what I did in matlab:
frf = fopen('one.txt', 'r');
frp = fopen('two.txt', 'r');
fw = fopen('result.txt', 'w');
while feof(frf) == 0
line = fgetl(frf);
first_char = line(1);
if strcmp(first_char, '>') == 1 % if the line in one.txt start by > it is the ID
id_fold = strrep(line, '>', ''); % Reomve the > symbol
frewind(frp) % Rewind two.txt file after each loop
while feof(frp) == 0
raw = fgetl(frp);
scan = textscan(raw, '%s%s');
id_pos = scan{1}{1};
pos = scan{2}{1};
if strcmp(id_fold, id_pos) == 1 % if both ids are the same
id_new = ['>', id_fold, '_', pos];
fprintf(fw, '%s\n', id_new);
end
end
else
fprintf(fw, '%s\n', line); % if the line doesn't start by > print it to results
end
end
One way using awk. FNR == NR process first file in arguments and saves each number. Second condition process second file, and when first field matches with a key in the array modifies that line appending the number.
awk '
FNR == NR {
data[ $1 ] = $2;
next
}
FNR < NR && data[ $1 ] {
$0 = $1 "_" data[ $1 ]
}
{ print }
' two.txt one.txt
Output:
>ENST001_110
(((....)))
(((...)))
>ENST002_59
(((((((.......))))))
((((...)))
With sed you can at first run only on two.txt you can make a sed commands to replace as you want and run it at one.txt:
First way
sed "$(sed -n '/>ENST/{s=.*\(ENST[0-9]\+\)\s\+\([0-9]\+\).*=s/\1/\1_\2/;=;p}' two.txt)" one.txt
Second way
If files are huge you'll get too many arguments error with previous way. Therefore there is another way to fix this error. You need execute all three commands one by one:
sed -n '1i#!/bin/sed -f
/>ENST/{s=.*\(ENST[0-9]\+\)\s\+\([0-9]\+\).*=s/\1/\1_\2/;=;p}' two.txt > script.sed
chmod +x script.sed
./script.sed one.txt
The first command will form the sed script that will be able to modify one.txt as you want. chmod will make this new script executable. And the last command will execute command. So each file is read only once. There is no any loops.
Note that first command consist from two lines, but still is one command. If you'll delete newline character it will break the script. It is because of i command in sed. You can look for details in ``sed man page.
This Perl solution sends the modified one.txt file to STDOUT.
use strict;
use warnings;
open my $f2, '<', 'two.txt' or die $!;
my %ids;
while (<$f2>) {
$ids{$1} = "$1_$2" if /^>(\S+)\s+(\d+)/;
}
open my $f1, '<', 'one.txt' or die $!;
while (<$f1>) {
s/^>(\S+)\s*$/>$ids{$1}/;
print;
}
Turn the problem on its head. In perl I would do something like this:
#!/usr/bin/perl
open(FH1, "one.txt");
open(FH2, "two.txt");
open(RESULT, ">result.txt");
my %data;
while (my $line = <FH2>)
{
chomp(line);
# Delete leading angle bracket
$line =~ s/>//d;
# split enst and pos
my ($enst, $post) = split(/\s+/, line);
# Store POS with ENST as key
$data{$enst} = $pos;
}
close(FH2);
while (my $line = <FH1>)
{
# Check line for ENST
if ($line =~ m/^>(ENST\d+)/)
{
my $enst = $1;
# Get pos for ENST
my $pos = $data{$enst};
# make new line
$line = '>' . $enst . '_' . $pos . '\n';
}
print RESULT $line;
}
close(FH1);
close(RESULT);
This might work for you (GNU sed):
sed -n '/^$/!s|^\(\S*\)\s*\(\S*\).*|s/^\1.*/\1_\2/|p' two.txt | sed -f - one.txt
Try this MATLAB solution (no loops):
%# read files as cell array of lines
fid = fopen('one.txt','rt');
C = textscan(fid, '%s', 'Delimiter','\n');
C1 = C{1};
fclose(fid);
fid = fopen('two.txt','rt');
C = textscan(fid, '%s', 'Delimiter','\n');
C2 = C{1};
fclose(fid);
%# use regexp to extract ENST numbers from both files
num = regexp(C1, '>ENST(\d+)', 'tokens', 'once');
idx1 = find(~cellfun(#isempty, num)); %# location of >ENST line
val1 = str2double([num{:}]); %# ENST numbers
num = regexp(C2, '>ENST(\d+)', 'tokens', 'once');
idx2 = find(~cellfun(#isempty, num));
val2 = str2double([num{:}]);
%# construct new header lines from file2
C2(idx2) = regexprep(C2(idx2), ' +','_');
%# replace headers lines in file1 with the new headers
[tf,loc] = ismember(val2,val1);
C1( idx1(loc(tf)) ) = C2( idx2(tf) );
%# write result
fid = fopen('three.txt','wt');
fprintf(fid, '%s\n',C1{:});
fclose(fid);