adding new line to an output file - perl
I'm writing a script for comparing 2 variable in 2 line then output the line with equal value to new file. However, the new file only contain last line only, the earlier line was delete. I do google my problem but still not find the way out. Sorry for my English.
Thank you very much in advance.
Here is my script somehow look like:
for (tmp1 = 1 ; tmp1 <= cnt1 ; tmp1++) {
$line1 = `head -tmp1 file1| tail -1`;
#str1 = split(/\s/, $line1);
for (tmp2 = 1 ; tmp2 <= cnt2 ; tmp2++) {
$line2 = `head -tmp2 file2| tail -1`;
#str2 = split(/\s/, $line2);
open(OUT, ">out");
if ($str1[3] eq $str2[3]) {
print OUT "$line1";
}
}
}
You should open the before the loop starts, or use open (OUT, ">>out"); to append mode.
Related
Add new hash keys and then print in a new file
Previously, I post a question to search for an answer to using regex to match specifics sequence identification (ID). Now I´m looking for some recommendations to print the data that I looking for. If you want to see the complete file, here's a GitHub link. This script takes two files to work. The first file is something like this (this is only a part of the file): AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 2 2 0.0804934 . . AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 4 4 0.0925522 . . AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 13 13 0.0250116 . . AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 23 23 0.565981 . . ... This file tells me when there is a value >= 0.5, this information is in the sixth column. When this happens my script takes the first column (this is an ID, to match in with the second file) and the fourth column (this is a position of a letter in the second file). Here my second file (this is only a part): >AGY29650.2|NA spike protein MTYSVFPLMCLLTFIGANAKIVTLPGNDA...EEYDLEPHKIHVH* Like I said previously, the script takes the ID in the first file to match with the ID in the second file when these are the same and then searches for the position (fourth column) in the contents of the data. Here an example, in file one the fourth row is a positive value (>=0.5) and the position in the fourth column is 23. Then the script searches for position 23 in the data contents of the second file, here position 23 is a letter T: MTYSVFPLMCLLTFIGANAKIV T LP When the script match with the letter, the looking for 2 letters right and 2 letters left to the position of interest: IVTLP In the previous post, thank the help of some people in Stack I could solve the problem because of a difference between ID in each file (difference like this: AGY29650_2_NA (file one) and AGY29650.2 (file two)). Now I looking for help to obtain the output that I need to complete the script. The script is incomplete because I couldn't found the way to print the output of interest, in this case, the 5 letters in the second file (one letter of the position that appears in file one) 2 letters right, and 2 left. I have thousands of files like the one and two, now I need some help to complete the script with any idea that you recommend. Here is the script: use strict; use warnings; use Bio::SeqIO; my $file = $ARGV[0]; my $in = $ARGV[1]; my %fastadata = (); my #array_residues = (); my $seqio_obj = Bio::SeqIO->new(-file => $in, -format => "fasta" ); while (my $seq_obj = $seqio_obj->next_seq ) { my $dd = $seq_obj->id; my $ss = $seq_obj->seq; ###my $ee = $seq_obj->desc; $fastadata{$dd} = "$ss"; } my $thres = 0.5; ### Selection of values in column N°5 with the following condition: >=0.5 # Open file open (F, $file) or die; ### open the file or end the analyze while(my $one = <F>) {### readline => F $one =~ s/\n//g; $one =~ s/\r//g; my #cols = split(/\s+/, $one); ### split columns next unless (scalar (#cols) == 7); ### the line must have 7 columns to add to the array my $val = $cols[5]; if ($val >= 0.5) { my $position = $cols[3]; my $id_list = $cols[0]; $id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/; if (exists($fastadata{$id_list})) { my $new_seq = $fastadata{$id_list}; my $subresidues = substr($new_seq, $position -3, 6); } } } close F; I´m thinking in add a push function to generate the new data and then print in a new file. My expected output is to print the position of a positive value (>=0.5), in this case, T (position 23) and the 2 letters right and 2 letters left. In this case, with the data example in GitHub (link above) the expected output is: IVTLP Any recommendation or help is welcome. Thank!
Main problem seems to be that the line has 8 columns not 7 as assumed in the script. Another small issue is that the extracted substring should have 5 characters not 6 as assumed by the script. Here is a modified version of the loop that works for me: open (F, $file) or die; ### open the file or end the analyze while(my $one = <F>) {### readline => F chomp $one; my #cols = split(/\s+/, $one); ### split columns next unless (scalar #cols) == 8; ### the line must have 8 columns to add to the array my $val = $cols[5]; if ($val >= 0.5) { my $position = $cols[3]; my $id_list = $cols[0]; $id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/; if (exists($fastadata{$id_list})) { my $new_seq = $fastadata{$id_list}; my $subresidues = substr($new_seq, $position -3, 5); print $subresidues, "\n"; } } }
Opening a file inside a subroutine for read/write in Perl
I am trying to open a file inside a subroutine to basically substitute some lines in the file. But since, it was not working, I tried a simpler way of printing a line instead of substitute, for debug purposes. Following is the subroutine code. sub replace { while (<INPUT_FILE>){ my $cell = $_[0]; our $rpl; if ($_=~ /^TASK\|VALUE = (.*)/ ) { my $task = $1; chomp $task; $rpl = $cell . '_' . $task . '_bunch_rpl'; print "000: $rpl\n"; } elsif ($_=~ /^(.*)\|VALUE = (.*)/ ) { my $line = $_; chomp $line; my $ip_var = $1; my $ip_val = $2; chomp $ip_var; chomp $ip_val; my $look= $ip_var."|VALUE"; open(REPLAY_FILE, "+<$rpl") || die "\ncannot open $rpl\n"; while (my $rpl_sub = <REPLAY_FILE>) { if ($rpl_sub =~ /^$line/) { print "\n 111: $ip_val"; } } close REPLAY_FILE; } elsif ($_=~ /^\s*$/) { print "\n"; return ; } } } The code prints the following as of now. 000: lfr_task62_bunch_rpl 111: 2.0.9.0 111: INLINE 111: POWER 000: aaa_task14_bunch_rpl Expected output is: 000: lfr_task62_bunch_rpl 111: 2.0.9.0 111: INLINE 111: POWER 000: aaa_task14_bunch_rpl 111: 0.45 111: NO The input sample is: TASK_CELL_NAME|VALUE = lfr TASK|VALUE = task62 TASK_VERSION|VALUE = 2.0.9.0 CHIP_PKG_TYPE|VALUE = INLINE JUNK_LINE = JUNK JUNK_LINE = JUNK FULL_ESD|VALUE = POWER TASK_CELL_NAME|VALUE = aaa TASK|VALUE = task14 CUSTOM_CELL_DENSITY|VALUE = 0.45 CUSTOM_CELL_SS|VALUE = NO Can someone tell me the mistake I am doing here? UPDATE: Main code below my #cell_names; open(INPUT_FILE, "<$ip_file") || die "\n!!!ERROR OPENING INPUT FILE. EXITING SCRIPT!!!\n"; while (<INPUT_FILE>) { if ($_=~ /(.*) =\n/ ) { $mw -> messageBox(-message=> "\nFormat not correct on line $. of input file. Exiting script\n"); exit; } elsif ($_=~ /(.*) =\s+\n/ ) { $mw -> messageBox(-message=> "\nFormat not correct on line $. of input file. Exiting script\n"); exit; } elsif ($_=~ /(.*) = \s+(.*)/ ) { $mw -> messageBox(-message=> "\nFormat not correct on line $. of input file. Exiting script\n"); exit; } elsif ($_=~ /^TASK_CELL_NAME\|VALUE = (.*)/ ) { my $cell_name = $1; chomp $cell_name; unless(grep( /^$cell_name $/, #cell_names )) { push #cell_names, "$cell_name "; #$count++; #print "\nCELL NAME: $cell_name\n"; replace($cell_name); } } } close INPUT_FILE; Update: lfr_task62_bunch_rpl before running code: # Select fund FUND|VALUE = mmi # Select bank BANK|VALUE = citi # Select cell name TASK_CELL_NAME|VALUE = lfr # Select task TASK|VALUE = task62 # Select task version TASK_VERSION|VALUE = 1.0.9.0 # Select fund type FULL_ESD|VALUE = MUTUAL # Select customer premium CUSTOM_CELL_SS|VALUE = YES # Select customer brand density CUSTOM_CELL_DENSITY|VALUE = 0.76 # Select card chip CHIP_PKG_TYPE|VALUE|VALUE = OUTLINE Expected lfr_task62_bunch_rpl after running code: # Select fund FUND|VALUE = mmi # Select bank BANK|VALUE = citi # Select cell name TASK_CELL_NAME|VALUE = lfr # Select task TASK|VALUE = task62 # Select task version TASK_VERSION|VALUE = 2.0.9.0 # Select fund type FULL_ESD|VALUE = POWER # Select customer premium CUSTOM_CELL_SS|VALUE = YES # Select customer brand density CUSTOM_CELL_DENSITY|VALUE = 0.76 # Select card chip CHIP_PKG_TYPE|VALUE|VALUE = INLINE
It's not really clear what this code is supposed to do. But I can immediately see a few problems with the logic. Let's step through a few iterations of the loop, using your sample data file. The first time, the line of data read in is: TASK_CELL_NAME|VALUE = lf So that matches on your second regex match. You set a few variables and then (because $ip_var is equal to "TASK_CELL_NAME") you skip to the else clause and close a filehandle that isn't open. Next time round, we read: TASK|VALUE = task62 That matches your first regex match. The variable $rpl_file is set to "XXX_lfr_bunch_rpl" (where 'XXX' is the parameter passed to the subroutine - obviously, I don't know what that is). You print a "000" line with that value and open the file with that name in r/w mode. Third time round, we get this data: TASK_VERSION|VALUE = 2.0.9.0 This matches your second regex and because $ip_var isn't equal to "TASK_CELL_NAME" we go into the if clause. This reads from your open filehandle and prints a "111" line. But this generates a warning if you have use warnings switched on as the line includes the value of $rpl_file which is currently defined. It was set the last time around the loop, but because the variable is declared inside the loop, it has now lost its value. We then close the filehandle. The fourth iteration will be the last one that's really interesting. We get this data: CHIP_PKG_TYPE|VALUE = INLINE This also matches the second regex, so we do a lot the same as the third iteration. But the difference here is that when we try to read from the filehandle, we get a warning because that filehandle is closed. Oh, and then we close it again for good measure :-) As I said at the start, I can't really work out what we're trying to do here. But I can see that the logic is very strange. You really need to go back to the drawing board and think through your logic again. Update: With the updated version of your code, I'm still seeing problems. On the first iteration, the data is: TASK_CELL_NAME|VALUE = lf So this matches your second regex. That goes into the piece of code that opens the other file and tries to read from it. But it expects to find the filename in $rpl and that variable hasn't been given a value yet. So the open() fails and the program dies.
Loop recursive through two files in perl
Hi guys I have an issue to solve, I have 2 files. File A col1,col2, value_total_to_put File A 201843,12345,30 File B col1,col2,col3, value_inputted, missing_value, value_max 201843,12345,447,4,0,4 201843,12345,448,0,0,4 201843,12345,449,0,0,2 201843,12345,450,4,0,4 201843,12345,451,2,0,2 201843,12345,455,4,0,4 201843,12345,457,0,0,4 201843,12345,899,10,0,10 201843,12345,334,0,1,1 201843,12345,364,0,1,1 201843,12345,71,0,2,2 201843,12345,260,0,2,2 201843,12345,321,0,2,2 201843,12345,328,0,2,2 201843,12345,371,0,2,2 201843,12345,385,0,2,2 201843,12345,426,0,2,2 201843,12345,444,0,2,2 201843,12345,31,4,6,10 201843,12345,360,2,87,99 201843,12345,373,4,95,99 201843,12345,472,4,95,99 201843,12345,475,4,95,99 201843,12345,430,0,99,99 201843,12345,453,0,99,99 201843,12345,463,0,99,99 201843,12345,482,0,99,99 201843,12345,484,0,99,99 My keys are col1 and col2 from both files and I am doing this way below and my loop is wrong because when I reach the EOF from File B my loop is stopped. What I want is match File A and B with $col1 and $col2 and while the value_total_to_put is > 0 withdraw 1 in each loop and in value_inputted from File B when value_inputted is less than value_max. For withdraw from File A missing_value might be > 0. For the result I will print when value_inputted is equal to value_max in other words the last value until reach value_max or value_total_to_put is 0. while ( <FA> ){ chomp; my($col1,$col2, $value_total_to_put) = split ","; push #A, [$col1,$col2, $value_total_to_put]; } my #B; while ( <FB> ){ chomp; my($col1,$col2,$col3, $value_inputted, $missing_value, $value_max) = split ","; push #B, [$col1,$col2,$col3, $value_inputted, $missing_value, $value_max]; } foreach my $line (#A){ my $idxl = #$line[0].",".#$line[1]; my $value_total_to_put = #$line[2]; while ($value_total_to_put > 0){ foreach my $row ( #B ){ if ( $idxr eq $idxl ){ my $idxr = #$row[0].",".#$row[1]; my $value_inputted = #$row[3]; my $value_max = #$row[5]; my $missing_value = #$row[4]; if ( ($value_inputted eq 0) and ($missing_value eq 0)){ #do_nothing } elsif($value_inputted == $value_max){ #do_nothing print join(",", $idxr, #$row[2],"Value_inputted: ".$value_inputted, "Missing_value: ".$missing_value, "Value_max:".$value_max, "Total: ".$value_total_to_put)."\n"; }else{ $value_inputted++; $missing_value--; $value_total_to_put--; } } } last if $value_total_to_put > 0; } } The third file will be this way: 201843,12345,447,4,0,4 201843,12345,450,4,0,4 201843,12345,451,2,0,2 201843,12345,455,4,0,4 201843,12345,899,10,0,10 201843,12345,334,1,0,1 201843,12345,364,1,0,1 201843,12345,71,2,0,2 201843,12345,260,2,0,2 201843,12345,321,2,0,2 201843,12345,328,2,0,2 201843,12345,371,2,0,2 201843,12345,385,2,0,2 201843,12345,426,2,0,2 201843,12345,444,2,0,2 201843,12345,31,10,0,10 201843,12345,360,3,86,99 201843,12345,373,5,94,99 201843,12345,472,5,94,99 201843,12345,475,5,94,99 201843,12345,430,1,98,99 201843,12345,453,1,98,99
As explained by #Dave Cross, your code is quite hard to read (and misses use strictand use warnings), and your explanation of what you are trying to achieve is quite unclear... One thing that caught my eye however, is that you start a loop with this statement... while ($value_total_to_put > 0){ ... but at the very end of that same block you do : last if $value_total_to_put > 0; } This will effectively cause Perl to exit the loop after the first iteration, no matter the value of the $value_total_to_put variable. This is probably not what you want. Hence, as far as I understand, you should start your investigation by removing that last statement.
Join broken lines with perl/awk
I have a huge file with sql broken statements like: PP3697HB ####0 <<<<<<Record has been deleted as per PP3697HB>>>>>> FROM sys.xtab_ref rc,sys.xtab_sys f,sys.domp ur WHE RE rc.milf = ur.milf AND rc.molf = f.molf AND ur.dept = 'SWIT'AND ur .department = 'IND' AND share = '2' AND ur.status = 'DONE' AND f.s tatus = 'TRUE' AND rc.OPERATOR = '=' AND rc.VALUE = '261366'AND rc.r unet IN (SELECT milf FROM sys.domp WHERE change = 'OVO' A ND IND = 75); I need all these broken lines to be recombined to a single line. The line should look like: PP3697HB ####0<<<<<<Record has been deleted as per PP3697HB>>>>>>FROM sys.xtab_ref rc,sys.xtab_sys f,sys.domp ur WHERE rc.milf = ur.milf AND rc.molf = f.molf AND ur.dept = 'SWIT'AND ur.department = 'IND' AND share = '2' AND ur.status = 'DONE' AND f.status = 'TRUE' AND rc.OPERATOR = '=' AND rc.VALUE = '261366'AND rc.runet IN (SELECT milf FROM sys.domp WHERE change = 'OVO' AND IND = 75); How can I achieve this in perl/awk. We can say that the start of the line must be ^PP(.*) and the end of sql statement must be (.*);$ Let me know if you have difficulty understand the problem and I will try to explain again.
try this one-liner: awk '!/;$/{printf "%s",$0}/;$/{print}' file
Using tr to remove the newlines and sed to split each SQL statement: tr '\n' ' ' < file | sed 's/;/;\n/g'
Try this solution in Perl: #!/usr/bin/perl -w use strict; use warnings; use Data::Dumper; ## The raw string my $str = " PP3697HB ####0 <<<<<<Record has been deleted as per PP3697HB>>>>>> FROM sys.xtab_ref rc,sys.xtab_sys f,sys.domp ur WHE RE rc.milf = ur.milf AND rc.molf = f.molf AND ur.dept = 'SWIT'AND ur .department = 'IND' AND share = '2' AND ur.status = 'DONE' AND f.s tatus = 'TRUE' AND rc.OPERATOR = '=' AND rc.VALUE = '261366'AND rc.r unet IN (SELECT milf FROM sys.domp WHERE change = 'OVO' A ND IND = 75); "; ## Split the given string as per new line. my #lines = split(/\n/, $str); ## Join every element of the formed array using blank. $str = join("", #lines); print $str;
Perl solution: perl -ne 'chomp $last unless /^PP/; print $last; $last = $_ }{ print $last' FILE.SQL
Assuming there's other lines that are not split up like this, and that only the specified lines require re-joining: awk ' /^PP/ {insql=1} /;$/ {insql=0} insql {printf "%s", $0; next} {print} ' file
Renaming names in a file using another file without using loops
I have two files: (one.txt) looks Like this: >ENST001 (((....))) (((...))) >ENST002 (((((((.......)))))) ((((...))) I have like 10000 more ENST (two.txt) looks like this: >ENST001 110 >ENST002 59 and so on for the rest of all ENSTs I basically would like to replace the ENSTs in the (one.txt) by the combination of the two fields in the (two.txt) so the results will look like this: >ENST001_110 (((....))) (((...))) >ENST002_59 (((((((.......)))))) ((((...))) I wrote a matlab script to do so but since it loops for all lines in (two.txt) it take like 6 hours to finish, so I think using awk, sed, grep, or even perl we can get the result in few minutes. This is what I did in matlab: frf = fopen('one.txt', 'r'); frp = fopen('two.txt', 'r'); fw = fopen('result.txt', 'w'); while feof(frf) == 0 line = fgetl(frf); first_char = line(1); if strcmp(first_char, '>') == 1 % if the line in one.txt start by > it is the ID id_fold = strrep(line, '>', ''); % Reomve the > symbol frewind(frp) % Rewind two.txt file after each loop while feof(frp) == 0 raw = fgetl(frp); scan = textscan(raw, '%s%s'); id_pos = scan{1}{1}; pos = scan{2}{1}; if strcmp(id_fold, id_pos) == 1 % if both ids are the same id_new = ['>', id_fold, '_', pos]; fprintf(fw, '%s\n', id_new); end end else fprintf(fw, '%s\n', line); % if the line doesn't start by > print it to results end end
One way using awk. FNR == NR process first file in arguments and saves each number. Second condition process second file, and when first field matches with a key in the array modifies that line appending the number. awk ' FNR == NR { data[ $1 ] = $2; next } FNR < NR && data[ $1 ] { $0 = $1 "_" data[ $1 ] } { print } ' two.txt one.txt Output: >ENST001_110 (((....))) (((...))) >ENST002_59 (((((((.......)))))) ((((...)))
With sed you can at first run only on two.txt you can make a sed commands to replace as you want and run it at one.txt: First way sed "$(sed -n '/>ENST/{s=.*\(ENST[0-9]\+\)\s\+\([0-9]\+\).*=s/\1/\1_\2/;=;p}' two.txt)" one.txt Second way If files are huge you'll get too many arguments error with previous way. Therefore there is another way to fix this error. You need execute all three commands one by one: sed -n '1i#!/bin/sed -f />ENST/{s=.*\(ENST[0-9]\+\)\s\+\([0-9]\+\).*=s/\1/\1_\2/;=;p}' two.txt > script.sed chmod +x script.sed ./script.sed one.txt The first command will form the sed script that will be able to modify one.txt as you want. chmod will make this new script executable. And the last command will execute command. So each file is read only once. There is no any loops. Note that first command consist from two lines, but still is one command. If you'll delete newline character it will break the script. It is because of i command in sed. You can look for details in ``sed man page.
This Perl solution sends the modified one.txt file to STDOUT. use strict; use warnings; open my $f2, '<', 'two.txt' or die $!; my %ids; while (<$f2>) { $ids{$1} = "$1_$2" if /^>(\S+)\s+(\d+)/; } open my $f1, '<', 'one.txt' or die $!; while (<$f1>) { s/^>(\S+)\s*$/>$ids{$1}/; print; }
Turn the problem on its head. In perl I would do something like this: #!/usr/bin/perl open(FH1, "one.txt"); open(FH2, "two.txt"); open(RESULT, ">result.txt"); my %data; while (my $line = <FH2>) { chomp(line); # Delete leading angle bracket $line =~ s/>//d; # split enst and pos my ($enst, $post) = split(/\s+/, line); # Store POS with ENST as key $data{$enst} = $pos; } close(FH2); while (my $line = <FH1>) { # Check line for ENST if ($line =~ m/^>(ENST\d+)/) { my $enst = $1; # Get pos for ENST my $pos = $data{$enst}; # make new line $line = '>' . $enst . '_' . $pos . '\n'; } print RESULT $line; } close(FH1); close(RESULT);
This might work for you (GNU sed): sed -n '/^$/!s|^\(\S*\)\s*\(\S*\).*|s/^\1.*/\1_\2/|p' two.txt | sed -f - one.txt
Try this MATLAB solution (no loops): %# read files as cell array of lines fid = fopen('one.txt','rt'); C = textscan(fid, '%s', 'Delimiter','\n'); C1 = C{1}; fclose(fid); fid = fopen('two.txt','rt'); C = textscan(fid, '%s', 'Delimiter','\n'); C2 = C{1}; fclose(fid); %# use regexp to extract ENST numbers from both files num = regexp(C1, '>ENST(\d+)', 'tokens', 'once'); idx1 = find(~cellfun(#isempty, num)); %# location of >ENST line val1 = str2double([num{:}]); %# ENST numbers num = regexp(C2, '>ENST(\d+)', 'tokens', 'once'); idx2 = find(~cellfun(#isempty, num)); val2 = str2double([num{:}]); %# construct new header lines from file2 C2(idx2) = regexprep(C2(idx2), ' +','_'); %# replace headers lines in file1 with the new headers [tf,loc] = ismember(val2,val1); C1( idx1(loc(tf)) ) = C2( idx2(tf) ); %# write result fid = fopen('three.txt','wt'); fprintf(fid, '%s\n',C1{:}); fclose(fid);