How can I put N's in my final string - perl

I'm beginner in perl so im have some problems writing a script.
I want a script that put the letter N one certain number of times with basis in a length that I previous check. This Ns have to be in the final of a string inside a .txt. This strings begin with a > and have that 'face':
A1_23ABR2014_53_CC07.P10R_E07_009.ab1
attgccttttgctagcttatagaataataattcatataaacaaaaaatat
tttatattatttaaaaataaataaaccaaataaagtcattgttgatccaa
ttgaacaaatcatattccatccatttaaagcgtctggataatcaggaata
cgtctaggcattacattaaatccaagaaaatgcataggtaagaatgttaa
I already wrote that, but I don't know how to do next.
if $qend > $sendi{
my $leg1 = $qendi - $sendi;
open(my #final, '>>', 'contiggeral.fasta') or die;
while (N < $leg1) {
do N++ in #nomecontig
}
Thanks and sorry for my bad english.

The condition if non-modifier if must be enclosed in parentheses. Variables must start with a sigil (N has none). There is no in operator in Perl.
my $string = 'abc';
my $final_length = 20;
$string .= 'N' x ($final_length - length $string);
print $string, "\n";

Related

Remove duplicate lines on file by substring - preserve order (PERL)

i m trying to write a perl script to deal with some 3+ gb text files, that are structured like :
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
I want to perform two operations :
Count the number of delimiters per line and compare it to a static number (ie 5), those lines that exceed said number should be output to a file.control.
Remove duplicates on the file by substring($line, 0, 7) - first 7 numbers, but i want to preserve order. I want the output of that in a file.output.
I have coded this in simple shell script (just bash), but it took too long to process, the same script calling on perl one liners was quicker, but i m interested in a way to do this purely in perl.
The code i have so far is :
open $file_hndl_ot_control, '>', $FILE_OT_CONTROL;
open $file_hndl_ot_out, '>', $FILE_OT_OUTPUT;
# INPUT.
open $file_hndl_in, '<', $FILE_IN;
while ($line_in = <$file_hndl_in>)
{
# Calculate n. of delimiters
my $delim_cur_line = $line_in =~ y/"$delimiter"//;
# print "$commas \n"
if ( $delim_cur_line != $delim_amnt_per_line )
{
print {$file_hndl_ot_control} "$line_in";
}
# Remove duplicates by substr(0,7) maintain order
my substr_in = substr $line_in, 0, 11;
print if not $lines{$substr_in}++;
}
And i want the file.output file to look like
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
and the file.control file to look like :
(assuming delimiter control number is 6)
4352342xx23232xxx345545x45454x23232xxx
Could someone assist me? Thank you.
Posting edits : Tried code
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
open(my $fh1, ">>", "outputcontrol.txt");
open(my $fh2, ">>", "outputoutput.txt");
while ( <> ) {
my $count = ($_ =~ y/x//);
print "$count \n";
# print $_;
if ( $count != $delim_amnt_per_line )
{
print fh1 $_;
}
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print fh2;
}
I dont know if i m supposed to post new code in here. But i tried the above, based on your example. What baffles me (i m still very new in perl) is that it doesnt output to either filehandle, but if i redirected from the command line just as you said, it worked perfect. The problem is that i need to output into 2 different files.
It looks like entries with the same seven-character prefix may appear anywhere in the file, so it's necessary to use a hash to keep track of which ones have already been encountered. With a 3GB text file this may result in your perl process running out of memory, in which case a different approach is necessary. Please give this a try and see if it comes in under the bar
The tr/// operator (the same as y///) doesn't accept variables for its character list, so I've used eval to create a subroutine delimiters() that will count the number of occurrences of $delimiter in $_
It's usually easiest to pass the input file as a parameter on the command line, and redirect the output as necessary. That way you can run your program on different files without editing the source, and that's how I've written this program. You should run it as
$ perl filter.pl my_input.file > my_output.file
use strict;
use warnings 'all';
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
eval "sub delimiters { tr/$delimiter// }";
while ( <> ) {
next if delimiters() == $delim_amnt_per_line;
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print;
}
output
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

How to remove non numeric charcter from numeric string?

I have following data. I would like to print last column without non numeric character from a string. Kindly help me
N THR K 149A
CA THR K 149A
C THR K 149A
O THR K 149A
CB THR K 149A
OG1 THR K 149A
CG2 THR K 149A
N SER K 149B
CA SER K 149B
C SER K 149B
O SER K 149B
CB SER K 149B
for solving the above problem I have tried by following program.
#!/usr/bin/perl -w
open(F1, "$ARGV[0]") or die;
chomp(#arr=<F1>);
close F1;
for($i=0;$i<=$#arr;$i++)
{
#pdb=split(/\h/,$arr[$i]);
if($pdb[3] =~ /[A-Z]/*$);{
$pdb[3] =~ s/\D//g;
print "$pdb[1] $pdb[2] $pdb[3]\n";
}
}
Ok, unless this is a typo, it is the thing wrong with your code.
if($pdb[3] =~ /[A-Z]/*$);{
In this code, you have placed the slash / in the middle of your regex, and also placed a semi-colon there which does not belong anywhere on that line. Also, you are using * as the quantifier, which will not work as intended, because it will allow a match on the empty string (zero matches), which will match all strings. The correct line is:
if($pdb[3] =~ /[A-Z]+$/) {
However, this entire line is incorrect, when taken in context:
if($pdb[3] =~ /[A-Z]*$/) {
$pdb[3] =~ s/\D//g;
Here you only remove non-digits if upper case letters are found. Besides the fact that you are checking for two different things, you do not need to check before substituting, because a substitution will not do anything if it does not match. So... something like this:
if ($foo =~ /A/) {
$foo =~ s/A//g;
is completely redundant, because s/A//g will not do anything unless there is already an A in the string.
Also, a few more things you should know:
Always use
use strict;
use warnings;
As it will help you prevent a lot of simple mistakes.
Use three argument open, with lexical file handle, and check the return value including the error:
open my $fh, "<", $file or die "Cannot open $file: $!";
You do not need to quote variables, such as with "$ARGV[0]". You leave out the quotes: $ARGV[0].
You are using a C-style for loop. Using a Perl-style loop is preferred, in my opinion:
for my $i (0 .. $#arr)
But you should not be using array indexes unless you need the indexes themselves, so the better loop is:
for my $line (#arr)
But again, as a general rule, it is better to read a file line-by-line than slurping it into an array. For this purpose you would use a while loop, which iterates over the file handle instead of exhausting it all at once:
while (<$fh>) {
# process line $_
}
Using /\h/ as the field delimiter for split is wrong, unless you intended that consecutive whitespace indicates empty fields. The default split is ' ', which splits on multiple whitespace /\s+/, and also strips leading whitespace. With CSV data, it is possibly correct to split on single delimiters, but in that case you should use the specific delimiter, and not a character class like \h.
Like I said before, using the * quantifier in a regex match is horribly wrong. You might notice that a regex such as /[A-Z]*/ matches anything if you try it out: perl -lnwe 'print /[A-Z]*/ ? "match!" : "no match";' That is because it is allowed to match the empty string, and all strings match the empty string.
And like I also said, you do not need to check before you substitute. At least not for the same thing. So, when simplified, your code becomes:
open my $fh, "<", $ARGV[0] or die "Cannot open $ARGV[0]: $!";
while (<$fh>) { # short for while ($_ = <$fh>)
chomp; # short for chomp($_)
my #fields = split; # short for split(' ', $_)
$fields[3] =~ s/\D//g;
print "#fields[1,2,3]\n"; # quoting an array inserts spaces between elements
}
Note that I used an array slice, where we only use the elements with the indicated elements. You can also write this such as:
print join(" ", $fields[1], $fields[2], $fields[3]), "\n";
You might note also that this is possible to achieve using a one-liner:
perl -anlwe '$F[3] =~ s/\D//g; print "#F[1,2,3]"'
The -a switch autosplits the line on whitespace, storing the fields in #F. The -l switch chomps the line and adds newline to print. And the -n switch reads input from STDIN or argument files, whichever is supplied.
Try this
perl -ne 'print "$1\n" if m/(\d+)\D$/' datafile

Concatenating strings from a multidimensional array overwrites the target string in Perl

I've built a two dimension array with string values. There are always 12 columns but the number of rows vary. Now I'd like to build a string of each row but when I run the following code:
$outstring = "";
for ($i=0; $i < $ctrLASTROW + 1; $i++) {
for ($k=0; $k < 12; $k++){
$datastring = $DATATABLE[$i][$k]);
$outstring .= $datastring;
}
}
$outstring takes the first value. Then on the second inner loop and subsequent loops the value in $outstring gets overlaid. For example the first value is "DATE" then the next time when the value "ABC" gets fed to it. Rather than being the hoped for "DATEABC" it's "ABCE". The "E" is the fourth character of DATE. I figure I'm missing the scalar / list issue but I've tried who knows how many variations to no avail. When I first started I tried the concatenation directly from the #DATATABLE. Same problem. Only quicker.
When you have a problem such as two strings DATE and ABC being concatenated, and the end result is ABCE, or one of the strings overwriting the other, a likely scenario is that you have a file from another OS, with the line endings \r\n, which are chomped, resulting in the string DATE\rABC when concatenated, which then becomes ABCE when printed.
In other words:
my $foo = "DATE\r\n";
my $bar = "ABC\r\n"; # \r\n line endings from file
chomp($foo, $bar); # removes \n but leaves \r
print $foo . $bar; # prints ABCE
To confirm, use
use Data::Dumper;
$Data::Dumper::Useqq = 1;
print Dumper $DATATABLE[$i][$k]; # prints $VAR1 = "DATE\rABC\r";
To resolve, instead of chomp use a regex such as:
$foo =~ s/[\r\n]+\z//;

grep in perl without array

If I have one variable : I assigned entire file text to it
$var = `cat file_name`
Suppose in the file , the word 'mine' comes in 17th line (location is not available but just giving example) and I want to search a pattern 'word' after N (eg 10) lines of word 'mine' if pattern 'word' exist in those lines or not. How can i do that in the regular expression without using array'
Example:
$var = "I am good in perl\n but would like to know about the \n grep command in details";
I want to search particular pattern in specific lines (lines 2 to 3 only). How can I do it without using array.
There is a valid case for not using arrays here - when files are prohibitively large.
This is a pretty specific requirement. Rather than beat around the bush to find that Perl idiom, I'd prescribe a subroutine:
sub n_lines_apart {
my ( $file, $n, $first_pattern, $second_pattern ) = #_;
open my $fh, '<', $file or die $!;
my $lines_apart;
while (<$fh>) {
$lines_apart++ if qr/$first_pattern/ .. qr/$second_pattern/;
}
return $lines_apart && $lines_apart <= $n+1;
}
Caveat
The sub above is not designed to handle multiple matches in a single file. Let that be an exercise for the reader.
You can do this with a regular expression match like this:
my $var = `cat $filename`;
while ( $var =~ /foo/g ) {
print $1, "\n";
print "match occurred at position ", pos($var), " in the string.\n";
}
This will print out all the matches of the string 'foo' from your string, similar to grep but not using an array (or list). The /$regexp/g syntax makes the regular expression iteratively match against the string from left to right.
I'd recommend reading perlrequick for a tutorial on matching with regular expressions.
Try this:
perl -ne '$m=$. if !$m && /first-pattern/;
print if $m && ($.-$m >= 2 && $.-$m <= 3) && /second-pattern/'

Parsing files that use synonyms

If I had a text file with the following:
Today (is|will be) a (great|good|nice) day.
Is there a simple way I can generate a random output like:
Today is a great day.
Today will be a nice day.
Using Perl or UNIX utils?
Closures are fun:
#!/usr/bin/perl
use strict;
use warnings;
my #gens = map { make_generator($_, qr~\|~) } (
'Today (is|will be) a (great|good|nice) day.',
'The returns this (month|quarter|year) will be (1%|5%|10%).',
'Must escape %% signs here, but not here (%|#).'
);
for ( 1 .. 5 ) {
print $_->(), "\n" for #gens;
}
sub make_generator {
my ($tmpl, $sep) = #_;
my #lists;
while ( $tmpl =~ s{\( ( [^)]+ ) \)}{%s}x ) {
push #lists, [ split $sep, $1 ];
}
return sub {
sprintf $tmpl, map { $_->[rand #$_] } #lists
};
}
Output:
C:\Temp> h
Today will be a great day.
The returns this month will be 1%.
Must escape % signs here, but not here #.
Today will be a great day.
The returns this year will be 5%.
Must escape % signs here, but not here #.
Today will be a good day.
The returns this quarter will be 10%.
Must escape % signs here, but not here %.
Today is a good day.
The returns this month will be 1%.
Must escape % signs here, but not here %.
Today is a great day.
The returns this quarter will be 5%.
Must escape % signs here, but not here #.
Code:
#!/usr/bin/perl
use strict;
use warnings;
my $template = 'Today (is|will be) a (great|good|nice) day.';
for (1..10) {
print pick_one($template), "\n";
}
exit;
sub pick_one {
my ($template) = #_;
$template =~ s{\(([^)]+)\)}{get_random_part($1)}ge;
return $template;
}
sub get_random_part {
my $string = shift;
my #parts = split /\|/, $string;
return $parts[rand #parts];
}
Logic:
Define template of output (my $template = ...)
Enter loop to print random output many times (for ...)
Call pick_one to do the work
Find all "(...)" substrings, and replace them with random part ($template =~ s...)
Print generated string
Getting random part is simple:
receive extracted substring (my $string = shift)
split it using | character (my #parts = ...)
return random part (return $parts[...)
That's basically all. Instead of using function you could put the same logic in s{}{}, but it would be a bit less readable:
$template =~ s{\( ( [^)]+ ) \)}
{ my #parts = split /\|/, $1;
$parts[rand #parts];
}gex;
Sounds like you may be looking for Regexp::Genex. From the module's synopsis:
#!/usr/bin/perl -l
use Regexp::Genex qw(:all);
$regex = shift || "a(b|c)d{2,4}?";
print "Trying: $regex";
print for strings($regex);
# abdd
# abddd
# abdddd
# acdd
# acddd
# acdddd
Use a regex to match each parenthetical (and the text inside it).
Use a string split operation (pipe delimiter) on the text inside of the matched parenthetical to get each of the options.
Pick one randomly.
Return it as the replacement for that capture.
Smells like a recursive algorithm
Edit: misread and thought you wanted all possibilities
#!/usr/bin/python
import re, random
def expand(line, all):
result = re.search('\([^\)]+\)', line)
if result:
variants = result.group(0)[1:-1].split("|")
for v in variants:
expand(line[:result.start()] + v + line[result.end():], all)
else:
all.append(line)
return all
line = "Today (is|will be) a (great|good|nice) day."
all = expand(line, [])
# choose a random possibility at the end:
print random.choice(all)
A similar construct that produces a single random line:
def expand_rnd(line):
result = re.search('\([^\)]+\)', line)
if result:
variants = result.group(0)[1:-1].split("|")
choice = random.choice(variants)
return expand_rnd(
line[:result.start()] + choice + line[result.end():])
else:
return line
Will fail however on nested constructs