making one single line - perl

keyword harry /
sally/
tally/
want that whenever the string matches with keyword it should also look for "/" character.This signifies continuation of line
Then I want output as
keyword harry sally tally
==========================
My current code:
#!/usr/bin/perl
open (file2, "trial.txt");
$keyword_1 = keyword;
foreach $line1 (<file2>) {
s/^\s+|\s+$//g;
if ($line1 =~ $keyword_1) {
$line2 =~ (s/$\//g, $line1) ;
print " $line2 " ;
}
}

If the ===== lines in your question are supposed to be in the output, then use
#! /usr/bin/env perl
use strict;
use warnings;
*ARGV = *DATA; # for demo only; delete
sub print_line {
my($line) = #_;
$line =~ s/\n$//; # / fix Stack Overflow highlighting
print $line, "\n",
"=" x (length($line) + 1), "\n";
}
my $line = "";
while (<>) {
$line .= $line =~ /^$|[ \t]$/ ? $_ : " $_";
if ($line !~ s!/\n$!!) { # / ditto
print_line $line;
$line = "";
}
}
print_line $line if length $line;
__DATA__
keyword jim-bob
keyword harry /
sally/
tally/
Output:
keyword jim-bob
================
keyword harry sally tally
==========================

You did not specify what to do with the lines that do not contain the keyword. You might use this code as an inspiration, though:
#!/usr/bin/perl
use warnings;
use strict;
my $on_keyword_line;
while (<>) {
if (/keyword/ or $on_keyword_line) {
chomp;
if (m{/$}) {
chop;
$on_keyword_line = 1;
} else {
$on_keyword_line = 0;
}
print;
} else {
$on_keyword_line = 0;
print "\n";
}
}

A redo is useful when dealing with concatenating continuation lines.
my $line;
while ( defined( $line = <DATA> )) {
chomp $line;
if ( $line =~ s{/\s*$}{ } ) {
$line .= <DATA>;
redo unless eof(DATA);
}
$line =~ s{/}{};
print "$line\n";
}
__DATA__
keyword harry /
sally/
tally/
and
done!!!
$ ./test.pl
keyword harry sally tally and
done!!!

I think you need to simply concatenate all lines that end in a slash, regardless of the keyword.
I suggest this code.
Updated to account for the OP's comment that continuation lines are terminated by backslashes.
while (<>) {
s|\\\s*\z||;
print;
}

Related

Perl script grep

The script is printing the amount of input lines, I want it to print the amount of input lines that are present in another file
#!/usr/bin/perl -w
open("file", "text.txt");
#todd = <file>;
close "file";
while(<>){
if( grep( /^$_$/, #todd)){
#if( grep #todd, /^$_$/){
print $_;
}
print "\n";
}
if for example file contains
1
3
4
5
7
and the input file that will be read from contains
1
2
3
4
5
6
7
8
9
I would want it to print 1,3,4,5 and 7
but 1-9 are being printed instead
UPDATE******
This is my code now and I am getting this error
readline() on closed filehandle todd at ./may6test.pl line 3.
#!/usr/bin/perl -w
open("todd", "<text.txt");
#files = <todd>; #file looking into
close "todd";
while( my $line = <> ){
chomp $line;
if ( grep( /^$line$/, #files) ) {
print $_;
}
print "\n";
}
which makes no sense to me because I have this other script that is basically doing the same thing
#!/usr/bin/perl -w
open("file", "<text2.txt"); #
#file = <file>; #file looking into
close "file"; #
while(<>){
$temp = $_;
$temp =~ tr/|/\t/; #puts tab between name and id
my ($name, $number1, $number2) = split("\t", $temp);
if ( grep( /^$number1$/, #file) ) {
print $_;
}
}
print "\n";
OK, the problem here is - grep sets $_ too. So grep { $_ } #array will always give you every element in the array.
At a basic level - you need to:
while ( my $line = <> ) {
chomp $line;
if ( grep { /^$line$/ } #todd ) {
#do something
}
}
But I'd suggest instead that you might want to consider building a hash of your lines instead:
open( my $input, '<', "text.txt" ) or die $!;
my %in_todd = map { $_ => 1 } <$input>;
close $input;
while (<>) {
print if $in_todd{$_};
}
Note - you might want to watch for trailing linefeeds.

Split a line on every 16th comma

I am using perl to extract "Yes," or "No," from a large CSV, and output to a file using this code
open my $fin, "leads.csv";
my $str;
for (<$fin>) {
if (/^\s*\d+\.\s*(\w+)/) {
$str .= $1 . ",";
}
}
open (MYFILE, '>>data.txt');
print MYFILE $str;
close (MYFILE);
This is working correctly, and outputting data like this http://pastebin.com/r7Lwwz8p, however I need to break
to a new line after the 16th element so it looks like this on output: http://pastebin.com/xC8Lyk5R
Any tips/tricks greatly appreciated!
The following splits a line by commas, and then regroups them by 16 elements:
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
my #fields = split ',', $line;
while (my #data = splice #fields, 0, 16) {
print join(',', #data), "\n";
}
}
__DATA__
LineA,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineB,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineC,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineD,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineE,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineF,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineG,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,LineH,2,3,4,5,6,7,8,9,10,11,12
Outputs:
LineA,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineB,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineC,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineD,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineE,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineF,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineG,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
LineH,2,3,4,5,6,7,8,9,10,11,12
Use a variable to count the number of yes/no matches that you find, and then use the mod (%) operator to insert a newline into the string.
#!/usr/bin/perl
use strict;
use warnings;
open my $fin, "leads.csv";
my $str;
my $count = 0;
for (<$fin>) {
if (/^\s*\d+\.\s*(\w+)/) {
$str .= $1 . ",";
$count++;
}
$str .= "\n" unless ($count % 16);
}
open (MYFILE, '>>data.txt');
print MYFILE $str;
close (MYFILE);

How can I choose particular lines from a file with Perl

I have a file which I want to take all the lines which starts with CDS and a line below.
This lines are like:
CDS 297300..298235
/gene="ENSBTAG00000035659"
I found this in your site:
open(FH,'FILE');
while ($line = <FH>) {
if ($line =~ /Pattern/) {
print "$line";
print scalar <FH>;
}
}
and it works great when the CDS is only a line.
Sometimes in my file is like
CDS join(complement(416559..416614),complement(416381..416392),
complement(415781..416087))
/gene="ENSBTAG00000047603"
or with more lines in the CDS.
How can I take only the CDS lines and the next line of the ID???
please i need your help!
Thank you in advance.
Assuming the "next line" always contains /gene=, one can use the flip-flop operator.
while (<>) {
print if m{^CDS} ... m{/gene=};
}
Otherwise, you need to parse the CDS line. It might be sufficient to count parens.
my $depth = 0;
my $print_next = 0;
while (<>) {
if (/^CDS/) {
print;
$depth = tr/(// - tr/)//;
$print_next = 1;
}
elsif ($depth) {
print;
$depth += tr/(// - tr/)//;
}
elsif ($print_next) {
print;
$print_next = 0;
}
}
You need to break the input into outdented paragraphs. Outdented paragraphs start a non-space character in their first line and start with space characters for the rest.
Try:
#!/usr/bin/env perl
use strict;
use warnings;
# --------------------------------------
my $input_file = shift #ARGV;
my $para = undef; # holds partial paragraphs
open my $in_fh, '<', $input_file or die "could not open $input_file: $!\n";
while( my $line = <$in_fh> ){
# paragraphs are outdented, that is, start with a non-space character
if( $line =~ m{ \A \S }msx ){
# don't do if very first line of file
if( defined $para ){
# If paragraph starts with CDS
if( $para =~ m{ \A CDS \b }msx ){
process_CDS( $para );
}
# delete the old paragraph
$para = undef;
}
}
# add the line to the paragraph,
$para .= $line;
}
close $in_fh or die "could not close $input_file: $!\n";
# the last paragraph is not handle inside the loop, so do it now
if( defined $para ){
# If paragraph starts with CDS
if( $para =~ m{ \A CDS \b }msx ){
process_CDS( $para );
}
}

basic regex and string manipulation for DNA analysis using perl

I am new to perl and would like to do what I think is some basic string manipulation to DNA sequences stored in an rtf file.
Essentially, my file reads (file is in FASTA format):
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT
What I would like to do is read into my file and print the header (header is >LM1) then match the following DNA sequence GTGCCAGCAGCCGC and then print the preceding DNA sequence.
So my output would look like this:
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC
I have written the following program:
#!/usr/bin/perl
use strict; use warnings;
open(FASTA, "<seq_V3_V6_130227.rtf") or die "The file could not be found.\n";
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
my $header = $_;
print "$header\n";
}
my $dna = <FASTA>;
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print "$dna";
}
}
close(FASTA);
The problem is that my program reads the file line by line and the output I am receiving is the following:
>LM1
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC
Basically I don't know how to assign the entire DNA sequence to my $dna variable and ultimately don't know how to avoid reading the DNA sequence line by line. Also I am getting this warning:
Use of uninitialized value $dna in pattern match (m//) at stacked.pl line 14, line 1113.
If anyone could give me some help with writing better code or point me in the correct direction it would be much appreciated.
Using the pos function:
use strict;
use warnings;
my $dna = "";
my $seq = "GTGCCAGCAGCCGC";
while (<DATA>) {
if (/^>/) {
print;
} else {
if (/^[AGCT]/) {
$dna .= $_;
}
}
}
if ($dna =~ /$seq/g) {
print substr($dna, 0, pos($dna) - length($seq)), "\n";
}
__DATA__
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT
You can process a file with multiple entries like so:
while (<DATA>) {
if (/^>/) {
if ($dna =~ /$seq/g) {
print substr($dna, 0, pos($dna) - length($seq)), "\n";
$dna = "";
}
print;
} elsif (/^[AGCT]/) {
$dna .= $_;
}
}
if ($dna && $dna =~ /$seq/g) {
print substr($dna, 0, pos($dna) - length($seq)), "\n";
}
Your while statement reads until the end of file. That means at every loop iteration, $_ is the next line in <FASTA>. So $dna = <FASTA> isn't doing what you think it is. It is reading more than you probably want it to.
while(<FASTA>) { #Reads a line here
chomp($_);
if ($_ =~ m/^>/ ) {
my $header = $_;
print "$header\n";
}
$dna = <FASTA> # reads another line here - Causes skips over every other line
}
Now, you need to read the sequence into your $dna. You can update your while loop with an else statement. So if its a head line, print it, else, we add it to $dna.
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
# It is a header line, so print it
my $header = $_;
print "$header\n";
} else {
# if it is not a header line, add to your dna sequence.
$dna .= $_;
}
}
After the loop, you can do your regex.
Note: This solution assumes there is only 1 sequence in the fasta file. If you have more than one, your $dna variable will have all the sequences as one.
Edit: Adding simple a way to handle multiple sequences
my $dna = "";
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
# Does $dna match the regex?
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print "$1\n";
}
# Reset the sequence
$dna = "";
# It is a header line, so print it
my $header = $_;
print "$header\n";
} else {
# if it is not a header line, add to your dna sequence.
$dna .= $_;
}
}
# Check the last sequence
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print "$1\n";
}
I came up with a solution using BioSeqIO (and the trunc method from BioSeq from the BioPerl distribution. I also used index to find the subsequence rather than using a regular expression.
This solution does not print out the id, (line begins with >), if the subsequence was not found or if the subsequence begins at the first postion, (and thus no preceding characters).
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new( -file => "fasta_junk.fasta" ,
-format => 'fasta');
my $out = Bio::SeqIO->new( -file => '>test.dat',
-format => 'fasta');
my $lookup = 'GTGCCAGCAGCCGC';
while ( my $seq = $in->next_seq() ) {
my $pos = index $seq->seq, $lookup;
# if $pos != -1, ($lookup not found),
# or $pos != 0, (found $lookup at first position, thus
# no preceding characters).
if ($pos > 0) {
my $trunc = $seq->trunc(1,$pos);
$out->write_seq($trunc);
}
}
__END__
*** fasta_junk.fasta
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT
*** contents of test.dat
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAAAGTACTGTCC
GTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTTGACGGTATCTAACCAGAAAG
CCACGGCTAACTAC
read the whole file into memory then look for the regexp
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
my $header = $_;
print "$header\n";
} else {
$dna .= $_;
}
}
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print $1;
}

Cleanest Perl parser for Makefile-like continuation lines

A perl script I'm writing needs to parse a file that has continuation lines like a Makefile. i.e. lines that begin with whitespace are part of the previous line.
I wrote the code below but don't feel like it is very clean or perl-ish (heck, it doesn't even use "redo"!)
There are many edge cases: EOF at odd places, single-line files, files that start or end with a blank line (or non-blank line, or continuation line), empty files. All my test cases (and code) are here: http://whatexit.org/tal/flatten.tar
Can you write cleaner, perl-ish, code that passes all my tests?
#!/usr/bin/perl -w
use strict;
sub process_file_with_continuations {
my $processref = shift #_;
my $nextline;
my $line = <ARGV>;
$line = '' unless defined $line;
chomp $line;
while (defined($nextline = <ARGV>)) {
chomp $nextline;
next if $nextline =~ /^\s*#/; # skip comments
$nextline =~ s/\s+$//g; # remove trailing whitespace
if (eof()) { # Handle EOF
$nextline =~ s/^\s+/ /;
if ($nextline =~ /^\s+/) { # indented line
&$processref($line . $nextline);
}
else {
&$processref($line);
&$processref($nextline) if $nextline ne '';
}
$line = '';
}
elsif ($nextline eq '') { # blank line
&$processref($line);
$line = '';
}
elsif ($nextline =~ /^\s+/) { # indented line
$nextline =~ s/^\s+/ /;
$line .= $nextline;
}
else { # non-indented line
&$processref($line) unless $line eq '';
$line = $nextline;
}
}
&$processref($line) unless $line eq '';
}
sub process_one_line {
my $line = shift #_;
print "$line\n";
}
process_file_with_continuations \&process_one_line;
How about slurping the whole file into memory and processing it using regular expressions. Much more 'perlish'. This passes your tests and is much smaller and neater:
#!/usr/bin/perl
use strict;
use warnings;
$/ = undef; # we want no input record separator.
my $file = <>; # slurp whole file
$file =~ s/^\n//; # Remove newline at start of file
$file =~ s/\s+\n/\n/g; # Remove trailing whitespace.
$file =~ s/\n\s*#[^\n]+//g; # Remove comments.
$file =~ s/\n\s+/ /g; # Merge continuations
# Done
print $file;
If you don't mind loading the entire file in memory, then the code below passes the tests.
It stores the lines in an array, adding each line either to the previous one (continuation) or at the end of the array (other).
#!/usr/bin/perl
use strict;
use warnings;
my #out;
while( <>)
{ chomp;
s{#.*}{}; # suppress comments
next unless( m{\S}); # skip blank lines
if( s{^\s+}{ }) # does the line start with spaces?
{ $out[-1] .= $_; } # yes, continuation, add to last line
else
{ push #out, $_; } # no, add as new line
}
$, = "\n"; # set output field separator
$\ = "\n"; # set output record separator
print #out;