How to remove new line characters until each line has a specific number of instances of a specific character? - perl

I have a real mess of a pipe-delimited file, which I need to load to a database. The file has 35 fields, and thus 34 pipes. One of the fields is comprised of HTML code which, for some records, includes multiple line breaks. Unfortunately there's no patter as to where the line breaks fall.
The solution I've come up with is to count the number of pipes in each line and until that number reaches 34, remove the new line character from that line. I'm not incredibly well-versed in Perl, but I think I'm close to achieving what I'm looking to do. Any suggestions?
#!/usr/local/bin/perl
use strict;
open (FILE, 'test.txt');
while (<FILE>) {
chomp;
my $line = $_;
#remove null characters that are included in file
$line =~ tr/\x00//;
#count number of pipes
my $count = ($line =~ tr/|//);
#each line should have 34 pipes
while ($count < 34) {
#remove new lines until line has 34 pipes
$line =~ tr/\r\n//;
$count = ($line =~ tr/|//);
print "$line\n";
}
}

This should work I guess.
#!/usr/bin/perl
use strict;
open (FILE, 'test.txt');
my $num_pipes = 0, my $line_num = 0;
my $tmp = "";
while (<FILE>)
{
$line_num++;
chomp;
my $line = $_;
$line =~ tr/\x00//; #remove null characters that are included in file
$num_pipes += ($line =~ tr/|//); #count number of pipes
if ($num_pipes == 34 && length($tmp))
{
$tmp .= $line;
print "$tmp\n";
# Reset values.
$tmp = "";
$num_pipes = 0;
}
elsif ($num_pipes == 34 && length($tmp) == 0)
{
print "$line\n";
$num_pipes = 0;
}
elsif ($num_pipes < 34)
{
$tmp .= $line;
}
elsif ($num_pipes > 34)
{
print STDERR "Error before line $line_num. Too many pipes ($num_pipes)\n";
$num_pipes = 0;
$tmp = "";
}
}

Twiddle with $/, the input record separator?
while (!eof(FILE)) {
# assemble a row of data: 35 pipe separated fields, possibly over many lines
my #fields = ();
{
# read 34 fields from FILE:
local $/ = '|';
for (1..34) {
push #fields, scalar <FILE>;
}
} # $/ is set back to original value ("\n") at the end of this block
push #fields, scalar <FILE>; # read last field, which ends with newline
my $line = join '|', #fields;
... now you can process $line, and you already have the #fields ......
}

Related

truncate all lines in a file while preserving whole words

I'm trying to shorten each line of a file to 96 characters while preserving whole words. If a line is under or equal to 96 chars, I want to do nothing with that line. If it over 96 chars, I want it cut it down to the closest amount less than 96 while preserving whole words. When I run this code, I get a blank file.
use Text::Autoformat;
use strict;
use warnings;
#open the file
my $filename = $ARGV[0]; # store the 1st argument into the variable
open my $file, '<', $filename;
open my $fileout, '>>', $filename.96;
my #file = <$file>; #each line of the file into an array
while (my $line = <$file>) {
chomp $line;
foreach (#file) {
#######
sub truncate($$) {
my ( $line, $max ) = #_;
# always do nothing if already short enough
( length( $line ) <= $max ) and return $line;
# forced to chop a word anyway
if ( $line =~ /\s/ ) {
return substr( $line, 0, $max );
}
# otherwise truncate on word boundary
$line =~ s/\S+$// and return $line;
die; # unreachable
}
#######
my $truncated = &truncate($line,96);
print $fileout "$truncated\n";
}
}
close($file);
close($fileout);
You have no output because you have no input.
1. my #file = <$file>; #each line of the file into an array
2. while (my $line = <$file>) { ...
The <$file> operation line 1 is in list context "consumes" all the input and loads it into #file. The <$file> operation in line 2 has no more input to read, so the while loop does not execute.
You either want to stream from the filehandle
# don't call #file = <$file>
while (my $line = <$file>) {
chomp $line;
my $truncated = &truncate($line, 96);
...
}
Or read from the array of file contents
my #file = <$file>;
foreach my $line (#file) {
chomp $line;
my $truncated = &truncate($line, 96);
...
}
If the input is large, the former format has the advantage of just loading a single line into memory at a time.

how can I choose particular lines from a file by using a second file

I am trying to filter a file (trying to find the best blast hits) by using a second file.
The file that I want to filter looks like this:
conserved1 chr22 100.00 92 0 0 1 92 19679676 19679767 2e-44 182
.....................
The second file (which is the first input in my script) that I’m using is like this:
conserved1 92
conserved2 76
.....................
(the first column is the name of my ‘item’ which is exactly with the first column of the previous file, and the second column is the size).
I stored the second file in a hash in order to connect the first file with the sizes of the conserved elements and to filter only the lines which the size (4th column) be the 70% of the size (from the 2nd file).
I wrote this script for that purpose, it works but it prints each chosen line more than once.
How can I fix this?
my $size_file = $ARGV[0];
my $alignment_file = $ARGV[1];
open my $con_info, $size_file or die "Could not open $size_file: $!";
my %hash;
while (<$con_info>)
{
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? "$val" : $val;
}
#print "# %hash\n", Dump \%hash;
#print %hash;
#print "#{[%hash]}";
close $con_info;
open my $al_info, $alignment_file or die "Could not open $alignment_file: $!";
while (my $line = <$al_info>) {
chomp;
my#data = split('\t', $line);
my $con_name = $data[0];
my $evalue = $data[10];
my $percent = $data[2];
my $length = $data[3];
# print $con_name. "\n";
foreach my $key (keys %hash) {
if ($key == $con_name) {
#print "key: $key, value: $hash{$key}\n";
if ($evalue <= 1e-4 && $length >= 0.70 * $hash{$key}) {
print $line;
}
}
}
}
The output should be the first file (the file which is at the first code box) but with less lines, the lines that passes through the last if condition.
Thank you very- very much for your help!!!
if ($key == $con_name)
should be
if ($key eq $con_name)
as this should be string comparison.
And you don't really need foreach loop, just to pick one particular key:
while (my $line = <$al_info>) {
chomp($line);
my #data = split('\t', $line);
# my $con_name = $data[0];
# my $percent = $data[2];
# my $length = $data[3];
# my $evalue = $data[10];
my ($con_name, $percent, $length, $evalue) = #data[0,2,3,10];
# print $con_name. "\n";
if ($evalue <= 1e-4 && $length >= 0.70 * $hash{$con_name}) {
print $line;
}
}

basic regex and string manipulation for DNA analysis using perl

I am new to perl and would like to do what I think is some basic string manipulation to DNA sequences stored in an rtf file.
Essentially, my file reads (file is in FASTA format):
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT
What I would like to do is read into my file and print the header (header is >LM1) then match the following DNA sequence GTGCCAGCAGCCGC and then print the preceding DNA sequence.
So my output would look like this:
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC
I have written the following program:
#!/usr/bin/perl
use strict; use warnings;
open(FASTA, "<seq_V3_V6_130227.rtf") or die "The file could not be found.\n";
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
my $header = $_;
print "$header\n";
}
my $dna = <FASTA>;
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print "$dna";
}
}
close(FASTA);
The problem is that my program reads the file line by line and the output I am receiving is the following:
>LM1
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC
Basically I don't know how to assign the entire DNA sequence to my $dna variable and ultimately don't know how to avoid reading the DNA sequence line by line. Also I am getting this warning:
Use of uninitialized value $dna in pattern match (m//) at stacked.pl line 14, line 1113.
If anyone could give me some help with writing better code or point me in the correct direction it would be much appreciated.
Using the pos function:
use strict;
use warnings;
my $dna = "";
my $seq = "GTGCCAGCAGCCGC";
while (<DATA>) {
if (/^>/) {
print;
} else {
if (/^[AGCT]/) {
$dna .= $_;
}
}
}
if ($dna =~ /$seq/g) {
print substr($dna, 0, pos($dna) - length($seq)), "\n";
}
__DATA__
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT
You can process a file with multiple entries like so:
while (<DATA>) {
if (/^>/) {
if ($dna =~ /$seq/g) {
print substr($dna, 0, pos($dna) - length($seq)), "\n";
$dna = "";
}
print;
} elsif (/^[AGCT]/) {
$dna .= $_;
}
}
if ($dna && $dna =~ /$seq/g) {
print substr($dna, 0, pos($dna) - length($seq)), "\n";
}
Your while statement reads until the end of file. That means at every loop iteration, $_ is the next line in <FASTA>. So $dna = <FASTA> isn't doing what you think it is. It is reading more than you probably want it to.
while(<FASTA>) { #Reads a line here
chomp($_);
if ($_ =~ m/^>/ ) {
my $header = $_;
print "$header\n";
}
$dna = <FASTA> # reads another line here - Causes skips over every other line
}
Now, you need to read the sequence into your $dna. You can update your while loop with an else statement. So if its a head line, print it, else, we add it to $dna.
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
# It is a header line, so print it
my $header = $_;
print "$header\n";
} else {
# if it is not a header line, add to your dna sequence.
$dna .= $_;
}
}
After the loop, you can do your regex.
Note: This solution assumes there is only 1 sequence in the fasta file. If you have more than one, your $dna variable will have all the sequences as one.
Edit: Adding simple a way to handle multiple sequences
my $dna = "";
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
# Does $dna match the regex?
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print "$1\n";
}
# Reset the sequence
$dna = "";
# It is a header line, so print it
my $header = $_;
print "$header\n";
} else {
# if it is not a header line, add to your dna sequence.
$dna .= $_;
}
}
# Check the last sequence
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print "$1\n";
}
I came up with a solution using BioSeqIO (and the trunc method from BioSeq from the BioPerl distribution. I also used index to find the subsequence rather than using a regular expression.
This solution does not print out the id, (line begins with >), if the subsequence was not found or if the subsequence begins at the first postion, (and thus no preceding characters).
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new( -file => "fasta_junk.fasta" ,
-format => 'fasta');
my $out = Bio::SeqIO->new( -file => '>test.dat',
-format => 'fasta');
my $lookup = 'GTGCCAGCAGCCGC';
while ( my $seq = $in->next_seq() ) {
my $pos = index $seq->seq, $lookup;
# if $pos != -1, ($lookup not found),
# or $pos != 0, (found $lookup at first position, thus
# no preceding characters).
if ($pos > 0) {
my $trunc = $seq->trunc(1,$pos);
$out->write_seq($trunc);
}
}
__END__
*** fasta_junk.fasta
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT
*** contents of test.dat
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAAAGTACTGTCC
GTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTTGACGGTATCTAACCAGAAAG
CCACGGCTAACTAC
read the whole file into memory then look for the regexp
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
my $header = $_;
print "$header\n";
} else {
$dna .= $_;
}
}
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print $1;
}

Manipulating files according to indexes by perl

I am working on some genome data and I have 2 files ->
File1
A1 1 10
A1 15 20
A2 2 11
A2 13 16
File2
>A1
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA
AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT
>A2
GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA
AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC
CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT
In file 1, 2nd and 3rd column represents the indexes in File2. So I want that, if character in column1 of file1 matches with character followed by symbol (>) in file2 , then from next line of that file2 give back the substring according to indexes in col2 and col3 of file1. (sorry, I know its complicated) Here is the desire output ->
Output
>A1#1:10
CTATTATTTA
>A1#15:20
ACCTA
>A2#2:11
TCTGCACAGC
>A2#13:16
GCTT
I know if I have only 1 string I can take out sub-string very easily ->
#ARGV or die "No input file specified";
open $first, '<',$ARGV[0] or die "Unable to open input file: $!";
$string="GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT";
while (<$first>)
{
#cols = split /\s+/;
$co=$cols[1]-1;
$length=$cols[2]-$co;
$fragment = substr $string, $co, $length;
print ">",$cols[0],"#",$cols[1],":",$cols[2],"\n",$fragment,"\n";
}
but here my problem is when should I input my second file and how should I match the character in col1 (of file1) with character in file2 (followed by > symbol) and then how to get substring?
I wasnt sure if they were all one continuous line or separate lines.
I set it up as continuous for now.
Basically, read the 2nd file as master.
Then you can process as many index files as you need.
You can use hash of arrays to help with the indexing.
push #{$index{$key}}, [$start,$stop];
use strict;
my $master_file = "dna_master.txt";
if ($#ARGV) {
print "Usage: $0 [filename(s)]\n";
exit 1;
}
my %Data = read_master($master_file);
foreach my $index_file (#ARGV) {
my %Index = read_index($index_file);
foreach my $key (sort keys %Index) {
foreach my $i (#{$Index{$key}}) {
my ($start,$stop) = #$i;
print ">$key#$start:$stop\n";
my $pos = $start - 1;
my $count = $stop - $start + 1;
print substr($Data{$key},$pos,$count)."\n";
}
}
}
sub read_file {
my $file = shift;
my #lines;
open(FILE, $file) or die "Error: cannot open $file\n$!";
while(<FILE>){
chomp; #remove newline
s/(^\s+|\s+$)//g; # strip lead/trail whitespace
next if /^$/; # skip blanks
push #lines, $_;
}
close FILE;
return #lines;
}
sub read_index {
my $file = shift;
my #lines = read_file($file);
my %index;
foreach (#lines) {
my ($key,$start,$stop) = split /\s+/;
push #{$index{$key}}, [$start,$stop];
}
return %index;
}
sub read_master {
my $file = shift;
my %master;
my $key;
my #lines = read_file($file);
foreach (#lines) {
if ( m{^>(\w+)} ) { $key = $1 }
else { $master{$key} .= $_ }
}
return %master;
}
Load File2 in a Hash, with A1, A2... as keys, and the DNA sequence as value. This way you can get the DNA sequence easily.
This 2nd update turns the master file into a hash of arrays as well.
This treats each row in the 2nd file as individual sequences.
use strict;
my $master_file = "dna_master.txt";
if ($#ARGV) {
print "Usage: $0 [filename(s)]\n";
exit 1;
}
my %Data = read_master($master_file);
foreach my $index_file (#ARGV) {
my %Index = read_index($index_file);
foreach my $key (sort keys %Index) {
foreach my $i (#{$Index{$key}}) {
my ($start,$stop) = #$i;
print ">$key#$start:$stop\n";
my $pos = $start - 1;
my $count = $stop - $start + 1;
foreach my $seq (#{$Data{$key}}) {
print substr($seq,$pos,$count)."\n";
}
}
}
}
sub read_file {
my $file = shift;
my #lines;
open(FILE, $file) or die "Error: cannot open $file\n$!";
while(<FILE>){
chomp; #remove newline
s/(^\s+|\s+$)//g; # strip lead/trail whitespace
next if /^$/; # skip blanks
push #lines, $_;
}
close FILE;
return #lines;
}
sub read_index {
my $file = shift;
my #lines = read_file($file);
my %index;
foreach (#lines) {
my ($key,$start,$stop) = split /\s+/;
push #{$index{$key}}, [$start,$stop];
}
return %index;
}
sub read_master {
my $file = shift;
my %master;
my $key;
my #lines = read_file($file);
foreach (#lines) {
if ( m{^>(\w+)} ) { $key = $1 }
else { push #{ $master{$key} }, $_ }
}
return %master;
}
Output:
>A1#1:10
CTATTATTTA
AAGTGTGTTA
>A1#15:20
ACCTAC
ATTAAT
>A2#2:11
TCTGCACAGC
ACCCCCCCCT
AAACCCCAAA
>A2#13:16
GCTT
CCCC
ACAA

Cleanest Perl parser for Makefile-like continuation lines

A perl script I'm writing needs to parse a file that has continuation lines like a Makefile. i.e. lines that begin with whitespace are part of the previous line.
I wrote the code below but don't feel like it is very clean or perl-ish (heck, it doesn't even use "redo"!)
There are many edge cases: EOF at odd places, single-line files, files that start or end with a blank line (or non-blank line, or continuation line), empty files. All my test cases (and code) are here: http://whatexit.org/tal/flatten.tar
Can you write cleaner, perl-ish, code that passes all my tests?
#!/usr/bin/perl -w
use strict;
sub process_file_with_continuations {
my $processref = shift #_;
my $nextline;
my $line = <ARGV>;
$line = '' unless defined $line;
chomp $line;
while (defined($nextline = <ARGV>)) {
chomp $nextline;
next if $nextline =~ /^\s*#/; # skip comments
$nextline =~ s/\s+$//g; # remove trailing whitespace
if (eof()) { # Handle EOF
$nextline =~ s/^\s+/ /;
if ($nextline =~ /^\s+/) { # indented line
&$processref($line . $nextline);
}
else {
&$processref($line);
&$processref($nextline) if $nextline ne '';
}
$line = '';
}
elsif ($nextline eq '') { # blank line
&$processref($line);
$line = '';
}
elsif ($nextline =~ /^\s+/) { # indented line
$nextline =~ s/^\s+/ /;
$line .= $nextline;
}
else { # non-indented line
&$processref($line) unless $line eq '';
$line = $nextline;
}
}
&$processref($line) unless $line eq '';
}
sub process_one_line {
my $line = shift #_;
print "$line\n";
}
process_file_with_continuations \&process_one_line;
How about slurping the whole file into memory and processing it using regular expressions. Much more 'perlish'. This passes your tests and is much smaller and neater:
#!/usr/bin/perl
use strict;
use warnings;
$/ = undef; # we want no input record separator.
my $file = <>; # slurp whole file
$file =~ s/^\n//; # Remove newline at start of file
$file =~ s/\s+\n/\n/g; # Remove trailing whitespace.
$file =~ s/\n\s*#[^\n]+//g; # Remove comments.
$file =~ s/\n\s+/ /g; # Merge continuations
# Done
print $file;
If you don't mind loading the entire file in memory, then the code below passes the tests.
It stores the lines in an array, adding each line either to the previous one (continuation) or at the end of the array (other).
#!/usr/bin/perl
use strict;
use warnings;
my #out;
while( <>)
{ chomp;
s{#.*}{}; # suppress comments
next unless( m{\S}); # skip blank lines
if( s{^\s+}{ }) # does the line start with spaces?
{ $out[-1] .= $_; } # yes, continuation, add to last line
else
{ push #out, $_; } # no, add as new line
}
$, = "\n"; # set output field separator
$\ = "\n"; # set output record separator
print #out;