perl to remove lines from file - perl

I have file that looks like:
ATOM 2517 O VAL 160 8.337 12.679 -2.487
ATOM 2518 OXT VAL 160 7.646 12.461 -0.386
TER
ATOM 2519 N VAL 161 -14.431 5.789 -25.371
ATOM 2520 H1 VAL 161 -15.336 5.698 -25.811
ATOM 2521 H2 VAL 161 -13.416 10.529 17.708
ATOM 2522 H3 VAL 161 -14.363 9.436 18.498
ATOM 2523 CA VAL 161 4.400 9.233 16.454
ATOM 2524 HA VAL 161 3.390 9.170 16.047
I have to remove "TER", the line before "TER" and 3 lines after the line just after TER and make file continuous like this:
ATOM 2517 O VAL 160 8.337 12.679 -2.487
ATOM 2519 N VAL 161 -14.431 5.789 -25.371
ATOM 2523 CA VAL 161 4.400 9.233 16.454
ATOM 2524 HA VAL 161 3.390 9.170 16.047

A simple line-by-line script.
Usage: perl script.pl -i.bak fileglob
E.g. perl script.pl -i.bak File*MINvac.pdb
This will alter the original file, and save a backup of each file with the extension .bak. Note that if TER lines appear too close to the end of the file, it will cause warnings. On the other hand, so will the other solutions presented.
If you do not wish to save backups (use caution, since changes are irreversible!), use -i instead.
Code:
#!/usr/bin/perl
use v5.10;
use strict;
use warnings;
my $prev;
while (<>) {
if (/^TER/) {
print scalar <>; # print next line
<> for 1 .. 3; # skip 3 lines
$prev = undef; # remove previous line
} else {
print $prev if defined $prev;
$prev = $_;
}
if (eof) { # New file next iteration?
print $prev;
$prev = undef;
}
}

I realized I was supposed to write it in Perl, but now I've already written it in Python. I'm posting it anyway as it may prove to be useful, don't see any harm in that.
#!/usr/bin/python2.7
import sys
import glob
import os
try:
dir = sys.argv[1]
except IndexError:
print "Usage: "+sys.argv[0]+" dir"
print "Example: "+sys.argv[0]+" /home/user/dir/"
sys.exit(1)
for file in glob.glob(os.path.join(dir, 'File*_*MINvac.pdb')):
fin = open(file, "r")
content = fin.readlines()
fin.close()
for i in range(0, len(content)):
try:
if "TER" in content[i]:
del content[i]
del content[i-1]
del content[i:i+3]
except IndexError:
break
fout = open(file, "w")
fout.writelines(content)
fout.close()
Edit: Added support for multiple files, like the OP wanted.

So, for each set of 6 consecutive lines, you want to discard all but the third line if the second line is a TER?
TIMTOWTDI, but this should work:
my #queue;
while (<>) {
push #queue, $_;
#queue = $queue[2] if #queue == 6 and $queue[1] =~ /^TER$/;
print shift #queue if #queue == 6;
}
print #queue; # assume no TERs in last 4 lines

use strict;
use warnings;
use Tie::File;
my #array;
tie #array, 'Tie::File', 'myFile.txt' or die "Unable to tie file";
my %unwanted = map { $_ => 1 } # Hashify ...
map { $_-1, $_, $_+2 .. $_+4 } # ... the five lines ...
grep { $array[$_] =~ /^TER/ } # ... around 'TER' ...
0 .. $#array ; # ... in the file
# Remove the unwanted lines
#array = map { $array[$_] } grep { ! $unwanted{$_} } 0 .. $#array;
untie #array; # The end

Related

compare columns from 2 files & print matching and non-matching rows in same order as in file1 & print YES/NO at end of matching and non-matching rows

file1
3 14573 ab712 A T
8 12099 ab002 G A
9 12874 ab790 A C
3 19879 ab734 G T
file2
3 14573 ab712 A T
9 12874 ab790 A C
output
3 14573 ab712 A T YES
8 12099 ab002 G A NO
9 12874 ab790 A C YES
3 19879 ab734 G T NO
I tried perl foreach loop on file1 & 2
output generated is as follows-
3 14573 ab712 A T YES
8 12099 ab002 G A NO
9 12874 ab790 A C NO
3 19879 ab734 G T NO
4 34565 ab992 C G NO
9 12874 ab790 A C YES
3 14573 ab712 A T NO
8 12099 ab002 G A NO
9 12874 ab790 A C NO
3 19879 ab734 G T NO
4 34565 ab992 C G NO
Script I tried
foreach $arr1 (#arr1) {
chomp $arr1;
($chr1, $pos1, $id1, $ref1, $alt1) = split(/\t/, $arr1);
foreach $arr2 (#arr2) {
chomp $arr2;
($chr2, $pos2, $id2, $ref2, $alt2) = split(/\s/, $arr2);
{
if (($pos1 eq $pos2 ) && ($chr1 eq $chr2 )) {
print "$chr1\t$pos1\t$ref1\t$alt1\tYES\n";
} else {
print "$chr1\t$pos1\t$ref1\t$alt1\tNO\n"
}
}
}
}
Your code is rather complex, so I'm afraid I don't have time to understand it and correct whatever you're doing wrong.
I do, however, have time to present my solution (with comments):
#!/usr/bin/perl
# Always use these
use strict;
use warnings;
# Open file2...
open my $fh2, '<', 'file2' or die $!;
# ... and use its contents to construct a hash.
# The key of the hash is the line of data from the
# file (without the newline) and the value is the
# number 1.
# We can therefore use this hash to work out if a
# given line from file1 exists in file2.
my %file2 = map { chomp; $_ => 1 } <$fh2>;
# Open file1...
open my $fh1, '<', 'file1' or die $!;
# ... and process it a line at a time
while (<$fh1>) {
# Remove the newline
chomp;
# Print the line
print;
# Find out if the line exists in file2
# and print 'YES' or 'NO' as appropriate.
print $file2{$_} ? ' YES' : ' NO';
# Print a newline.
print "\n";
}
Update: Here's a version that only matches on the first two fields of the input data (that shouldn't matter given the sample input, but your code implies that's what you want to match on).
#!/usr/bin/perl
# Always use these
use strict;
use warnings;
# Open file2...
open my $fh2, '<', 'file2' or die $!;
# ... and use its contents to construct a hash.
# The key of the hash is the first two fields from
# the line of data from the file and the value is the
# number 1.
# We can therefore use this hash to work out if a
# given line from file1 exists in file2.
my %file2 = map { join(' ', (split)[0,1]) => 1 } <$fh2>;
# Open file1...
open my $fh1, '<', 'file1' or die $!;
# ... and process it a line at a time
while (<$fh1>) {
# Remove the newline
chomp;
# Print the line
print;
# Find out if the line exists in file2
# and print 'YES' or 'NO' as appropriate.
print $file2{join ' ', (split)[0,1]} ? ' YES' : ' NO';
# Print a newline.
print "\n";
}
You could read file2 into a hashmap and use it to lookup the entries in file1.
Example:
#!/usr/bin/perl
use strict;
use warnings;
use Path::Tiny;
my #file1 = path("file1")->lines;
chomp #file1;
my %file2 = map {chomp; $_ => 1} path("file2")->lines;
for my $line (#file1) {
print "$line " . (defined($file2{$line}) ? 'YES' : 'NO') . "\n";
}
If only the first and second column should be involved in the comparison:
#!/usr/bin/perl
use strict;
use warnings;
use Path::Tiny;
my #file1 = path("file1")->lines;
chomp #file1;
my %file2 = map {my #f = split; $f[0].' '.$f[1] => 1} path("file2")->lines;
for my $line (#file1) {
my #f=split/\s+/,$line;
print "$line " . (defined($file2{$f[0].' '.$f[1]}) ? 'YES' : 'NO') . "\n";
}
Output in both cases:
3 14573 ab712 A T YES
8 12099 ab002 G A NO
9 12874 ab790 A C YES
3 19879 ab734 G T NO

Merge Fasta and Qual files with different headers order in FASTQ

I am trying to merge a fasta file and a qual file in a new fastq file having in mind the case that the two files might be provided with different order in their sequence IDs. To do that, I tried the first step of my script to be the sorting of the sequences which works perfectly when I test it as a separate script. The same with the rest, when I run separately the part where it combines the files in a fastq, it runs perfectly. But now that I am trying to combine the two methods in one script it doesn't work and I don't know what else to do! I would appreciate it if you can help me.
Here is my script as far. It creates the new fastq file but the content is messed up and not what I want. I run it from terminal like this:
$ perl script.pl reads.fasta reads.qual > reads.fq
Script :
#!/usr/bin/env perl
use strict;
use warnings;
die ("Usage: script.pl reads.fasta reads.qual > reads.fq") unless (scalar #ARGV) == 2;
open FASTA, $ARGV[0] or die "cannot open fasta: $!\n";
open QUAL, $ARGV[1] or die "cannot open qual: $!\n";
my $offset = 33;
my $count = 0;
local($/) = "\n>";
my %id2seq = ();
my $id = '';
my %idq2seq = ();
my $idq = '';
my (#sort_q, #sort_f);
while(<FASTA>){
chomp;
if($_ =~ /^>(.+)/){
$id = $1;
}else{
$id2seq{$id} .= $_;
}
}
for $id (sort keys %id2seq)
{
#sort_f = "$id\n$id2seq{$id}\n\n";
print #sort_f;
}
while(<QUAL>){
chomp;
if($_ =~ /^>(.+)/){
$idq = $1;
}else{
$idq2seq{$idq} .= $_;
}
}
for $idq (sort keys %idq2seq)
{
#sort_q = "$idq\n$idq2seq{$idq}\n\n";
print "#sort_q";
}
while (my #sort_f) {
chomp #sort_f;
my ($fid, #seq) = split "\n", #sort_f;
my $seq = join "", #seq; $seq =~ s/\s//g;
my $sortq = #sort_q;
chomp my #sortq;
my ($qid, #qual) = split "\n", #sortq;
#qual = split /\s+/, (join( " ", #qual));
# convert score to character code:
my #qual2 = map {chr($_+$offset)} #qual;
my $quals = join "", #qual2; `enter code here`
die "missmatch of fasta and qual: '$fid' ne '$qid'" if $fid ne $qid;
$fid =~ s/^\>//;
print STDOUT (join( "\n", "#".$fid, $seq, "+$fid", $quals), "\n");
$count++;
}
close FASTA;
close QUAL;
print STDERR "wrote $count entries\n";
Thank you in advance
It's been a while since I have used perl, but I would approach this using a hash of key/value pairs for both the fasta and quality input. Then write out all the pairs by looping over the fasta hash and pulling out the corresponding quality string.
I have written something in python that will do what you need, you can see it in action here:
It assumes that your input looks like this:
reads.fasta
>fa_0
GCAGCCTGGGACCCCTGTTGT
>fa_1
CCCACAAATCGCAGACACTGGTCGG
reads.qual
>fa_0
59 37 38 51 56 55 60 44 43 42 56 65 60 68 52 67 43 72 59 65 69
>fa_1
36 37 47 72 34 53 67 41 70 67 66 51 47 41 73 58 75 36 61 48 70 55 46 42 42
output
#fa_0
GCAGCCTGGGACCCCTGTTGT
+
;%&387<,+*8A<D4C+H;AE
#fa_1
CCCACAAATCGCAGACACTGGTCGG
+
$%/H"5C)FCB3/)I:K$=0F7.**
#fa_2
TCGTACAGCAGCCATTTTCATAACCGAACATGACTC
+
C?&93A#:?#F,2:'KF*20CC:I7F9J.,:E8&?F
import sys
# Check there are enough arguments
if len(sys.argv) < 3:
print('Usage: {s} reads.fasta reads.qual > reads.fq'.format(s=sys.argv[0]), file=sys.stderr)
sys.exit(1)
# Initalise dictionaries for reads and qualities
read_dict = dict()
qual_dict = dict()
fa_input = sys.argv[1]
qual_input = sys.argv[2]
# Read in fasta input
with open(fa_input, 'r') as fa:
for line in fa:
line = line.strip()
if line[0] == '>':
read_dict[line[1:]] = next(fa).strip()
else:
next(fa)
# Read in quality input
with open(qual_input, 'r') as qual:
for line in qual:
line = line.strip()
if line[0] == '>':
qual_dict[line[1:]] = next(qual).strip()
else:
next(qual)
count = 0
# Iterate over all fasta reads
for key, seq in read_dict.items():
# Check if read header is in the qualities data
if key in qual_dict.keys():
# There's both sequence and quality data so write stdout
read_str = '#{header}\n{seq}\n+\n{qual}'.format(
header=key,
seq=seq,
qual=''.join([chr(int(x)) for x in qual_dict[key].split(' ')]))
print(read_str, file=sys.stdout)
count += 1
else: # not found
# Write error to stderr
print('Error: {k} not found in qual file'.format(k=key), file=sys.stderr)
# Print count to stderr
print('{c} reads written'.format(c=count), file=sys.stderr)
If you need to use an offset for the quality score edit
qual=''.join([chr(int(x)) for x in qual_dict[key].split(' ')])) to
qual=''.join([chr(int(x) + offset) for x in qual_dict[key].split(' ')])) and define an offset variable before this.

Calculating distances in PDB file

With reference to the question Calculating the distance between atomic coordinates, where the input is
ATOM 920 CA GLN A 203 39.292 -13.354 17.416 1.00 55.76 C
ATOM 929 CA HIS A 204 38.546 -15.963 14.792 1.00 29.53 C
ATOM 939 CA ASN A 205 39.443 -17.018 11.206 1.00 54.49 C
ATOM 947 CA GLU A 206 41.454 -13.901 10.155 1.00 26.32 C
ATOM 956 CA VAL A 207 43.664 -14.041 13.279 1.00 40.65 C
.
.
.
ATOM 963 CA GLU A 208 45.403 -17.443 13.188 1.00 40.25 C
there is an answer reported as
use strict;
use warnings;
my #line;
while (<>) {
push #line, $_; # add line to buffer
next if #line < 2; # skip unless buffer is full
print proc(#line), "\n"; # process and print
shift #line; # remove used line
}
sub proc {
my #a = split ' ', shift; # line 1
my #b = split ' ', shift; # line 2
my $x = ($a[6]-$b[6]); # calculate the diffs
my $y = ($a[7]-$b[7]);
my $z = ($a[8]-$b[8]);
my $dist = sprintf "%.1f", # format the number
sqrt($x**2+$y**2+$z**2); # do the calculation
return "$a[3]-$b[3]\t$dist"; # return the string for printing
}
The output of above code is the distance between the first CA to the second one and second to third and so on...
How to modify this code to find the distance between first CA to rest of the CAs (2, 3, ..) and from second CA to rest of the CAs (3, 4, ..) and so on and printing only those which is less then 5 Angstrom?
I found that push #line, $_; statement should be altered to increase the array size but not clear how to do that.
To get the pairs, read the file into an array, #data_array. Then loop over the entries.
Update: Added file opening and load #data_array.
open my $fh, '<', 'atom_file.pdb' or die $!;
my #data_array = <$fh>;
close $fh or die $!;
for my $i (0 .. $#data_array) {
for my $j ($i+1 .. $#data_array) {
process(#data_array[$i,$j]);
}
}
May be try this:
use strict;
use warnings;
my #alllines = ();
while(<DATA>) { push(#alllines, $_); }
#Each Current line
for(my $i=0; $i<=$#alllines+1; $i++)
{
#Each Next line
for(my $j=$i+1; $j<=$#alllines; $j++)
{
if($alllines[$i])
{
#Split the line into tab delimits
my ($line1_tb_1,$line1_tb_2,$line1_tb_3) = split /\t/, $alllines[$i];
print "Main_Line: $line1_tb_1\t$line1_tb_2\t$line1_tb_3";
if($alllines[$j])
{
#Split the line into tab delimits
my ($line_nxt_tb1,$line_nxt_tb2,$line_nxt_tb3) = split /\t/, $alllines[$j];
print "Next_Line: $line_nxt_tb1\t$line_nxt_tb2\t$line_nxt_tb3";
#Do it your coding/regex here
}
}
#system 'pause'; Testing Purpose!!!
}
}
__DATA__
tab1 123 456
tab2 789 012
tab3 345 678
tab4 901 234
tab5 567 890
I hope this will help you.

Perl: matching data in two files

I would like to match and print data from two files (File1.txt and File2.txt). Currently, I'm trying to match the first letter of the second column in File1 to the first letter of the third column in File2.txt.
File1.txt
1 H 35
1 C 22
1 H 20
File2.txt
A 1 HB2 MET 1
A 2 CA MET 1
A 3 HA MET 1
OUTPUT
1 MET HB2 35
1 MET CA 22
1 MET HA 20
Here is my script, I've tried following this submission: In Perl, mapping between a reference file and a series of files
#!/usr/bin/perl
use strict;
use warnings;
my %data;
open (SHIFTS,"file1.txt") or die;
open (PDB, "file2.txt") or die;
while (my $line = <PDB>) {
chomp $line;
my #fields = split(/\t/,$line);
$data{$fields[4]} = $fields[2];
}
close PDB;
while (my $line = <SHIFTS>) {
chomp($line);
my #columns = split(/\t/,$line);
my $value = ($columns[1] =~ m/^.*?([A-Za-z])/ );
}
print "$columns[0]\t$fields[3]\t$value\t$data{$value}\n";
close SHIFTS;
exit;
Here's one way using split() hackery:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $f1 = 'file1.txt';
my $f2 = 'file2.txt';
my #pdb;
open my $pdb_file, '<', $f2
or die "Can't open the PDB file $f2: $!";
while (my $line = <$pdb_file>){
chomp $line;
push #pdb, $line;
}
close $pdb_file;
open my $shifts_file, '<', $f1
or die "Can't open the SHIFTS file $f1: $!";
while (my $line = <$shifts_file>){
chomp $line;
my $pdb_line = shift #pdb;
# - inner split: get the third element from the $pdb_line
# - outer split: get the first element (character) from the
# result of the inner split
my $criteria = (split('', (split('\s+', $pdb_line))[2]))[0];
# - compare the 2nd element of the file1.txt line against
# the above split() operations
if ((split('\s+', $line))[1] eq $criteria){
print "$pdb_line\n";
}
else {
print "**** >$pdb_line< doesn't match >$line<\n";
}
}
Files:
file1.txt (note I changed line two to ensure a non-match worked):
1 H 35
1 A 22
1 H 20
file2.txt:
A 1 HB2 MET 1
A 2 CA MET 1
A 3 HA MET 1
Output:
./app.pl
A 1 HB2 MET 1
****>A 2 CA MET 1< doesn't match >1 A 22<
A 3 HA MET 1

Select data in a text file from a separate list - perl or unix

I have a massive tab delimited file like this:
contig04733 contig00012 77 contig00546 contig01344 12
contig08943 contig00001 14 contig00765 contig03125 88 etc
And I have a separate tab delimited file with only a subset of these contig pairs like this:
contig04733 contig00012 contig08943 contig00001 etc
I want to extract into a new file the lines in the first file that correspond with the ones listed in the second. In this particular dataset I think which way round each pair should be the same in both files. But would also like to know if say:
file1 contig08943 contig00001 14
but in file2 its
contig00001 contig08943
and I still want that combination, is it possible to script something for this too?
My code is below.
use strict;
use warnings;
#open the contig pairs list
open (PAIRS, "$ARGV[0]") or die "Error opening the input file with contig pairs";
#hash to store contig IDs - I think?!
my $pairs;
#read through the pairs list and read into memory?
while(<PAIRS>){
chomp $_; #get rid of ending whitepace
$pairs->{$_} = 1;
}
close(PAIRS);
#open data file
open(DATA, "$ARGV[1]") or die "Error opening the sequence pairs file\n";
while(<DATA>){
chomp $_;
my ($contigs, $identity) = split("\t", $_);
if (defined $pairs->{$contigs}) {
print STDOUT "$_\n";
}
}
close(DATA);
Piece together the code below without the running commentary to get a working program. We start off with typical front matter that instructs perl to give you helpful warnings if you make common mistakes.
#! /usr/bin/env perl
use strict;
use warnings;
Showing the user when necessary how to properly invoke your program is always a nice touch.
die "Usage: $0 master subset\n" unless #ARGV == 2;
With read_subset, the program reads the second file named on the command line. Because your question states that you do not care about the order, e.g., that
contig00001 contig08943
is equivalent to
contig08943 contig00001
the code increments both $subset{$p1}{$p2} and $subset{$p2}{$p1}.
sub read_subset {
my($path) = #_;
my %subset;
open my $fh, "<", $path or die "$0: open $path: $!";
while (<$fh>) {
chomp;
my($p1,$p2) = split /\t/;
++$subset{$p1}{$p2};
++$subset{$p2}{$p1};
}
%subset;
}
Using a hash to mark occurrences that your program has observed is highly frequent in Perl programs. In fact, many of the examples in the Perl FAQ use hashes named %seen, as in “I have seen this.”
By removing the second command-line argument with pop, that leaves only the master file which lets the program easily read all lines of input using while (<>) { ... }. With %subset populated, the code splits each line into fields and skips any line not marked seen. Everything that passes this filter is printed on the standard output.
my %subset = read_subset pop #ARGV;
while (<>) {
my($f1,$f2) = split /\t/;
next unless $subset{$f1}{$f2};
print;
}
For example:
$ cat file1
contig04733 contig00012 77
contig00546 contig01344 12
contig08943 contig00001 14
contig00765 contig03125 88
$ cat file2
contig04733 contig00012
contig00001 contig08943
$ perl extract-subset file1 file2
contig04733 contig00012 77
contig08943 contig00001 14
To create a new output that contains the selected subset, redirect the standard output as in
$ perl extract-subset file1 file2 >my-subset
Try this one using a hash of hashes based on the two keys (after a split)
use strict;
use warnings;
#open the contig pairs list
open (PAIRS, "$ARGV[0]") or die "Error opening the input file with contig pairs";
#hash to store contig IDs - I think?!
#my $pairs;
#read through the pairs list and read into memory?
my %all_configs;
while(<PAIRS>){
chomp $_; #get rid of ending whitepace
my #parts = split("\t", $_); #split into ['contig04733', 'contig00012', 77]
#store the entire row as a hash of hashes
$all_configs{$parts[0]}{$parts[1]} = $_;
#$pairs->{$_} = 1;
}
close(PAIRS);
#open data file
open(DATA, "$ARGV[1]") or die "Error opening the sequence pairs file\n";
while(<DATA>){
chomp $_;
my ($contigs, $identity) = split("\t", $_);
#see if we find a row, one way, or the other
my $found_row = $all_configs{$contigs}{$identity}
|| $all_configs{$identity}{$contigs};
#if found, the split, and handle each part
if ($found_row) {
my #parts = split("\t", $found_row);
#same sequence as in first file
my $modified_row = $parts[0]."\t".$parts[1].(sqrt($parts[2]/100));
#if you want to keep the same sequence as found in second file
my $modified_row = $contigs."\t".$identity.(sqrt($parts[2]/100));
print STDOUT $found_row."\n"; #or
print STDOUT $modified_row."\n";
}
}
close(DATA);