I’m a beginner in perl and I’m trying to compare two files with perl. One contains a list of id’s the other one has strings which contain id’s and more text. I want to copy the lines with matching id’s to a third file, but instead of the correct strings I only get a number. What have I done wrong?
use strict;
use warnings;
open ( IDS , "<id.txt");
my #ids = <IDS>;
chomp(#ids);
close IDS;
my $id = #ids;
open ( META , "<meta.txt");
my #metas = <META>;
chomp(#metas);
my $meta = #metas;
open ( OUT1, ">>", "outtest.txt");
foreach $id (#metas){
print OUT1 "$meta"."\n";
}
close OUT1;
close META;
Try With hash variables to get the output:
use strict;
use warnings;
open ( META , "<meta.txt");
my %idsValues = (); #Create one new HASH Variable
while(<META>)
{
my $line = $_;
if($line=~m{<id>(\d+)</id>\s*<string>([^<>]*)</string>})
{
$idsValues{$1} = $2; #Store the values and text into the HASH Variable
}
}
close(META); #Close the opened file
my #Values;
open ( IDS , "<id.txt");
while(<IDS>)
{
my $line = $_;
if($line=~m/<id>(\d+)<\/id>/i)
{
#Check if the value presents in the file and push them into ARRAY Variable.
push(#Values, "IDS: $1\tVALUES: $idsValues{$1}") if(defined $idsValues{$1} );
}
}
close(IDS); #Close the opened file
open ( OUT1, ">>", "outtest.txt");
print OUT1 join "\n", #Values; #Join with newline and Print the output line in the output file.
close OUT1; #Close the opened file
Related
I have two files. The first has a sequence of words on each line
bus do car
car tree
The second file is an XML file
<title>i have a car. i take bus..</title>
I want to search the text file for each word in the XML file. If it is found, I want to insert all lines from the text file where it appears, with any spaces replaced by x.
The result file would be
<title>i have a car busxdoxcar carxtree. i take bus busxdoxcar..</title>
I try this
use strict;
use warnings;
use autodie;
my $QueryFile = "query.txt";
my $SequenceFile = "Seq_2_terms_150.txt";
my %hashlist;
open NewQueryFile ,">./NewQuery.txt"
or die "Cannot create NewQuery.txt";
open(my $fh,$SequenceFile)
or die "$SequenceFile : $!";
while ( <$fh> ) {
chop;
s/^\s+|\s+$//g;
my $h = \%hashlist;
foreach ( split('\s+', $_) ) {
$h->{$_} //= {};
$h = $h->{$_};
}
$h->{'#'} = 1;
}
close $fh;
open(my $fd, $QueryFile)
or die "$QueryFile : $!";
for my $xml (<$fd>) {
foreach my $line (split(/\n/, $xml)) {
my #words = split(/\s/, $line);
if $words = #hashlist[$_] {
print NewQueryFile join ('x',$words) ;
}
}
}
close NewQueryFile ;
close($fd);
I have put together a quick script to indicate how one might go about this.
I have not bothered with xml, because that may well have left me in a bad mood.
My advice would be: do use variables, whatever you save from not doing so is lost as your code gets confusing and then buggy.
#!/usr/bin/env perl
use strict;
use warnings;
# Notes:
# - more than one space or tab in a row are mangled: They become one space only
# - the query file is not checked for containing actual words to match on,
# it is assumed to be suitable
# - I have made no attempt to parse xml. You should use a parser for that.
# Search Stack Overflow or Google or CPAN or all of those for examples.
# - The replace_xml_text function can be used on the text supplied by the
# parser to get the desired output
# - a feeble attempt is made to deal with punctuation in replace_xml_text
# - This code is not really tested
my %query_words;
my $query_fn = 'query.txt';
open (my $fh, "<",$query_fn) or die "could not open file '$query_fn'";
# build list of words from query file
while ( <$fh> ){
chomp;
# Words mentioned in line.
my #words = split(/\s+/,$_);
# Words joined by 'x'. Seems a strange choice *shrug*.
# This is used to replace words with later.
my $line = join("x",#words);
# Storing in arrayref, this seems easier to me
# than concatening repeatedly and getting the spaces right.
for my $word ( #words ){
push #{$query_words{$word}}, $line;
}
}
# Expects the text to replace.
# Returns the text to replace it with.
sub replace_xml_text {
my $original_text = shift;
my #words;
for my $word ( split(/\s+/,$original_text) ){
my $punctuation = '';
# Remove punctuation before matching,
# but do preserve it.
if ( $word =~s /(\s*[,.]\s*)$// ){
$punctuation = $1;
}
if ( my $additions = $query_words{$word} ){
$word = join(" ",$word,#$additions);
}
# Put punctuation back.
$word .= $punctuation;
# Done replacing in this word, next
push #words,$word;
}
return join(" ",#words);
}
I'm new with perl. I'm trying to read a large comma separate file, split and grab only some columns. I could create it with some internet help, but I'm struggling to change to code to start reading from a specific line thru the end of the file.
my need is open file start reading on line 12, split ',' grab column 0,2,10,11 and concatenate those needed columns with '\t'.
here is my code
#!/usr/bin/perl
my $filename = 'file_to_read.csv';
open(FILER, $filename) or die "Could not read $filename.";
open(FILEW, ">$filename.txt") || die "couldn't create the file\n";
while(<FILER>) {
chomp;
my #fields = split(',', $_);
print FILEW "$fields[0]\t$fields[3]\t$fields[10]\t$fields[11]\n";
}
close FILER;
close FILEW;
here is the file example:
[Header]
GSGT Version: X
Processing Date:12/01/2010 7:20 PM
Content:
Num SNPs:
Total SNPs:
Num Samples:
Total Samples:
Sample:
[Data]
SNP Name,Chromosome,Pos,GC Score,Theta,R,X,Y,X Raw,Y Raw,B Allele Freq,Log R Ratio,Allele1 - TOP,Allele2 - TOP
1:10001102-G-T,1,10001102,0.4159,0.007,0.477,0.472,0.005,6281,126,0.0000,-0.2581,A,A
1:100011159-T-G,1,100011159,0.4259,0.972,0.859,0.036,0.822,807,3648,0.9942,-0.0304,C,C
1:10002775-GA,1,10002775,0.4234,0.977,1.271,0.043,1.228,809,5140,0.9892,0.0111,G,G
Rather than skipping until a specific line number, which may vary from file to file, it is best to keep track of the current section of the file marked by [Header], [Data] etc.
This solution keeps a state variable $section which is updated to the current section name every time a [Section] label is encountered in the file. Everything from the Data section is summarised and printed
A similar thing could be done with the column headers, using names instead of numbers to select the fields to be output, but I have chosen to keep the complexity down instead
use strict;
use warnings 'all';
use feature 'say';
my $filename = 'file_to_read.csv';
open my $fh, '<', $filename or die qq{Unable to open "$filename" for input: $!};
my $section = "";
while ( <$fh> ) {
next unless /\S/; # Skip empty lines
if ( $section eq 'Data' ) { # Skip unless we're in the [Data] section
chomp;
my #fields = split /,/;
say join ',', #fields[0,3,10,11];
}
elsif ( /\[(\w+)\]/ ) {
$section = $1;
}
}
output
SNP Name,GC Score,B Allele Freq,Log R Ratio
1:10001102-G-T,0.4159,0.0000,-0.2581
1:100011159-T-G,0.4259,0.9942,-0.0304
1:10002775-GA,0.4234,0.9892,0.0111
please assign a variable to count the lines processed like my $line_count = 0;
and inside the beginning of while loop increment the varialbe $line_count++;
and skip if the line count is below 12 ie , next if $line_count > 12;
I wrote a PERL program which takes an excel sheet (coverted to a text file by changing the extension from .xls to .txt) and a sequence file for its input. The excel sheet contains the start point and the end point of an area in the sequence file (along with 70 flanking values on either side of the match area) that needs to cut and extracted into a third output file. There are like 300 values. The program reads in the start point and the end point of the sequence that needs to be cut each time but it repeatedly tells me that the value is outside the length on the input file when it clearly isn't. I just cant seem to get this fixed
This is the program
use strict;
use warnings;
my $blast;
my $i;
my $idline;
my $sequence;
print "Enter Your BLAST result file name:\t";
chomp( $blast = <STDIN> ); # BLAST result file name
print "\n";
my $database;
print "Enter Your Gene list file name:\t";
chomp( $database = <STDIN> ); # sequence file
print "\n";
open IN, "$blast" or die "Can not open file $blast: $!";
my #ids = ();
my #seq_start = ();
my #seq_end = ();
while (<IN>) {
#spliting the result file based on each tab
my #feilds = split( "\t", $_ );
push( #ids, $feilds[0] ); #copying the name of sequence
#coping the 6th tab value of the result which is the start point of from where a value should be cut.
push( #seq_start, $feilds[6] );
#coping the 7th tab value of the result file which is the end point of a value should be cut.
push( #seq_end, $feilds[7] );
}
close IN;
open OUT, ">Result.fasta" or die "Can not open file $database: $!";
for ( $i = 0; $i <= $#ids; $i++ ) {
($sequence) = &block( $ids[$i] );
( $idline, $sequence ) = split( "\n", $sequence );
#extracting the sequence from the start point to the end point
my $seqlen = $seq_end[$i] - $seq_start[$i] - 1;
my $Nucleotides = substr( $sequence, $seq_start[$i], $seqlen ); #storing the extracted substring into $sequence
$Nucleotides =~ s/(.{1,60})/$1\n/gs;
print OUT "$idline\n";
print OUT "$Nucleotides\n";
}
print "\nExtraction Completed...";
sub block {
#block for id storage which is the first tab in the Blast output file.
my $id1 = shift;
print "$id1\n";
my $start = ();
open IN3, "$database" or die "Can not open file $database: $!";
my $blockseq = "";
while (<IN3>) {
if ( ( $_ =~ /^>/ ) && ($start) ) {
last;
}
if ( ( $_ !~ /^>/ ) && ($start) ) {
chomp;
$blockseq .= $_;
}
if (/^>$id1/) {
my $start = $. - 1;
my $blockseq .= $_;
}
}
close IN3;
return ($blockseq);
}
BLAST RESULT FILE: http://www.fileswap.com/dl/Ws7ehftejp/
SEQUENCE FILE: http://www.fileswap.com/dl/lPwuGh2oKM/
Error
substr outside of string at Nucleotide_Extractor.pl line 39.
Use of uninitialized value $Nucleotides in substitution (s///) at
Nucleotide_Extractor.pl line 41.
Use of uninitialized value $Nucleotides in concatenation (.) or string
at Nucleotide_Extractor.pl line 44.
Any help is very much appreciated and queries are always invited
There were several problems with the existing code, and I ended up rewriting the script while fixing the errors. Your implementation isn't very efficient as it opens, reads, and closes the sequence file for every ID in your Excel sheet. A better approach would be either to read and store the data from the sequence file, or, if memory is limited, go through each entry in the sequence file and pick out the corresponding data from the Excel file. You would also be better off using hashes, instead of arrays; hashes store data in key -- value pairs, so it is MUCH easier to find what you're looking for. I have also used references throughout, as they make it easy to pass data into and out of subroutines.
If you are not familiar with perl data structures, check out perlfaq4 and perldsc, and perlreftut has information on using references.
The main problem with your existing code was that the subroutine to get the sequence from the fasta file was not returning anything. It is a good idea to put plenty of debugging statements in your code to ensure that it is doing what you think it is doing. I've left in my debugging statements but commented them out. I've also copiously commented the code that I changed.
#!/usr/bin/perl
use strict;
use warnings;
# enables 'say', which prints out your text and adds a carriage return
use feature ':5.10';
# a very useful module for dumping out data structures
use Data::Dumper;
#my $blast = 'infesmall.txt';
print "Enter Your BLAST result file name:\t";
chomp($blast = <STDIN>); # BLAST result file name
print "\n";
#my $database = 'infe.fasta';
print "Enter Your Gene list file name:\t";
chomp($database = <STDIN>); # sequence file
print "\n";
open IN,"$blast" or die "Can not open file $blast: $!";
# instead of using three arrays, let's use a hash reference!
# for each ID, we want to store the start and the end point. To do that,
# we'll use a hash of hashes. The start and end information will be in one
# hash reference:
# { start => $fields[6], end => $fields[7] }
# and we will use that hashref as the value in another hash, where the key is
# the ID, $fields[0]. This means we can access the start or end data using
# code like this:
# $info->{$id}{start}
# $info->{$id}{end}
my $info;
while(<IN>){
#splitting the result file based on each tab
my #fields = split("\t",$_);
# add the data to our $info hashref with the ID as the key:
$info->{ $fields[0] } = { start => $fields[6], end => $fields[7] };
}
close IN;
#say "info: " . Dumper($info);
# now read the sequence info from the fasta file
my $sequence = read_sequences($database);
#say "data from read_sequences:\n" . Dumper($sequence);
my $out = 'result.fasta';
open(OUT, ">" . $out) or die "Can not open file $out: $!";
foreach my $id (keys %$info) {
# check whether the sequence exists
if ($sequence->{$id}) {
#extracting the sequence from the start point to the end point
my $seqlen = $info->{$id}{end} - $info->{$id}{start} - 1;
#say "seqlen: $seqlen; stored seq length: " . length($sequence->{$id}{seq}) . "; start: " . $info->{$id}{start} . "; end: " . $info->{$id}{end};
#storing the extracted substring into $sequence
my $nucleotides = substr($sequence->{$id}{seq}, $info->{$id}{start}, $seqlen);
$nucleotides =~ s/(.{1,60})/$1\n/gs;
#say "nucleotides: $nucleotides";
print OUT $sequence->{$id}{header} . "\n";
print OUT "$nucleotides\n";
}
}
print "\nExtraction Completed...";
sub read_sequences {
# fasta file
my $fasta_file = shift;
open IN3, "$fasta_file" or die "Can not open file $fasta_file: $!";
# initialise two variables. We will store our sequence data in $fasta
# and use $id to track the current sequence ID
# the $fasta hash will look like this:
# $fasta = {
# 'gi|7212472|ref|NC_002387.2' => {
# header => '>gi|7212472|ref|NC_002387.2| Phytophthora...',
# seq => 'ATAAAATAATATGAATAAATTAAAACCAAGAAATAAAATATGTT...',
# }
#}
my ($fasta, $id);
while(<IN3>){
chomp;
if (/^>/) {
if (/^>(\S+) /){
# the header line with the sequence info.
$id = $1;
# save the data to the $fasta hash, keyed by seq ID
# we're going to build up an entry as we go along
# set the header to the current line
$fasta->{ $id }{ header } = $_;
}
else {
# no ID found! Erk. Emit an error and undef $id.
warn "Formatting error: $_";
undef $id;
}
}
## ensure we're getting sequence lines...
elsif (/^[ATGC]/) {
# if $id is not defined, there's something weird going on, so
# don't save the sequence. In a correctly-formatted file, this
# should not be an issue.
if ($id) {
# if $id is set, add the line to the sequence.
$fasta->{ $id }{ seq } .= $_;
}
}
}
close IN3;
return $fasta;
}
For this script, I am pulling a csv file that includes what needs to be found and what the replacement is. Those values, $pattern1 and $replacement1 are then inserted into a find & replace function. Ideally this will take the csv key file & do an inplace replacement of the raw data file.
use English;
use strict;
use warnings;
sub inplace_sanitize {
my ( $datafile, $pattern1, $replacement1 ) = #_;
local #ARGV = ( $datafile ),
my $INPLACE_EDIT = '.back';
while ( <> ) {
s/\Q$pattern1/$replacement1/g;
#print;
}
}
sub main
{
# Select Key for Find & Replace
my $filename = 'stmre_fr_key.csv';
open(INPUT, $filename) or die "Cannot open $filename";
# Read the header line.
my $line = <INPUT>;
# Read the lines one by one.
while($line = <INPUT>)
{
chomp($line);
#Split & Assign
my ($replacement1, $pattern1) = split(',', $line);
# Select Data File
my $datafile = 'rawdata.csv';
#Find & Replace Data File
&inplace_sanitize( $datafile, $pattern1, $replacement1 );
}
}
close(INPUT);
main();
So this is not working, as it doesn't perform the replacement. Without the inplace_sanitizecall it prints out the $replacement1 & $pattern1 correctly. The inplace_sanitize works by itself if you define $replacement1 = 'replace'; and $pattern1 = 'find';. But together there it doesn't work. Any ideas?
Samples:
$replacement1 = '7306e005';
$pattern1 = 'leighs_satcon011016001_00753b94';
stmre_fr_key.csv:
find,replace
leighs_satcon011016001_00753b94,7306e005
leighs_satcon011016001_00753b95,7306e006
.
.
.
You're use of my $INPLACE_EDIT is your problem. You want to effect the global variable:
local $INPLACE_EDIT = '.back';
The same way you're treating #ARGV
This is frustrating.
I have 2 text file that are just a phone number per line.
I need to read the first line from file1, and search file2 for a match.
If there is a no match, write the line value to an output file.
I've been trying this but I know its wrong.
$file1 = 'pokus1.txt';
$file2 = 'pokus2.txt';
open (F1, $file1) || die ("Could not open $file1!");
open (F2, $file2) || die ("Could not open $file2!");
open (OUTFILE, '>>output\output_x1.txt');
#f1data = <F1>;
#f2data = <F2>;
while (#f1data){
#grp = grep {/$f1data/} #f2data;
print OUTFILE "$grp";
}
close (F1);
close (F2);
close (OUTFILE);
I hope someone can help?
Thanks
Brent
bash :
not exists
grep -vf file1 file2 > file3
shared
grep -f file1 file2 > file4
A customary solution where you process one file saving its data as keys of a hash and later process the other looking if that key exists:
#!/usr/bin/env perl
use warnings;
use strict;
my (%phone);
open my $fh1, '<', shift or die;
open my $fh2, '<', shift or die;
##open my $ofh, '>>', shift or die;
while ( <$fh2> ) {
chomp;
$phone{ $_ } = 1;
}
while ( <$fh1> ) {
chomp;
next if exists $phone{ $_ };
##printf $ofh qq|%s\n|, $_;
printf qq|%s\n|, $_;
}
exit 0;
Run it like:
perl script.pl file1 file2 > outfile
Whenever you get a is one piece of data in one group in another group type question (and they come up quite a bit, you should think in terms of hashes.
A hash is a keyed lookup. Let's say you create a hash keyed on say... I don't know... phone numbers taken from file #1. If you read a line in file #2, you can easily see if it's in file #1 by simply looking at the hash. Fast, efficient.
use strict; #ALWAYS ALWAYS ALWAYS
use warnings; #ALWAYS ALWAYS ALWAYS
use autodie; #Will end the program if files you try to open don't exist
# Constants are a great way of storing data that is ...uh... constant
use constant {
FILE_1 => "a1.txt",
FILE_2 => "a2.txt",
};
my %phone_hash;
open my $phone_num1_fh, "<", FILE_1;
#Let's build our phone number hash
while ( my $phone_num = <$phone_num1_fh> ) {
chomp $phone_num;
$phone_hash{ $phone_num } = 1; #Doesn't really matter, but best not a zero value
}
close $phone_num1_fh;
#Now that we have our phone hash, let's see if it's in file #2
open my $phone_num2_fh, "<", FILE_2;
while ( my $phone_num = <$phone_num2_fh> ) {
chomp $phone_num;
if ( exists $phone_hash { $phone_num } ) {
print "$phone_num is in file #1 and file #2";
}
else {
print "$phone_num is only in file #2";
}
}
See how nicely that works. The only issue is that there may be phone numbers in file #1 that aren't in file #2. You could solve this by simply creating a second hash for all the phone numbers in file #2.
Let's do this one more time with two hashes:
my %phone_hash1;
my %phone_hash2;
open my $phone_num1_fh, "<", FILE_1;
while ( my $phone_num = <$phone_num1_fh> ) {
chomp $phone_num;
$phone_hash1{ $phone_num } = 1;
}
close $phone_num1_fh;
open my $phone_num2_fh, "<", FILE_2;
while ( my $phone_num = <$phone_num2_fh> ) {
chomp $phone_num;
$phone_hash2{ $phone_num } = 1;
}
close $phone_num1_fh;
Now, we'll use keys to list the keys and go through them. I'm going to create an %in_common hash when the phone is in both hashes
my %in_common;
for my $phone ( keys %phone_hash1 ) {
if ( $phone_hash2{$phone} ) {
$in_common{$phone} = 1; #Phone numbers in common between the two lists
}
}
Now, I have three hashes %phone_hash1, %phone_hash2, and %in_common.
for my $phone ( sort keys %phone_hash1 ) {
if ( not $in_common{$phone} ) {
print "Phone number $phone is only in the first file\n";
}
}
for my $phone ( sort keys %phone_hash2 ) {
if ( not $in_common{$phone} ) {
print "Phone number $phone is only in " . FILE_2 . "\n";
}
}
for my $phone ( sort keys %in_common ) {
print "Phone number $phone is in both files\n";
}
Note in this example, I didn't use the exists to see if the key exists in the hash. That is, I simply put if ( $phone_hash2{$phone} ) instead of if ( exists $phone_hash2{$phone} ). The first form checks to see if the key is defined -- even if the value is a null string or numerically zero.
The second form will be true as long as the value is not zero, a null string, or undefined. Since I purposefully set the value of the hash to 1, I can use this form. It's a good habit to use exists because there will be a situation where a valid value could be a null string or zero. However, some people like the way the code reads without using the exists when possible.