match first columns in two files - perl

I have two files of unequal sizes. First file has two columns and second has only one column. I want to match the column in second file to the first column in first file and if they match, print the whole line from the first file. Pretty simple but I am stuck. Here's what I did after opening and storing the contents of both the files in arrays
foreach(#q) #second file
{
$line=$_;
foreach(#gs) #first file
{
$line1=$_;
if ( $line1=~ /$line/ )
{
print $line1;
}
}
}
This doesnt give an output.

I suspect you might be getting tripped up by line endings for one or both of your files. Regardless, it's not necessary to slurp both your files, just the 2nd one. And a regex is most likely overkill, a simple equality check is sufficient, and more likely what you intend.
The following is probably what you intend:
use strict;
use warnings;
use autodie;
my $file1 = 'foo.txt';
my $file2 = 'bar.txt';
open my $fh2, '<', $file2;
my #keys = <$fh2>;
chomp(#keys);
open my $fh1, '<', $file1;
while (my $line = <$fh1>) {
my $fields = split ' ', $line;
if (grep {$fields[0] eq $_} #keys) {
print $line;
}
}

use strict;
use warnings;
my $file2 = 'foo.txt';
my $file1 = 'bar.txt';
my #line1;
open FF,$file2;
while(<FF>)
{
unshift(#line1,$_);
}
close(FF);
open FH,$file1;
while(<FH>)
{
my $se=$_;
chomp($se);
foreach my $data (#line1)
{
if($data=~m/^\s*$se\s*\t/is)
{
print $data."\n";
}
}
}
close(FH);
Try This....

Related

How to check whether one file's value contains in another text file? (perl script)

I would like to check one of the file's values contains on another file. if one of the value contains it will show there is existing bin for that specific, if no, it will show there is no existing bin limit. the problem is I am not sure how to check all values at once.
first DID1 text file value contain :
L84A:D:O:M:
L84C:B:E:D:
second DID text file value contain :
L84A:B:E:Q:X:F:i:M:Y:
L84C:B:E:Q:X:F:i:M:Y:
L83A:B:E:Q:X:F:i:M:Y:
if first 4words value are match, need to check all value for that line.
for example L84A in first text file & second text file value has M . it should print out there is an existing M bin
below is my code :
use strict;
use warnings;
my $filename = 'DID.txt';
my $filename1 = 'DID1.txt';
my $count = 0;
open( FILE2, "<$filename1" )
or die("Could not open log file. $!\n");
while (<FILE2>) {
my ($number) = $_;
chomp($number);
my #values1 = split( ':', $number );
open( FILE, "<$filename" )
or die("Could not open log file. $!\n");
while (<FILE>) {
my ($line) = $_;
chomp($line);
my #values = split( ':', $line );
foreach my $val (#values) {
if ( $val =~ /$values1[0]/ ) {
$count++;
if ( $values[$count] =~ /$values1[$count]/ ) {
print
"Yes ,There is an existing bin & DID\n #values1\n";
}
else {
print "No, There is an existing bin & DID\n";
}
}
}
}
}
I cannot check all value. please help to give any advice on it since this is my first time learning for perl language. Thanks a lot :)
Based on my understanding I write this code:
use strict;
use warnings;
#use ReadWrite;
use Array::Utils qw(:all);
use vars qw($my1file $myfile1cnt $my2file $myfile2cnt #output);
$my1file = "did1.txt"; $my2file = "did2.txt";
We are going to read both first and second files (DID1 and DID2).
readFileinString($my1file, \$myfile1cnt); readFileinString($my2file, \$myfile2cnt);
In first file, as per the OP's request the first four characters should be matched with second file and then if they matched we need to check rest of the characters in the first file with the second one.
while($myfile1cnt=~m/^((\w){4})\:([^\n]+)$/mig)
{
print "<LineStart>";
my $lineChk = $1; my $full_Line = $3; #print ": $full_Line\n";
my #First_values = split /\:/, $full_Line; #print join "\n", #First_values;
If the first four digit matched then,
if($myfile2cnt=~m/^$lineChk\:([^\n]+)$/m)
{
Storing the rest of the content in the same and to be split with colon and getting the characters to be matched with first file contents.
my $FullLine = $1; my #second_values = split /:/, $FullLine;
Then search each letter first and second content which matched line...
foreach my $sngletter(#First_values)
{
If the letters are matched with first and second file its going to be printed.
if( grep {$_ eq "$sngletter"} #second_values)
{
print "Matched: $sngletter\t";
}
}
}
else { print "Not Matched..."; }
This is just information that the line end.
print "<LineEnd>\n"
}
#------------------>Reading a file
sub readFileinString
#------------------>
{
my $File = shift;
my $string = shift;
use File::Basename;
my $filenames = basename($File);
open(FILE1, "<$File") or die "\nFailed Reading File: [$File]\n\tReason: $!";
read(FILE1, $$string, -s $File, 0);
close(FILE1);
}
Read search pattern and data into hash (first field is a key), then go through data and select only field included into pattern for this key.
use strict;
use warnings;
use feature 'say';
my $input1 = 'DID1.txt'; # look for key,pattern(array)
my $input2 = 'DID.txt'; # data - key,elements(array)
my $pattern;
my $data;
my %result;
$pattern = file2hash($input1); # read pattern into hash
$data = file2hash($input2); # read data into hash
while( my($k,$v) = each %{$data} ) { # walk through data
next unless defined $pattern->{$k}; # skip those which is not in pattern hash
my $find = join '|', #{ $pattern->{$k} }; # form search pattern for grep
my #found = grep {/$find/} #{ $v }; # extract only those of interest
$result{$k} = \#found; # store in result hash
}
while( my($k,$v) = each %result ) { # walk through result hash
say "$k has " . join ':', #{ $v }; # output final result
}
sub file2hash {
my $filename = shift;
my %hash;
my $fh;
open $fh, '<', $filename
or die "Couldn't open $filename";
while(<$fh>) {
chomp;
next if /^\s*$/; # skip empty lines
my($key,#data) = split ':';
$hash{$key} = \#data;
}
close $fh;
return \%hash;
}
Output
L84C has B:E
L84A has M

Merge two files based on the starting of the line

I want to merge two files into one using perl. Below are the sample files.
***FILE 1***
XDC123
XDC456
XDC678
BB987
BB654
*** FILE 2 ***
XDC876
XDC234
XDC789
BB456
BB678
And I want the merged file to look like:
***MERGED FILE***
XDC123
XDC456
XDC678
XDC876
XDC234
XDC789
BB987
BB654
BB456
BB678
For the above functionality I have written the below perl script snippet:
#!/usr/bin/env perl;
use strict;
use warnings;
my $file1 = 'C:/File1';
my $file2 = 'C:/File2';
my $file3 = 'C:/File3';
open( FILEONE, '<$file1' );
open( FILETWO, '<$file2' );
open( FILETHREE, '>$file3' );
while (<FILEONE>) {
if (/^XDC/) {
print FILETHREE;
}
if (/^BB/) {
last;
}
}
while (<FILETWO>) {
if (/^XDC/) {
print FILETHREE;
}
if (/^BB/) {
last;
}
}
while (<FILEONE>) {
if (/^BB/) {
print FILETHREE;
}
}
while (<FILETWO>) {
if (/^BB/) {
print FILETHREE;
}
}
close($file1);
close($file2);
close($file3);
But the merged file that is generated from the above code looks like:
***FILE 3***
XDC123
XDC456
XDC678
XDC876
XDC234
XDC789
BB654
BB678
The first line that starts from BB is missed out from both the files. Any help on this will be appreciated. Thank you.
The problem is, you iterate each file to the end, but never 'rewind' for if you're wanting to start over.
So your while ( <FILEONE> ) { line consumes (and discards) the first line that matches m/^BB/ - the last exits the "while" loop, but only after it's already read the line.
However that's assuming you get your open statements right, because:
open( FILEONE, '>$file1' );
Actually empties it, it doesn't read from it. So I am assuming you've transposed your code, and introduced new errors whilst doing so.
As a style point - you should really use 3 argument open, with lexical filehandles.
So instead:
#!/usr/bin/env perl;
use strict;
use warnings;
my $file1 = 'C:/File1';
my $file2 = 'C:/File2';
my $file3 = 'C:/File3';
my #lines;
foreach my $file ( $file1, $file2 ) {
open( my $input, '<', $file ) or die $!;
push( #lines, <$input> );
close($input);
}
open( my $output, '>', $file3 ) or die $!;
print {$output} sort #lines;
close($output)
(Although as noted in the comments - if that's all you want to do, the unix sort utility is probably sufficient).
However, if you need to preserve the numeric ordering, whilst sorting on the alphabetical, you need a slightly different data structure:
#!/usr/bin/env perl;
use strict;
use warnings;
my $file1 = 'C:/File1';
my $file2 = 'C:/File2';
my $file3 = 'C:/File3';
my %lines;
foreach my $file ( $file1, $file2 ) {
open( my $input, '<', $file ) or die $!;
while ( my $line = <$file> ) {
my ( $key ) = $line =~ m/^(\D+)/;
push %{$lines{$key}}, $line;
}
close($input);
}
open( my $output, '>', $file3 ) or die $!;
foreach my $key ( sort keys %lines ) {
print {$output} #{$lines{$key}};
}
close($output)

Extract data from file

I have data like
"scott
E -45 COLLEGE LANE
BENGALI MARKET
xyz -785698."
"Tomm
D.No: 4318/3,Ansari Road, Dariya Gunj,
xbc - 289235."
I wrote one Perl program to extract names i.e;
open(my$Fh, '<', 'printable address.txt') or die "!S";
open(my$F, '>', 'names.csv') or die "!S";
while (my#line =<$Fh> ) {
for(my$i =0;$i<=13655;$i++){
if ($line[$i]=~/^"/) {
print $F $line[$i];
}
}
}
It works fine and it extracts names exactly .Now my aim is to extract address that is like
BENGALI MARKET
xyz -785698."
D.No: 4318/3,Ansari Road, Dariya Gunj,
xbc - 289235."
In CSV file. How to do this please tell me
There are a lot of flaws with your original problem. Should address those before suggesting any enhancements:
Always have use strict; and use warnings; at the top of every script.
Your or die "!S" statements are broken. The error code is actually in $!. However, you can skip the need to do that by just having use autodie;
Give your filehandles more meaningful names. $Fh and $F say nothing about what those are for. At minimum label them as $infh and $outfh.
The while (my #line = <$Fh>) { is flawed as that can just be reduced to my #line = <$Fh>;. Because you're going readline in a list context it will slurp the entire file, and the next loop it will exit. Instead, assign it to a scalar, and you don't even need the next for loop.
If you wanted to slurp your entire file into #line, your use of for(my$i =0;$i<=13655;$i++){ is also flawed. You should iterate to the last index of #line, which is $#line.
if ($line[$i]=~/^"/) { is also flawed as you leave the quote character " at the beginning of your names that you're trying to match. Instead add a capture group to pull the name.
With the suggested changes, the code reduces to:
use strict;
use warnings;
use autodie;
open my $infh, '<', 'printable address.txt';
open my $outfh, '>', 'names.csv';
while (my $line = <$infh>) {
if ($line =~ /^"(.*)/) {
print $outfh "$1\n";
}
}
Now if you also want to isolate the address, you can use a similar method as you did with the name. I'm going to assume that you might want to build the whole address in a variable so you can do something more complicated with it than throwing them blindly at a file. However, mirroring the file setup for now:
use strict;
use warnings;
use autodie;
open my $infh, '<', 'printable address.txt';
open my $namefh, '>', 'names.csv';
open my $addressfh, '>', 'address.dat';
my $address = '';
while (my $line = <$infh>) {
if ($line =~ /^"(.*)/) {
print $namefh "$1\n";
} elsif ($line =~ /(.*)"$/) {
$address .= $1;
print $addressfh "$address\n";
$address = '';
} else {
$address .= $line;
}
}
Ultimately, no matter what you want to use your data for, your best solution is probably to output it to a real CSV file using Text::CSV. That way it can be imported into a spreadsheet or some other system very easily, and you won't have to parse it again.
use strict;
use warnings;
use autodie;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1, eol => "\n" } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $infh, '<', 'printable address.txt';
open my $outfh, '>', 'address.csv';
my #data;
while (my $line = <$infh>) {
# Name Field
if ($line =~ /^"(.*)/) {
#data = ($1, '');
# End of Address
} elsif ($line =~ /(.*)"$/) {
$data[1] .= $1;
$csv->print($outfh, \#data);
# Address lines
} else {
$data[1] .= $line;
}
}

Perl - empty rows while writing CSV from Excel

I want to convert excel-files to csv-files with Perl. For convenience I like to use the module File::Slurp for read/write operations. I need it in a subfunction.
While printing out to the screen, the program generates the desired output, the generated csv-files unfortunately just contain one row with semicolons, field are empty.
Here is the code:
#!/usr/bin/perl
use File::Copy;
use v5.14;
use Cwd;
use File::Slurp;
use Spreadsheet::ParseExcel;
sub xls2csv {
my $currentPath = getcwd();
my #files = <$currentPath/stage0/*.xls>;
for my $sourcename (#files) {
print "Now working on $sourcename\n";
my $outFile = $sourcename;
$outFile =~ s/xls/csv/g;
print "Output CSV-File: ".$outFile."\n";
my $source_excel = new Spreadsheet::ParseExcel;
my $source_book = $source_excel->Parse($sourcename)
or die "Could not open source Excel file $sourcename: $!";
foreach my $source_sheet_number ( 0 .. $source_book->{SheetCount} - 1 )
{
my $source_sheet = $source_book->{Worksheet}[$source_sheet_number];
next unless defined $source_sheet->{MaxRow};
next unless $source_sheet->{MinRow} <= $source_sheet->{MaxRow};
next unless defined $source_sheet->{MaxCol};
next unless $source_sheet->{MinCol} <= $source_sheet->{MaxCol};
foreach my $row_index (
$source_sheet->{MinRow} .. $source_sheet->{MaxRow} )
{
foreach my $col_index (
$source_sheet->{MinCol} .. $source_sheet->{MaxCol} )
{
my $source_cell =
$source_sheet->{Cells}[$row_index][$col_index];
if ($source_cell) {
print $source_cell->Value, ";"; # correct output!
write_file( $outFile, { binmode => ':utf8' }, $source_cell->Value, ";" ); # only one row of semicolons with empty fields!
}
}
print "\n";
}
}
}
}
xls2csv();
I know it has something to do with the parameter passing in the write_file function, but couldn't manage to fix it.
Has anybody an idea?
Thank you very much in advance.
write_file will overwrite the file unless the append => 1 option is given. So this:
write_file( $outFile, { binmode => ':utf8' }, $source_cell->Value, ";" );
Will write a new file for each new cell value. It does however not match your description of "only one row of semi-colons of empty fields", as it should only be one semi-colon, and one value.
I am doubtful towards this sentiment from you: "For convenience I like to use the module File::Slurp". While the print statement works as it should, using File::Slurp does not. So how is that convenient?
What you should do, if you still want to use write_file is to gather all the lines to print, and then print them all at once at the end of the loop. E.g.:
$line .= $source_cell->Value . ";"; # use concatenation to build the line
...
push #out, "$line\n"; # store in array
...
write_file(...., \#out); # print the array
Another simple option would be to use join, or to use the Text::CSV module.
Well, in this particular case, File::Slurp was indeed complicating this for me. I just wanted to avoid to repeat myself, which I did in the following clumsy working solution:
#!/usr/bin/perl
use warnings;
use strict;
use File::Copy;
use v5.14;
use Cwd;
use File::Basename;
use File::Slurp;
use Tie::File;
use Spreadsheet::ParseExcel;
use open qw/:std :utf8/;
# ... other functions
sub xls2csv {
my $currentPath = getcwd();
my #files = <$currentPath/stage0/*.xls>;
my $fh;
for my $sourcename (#files) {
say "Now working on $sourcename";
my $outFile = $sourcename;
$outFile =~ s/xls/csv/gi;
if ( -e $outFile ) {
unlink($outFile) or die "Error: $!";
print "Old $outFile deleted.";
}
my $source_excel = new Spreadsheet::ParseExcel;
my $source_book = $source_excel->Parse($sourcename)
or die "Could not open source Excel file $sourcename: $!";
foreach my $source_sheet_number ( 0 .. $source_book->{SheetCount} - 1 )
{
my $source_sheet = $source_book->{Worksheet}[$source_sheet_number];
next unless defined $source_sheet->{MaxRow};
next unless $source_sheet->{MinRow} <= $source_sheet->{MaxRow};
next unless defined $source_sheet->{MaxCol};
next unless $source_sheet->{MinCol} <= $source_sheet->{MaxCol};
foreach my $row_index (
$source_sheet->{MinRow} .. $source_sheet->{MaxRow} )
{
foreach my $col_index (
$source_sheet->{MinCol} .. $source_sheet->{MaxCol} )
{
my $source_cell =
$source_sheet->{Cells}[$row_index][$col_index];
if ($source_cell) {
print $source_cell->Value, ";";
open( $fh, '>>', $outFile ) or die "Error: $!";
print $fh $source_cell->Value, ";";
close $fh;
}
}
print "\n";
open( $fh, '>>', $outFile ) or die "Error: $!";
print $fh "\n";
close $fh;
}
}
}
}
xls2csv();
I'm actually NOT happy with it, since I'm opening and closing the files so often (I have many files with many lines). That's not very clever in terms of performance.
Currently I still don't know how to use the split or Text:CSV in this case, in order to put everything into an array and to open, write and close each file only once.
Thank you for your answer TLP.

How to search and replace using hash with Perl

I'm new to Perl and I'm afraid I am stuck and wanted to ask if someone might be able to help me.
I have a file with two columns (tab separated) of oldname and newname.
I would like to use the oldname as key and newname as value and store it as a hash.
Then I would like to open a different file (gff file) and replace all the oldnames in there with the newnames and write it to another file.
I have given it my best try but am getting a lot of errors.
If you could let me know what I am doing wrong, I would greatly appreciate it.
Here are how the two files look:
oldname newname(SFXXXX) file:
genemark-scaffold00013-abinit-gene-0.18 SF130001
augustus-scaffold00013-abinit-gene-1.24 SF130002
genemark-scaffold00013-abinit-gene-1.65 SF130003
file to search and replace in (an example of one of the lines):
scaffold00013 maker gene 258253 258759 . - . ID=maker-scaffold00013-augustus-gene-2.187;Name=maker-scaffold00013-augustus-gene-2.187;
Here is my attempt:
#!/usr/local/bin/perl
use warnings;
use strict;
my $hashfile = $ARGV[0];
my $gfffile = $ARGV[1];
my %names;
my $oldname;
my $newname;
if (!defined $hashfile) {
die "Usage: $0 hash_file gff_file\n";
}
if (!defined $gfffile) {
die "Usage: $0 hash_file gff_file\n";
}
###save hashfile with two columns, oldname and newname, into a hash with oldname as key and newname as value.
open(HFILE, $hashfile) or die "Cannot open $hashfile\n";
while (my $line = <HFILE>) {
chomp($line);
my ($oldname, $newname) = split /\t/;
$names{$oldname} = $newname;
}
close HFILE;
###open gff file and replace all oldnames with newnames from %names.
open(GFILE, $gfffile) or die "Cannot open $gfffile\n";
while (my $line2 = <GFILE>) {
chomp($line2);
eval "$line2 =~ s/$oldname/$names{oldname}/g";
open(OUT, ">SFrenamed.gff") or die "Cannot open SFrenamed.gff: $!";
print OUT "$line2\n";
close OUT;
}
close GFILE;
Thank you!
Your main problem is that you aren't splitting the $line variable. split /\t/ splits $_ by default, and you haven't put anything in there.
This program builds the hash, and then constructs a regex from all the keys by sorting them in descending order of length and joining them with the | regex alternation operator. The sorting is necessary so that the longest of all possible choices is selected if there are any alternatives.
Every occurrence of the regex is replaced by the corresponding new name in each line of the input file, and the output written to the new file.
use strict;
use warnings;
die "Usage: $0 hash_file gff_file\n" if #ARGV < 2;
my ($hashfile, $gfffile) = #ARGV;
open(my $hfile, '<', $hashfile) or die "Cannot open $hashfile: $!";
my %names;
while (my $line = <$hfile>) {
chomp($line);
my ($oldname, $newname) = split /\t/, $line;
$names{$oldname} = $newname;
}
close $hfile;
my $regex = join '|', sort { length $b <=> length $a } keys %names;
$regex = qr/$regex/;
open(my $gfile, '<', $gfffile) or die "Cannot open $gfffile: $!";
open(my $out, '>', 'SFrenamed.gff') or die "Cannot open SFrenamed.gff: $!";
while (my $line = <$gfile>) {
chomp($line);
$line =~ s/($regex)/$names{$1}/g;
print $out $line, "\n";
}
close $out;
close $gfile;
Why are you using an eval? And $oldname is going to be undefined in the second while loop, because the first while loop you redeclare them in that scope (even if you used the outer scope, it would store the very last value that you processed, which wouldn't be helpful).
Take out the my $oldname and my $newname at the top of your script, it is useless.
Take out the entire eval line. You need to repeat the regex for each thing you want to replace. Try something like:
$line2 =~ s/$_/$names{$_}/g for keys %names;
Also see Borodin's answer. He made one big regex instead of a loop, and caught your lack of the second argument to split.