Filter common values between text files - perl

I'm a beginner at perl and I'm trying to filter a large text file with 1 column of ID names, each a few characters long and unique, e.g.:
Aghm
Tbc2
Popc
Ltr1
Iubr
Osv5
and filter this list against a second text file with some of the same ID names, e.g.:
Popc
Iubr
Trv7
Ybd8
I only want to find the common ID names and print into a new text file. In the example above I want to generate the list:
Popc
Iubr
How can I do it using perl script?

To put you on a path, you seem to make a Perl filter.
You could try by opening the first file, looping on the diamond operator (that is <>) and writing selected lines to the second file.
You should try to get a copy of the Perl Cookbook, the chapter 07 is dealing of such case.

Having the id file ids.txt, filter file filter_ids.txt this would write the desired result to filtered_ids.txt:
#!/usr/bin/perl
use strict;
use warnings;
open my $rh, '<', 'filter_ids.txt' or die "$!\n";
my %filter = map {$_ => 1} <$rh>;
open $rh, '<', 'ids.txt' or die "$!\n";
open my $wh, '>', 'filtered_ids.txt' or die "$!\n";
map {print $wh $_} grep $filter{$_}, <$rh>;
close $wh;
Personally I'd rather do this with grep:
grep -f filter_ids.txt ids.txt > filtered_ids.txt
Result in either case:
flesk#flesk:~$ more filtered_ids.txt
Popc
Iubr

Related

How to match and find common based on substring from two files?

I have two files. File1 contains list of email addresses. File2 contains list of domains.
I want to filter out all the email addresses after matching exact domain using Perl script.
I am using below code, but I don't get correct result.
#!/usr/bin/perl
#use strict;
#use warnings;
use feature 'say';
my $file1 = "/home/user/domain_file" or die " FIle not found\n";
my $file2 = "/home/user/email_address_file" or die " FIle not found\n";
my $match = open(MATCH, ">matching_domain") || die;
open(my $data1, '<', $file1) or die "Could not open '$file1' $!\n";
my #wrd = <$data1>;
chomp #wrd;
# loop on the fiile to be searched
open(my $data2, '<', $file2) or die "Could not open '$file2' $!\n";
while(my $line = <$data2>) {
chomp $line;
foreach (#wrd) {
if($line =~ /\#$_$/) {
print MATCH "$line\n";
}
}
}
File1
abc#1gmail.com.au
abc#gmail.com
abc#gmail.com1
abc#2outlook.com2
abc#outlook.com1
abc#yahoo.com
abc#yahooo1.com
abc#yahooo.com
File2
yahoo.com
gmail.com
Expected output
abc#gmail.com
abc#yahoo.com
First off, since you seem to be on *nix, you might want to check out grep -f, which can take search patterns from a given file. I'm no expert in grep, but I would try the file and "match whole words" and this should be fairly easy.
Second: Your Perl code can be improved, but it works as expected. If you put the emails and domains in the files as indicated by your code. It may be that you have mixed the files up.
If I run your code, fixing only the paths, and keeping the domains in file1, it does create the file matching_domain and it contains your expected output:
abc#gmail.com
abc#yahoo.com
So I don't know what you think your problem is (because you did not say). Maybe you were expecting it to print output to the terminal. Either way, it does work, but there are things to fix.
#use strict;
#use warnings;
It is a huge mistake to remove these two. Biggest mistake you will ever do while coding Perl. It will not remove your errors, just hide them. You will spend 10 times as much time bug fixing. Uncomment this as your first thing you do to fix this.
use feature 'say';
You never use this. You could for example replace print MATCH "$line\n" with say MATCH $line, which is slightly more concise.
my $file1 = "/home/user/domain_file" or die " FIle not found\n";
my $file2 = "/home/user/email_address_file" or die " FIle not found\n";
This is very incorrect. You are placing a condition on the creation of a variable. If the condition fails, does the variable exist? Don't do this. I assume this is to check if the file exists, but that is not what this does. To check if a file exists, you can use -e, documented as perldoc "-X" (various file tests).
Furthermore, a statement in the form of a string, "/home/user..." is TRUE ("truthy"), as far as Perl conditions are concerned. It is only false if it is "0" (zero), "" (empty) or undef (undefined). So your or clause will never be executed. E.g. "foo" or die will never die.
Lastly, this test is quite meaningless, as you will be testing this in your open statement later on anyway. If the file does not exist, the open will fail and your program will die.
my $match = open(MATCH, ">matching_domain") || die;
This is also very incorrect. First off, you never use the $match variable. Secondly, I bet it does not contain what you think it does. (it contains a boolean which states whether open was successful or not, see perldoc -f open) Thirdly, again, don't put conditions on my declarations of variables, it is a bad idea.
What this statement really means is that $match will contain either the return value of the open, or the return value of die. This should probably be simply:
open my $match, ">", "matching_domain" or die "Cannot open '$match': $!;
Also, use the three argument open with explicit open MODE, and use lexical file handles, like you have done elsewhere.
And one more thing on top of all the stuff I've already badgered you with: I don't recommend hard coding output files for small programs like this. If you want to redirect the output, use shell redirection: perl foo.pl > output.txt. I think this is what has prompted you to think something is wrong with your code: You don't see the output.
Other than that, your code is fine, as near as I can tell. You may want to chomp the lines from the domain file, but it should not matter. Also remember that indentation is a good thing, and it helps you read your code. I mentioned this in a comment, but it was removed for some reason. It is important though.
Good luck!
This assumes that the lines labeled File1 are in the file pointed to by $file1 and the lines labeled File2 are in the file pointed to by $file2.
You have your variables swapped. You want to match what is in $line against $_, not the other way around:
# loop on the file to be searched
open( my $data2, '<', $file2 ) or die "Could not open '$file2' $!\n";
while ( my $line = <$data2> ) {
chomp $line;
foreach (#wrd) {
if (/\#$line$/) {
print MATCH "$_\n";
}
}
}
You should un-comment the warnings and strict lines:
use strict;
use warnings;
warnings shows you that the or die checks are not really working the way you intended in the file name assignment statements. Just use :
my $file1 = "/home/user/domain_file";
my $file2 = "/home/user/email_address_file";
You are already doing the checks where they belong (on open).

Using chop in grep expression

My Perl script searches a directory of file names, using grep to output only file names without the numbers 2-9 in their names. That means, as intended, that file names ending with the number "1" will also be returned. However, I want to use the chop function to output these file names without the "1", but can't figure out how. Perhaps the grep and chop functions can be combined in one line of code to achieve this? Please advise. Thanks.
Here's my Perl script:
#!/usr/bin/perl
use strict;
use warnings;
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir(DIR, $dir);
#files = grep (/^[^2-9]*\.png\z/,readdir(DIR));
foreach $file (#files) {
print "$file\n";
}
Here's the output:
Ilex_verticillata.png
Asarum_canadense1.png
Ageratina_altissima.png
Lonicera_maackii.png
Chelone_obliqua1.png
Here's my desired output with the number "1" removed from the end of file names:
Ilex_verticillata.png
Asarum_canadense.png
Ageratina_altissima.png
Lonicera_maackii.png
Chelone_obliqua.png
The number 1 to remove is at the end of the name before the extension; this is different from filtering on numbers (2-9) altogether and I wouldn't try to fit it into one operation.
Instead, once you have your filtered list (no 2-9 in names), then clip off that 1. Seeing that all names of interest are .png can simply use a regex
$filename =~ s/1\.png\z/.png/;
and if there is no 1 right before .png the string is unchanged. If it were possible to have other extensions involved then you should use a module to break up the filename.
To incorporate this, you can pass grep's output through a map
opendir my $dfh, $dir or die "Can't open $dir: $!";
my #files =
map { s/1\.png\z/.png/r }
grep { /^[^2-9]*\.png\z/ }
readdir $dfh;
where I've also introduced a lexical directory filehandle instead of a glob, and added a check on whether opendir worked. The /r modifier on the substitution in map is needed so that the string is returned (changed or unchanged if regex didn't match), and not changed in place, as needed here.
This passes over the list of filenames twice, though, while one can use a straight loop. In principle that may impact performance; however, here all operations are done on each element of a list so a difference in performance is minimal.
You could use use the following:
s/1//g for #files;
It's also possible to integrate a solution into your chain using map.
my #files =
map s/1//rg,
grep /^[^2-9]*\.png\z/,
readdir(DIR);

Perl copying specific lines of VECT File

I want to copy lines 7-12 of files, like
this example .vect file,
into another .vect file in the same directory.
I want each line, to be copied twice, and the two copies of each line to be pasted consecutively in the new file.
This is the code I have used so far, and would like to continue using these methods/packages in Perl.
use strict;
use warnings;
use feature qw(say);
# This method works for reading a single file
my $dir = "D:\\Downloads";
my $readfile = $dir ."\\2290-00002.vect";
my $writefile = $dir . "\\file2.vect";
#open a file to read
open(DATA1, "<". $readfile) or die "Can't open '$readfile': $!";;
# Open a file to write
open(DATA2, ">" . $writefile) or die "Can't open '$writefile': $!";;
# Copy data from one file to another.
while ( <DATA1> ) {
print DATA2 $_;
}
close( DATA1 );
close( DATA2 );
What would be a simple way to do this using the same opening and closing file syntax I have used above?
Just modify the print line to
print DATA2 $_, $_ if 7 .. 12;
See Range Operators in "perlop - Perl operators and precedence" for details.
It's worth remembering the
Tie::File
module which maps a file line by line to a Perl array and allows you to manipulate text files using simple array operations. It can be slow when working with large amounts of data, but it is ideal for the majority of applications involving regular text files
Copying a range of lines from one file to another becomes a simple matter of copying an array slice. Remember that the file starts with line one in array element zero, so lines 7 to 12 are at indexes 6...11
This is the Perl code to do what you ask
use strict;
use warnings;
use Tie::File;
chdir 'D:\Downloads' or die $!;
tie my #infile, 'Tie::File', '2290-00002.vect' or die $!;
tie my #outfile, 'Tie::File', 'file2.vect' or die $!;
#outfile = map { $_, $_ } #infile[6..11];
Nothing else is required. Isn't that neat?

How can I create a new output file for each subfolder under a main folder using perl?

I have 100 subfolder in a main folder. They have difference names. Each subfolder includes a .txt file, which has 10 column. I want to get a new .txt file for each subfolder. Each new .txt file must be in its own folder. That is I will have 2 .txt files (old and new) in each subfolder. I am trying to select the lines starting "ATOM" and some columns 2,6,7 and 8 from each .txt file. My code is the following. It doesn't work correctly. It doesnt create a new .txt file. How can i figure out this problem?
#!/usr/bin/perl
$search_text = "ATOM";
#files = <*/*.txt>;
foreach $file (#files) {
print $file . "\n";
open(DATA, $file);
open(OUT_FILE, ">$file a.txt");
while ($line = <DATA>)
{
#fields = split /\s+/, $line;
if ($line =~ m/$search_text/)
{
print OUT_FILE "$fields[2]\t$fields[6]\t$fields[7]\t$fields[8]\n";
}
}
}
close(OUT_FILE);
To put the output file a.txt into the same directory as the input file, you need to extract the directory name from the input file name, and prepend it to the output file name (a.txt). There are a couple of ways you can do that; probably the simplest is to use dirname() from the standard module File::Basename:
use File::Basename;
my $dir = dirname($file);
open(OUT_FILE, ">", "$dir/a.txt") or die "Failed to open $dir/a.txt: $!";
or you could use File::Spec directly:
use File::Spec;
my ($volume, $dir) = File::Spec->splitpath($file);
my $outname = File::Spec->catpath($volume, $dir, 'a.txt');
open(OUT_FILE, ">", $outname) or die "Failed to open $outname: $!";
or you could just use a regexp substitution:
my $outname = ( $file =~ s![^/]+$!a.txt!r );
open(OUT_FILE, ">", $outname) or die "Failed to open $outname: $!";
Ps. In any case, I'd recommend adopting several good habits that will help you write better Perl scripts:
Always start your scripts with use strict; and use warnings;. Fix any errors and warnings they produce. In particular, declare all your local variables with my to make them lexically scoped.
Check the return value of functions like open(), and abort the script if they fail. (I've done this in my examples above.)
Use the three-argument form of open(), as I also did in my examples above. It's a lot less likely to break if your filenames contain funny characters.
Consider using lexically scoped file handles (open my $out_file, ...) instead of global file handles (open OUT_FILE, ...). I didn't do that in my code snippets above, because I wanted to keep them compatible with the rest of your code, but it would be good practice.
If you're pre-declaring a regular expression, like your $search_text, use qr// instead of a plain string, like this:
my $search_text = qr/ATOM/;
It's slightly more efficient, and the quoting rules for special characters are much saner.
For printing multiple columns from an array, consider using join() and a list slice, as in:
print OUT_FILE join("\t", #fields[2,6,7,8]), "\n";
Finally, if I were you, I'd reconsider my file naming scheme: the output file name a.txt matches your input file name glob *.txt, so your script will likely break if you run it twice in a row.

Why isn't $ARGV[0] initialized by the file name I pass this perl one liner?

I have a perl one liner that supposed to export each individual line in an xml file to it's own separate file with the name of the original file and then the line number within that file that it came from.
For instance, if the xml file is called "foo.xml" and it has 100 lines in it then I want to have a hundred files called, "foo_1.xml", "foo_2.xml", "foo_3.xml", etc.
I thought that the name of the file that I pass to the one liner would be available via ARGV, but it's not. I'm getting a "uninitialized value in $ARGV[0]" error when I run this:
perl -nwe 'open (my $FH, ">", "$ARGV[0]_$.\.xml"); print $FH $_;close $FH;' foo.xml
What am I overlooking?
When using the magic <> filehandle (which you're doing implicitly with the -n option), Perl shifts the filenames out of #ARGV as it opens them. (This is mentioned in perlop.) You need to use the plain scalar $ARGV, which contains the filename currently being read from:
perl -nwe 'open (my $FH, ">", "${ARGV}_$.\.xml"); print $FH $_;close $FH;' foo.xml
(The braces are necessary because $ARGV_ is a legal name for a variable.)
cjm has the correct answer. However, it will create files such as foo.xml_1.xml. You asked for foo_1.xml, etc.
perl -nwe '
my ($file) = $ARGV =~ /(\w+)/;
open my $fh, ">", $file . "_$..xml" or die $!;
print $fh $_;
' foo.xml