Search for occurrences of contents of a file in another file - perl

I want to search the contents of files in a directory for words present in files in another directory. Is there a better way to do it than the following? (By better mean memory usage wise)
More specifically:
folder 1 has several files, each file has several lines of text.
folder 2 has several files, each file has several words, each on its line.
What I want to do is count the number of occurrences of each word in each file in folder 2 in each line of each file of folder 1.
I hope that wasn't too confusing.
open my $output, ">>D:/output.txt";
my #files = <"folder1/*">;
my #categories = <"folder2/*">;
foreach my $file (#files){
open my $fileh, $file || die "Can't open file $companyName";
foreach my $line (<$fileh>){
foreach my $categoryName (#categories){
open my $categoryFile, $categoryName || die "Can't open file $categoryName";
foreach my $word(<$categoryFile>){
#search using regex
}
#print to output
}
}
}

One obvious improvement is to open all the category files first in a separate loop and cache the words in them into a hash of arrays (hash key being the filename), or just one big array if you don't care which search word came from which file.
This will avoid having to re-read the search files for every line in every $file - AND help get rid of duplicate search words in the bargain.
use File::Slurp;
open my $output, ">>D:/output.txt";
my %categories = ();
my #files = <"folder1/*">;
my #categories = <"folder2/*">;
foreach my $categoryName (#categories) {
my #lines = read_file($categoryName);
foreach my $category (#lines) {
chomp($category);
$categories{$category} = 0;
}
}
# add in some code to uniquify #categories
foreach my $file (#files) {
open my $fileh, $file || die "Can't open file $companyName";
foreach my $line (<$fileh>) {
foreach my $category (#categories) {
# count
}
}
# output
}
Also, if these are real "words" - meaning a category of "cat" needs to match "cat dog" but not "mcat" - I would count the word usage by splitting instead of a regex:
foreach my $line (<$fileh>) {
my #words = split(/\s+/, $line);
foreach my $word (#words) {
$categories{$word}++ if exists $categories{$word};
}
}

Related

Data driven perl script

I want to list file n folder in directory. Here are the list of the file in this directory.
Output1.sv
Output2.sv
Folder1
Folder2
file_a
file_b
file_c.sv
But some of them, i don't want it to be listed. The list of not included file, I list in input.txt like below. Note:some of them is file and some of them is folder
NOT_INCLUDED=file_a
NOT_INCLUDED=file_b
NOT_INCLUDED=file_c.sv
Here is the code.
#!/usr/intel/perl
use strict;
use warnings;
my $input_file = "INPUT.txt";
open ( OUTPUT, ">OUTPUT.txt" );
file_in_directory();
close OUTPUT;
sub file_in_directory {
my $path = "experiment/";
my #unsort_output;
my #not_included;
open ( INFILE, "<", $input_file);
while (<INFILE>){
if ( $_ =~ /NOT_INCLUDED/){
my #file = $_;
foreach my $file (#file) {
$file =~ s/NOT_INCLUDED=//;
push #not_included, $file;
}
}
}
close INFILE;
opendir ( DIR, $path ) || die "Error in opening dir $path\n";
while ( my $filelist = readdir (DIR) ) {
chomp $filelist;
next if ( $filelist =~ m/\.list$/ );
next if ( $filelist =~ m/\.swp$/ );
next if ( $filelist =~ s/\.//g);
foreach $_ (#not_included){
chomp $_;
my $not_included = "$_";
if ( $filelist eq $not_included ){
next;
}
push #unsort_output, $filelist;
}
closedir(DIR);
my #output = sort #unsort_output;
print OUTPUT #output;
}
The output that I want is to list all the file in that directory except the file list in input.txt 'NOT_INCLUDED'.
Output1.sv
Output2.sv
Folder1
Folder2
But the output that i get seem still included that unwanted file.
This part of the code makes no sense:
while ( my $filelist = readdir (DIR) ) {
...
foreach $_ (#not_included){
chomp $_;
my $not_included = "$_";
if ( $filelist eq $not_included ){
next;
} # (1)
push #unsort_output, $filelist; # (2)
}
This code contains three opening braces ({) but only two closing braces (}). If you try to run your code as-is, it fails with a syntax error.
The push line (marked (2)) is part of the foreach loop, but indented as if it were outside. Either it should be indented more (to line up with (1)), or you need to add a } before it. Neither alternative makes much sense:
If push is outside of the foreach loop, then the next statement (and the whole foreach loop) has no effect. It could just be deleted.
If push is inside the foreach loop, then every directory entry ($filelist) will be pushed multiple times, one for each line in #not_included (except for the names listed somewhere in #not_included; those will be pushed one time less).
There are several other problems. For example:
$filelist =~ s/\.//g removes all dots from the file name, transforming e.g. file_c.sv into file_csv. That means it will never match NOT_INCLUDED=file_c.sv in your input file.
Worse, the next if s/// part means the loop skips all files whose names contain dots, such as Output1.sv or Output2.sv.
Results are printed without separators, so you'll get something like
Folder1Folder1Folder1Folder2Folder2Folder2file_afile_afile_bfile_b in OUTPUT.txt.
Global variables are used for no reason, e.g. INFILE and DIR.
Here is how I would structure the code:
#!/usr/intel/perl
use strict;
use warnings;
my $input_file = 'INPUT.txt';
my %is_blacklisted;
{
open my $fh, '<', $input_file or die "$0: $input_file: $!\n";
while (my $line = readline $fh) {
chomp $line;
if ($line =~ s!\ANOT_INCLUDED=!!) {
$is_blacklisted{$line} = 1;
}
}
}
my $path = 'experiment';
my #results;
{
opendir my $dh, $path or die "$0: $path: $!\n";
while (my $entry = readdir $dh) {
next
if $entry eq '.' || $entry eq '..'
|| $entry =~ /\.list\z/
|| $entry =~ /\.swp\z/
|| $is_blacklisted{$entry};
push #results, $entry;
}
}
#results = sort #results;
my $output_file = 'OUTPUT.txt';
{
open my $fh, '>', $output_file or die "$0: $output_file: $!\n";
for my $result (#results) {
print $fh "$result\n";
}
}
The contents of INPUT.txt (more specifically, the parts after NOT_INCLUDED=) are read into a hash (%is_blacklisted). This allows easy lookup of entries.
Then we process the directory entries. We skip over . and .. (I assume you don't want those) as well as all files ending with *.list or *.swp (that was in your original code). We also skip any file that is blacklisted, i.e. that was specified as excluded in INPUT.txt. The remaining entries are collected in #results.
We sort our results and write them to OUTPUT.txt, one entry per line.
Not deviating too much from your code, here is the solution. Please find the comments:
#!/usr/intel/perl
use strict;
use warnings;
my $input_file = "INPUT.txt";
open ( OUTPUT, ">OUTPUT.txt" );
file_in_directory();
close OUTPUT;
sub file_in_directory {
my $path = "experiment/";
my #unsort_output;
my %not_included; # creating hash map insted of array for cleaner and faster implementaion.
open ( INFILE, "<", $input_file);
while (my $file = <INFILE>) {
if ($file =~ /NOT_INCLUDED/) {
$file =~ s/NOT_INCLUDED=//;
$not_included{$file}++; # create a quick hash map of (filename => 1, filename2 => 1)
}
}
close INFILE;
opendir ( DIR, $path ) || die "Error in opening dir $path\n";
while ( my $filelist = readdir (DIR) ) {
next if $filelist =~ /^\.\.?$/xms; # discard . and .. files
chomp $filelist;
next if ( $filelist =~ m/\.list$/ );
next if ( $filelist =~ m/\.swp$/ );
next if ( $filelist =~ s/\.//g);
if (defined $not_included{$filelist}) {
next;
}
else {
push #unsort_output, $filelist;
}
}
closedir(DIR); # earlier the closedir was inside of while loop. Which is wrong.
my #output = sort #unsort_output;
print OUTPUT join "\n", #output;
}

Perl : How to search a Indefinite list of keywords from a list of files in a folder

Can anyone help me with Perl Script on below problem:
File1.txt -> with keywords to search
Hello_
World!
+Bye
Temp-
File2 (Can be of any extension) In which Keywords to search for, File3, File4 ....
I want to search for all the keywords from File1 in File2, and If they are found then print the keyword found along with the file number and line number In which this particular keyword is found.
I want to keep these no of keywords and files to be indefinite - they can be added and modified.
open(MYINPUTFILE, "<expressions.txt");
# open for input
my(#lines) = <MYINPUTFILE>;
#print #lines;
my #files = grep ( -f ,<*main_log>,<*Project>);
$n = 0;
$l = 0;
#foreach my$file (#files) {
foreach my $line (#lines) {
my #f = grep /$line/,#files;
print "#f\n";
}
#}
}
Issue - I tried to execute the above code but It does not print anything on my command prompt. I am using Windows 7
This answer is based on your posted code:
use strict; # always use these
use warnings;
open( my $kw, '<', 'expressions.txt') or die $!;
my #keywords = <$kw>;
chomp(#keywords); # remove newlines at the end of keywords
# get list of files in current directory
my #files = grep { -f } (<*main_log>,<*Project>);
# loop over each file to search keywords in
foreach my $file (#files) {
open(my $fh, '<', $file) or die $!;
my #content = <$fh>;
close($fh);
my $l = 0;
foreach my $kw (#keywords) {
my $search = quotemeta($kw); # otherwise keyword is used as regex, not literally
foreach (#content) { # go through every line for this keyword
$l++;
printf 'Found keyword %s in file %s, line %d:%s'.$/, $kw, $file, $l, $_
if /$search/;
}
}
}
Regarding the questions in the comments below:
The innermost loop just counts for line numbers $l++ and puts out the finds in case of occurence - the if /$search/ is still part of the statement above. It could also be written as
if ( /$search/ ) {
printf ...
}
The printf is used to format the output. You could have also done this by simple using print and concatinate all the needed variables. I just prefer it this way.
This assumes, that you want a list of found lines per keyword for every file. You have to switch the order and logic for #keywords and #content to get it line ordered.
For additional functionality regarding comments in the keyword file, you would have to postprocess the content to discern the search terms from comments. Possibly in a hash with search term as key and comment as value. Then you could use only the hash keys for the search (see innermost loop) and put out the comment, if existing, as additional line.

Perl - Read multiple files and read line by line of the text file

I am trying to read multiple .txt files in a folder. Each file should be read line by line, however, I failed to read multiple .txt files by using glob. Any advice on my code?
my %data;
#FILES = glob("*.txt");
$EmailMsg .= "EG. Folder(week) = Folder(CW01) --CW01 = Week 1 -- Number is week\n ";
$EmailMsg .= "=======================================================================================================\n";
# Try to Loop multiple files here
foreach my $file (#FILES) {
local $/ = undef;
open my $fh, '<', $file;
$data{$file} = <$fh>;
# Read the file one line at a time.
while (my $line = <$fh>) {
chomp $line;
$line =~ s/^\s+//;
$line =~ s/\s+$//;
my ($name, $date, $week) = split /\:/, $line;
if ($name eq "NoneFolder") {
$EmailMsg .= "Folder ($week) - No Folder created on the FTP! Failed to open folder!\n";
}
if ($name eq "EmptyFiles") {
$EmailMsg .= "Folder ($week) - No Files insides the folder! Failed download files!\n";
}
}
}
$EmailMsg .= "=======================================================================================================\n";
$EmailMsg .= "Please note that if you receive this email means that the script is running fine just that no folder is created or no files inside the folder for the week on the FTP.\n";
# close the file.
#close <$fh>;
Currently output:
EG. Folder(week) = Folder(CW01) --CW01 = Week 1 -- Number is week
=======================================================================================================
=======================================================================================================
Please note that if you receive this email means that the script is running fine just that no folder is created or no files inside the folder for the week on the FTP.
It failed to get any .txt files.
You are trying to read each file twice: firstly into the hash %data and then again line by line.
Once you have reached end of file, you have to either reopen the file or use seek to move the read pointer back to the beginning.
You also need to set $/ back to its original value, otherwise your loop will read the entire file instead of one line at a time.
It's not clear whether you really need the second copy of the file data in the hash, but you can avoid having to reset $/ by putting the change within a block, like this
open my $fh, '<', $file;
$data{$file} = do {
local $/ = undef;
<$fh>;
};
and then reset the file pointer to the start again before the while loop.
seek $fh, 0, 0;
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
my #files=('Read a file.pl','Read a single text file.pl','Read only one
file.pl','Read the file using while.pl','Reading the file.pl');
foreach my $i(#files) {
open(FH, "<$i");
{
while (my $row = <FH>) {
chomp $row;
print "$row\n";
}
}
}
The file globbing works for me. You might want to specify scope for your #FILES variable and check that there actually are files matching the path you have specified,
#!/bin/env perl
use strict;
use warnings;
## glob on all files in home directory
## see: http://perldoc.perl.org/File/Glob.html
use File::Glob ':globally';
my #configs = <~myname/project/etc/*.cfg>;
foreach my $fn (#configs) {
print "file $fn\n";
}
your code,
my %data;
#here are some .c files,
my #FILES = glob("../*.c");
foreach my $fn (#FILES) {
print "file $fn\n";
}
exit;
This way catches more garbage for about the same amount of code.
my $PATH = shift #ARGV ;
chomp $PATH ;
opendir(TXTFILE,$PATH) || die ("failed to opendir: $PATH") ;
my #file = readdir TXTFILE ;
closedir(TXTFILE) ;
foreach(#file) { #
next unless ($_ =~ /\.txt$/i) ; # Only get .txt files
$PATH =~ s/\/$//g ; $PATH =~ s/$/\// ; # Uniform trailing slash
my $thisfile = $PATH . $_ ; # now a fully qualified filename
unless (open(THISFILE,$thisfile)) { # Notify on busted files.
warn ("$thisfile failed to open") ;
next ;
}
while(<THISFILE>) {
# etc. etc.
}
close(THISFILE) ;
}

How to check if a file is within a directory

I have a text file with a list of individual mnemonics (1000+) in it and a directory that also has page files in it. I want to see how many pages a given mnemonic is on.
below is my code so far..
use strict;
use warnings;
use File::Find ();
my $mnemonics = "path\\path\\mnemonics.txt";
my $pages = "path\\path\\pages\\";
open (INPUT_FILE, $names) or die "Cannot open file $mnemonics\n";
my #mnemonic_list = <INPUT_FILE>;
close (INPUT_FILE);
opendir (DH, $pages);
my #pages_dir = readdir DH;
foreach my $mnemonic (#mnemonic_list) {
foreach my $page (#pages_dir) {
if (-e $mnemonic) {
print "$mnemonic is in the following page: $page";
} else {
print "File does not exist \n";
}
}
}
Basically, where I know that a name exists in a page, it isn't showing me the correct output. I'm getting a lot of "File does not exists" when I know it does.
Also, instead of (-e) I tried using:
if ($name =~ $page)
and that didn't work either..
Please help!
Assuming that you want to search a directory full of text files and print the names of files that contain words from the words in mnemonics.txt, try this:
use strict; use warnings;
my $mnemonics = "path/mnemonics.txt";
my $pages = "path/pages/";
open (INPUT_FILE, $mnemonics) or die "Cannot open file $mnemonics\n";
chomp(my #mnemonic_list = <INPUT_FILE>);
close (INPUT_FILE);
local($/, *FILE); # set "slurp" mode
for my $filename (<$pages*>) {
next if -d "$filename"; # ignore subdirectories
open FILE, "$filename";
binmode(FILE);
$filename =~ s/.+\///; # remove path from filename for output
my $contents = <FILE>; # "slurp" file contents
for my $mnemonic (#mnemonic_list) {
if ($contents =~ /$mnemonic/i) {
print "'$mnemonic' found in file $filename\n";
}
}
close FILE;
}

Perl program help on opendir and readdir

So I have a program that I want to clean some text files. The program asks for the user to enter the full pathway of a directory containing these text files. From there I want to read the files in the directory, print them to a new file (that is specified by the user), and then clean them in the way I need. I have already written the script to clean the text files.
I ask the user for the directory to use:
chomp ($user_supplied_directory = <STDIN>);
opendir (DIR, $user_supplied_directory);
Then I need to read the directory.
my #dir = readdir DIR;
foreach (#dir) {
Now I am lost.
Any help please?
I'm not certain of what do you want. So, I made some assumptions:
When you say clean the text file, you meant delete the text file
The names of the files you want to write into are formed by a pattern.
So, if I'm right, try something like this:
chomp ($user_supplied_directory = <STDIN>);
opendir (DIR, $user_supplied_directory);
my #dir = readdir DIR;
foreach (#dir) {
next if (($_ eq '.') || ($_ eq '..'));
# Reads the content of the original file
open FILE, $_;
$contents = <FILE>;
close FILE;
# Here you supply the new filename
$new_filename = $_ . ".new";
# Writes the content to the new file
open FILE, '>'.$new_filename;
print FILE $content;
close FILE;
# Deletes the old file
unlink $_;
}
I would suggest that you switch to File::Find. It can be a bit of a challenge in the beginning but it is powerful and cross-platform.
But, to answer your question, try something like:
my #files = readdir DIR;
foreach $file (#files) {
foo($user_supplied_directory/$file);
}
where "foo" is whatever you need to do to the files. A few notes might help:
using "#dir" as the array of files was a bit misleading
the folder name needs to be prepended to the file name to get the right file
it might be convenient to use grep to throw out unwanted files and subfolders, especially ".."
I wrote something today that used readdir. Maybe you can learn something from it. This is just a part of a (somewhat) larger program:
our #Perls = ();
{
my $perl_rx = qr { ^ perl [\d.] + $ }x;
for my $dir (split(/:/, $ENV{PATH})) {
### scanning: $dir
my $relative = ($dir =~ m{^/});
my $dirpath = $relative ? $dir : "$cwd/$dir";
unless (chdir($dirpath)) {
warn "can't cd to $dirpath: $!\n";
next;
}
opendir(my $dot, ".") || next;
while ($_ = readdir($dot)) {
next unless /$perl_rx/o;
### considering: $_
next unless -f;
next unless -x _;
### saving: $_
push #Perls, "$dir/$_";
}
}
}
{
my $two_dots = qr{ [.] .* [.] }x;
if (grep /$two_dots/, #Perls) {
#Perls = grep /$two_dots/, #Perls;
}
}
{
my (%seen, $dev, $ino);
#Perls = grep {
($dev, $ino) = stat $_;
! $seen{$dev, $ino}++;
} #Perls;
}
The crux is push(#Perls, "$dir/$_"): filenames read by readdir are basenames only; they are not full pathnames.
You can do the following, which allows the user to supply their own directory or, if no directory is specified by the user, it defaults to a designated location.
The example shows the use of opendir, readdir, stores all files in the directory in the #files array, and only files that end with '.txt' in the #keys array. The while loop ensures that the full path to the files are stored in the arrays.
This assumes that your "text files" end with the ".txt" suffix. I hope that helps, as I'm not quite sure what's meant by "cleaning the files".
use feature ':5.24';
use File::Copy;
my $dir = shift || "/some/default/directory";
opendir(my $dh, $dir) || die "Can't open $dir: $!";
while ( readdir $dh ) {
push( #files, "$dir/$_");
}
# store ".txt" files in new array
foreach $file ( #files ) {
push( #keys, $file ) if $file =~ /(\S+\.txt\z)/g;
}
# Move files to new location, even if it's across different devices
for ( #keys ) {
move $_, "/some/other/directory/"; || die "Couldn't move files: $!\n";
}
See the perldoc of File::Copy for more info.