How to match and find common based on substring from two files? - perl

I have two files. File1 contains list of email addresses. File2 contains list of domains.
I want to filter out all the email addresses after matching exact domain using Perl script.
I am using below code, but I don't get correct result.
#!/usr/bin/perl
#use strict;
#use warnings;
use feature 'say';
my $file1 = "/home/user/domain_file" or die " FIle not found\n";
my $file2 = "/home/user/email_address_file" or die " FIle not found\n";
my $match = open(MATCH, ">matching_domain") || die;
open(my $data1, '<', $file1) or die "Could not open '$file1' $!\n";
my #wrd = <$data1>;
chomp #wrd;
# loop on the fiile to be searched
open(my $data2, '<', $file2) or die "Could not open '$file2' $!\n";
while(my $line = <$data2>) {
chomp $line;
foreach (#wrd) {
if($line =~ /\#$_$/) {
print MATCH "$line\n";
}
}
}
File1
abc#1gmail.com.au
abc#gmail.com
abc#gmail.com1
abc#2outlook.com2
abc#outlook.com1
abc#yahoo.com
abc#yahooo1.com
abc#yahooo.com
File2
yahoo.com
gmail.com
Expected output
abc#gmail.com
abc#yahoo.com

First off, since you seem to be on *nix, you might want to check out grep -f, which can take search patterns from a given file. I'm no expert in grep, but I would try the file and "match whole words" and this should be fairly easy.
Second: Your Perl code can be improved, but it works as expected. If you put the emails and domains in the files as indicated by your code. It may be that you have mixed the files up.
If I run your code, fixing only the paths, and keeping the domains in file1, it does create the file matching_domain and it contains your expected output:
abc#gmail.com
abc#yahoo.com
So I don't know what you think your problem is (because you did not say). Maybe you were expecting it to print output to the terminal. Either way, it does work, but there are things to fix.
#use strict;
#use warnings;
It is a huge mistake to remove these two. Biggest mistake you will ever do while coding Perl. It will not remove your errors, just hide them. You will spend 10 times as much time bug fixing. Uncomment this as your first thing you do to fix this.
use feature 'say';
You never use this. You could for example replace print MATCH "$line\n" with say MATCH $line, which is slightly more concise.
my $file1 = "/home/user/domain_file" or die " FIle not found\n";
my $file2 = "/home/user/email_address_file" or die " FIle not found\n";
This is very incorrect. You are placing a condition on the creation of a variable. If the condition fails, does the variable exist? Don't do this. I assume this is to check if the file exists, but that is not what this does. To check if a file exists, you can use -e, documented as perldoc "-X" (various file tests).
Furthermore, a statement in the form of a string, "/home/user..." is TRUE ("truthy"), as far as Perl conditions are concerned. It is only false if it is "0" (zero), "" (empty) or undef (undefined). So your or clause will never be executed. E.g. "foo" or die will never die.
Lastly, this test is quite meaningless, as you will be testing this in your open statement later on anyway. If the file does not exist, the open will fail and your program will die.
my $match = open(MATCH, ">matching_domain") || die;
This is also very incorrect. First off, you never use the $match variable. Secondly, I bet it does not contain what you think it does. (it contains a boolean which states whether open was successful or not, see perldoc -f open) Thirdly, again, don't put conditions on my declarations of variables, it is a bad idea.
What this statement really means is that $match will contain either the return value of the open, or the return value of die. This should probably be simply:
open my $match, ">", "matching_domain" or die "Cannot open '$match': $!;
Also, use the three argument open with explicit open MODE, and use lexical file handles, like you have done elsewhere.
And one more thing on top of all the stuff I've already badgered you with: I don't recommend hard coding output files for small programs like this. If you want to redirect the output, use shell redirection: perl foo.pl > output.txt. I think this is what has prompted you to think something is wrong with your code: You don't see the output.
Other than that, your code is fine, as near as I can tell. You may want to chomp the lines from the domain file, but it should not matter. Also remember that indentation is a good thing, and it helps you read your code. I mentioned this in a comment, but it was removed for some reason. It is important though.
Good luck!

This assumes that the lines labeled File1 are in the file pointed to by $file1 and the lines labeled File2 are in the file pointed to by $file2.
You have your variables swapped. You want to match what is in $line against $_, not the other way around:
# loop on the file to be searched
open( my $data2, '<', $file2 ) or die "Could not open '$file2' $!\n";
while ( my $line = <$data2> ) {
chomp $line;
foreach (#wrd) {
if (/\#$line$/) {
print MATCH "$_\n";
}
}
}
You should un-comment the warnings and strict lines:
use strict;
use warnings;
warnings shows you that the or die checks are not really working the way you intended in the file name assignment statements. Just use :
my $file1 = "/home/user/domain_file";
my $file2 = "/home/user/email_address_file";
You are already doing the checks where they belong (on open).

Related

Check whether a field from a line of text line matches a value

I have been using the following Perl code to extract text from multiple text files. It works fine.
Example of a couple of lines in one of the input files:
Fa0/19 CUTExyz notconnect 129 half 100 10/100BaseTX
Fa0/22 xyz MLS notconnect 1293 half 10 10/100BaseTX
What I need is to match the numbers in each line exactly (i.e. 129 is not matched by 1293) and print the corresponding lines.
It would also be nice to match a range of numbers leaving specific numbers out i.e. match 2 through 10 but not 11 the 12 through 20
#!/perl/bin/perl
use warnings;
my #files = <c:/perl64/files/*>;
foreach $file ( #files ) {
open( FILE, "$file" );
while ( $line = <FILE> ) {
print "$file $line" if $line =~ /123/n;
}
close FILE;
}
Thank you for the suggestions, but can it can be done using the code structure above?
I suggest that you take a look at perldoc perlre.
You need to anchor your regex pattern. The easiest way is probably using \b which is a zero-width boundary between alphanumerics and non-alphanumerics.
#!/perl/bin/perl
use warnings;
use strict;
foreach my $file ( glob "c:/perl64/files/*" ) {
open( my $input, '<', $file ) or die $!;
while (<$input>) {
print "$file $_" if m/\b123\b/;
}
close $input;
}
Note - you should use three-argument open with lexical file handles as above, because it is better practice.
I've also removed the n pattern modifier, as it appears redundant.
Following your edit though, to give us some source data. I'd suggest the solution is not to use a regex - your source data looks space delimited. (Maybe those are tabs?).
So I'd suggest you're better off using split and selecting the field you want, and testing it numerically, because you mention matching ranges. This is not a good fit for regexes because they don't understand the numeric content.
Instead:
while ( <$input> ) {
print if (split)[-4] == 129;
}
Note - I use -4 in the split, which indexes from the end of the list.
This is because column 3 contains spaces, so splitting on whitespace is going to produce the wrong result unless we count down from the end of the array. Using a negative index we get the right field each time.
If your data is tab separated then you could use chomp and split /\t/. Or potentially split on /\s{2,}/ to split on 2-or-more spaces
But by selecting the field, you can do numeric tests on it, like
if $fields[-4] > 100 and $fields[-4] < 200
etc.
I hope you don't get the answers you're asking for, which discard best practice because of your unfamiliarity with Perl. It is inappropriate to ask how to write an ugly solution because proper Perl is beyond your reach
As has been said repeatedly on this site, if you don't know how to do a job then you should hire someone who does know and pay them for their work. No other profession that I know has the expectation of getting quality work done for free
Here's a few notes on your code. Wherever you have learned your techniques, you have been looking at a very outdated resource
Do you really have a root directory perl, so that your compiler is /perl/bin/perl? That's very unusual, and there is no need to use a shebang line in Windows
You must always add use strict and use warnings 'all' at the top of every Perl program you write, and declare all of your variables using my as close as possible to their first point of use. For some reason you do this with #files but not with $file
It is better to replace <c:/perl64/files/*> with glob 'C:/perl64/files/*'. Otherwise the code is less clear because Perl overloads the <> operator
Don't put variable names inside double quotes. It is unnecessary at best, and may cause bugs. So "$file" should be $file
Always use the three-parameter version of open, so that the second parameter is the open mode
Don't use global file handles. And always test whether the file has been opened correctly, dying with a message including $!—the reason for the failure—if the open fails
open( FILE, "$file" )
should be something like
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!}
Don't rely on regex patterns for everything. In this case it looks like split would be a better option, or perhaps unpack if your records have fixed-width fields. In my solution below I have used split on "more than one space", but if your real data is different from what you have shown (tab-delimited?) then this is not going to work
Note that Fa0/129 will also be matched by your current approach
This Perl program filters your data, printing lines where the fourth field $lines[3] (delineated by more than one whitespace character) is numerically equal to 129
The output shown is produced when the input is the single file splitn.txt, containing the data shown in your question
use strict;
use warnings 'all';
for my $file ( glob 'C:/perl64/files/*' ) {
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
while ( my $line = <$fh> ) {
chomp;
my #fields = split /\s\s+/, $line;
print "$file $line" if $fields[3] == 129;
}
}
output
splitn.txt Fa0/19 CUTExyz notconnect 129 half 100 10/100BaseTX
Your question is unclear. When you say:
What I need is to match numbers in the on each line exactly
That could mean a couple of things. It could mean that each line contains nothing but a single number which you want to match. In that case, using == is probably better than using a regular expression. Or it could mean that you have lots of text on a line and you only want to match complete numbers. In that case you should use \b (the "word boundary" anchor) - /\b123\b/.
If you're clearer in your questions (perhaps by giving us sample input) then people won't have to guess at your meaning.
A few more points on your code:
Always include both use strict and use warnings.
Always check the return value from open() and take appropriate action on failure.
Use lexical filehandles and 3-arg version of open().
No need to quote $file in your open() call.
Using $_ can simplify your code.
/n on the match operator has no effect unless your regex contains parentheses.
Putting that all together (and assuming my second interpretation of your question is correct), your code could look like this:
#!/perl/bin/perl
use strict;
use warnings;
my #files = <c:/perl64/files/*>;
foreach my $file (#files) {
open my $file_h, '<', $file
or die "Can't open $file: $!";
while (<$file_h>) {
print "$file $_\n" if /\b123\b/;
}
# No need to close $file_h as it is closed
# automatically when the variable goes out
# of scope.
}

perl: canot open file within a loop

I am trying to read in a bunch of similar files and process them one by one. Here is the code I have. But somehow the perl script doesn't read in the files correctly. I'm not sure how to fix it. The files are definitely readable and writable by me.
#!/usr/bin/perl
use strict;
use warnings;
my #olap_f = `ls /full_dir_to_file/*txt`;
foreach my $file (#olap_f){
my %traits_h;
open(IN,'<',$file) || die "cannot open $file";
while(<IN>){
chomp;
my #array = split /\t/;
my $trait = $array[4];
$traits_h{$trait} ++;
}
close IN;
}
When I run it, the error message (something like below) showed up:
cannot open /full_dir_to_file/a.txt
You have newlines at the end of each filename:
my #olap_f = `ls ~dir_to_file/*txt`;
chomp #olap_f; # Remove newlines
Better yet, use glob to avoid launching a new process (and having to trim newlines):
my #olap_f = glob "~dir_to_file/*txt";
Also, use $! to find out why a file couldn't be opened:
open(IN,'<',$file) || die "cannot open $file: $!";
This would have told you
cannot open /full_dir_to_file/a.txt
: No such file or directory
which might have made you recognize the unwanted newline.
I'll add a quick plug for IO::All here. It's important to know what's going on under the hood but it's convenient sometimes to be able to do:
use IO::All;
my #olap_f = io->dir('/full_dir_to_file/')->glob('*txt');
In this case it's not shorter than #cjm's use of glob but IO::All does have a few other convenient methods for working with files as well.

File truncated, when opened in Perl

Im new to perl, so sorry if this is obvious, but i looked up how to open a file, and use the flags, but for the life of me they dont seem to work right I narrowed it down to these lines of code.
if ($flag eq "T"){
open xFile, ">" , "$lUsername\\$openFile";
}
else
{
open xFile, ">>", "$lUsername\\$openFile";
}
Both of these methods seem to delete the contents of my file. I also checked if the flag is formatted correctly and it is, i know for a fact ive gone down both conditions.
EDIT: codepaste of a larger portion of my code http://codepaste.net/n52sma
New to Perl? I hope you're using use strict and use warnings.
As other's have stated, you should be using a test to make sure your file is open. However, that's not really the problem here. In fact, I used your code, and it seems to work fine for me. Maybe you should try printing some debugging messages to see if this is doing what you think it's doing:
use strict;
use warnings;
use autodie; #Will stop your program if the "open" doesn't work.
my $lUsername = "ABaker";
my $openFile = "somefile.txt";
if ($flag eq "T") {
print qq(DEBUG: Flag = "$flag": Deleting file "$lUsername/$openFile");
open xFile, ">" , "$lUsername/$openFile";
}
else {
print qq(DEBUG: Flag = "$flag": Appending file "$lUsername/$openFile");
open xFile, ">>", "$lUsername/$openFile";
}
You want to use strict and warnings in order to make sure you're not having issues with variable names. The use strict forces you to declare your variables first. For example, are you setting $Flag, but then using $flag? Maybe $flag is set the first time through, but you're setting $Flag the second time through.
Anyway, the DEBUG: statements will give you a better idea of what your error could be.
By the way, in Perl, you're checking if $flag is set to T and not t. If you want to test against both t and T, test whether uc $flag eq 'T' and not just $flag eq 'T'.
#Ukemi
I reformated to comply with use strict, i also made print statements to make sure i was trunctating when i want to, and not when i dont. It still is deleting the file. Although now sometimes its simply not writing, im going to give a larger portion of my code in a link, id really appreciate it if you gave it a once over.
Are you seeing it say Truncating, but the file is empty? Are you sure the file already existed? There's a reason why I put the flag and everything in my debug statements. The more you print, the more you know. Try the following section of code:
$file = "lUsername/$openFile" #Use forward slashes vs. back slashes.
if ($flag eq "T") {
print qq(Flag = "$flag". Truncating file "$file"\n);
open $File , '>', $file
or die qq(Unable to open file "$file" for writing: $!\n);
}
else {
print qq(Flag = "$flag". Appending to file "$file"\n);
if (not -e $file) {
print qq(File "$file" does not exist. Will create it\n");
}
open $File , '>>', $file
or die qq(Unable to open file "$file" for appending: $!\n);
}
Note I'm printing out the flag and the name of the file in quotes. This will allow me to see if there are any hidden characters in my file name.
I'm using the qq(...) method to quote strings, so I can use the quotation marks in my print statements.
Also note I'm checking for the existence of the file when I truncate. This way, I make sure the file actually exists.
This should point out any possible errors in your logic. The other thing you can do is to stop your program when you finish writing out the file and verify that the file was written out as expected.
print "Write to file now:\n";
my $writeToFile = <>;
printf $File "$writeToFile";
close $File;
print "DEBUG: Temporary stop. Examine file\n";
<STDIN>; #DEBUG:
Now, if you see it saying it's appending to the file, and the file exists, and you still see the file being overwritten, we'll know the problem lies in your actual open xFile, ">>" $file statement.
You should use the three-argument-version of open, lexical filehandles and check wether there might have been an error:
# Writing to file (clobbering it if it exists)
open my $file , '>', $filename
or die "Unable to write to file '$filename': $!";
# Appending to file
open my $file , '>>', $filename
or die "Unable to append to file '$filename': $!";
>> does not clobber or truncate. Either you ended up in the "then" clause when you expected to be in the "else" clause, or the problem is elsewhere.
To check what $flag contains:
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
print(Dumper($flag));
For your reference I have mentioned some basic file handling techniques below.
open FILE, "filename.txt" or die $!;
The command above will associate the FILE filehandle with the file filename.txt. You can use the filehandle to read from the file. If the file doesn't exist - or you cannot read it for any other reason - then the script will die with the appropriate error message stored in the $! variable.
open FILEHANDLE, MODE, EXPR
The available modes are the following:
read < #this mode will read the file
write > # this mode will create the new file. If the file already exists it will truncate and overwrite.
append >> #this will append the contents if the file already exists,else it will create new one.
if you have confusion on this, you can use the module called File::Slurp;
I have mentioned the sample codes using File::Slurp module.
use strict;
use File::Slurp;
my $read_mode=read_file("test.txt"); #to read file contents
write_file("test2.txt",$read_mode); #to write file
my #all_files=read_dir("/home/desktop",keep_dot_dot=>0); #read a dir
write_file("test2.txt",{append=>1},"#all_files"); #Append mode

How to open/join more than one file (depending on user input) and then use 2 files simultaneously

EDIT: Sorry for the misunderstanding, I have edited a few things, to hopefully actually request what I want.
I was wondering if there was a way to open/join two or more files to run the rest of the program on.
For example, my directory has these files:
taggedchpt1_1.txt, parsedchpt1_1.txt, taggedchpt1_2.txt, parsedchpt1_2.txt etc...
The program must call a tagged and parsed simultaneously. I want to run the program on both of chpt1_1 and chpt1_2, preferably joined together in one .txt file, unless it would be very slow to do so. For instance run what would be accomplished having two files:
taggedchpt1_1_and_chpt1_2 and parsedchpt1_1_and_chpt1_2
Can this be done through Perl? Or should I just combine the text files myself(or automate that process, making chpt1.txt which would include chpt1_1, chpt1_2, chpt1_3 etc...)
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
print "Please type in the chapter and section NUMBERS in the form chp#_sec#:\n"; ##So the user inputs 31_3, for example
chomp (my $chapter_and_section = "chpt".<>);
print "Please type in the search word:\n";
chomp (my $search_key = <>);
open(my $tag_corpus, '<', "tagged${chapter_and_section}.txt") or die $!;
open(my $parse_corpus, '<', "parsed${chapter_and_section}.txt") or die $!;
For the rest of the program to work, I need to be able to have:
my #sentences = <$tag_corpus>; ##right now this is one file, I want to make it more
my #typeddependencies = <$parse_corpus>; ##same as above
EDIT2: Really sorry about the misunderstanding. In the program, after the steps shown, I do 2 for loops. Reading through the lines of the tagged and parsed.
What I want is to accomplish this with more files from the same directory, without having to re-input the next files. (ie. I can run taggedchpt31_1.txt and parsedchpt31_1.txt...... I want to run taggedchpt31 and parsedchpt31 - which includes ~chpt31_1, ~chpt31_2, etc...)
Ultimately, it would be best if I joined all the tagged files and all the parsed files that have a common chapter (in the end still requiring only two files I want to run) but not have to save the joined file to the directory... Now that I put it into words, I think I should just save files that include all the sections.
Sorry and Thanks for all your time! Look at FMc's breakdown of my question for more help.
You could iterate over the file names, opening and reading each one in turn. Or you could produce an iterator that knows how to read lines from sequence of files.
sub files_reader {
# Takes a list of file names and returns a closure that
# will yield lines from those files.
my #handles = map { open(my $h, '<', $_) or die $!; $h } #_;
return sub {
shift #handles while #handles and eof $handles[0];
return unless #handles;
return readline $handles[0];
}
}
my $reader = files_reader('foo.txt', 'bar.txt', 'quux.txt');
while (my $line = $reader->()) {
print $line;
}
Or you could use Perl's built-in iterator that can do the same thing:
local #ARGV = ('foo.txt', 'bar.txt', 'quux.txt');
while (my $line = <>) {
print $line;
}
Edit in response to follow-up questions:
Perhaps it would help to break your problem down into smaller sub-tasks. As I understand it, you have three steps.
Step 1 is to get some input from the user -- perhaps a directory name, or maybe a couple of file name patterns (taggedchpt and parsedchpt).
Step 2 is for the program to find all of the relevant file names. For this task, glob() or readdir()might be useful. There are many questions on StackOverflow related to such issues. You'll end up with two lists of file names, one for the tagged files and one for the parsed files.
Step 3 is to process the lines across all of the files in each of the two sets. Most of the answers you have received, including mine, will help you with this step.
No one has mentioned the #ARGV hack yet? Ok, here it is.
{
local #ARGV = ('taggedchpt1_1.txt', 'parsedchpt1_1.txt', 'taggedchpt1_2.txt',
'parsedchpt1_2.txt');
while (<ARGV>) {
s/THIS/THAT/;
print FH $_;
}
}
ARGV is a special filehandle that iterates through all the filenames in #ARGV, closing a file and opening the next one as necessary. Normally #ARGV contains the command-line arguments that you passed to perl, but you can set it to anything you want.
You're almost there... this is a bit more efficient than discrete opens on each file...
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
print "Please type in the chapter and section NUMBERS in the for chp#_sec#:\n";
chomp (my $chapter_and_section = "chpt".<>);
print "Please type in the search word:\n";
chomp (my $search_key = <>);
open(FH, '>output.txt') or die $!; # Open an output file for writing
foreach ("tagged${chapter_and_section}.txt", "parsed${chapter_and_section}.txt") {
open FILE, "<$_" or die $!; # Read a filename (from the array)
foreach (<FILE>) {
$_ =~ s/THIS/THAT/g; # Regex replace each line in the open file (use
# whatever you like instead of "THIS" &
# "THAT"
print FH $_; # Write to the output file
}
}

How can I create a new file using a variable value as the name in Perl?

Eg:
$variable = "10000";
for($i=0; $i<3;$i++)
{
$variable++;
$file = $variable."."."txt";
open output,'>$file' or die "Can't open the output file!";
}
This doesn't work. Please suggest a new way.
Everyone here has it right, you are using single quotes in your call to open. Single quotes do not interpolate variables into the quoted string. Double quotes do.
my $foo = 'cat';
print 'Why does the dog chase the $foo?'; # prints: Why does the dog chase the $foo?
print "Why does the dog chase the $foo?"; # prints: Why does the dog chase the cat?
So far, so good. But, the others have neglected to give you some important advice about open.
The open function has evolved over the years, as has the way that Perl works with filehandles. In the old days, open was always called with the mode and the file name combined in the second argument. The first argument was always a global filehandle.
Experience showed that this was a bad idea. Combining the mode and the filename in one argument created security problems. Using global variables, well, is using global variables.
Since Perl 5.6.0 you can use a 3 argument form of open that is much more secure, and you can store your filehandle in a lexically scoped scalar.
open my $fh, '>', $file or die "Can't open $file - $!\n";
print $fh "Goes into the file\n";
There are many nice things about lexical filehandles, but one excellent property is that they are automatically closed when their refcount drops to 0 and they are destroyed. There is no need to explicitly close them.
Something else worth noting is that it is considered by most of the Perl community that it is a good idea to always use the strict and warnings pragmas. Using them helps catch many bugs early in the development process and can be a huge time saver.
use strict;
use warnings;
for my $base ( 10_001..10_003 ) {
my $file = "$base.txt";
print "file: $file\n";
open my $fh,'>', $file or die "Can't open the output file: $!";
# Do stuff with handle.
}
I simplified your code a bit too. I used the range operator to generate your base numbers for the file names. Since we are working with numbers and not strings, I was able to use the _, as the thousands separator to improve readability without impacting the final result. Finally, I used an idiomatic perl for loop instead of the C style for you had.
I hope you find this helpful.
use double quotes: ">$file". single quotes will not interpolate your variable.
$variable = "10000";
for($i=0; $i<3;$i++)
{
$variable++;
$file = $variable."."."txt";
print "file: $file\n";
open $output,">$file" or die "Can't open the output file!";
close($output);
}
The problem is that you're using single quotes for the second argument to open, and single-quoted strings do not interpolate variables mentioned in them. Perl interpreted your code as though you wanted to open a file that really had a dollar sign for the first character of its name. (Check your disk; you should see an empty file named $file there.)
You can avoid the issue by using the three-argument version of open:
open output, '>', $file
Then the file-name argument can't accidentally interfere with the open-mode argument, and there's no unnecessary variable interpolation or concatenation.
$variable = "10000";
for($i=0; $i<3;$i++)
{
$variable++;
$file = $variable . 'txt';
open output,'>$file' or die "Can't open the output file!";
}
this works
1.txt
2.txt and so on ..
Use a file handle:
my $file = "whatevernameyouwant";
open (MYFILE, ">>$file");
print MYFILE "Bob\n";
close (MYFILE);
print '$file' yields $file, whereas print "$file" yields whatevernameyouwant.
You almost have it right, but there are a couple of issues.
1 - You need to use double quotes around the file you're opening.
open output,">$file" or die[...]
2 - Minor niggles, you don't close the files afterwards.
I'd rewrite your code something like this:
#!/usr/bin/perl
$variable = "1000";
for($i=0; $i<3;$i++) {
$variable++;
$file = $variable."."."txt";
open output,">$file" or die "Can't open the output file!";
}