I have a log file like below
ID: COM-1234
Program: Swimming
Name: John Doe
Description: Joined on July 1st
------------------------------ID: COM-2345
Program: Swimming
Name: Brock sen
Description: Joined on July 1st
------------------------------ID: COM-9876
Program: Swimming
Name: johny boy
Description: Joined on July 1st
------------------------------ID: COM-9090
Program: Running
Name: justin kim
Description: Good Record
------------------------------
and I want to group it based on the Program (Swimming , Running etc) and want a display like,
PROGRAM: Swimming
==>ID
COM-1234
COM-2345
COM-9876
PROGRAM: Running
==>ID
COM-9090
I'm very new to Perl and I wrote the below piece (incomplete).
#!/usr/bin/perl
use Data::Dumper;
$/ = "%%%%";
open( AFILE, "D:\\mine\\out.txt");
while (<AFILE>)
{
#temp = split(/-{20,}/, $_);
}
close (AFILE);
my %hash = #new;
print Dumper(\%hash);
I have read from perl tutorials that hash key-value pairs will take unique keys with multiple values but not sure how to make use of it.
I'm able to read a file and store in to hash, unsure how to process to the aforementioned format.Any help is really appreciated.Thanks.
I always prefer to write programs like this so they read from STDIN as that makes them more flexible.
I'd do it like this:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# Read one "chunk" of data at a time
local $/ = '------------------------------';
# A hash to store our results.
my %program;
# While there's data on STDIN...
while (<>) {
# Remove the end of record marker
chomp;
# Skip any empty records
# (i.e. ones with no printable characters)
next unless /\S/;
# Extract the information that we want with a regex
my ($id, $program) = /ID: (.+?)\n.*Program: (.+?)\n/s;
# Build a hash of arrays containing our data
push #{$program{$program}}, $id;
}
# Now we have all the data we need, so let's display it.
# Keys in %program are the program names
foreach my $p (keys %program) {
say "PROGRAM: $p\n==>ID";
# $program{$p} is a reference to an array of IDs
say "\t$_" for #{$program{$p}};
say '';
}
Assuming this is in a program called programs.pl and the input data is in programs.txt, then you'd run it like this:
C:/> programs.pl < programs.txt
Always put use warnings; and use strict; in top of the program. And always use three argument for open
open my $fh, "<", "D:\\mine\\out.txt";
my %hash;
while (<$fh>){
if(/ID/)
{
my $nxt = <$fh>;
s/.*?ID: //g;
$hash{"$nxt==>ID \n"}.=" $_";
}
}
print %hash;
Output
Program: Running
==>ID
COM-9090
Program: Swimming
==>ID
COM-1234
COM-2345
COM-9876
I your input file program were found at the line after ID. So I used
my $nxt = <$fh>; Now the program were store into the $nxt variable.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %hash = ();
open my $IN, "<", "your file name here" or die "Error: $!\n";
while (<$IN>) {
if ($_ =~ m/^\s*-*ID:\s*COM/) {
(my $id) = ($_ =~ m/\s*ID:\s*(.*)/);
my $prog_name = <$IN>;
chomp $prog_name;
$prog_name =~ s/Program/PROGRAM/;
$hash{$prog_name} = [] unless $hash{$prog_name};
push #{$hash{$prog_name}}, $id;
}
}
close $IN;
print Dumper(\%hash);
Output will be:
$VAR1 = {
'PROGRAM: Running' => [
'COM-9090'
],
'PROGRAM: Swimming' => [
'COM-1234',
'COM-2345',
'COM-9876'
]
};
Let's look at these two lines:
$hash{$prog_name} = [] unless $hash{$prog_name};
push #{$hash{$prog_name}}, $id;
The first line initiates an empty array reference as the value if the hash is undefined. The second line pushes the ID to the end of that array (regardless of the first line).
Moreover, the first line is not mandatory. Perl knows what you mean if you just write push #{$hash{$prog_name}}, $id; and interprets it as if you said "go to the value of this key" and creates it if it wasn't already there. Now you say that the value is an array and you push $id to the list.
Related
Friends need help. Following my INPUT TEXT FILE
Andrew UK
Cindy China
Rupa India
Gordon Australia
Peter New Zealand
To convert the above into hash and to write back into file when the records exist in a directory. I have tried following (it does not work).
#!/usr/perl/5.14.1/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %hash = ();
my $file = ".../input_and_output.txt";
my $people;
my $country;
open (my $fh, "<", $file) or die "Can't open the file $file: ";
my $line;
while (my $line =<$fh>) {
my ($people) = split("", $line);
$hash{$people} = 1;
}
foreach my $people (sort keys %hash) {
my #country = $people;
foreach my $c (#country) {
my $c_folder = `country/test1_testdata/17.26.6/$c/`;
if (-d $cad_root){
print "Exit\n";
} else {
print "NA\n";
}
}
This is the primary problem:
my ($people) = split("", $line);
Your are splitting using an empty string, and you are assigning the return value to a single variable (which will just end up with the first character of each line).
Instead, you should split on ' ' (a single space character which is a special pattern):
As another special case, ... when the PATTERN is either omitted or a string composed of a single space character (such as ' ' or "\x20" , but not e.g. / /). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were /\s+/; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator.
Limit the number of fields returned to ensure the integrity of country names with spaces:
#!/usr/bin/env perl
use strict;
use warnings;
my #people;
while (my $line = <DATA>) {
$line =~ /\S/ or next;
$line =~ s/\s+\z//;
push #people, [ split ' ', $line, 2 ];
}
use YAML::XS;
print Dump \#people;
__DATA__
Andrew UK
Cindy China
Rupa India
Gordon Australia
Peter New Zealand
The entries are added to an array so 1) The input order is preserved; and 2) Two people with the same name but from different countries do not result in one entry being lost.
If the order is not important, you could just use a hash keyed on country names with people's names in an array reference for each entry. For now, I am going to assume order matters (it would help us help you if you put more effort into formulate a clear question).
One option is to now go through the list of person-country pairs, and print all those pairs for which the directory country/test1_testdata/17.26.6/$c/ exists (incidentally, in your code you have
my $c_folder = `country/test1_testdata/17.26.6/$c/`;
That will try to execute a program called country/test1_testdata/17.26.6/$c/ and save its output in $c_folder if it produces any. To moral of the story: In programming, precision matters. Just because ` looks like ', that doesn't mean you can use one to mean the other.)
Given that your question is focused on hashes, I use an array of references to anonymous hashes to store the list of people-country pairs in the code below. I cache the result of the lookup to reduce the number of times you need to hit the disk.
#!/usr/bin/env perl
use strict;
use warnings;
#ARGV == 2 ? run( #ARGV )
: die_usage()
;
sub run {
my $people_data_file = shift;
my $country_files_location = shift;
open my $in, '<', $people_data_file
or die "Failed to open '$people_data_file': $!";
my #people;
my %countries;
while (my $line = <$in>) {
next unless $line =~ /\S/; # ignore lines consisting of blanks
$line =~ s/\s+\z//;# remove all trailing whitespace
my ($name, $country) = split ' ', $line, 2;
push #people, { name => $name, country => $country };
$countries{ $country } = undef;
}
# At this point, #people has a list of person-country pairs
# We are going to use %countries to reduce the number of
# times we need to check the existence of a given directory,
# assuming that the directory tree is stable while this program
# is running.
PEOPLE:
for my $person ( #people ) {
my $country = $person->{country};
if ($countries{ $country }) {
print join("\t", $person->{name}, $country), "\n";
}
elsif (-d "$country_files_location/$country/") {
$countries{ $country } = 1;
redo PEOPLE;
}
}
}
sub die_usage {
die "Need data file name and country files location\n";
}
Now, there are a bazillion variations on this which is why it is important for you to formulate a clear and concise question so people trying to help you can answer your specific questions, instead of each coming up his/her own solution to the problem as they see it. For example, one could also do this:
#!/usr/bin/env perl
use strict;
use warnings;
#ARGV == 2 ? run( #ARGV )
: die_usage()
;
sub run {
my $people_data_file = shift;
my $country_files_location = shift;
open my $in, '<', $people_data_file
or die "Failed to open '$people_data_file': $!";
my %countries;
while (my $line = <$in>) {
next unless $line =~ /\S/; # ignore lines consisting of blanks
$line =~ s/\s+\z//;# remove all trailing whitespace
my ($name, $country) = split ' ', $line, 2;
push #{ $countries{$country} }, $name;
}
for my $country (keys %countries) {
-d "$country_files_location/$country"
or delete $countries{ $country };
}
# At this point, %countries maps each country for which
# we have a data file to a list of people. We can then
# print those quite simply so long as we don't care about
# replicating the original order of lines from the original
# data file. People's names will still be sorted in order
# of appearance in the original data file for each country.
while (my ($country, $people) = each %countries) {
for my $person ( #$people) {
print join("\t", $person, $country), "\n";
}
}
}
sub die_usage {
die "Need data file name and country files location\n";
}
If what you want is a counter of names in a hash, then I got you, buddy!
I won't attempt the rest of the code because you are checking a folder of records
that I don't have access to so I can't trouble shoot anything more than this.
I see one of your problems. Look at this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say'; # Really like using say instead of print because no need for newline.
my $file = 'input_file.txt';
my $fh; # A filehandle.
my %hash;
my $people;
my $country;
my $line;
unless(open($fh, '<', $file)){die "Could not open file $_ because $!"}
while($line = <$fh>)
{
($people, $country) = split(/\s{2,}/, $line); # splitting on at least two spaces
say "$people \t $country"; # Just printing out the columns in the file or people and Country.
$hash{$people}++; # Just counting all the people in the hash.
# Seeing how many unique names there are, like is there more than one Cindy, etc ...?
}
say "\nNow I'm just sorting the hash of people by names.";
foreach(sort{$a cmp $b} keys %hash)
{
say "$_ => $hash{$_}"; # Based on your file. The counter is at 1 because nobody has the same names.
}
Here is the output. As you can see I fixed the problem by splitting on at least two white-spaces so the country names don't get cut out.
Andrew UK
Cindy China
Rupa India
Gordon Australia
Peter New Zealand
Andrew United States
Now I'm just sorting the hash of people by names.
Andrew => 2
Cindy => 1
Gordon => 1
Peter => 1
Rupa => 1
I added another Andrew to the file. This Andrew is from the United States
as you can see. I see one of your problems. Look at this:
my ($people) = split("", $line);
You are splitting on characters as there is no space between those quotes.
If you look at this change now, you are splitting on at least one space.
my ($people) = split(" ", $line);
I'm new to Perl, so please bare with my on my ignorance. What I'm trying to do is read a file (already using File::Slurp module) and create variables from the data in the file. Currently I have this setup:
use File::Slurp;
my #targets = read_file("targetfile.txt");
print #targets;
Within that target file, I have the following bits of data:
id: 123456789
name: anytownusa
1.2.3.4/32
5.6.7.8/32
The first line is an ID, the second line is a name, and all successive lines will be IP addresses (maximum length of a few hundred).
So my goal is to read that file and create variables that look something like this:
$var1="123456789";
$var2="anytownusa";
$var3="1.2.3.4/32,5.6.7.8/32,etc,etc,etc,etc,etc";
** Taking note that all the IP addresses end up grouped together into a single variable and seperated by a (,) comma.
File::Slurp will read the complete file data in one go. This might cause an issue if the file size is very big. Let me show you a simple approach to this problem.
Read file line by line using while loop
Check line number using $. and assign line data to respective variable
Store ips in an array and at the end print them using join
Note: If you have to alter the line data then use search and replace in the respective conditional block before assigning the line data to the variable.
Code:
#!/usr/bin/perl
use strict;
use warnings;
my ($id, $name, #ips);
while(<DATA>){
chomp;
if ($. == 1){
$id = $_;
}
elsif ($. == 2){
$name = $_;
}
else{
push #ips, $_;
}
}
print "$id\n";
print "$name\n";
print join ",", #ips;
__DATA__
id: 123456789
name: anytownusa
1.2.3.4/32
5.6.7.8/32
Demo
As it has been noted, there is no reason to "slurp" the whole file into a variable. If nothing else, it only makes the processing harder.
Also, why not store named labels in a hash, in this example
my %identity = (id => 123456789, name => 'anytownusa');
The code below picks up the key names from the file, they aren't hard-coded.
Then
use warnings;
use strict;
use feature 'say';
my (#ips, %identity);
my $file = 'targetfile.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>)
{
next if not /\S/;
chomp;
my ($m1, $m2) = split /:/; #/(stop bad syntax highlight)
if ($m1 and $m2) { $identity{$m1} = $m2; }
else { push #ips, $m1; }
}
say "$_: $identity{$_}" for keys %identity;
say join '/', #ips;
If the line doesn't have : the split will return it whole, which will be the ip and which is stored in an array for processing later. Otherwise it returns the named pair, for 'id' and 'name'.
We first skipped blank lines with next if not /\S/;, so the line must have some non-space elements and else suffices, as there is always something in $m1. We also need to remove the newline, with chomp.
Read the file into variables directly:
use Modern::Perl;
my ($id, $name, #ips) = (<DATA>,<DATA>,<DATA>);
chomp ($id, $name, #ips);
say $id;
say $name;
$" = ',';
say "#ips";
__DATA__
id: 123456789
name: anytownusa
1.2.3.4/32
5.6.7.8/32
Output:
id: 123456789
name: anytownusa
1.2.3.4/32,5.6.7.8/32
I am a beginner programmer, who has been given a weeklong assignment to build a complex program, but is having a difficult time starting off. I have been given a set of data, and the goal is separate it into two separate arrays by the second column, based on whether the letter is M or F.
this is the code I have thus far:
#!/usr/local/bin/perl
open (FILE, "ssbn1898.txt");
$x=<FILE>;
split/[,]/$x;
#array1=$y;
if #array1[2]="M";
print #array2;
else;
print #array3;
close (FILE);
How do I fixed this? Please try and use the simplest terms possible I stared coding last week!
Thank You
First off - you split on comma, so I'm going to assume your data looks something like this:
one,M
two,F
three,M
four,M
five,F
six,M
There's a few problems with your code:
turn on strict and warnings. The warn you about possible problems with your code
open is better off written as open ( my $input, "<", $filename ) or die $!;
You only actually read one line from <FILE> - because if you assign it to a scalar $x it only reads one line.
you don't actually insert your value into either array.
So to do what you're basically trying to do:
#!/usr/local/bin/perl
use strict;
use warnings;
#define your arrays.
my #M_array;
my #F_array;
#open your file.
open (my $input, "<", 'ssbn1898.txt') or die $!;
#read file one at a time - this sets the implicit variable $_ each loop,
#which is what we use for the split.
while ( <$input> ) {
#remove linefeeds
chomp;
#capture values from either side of the comma.
my ( $name, $id ) = split ( /,/ );
#test if id is M. We _assume_ that if it's not, it must be F.
if ( $id eq "M" ) {
#insert it into our list.
push ( #M_array, $name );
}
else {
push ( #F_array, $name );
}
}
close ( $input );
#print the results
print "M: #M_array\n";
print "F: #F_array\n";
You could probably do this more concisely - I'd suggest perhaps looking at hashes next, because then you can associate key-value pairs.
There's a part function in List::MoreUtils that does exactly what you want.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use List::MoreUtils 'part';
my ($f, $m) = part { (split /,/)[1] eq 'M' } <DATA>;
say "M: #$m";
say "F: #$f";
__END__
one,M,foo
two,F,bar
three,M,baz
four,M,foo
five,F,bar
six,M,baz
The output is:
M: one,M,foo
three,M,baz
four,M,foo
six,M,baz
F: two,F,bar
five,F,bar
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my #boys=();
my #girls=();
my $fname="ssbn1898.txt"; # I keep stuff like this in a scalar
open (FIN,"< $fname")
or die "$fname:$!";
while ( my $line=<FIN> ) {
chomp $line;
my #f=split(",",$line);
push #boys,$f[0] if $f[1]=~ m/[mM]/;
push #girls,$f[1] if $f[1]=~ m/[gG]/;
}
print Dumper(\#boys);
print Dumper(\#girls);
exit 0;
# Caveats:
# Code is not tested but should work and definitely shows the concepts
#
In fact the same thing...
#!/usr/bin/perl
use strict;
my (#m,#f);
while(<>){
push (#m,$1) if(/(.*),M/);
push (#f,$1) if(/(.*),F/);
}
print "M=#m\nF=#f\n";
Or a "perl -n" (=for all lines do) variant:
#!/usr/bin/perl -n
push (#m,$1) if(/(.*),M/);
push (#f,$1) if(/(.*),F/);
END { print "M=#m\nF=#f\n";}
I've got a small programme that basically processes lists of blast hits, and checks to see if there's overlap between the blast results by iterating blast results (as hash key) through hashes containing each blast list.
This involves processing each blast input file as $ARGV in the same way. Depending on what I'm trying to achieve, I might want to compare 2, 3 or 4 blast lists for gene overlap. I want to know how I can write the basic processing block as a subroutine that I can iterate over for however many $ARGV arguments exist.
For example, the below works fine if I input 2 blast lists:
#!/usr/bin/perl -w
use strict;
use File::Slurp;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
if ($#ARGV != 1){
die "Usage: intersect.pl <de gene list 1><de gene list 2>\n"
}
my $input1 = $ARGV[0];
open my $blast1, '<', $input1 or die $!;
my $results1 = 0;
my (#blast1ID, #blast1_info, #split);
while (<$blast1>) {
chomp;
#split = split('\t');
push #blast1_info, $split[0];
push #blast1ID, $split[2];
$results1++;
}
print "$results1 blast hits in $input1\n";
my %blast1;
push #{$blast1{$blast1ID[$_]} }, [ $blast1_info[$_] ] for 0 .. $#blast1ID;
#print Dumper (\%blast1);
my $input2 = $ARGV[1];
open my $blast2, '<', $input2 or die $!;
my $results2 = 0;
my (#blast2ID, #blast2_info);
while (<$blast2>) {
chomp;
#split = split('\t');
push #blast2_info, $split[0];
push #blast2ID, $split[2];
$results2++;
}
my %blast2;
push #{$blast2{$blast2ID[$_]} }, [ $blast2_info[$_] ] for 0 .. $#blast2ID;
#print Dumper (\%blast2);
print "$results2 blast hits in $input2\n";
But I would like to be able to adjust it to cater for 3 or 4 blast lists inputs. I imagine a sub routine would work best for this, that is invoked for each input, and might look something like this:
sub process {
my $input$i = $ARGV[$i-1];
open my $blast$i, '<', $input[$i] or die $!;
my $results$i = 0;
my (#blast$iID, #blast$i_info, #split);
while (<$blast$i>) {
chomp;
#split = split('\t');
push #blast$i_info, $split[0];
push #blast$iID, $split[2];
$results$i++;
}
print "$results$i blast hits in $input$i\n";
print Dumper (\#blast$i_info);
print Dumper (\#blast$iID);
# Call sub 'process for every ARGV...
&process for 0 .. $#ARGV;
UPDATE:
I've removed the hash part for the last snippet.
The resultant data structure should be:
4 blast hits in <$input$i>
$VAR1 = [
'TCONS_00001332(XLOC_000827),_4.60257:9.53943,_Change:1.05146,_p:0.03605,_q:0.998852',
'TCONS_00001348(XLOC_000833),_0.569771:6.50403,_Change:3.51288,_p:0.0331,_q:0.998852',
'TCONS_00001355(XLOC_000837),_10.8634:24.3785,_Change:1.16613,_p:0.001,_q:0.998852',
'TCONS_00002204(XLOC_001374),_0.316322:5.32111,_Change:4.07226,_p:0.00485,_q:0.998852',
];
$VAR1 = [
'gi|50418055|gb|BC078036.1|_Xenopus_laevis_cDNA_clone_MGC:82763_IMAGE:5156829,_complete_cds',
'gi|283799550|emb|FN550108.1|_Xenopus_(Silurana)_tropicalis_mRNA_for_alpha-2,3-sialyltransferase_ST3Gal_V_(st3gal5_gene)',
'gi|147903202|ref|NM_001097651.1|_Xenopus_laevis_forkhead_box_I4,_gene_1_(foxi4.1),_mRNA',
'gi|2598062|emb|AJ001730.1|_Xenopus_laevis_mRNA_for_Xsox17-alpha_protein',
];
And the input:
TCONS_00001332(XLOC_000827),_4.60257:9.53943,_Change:1.05146,_p:0.03605,_q:0.998852 0.0 gi|50418055|gb|BC078036.1|_Xenopus_laevis_cDNA_clone_MGC:82763_IMAGE:5156829,_complete_cds
TCONS_00001348(XLOC_000833),_0.569771:6.50403,_Change:3.51288,_p:0.0331,_q:0.998852 0.0 gi|283799550|emb|FN550108.1|_Xenopus_(Silurana)_tropicalis_mRNA_for_alpha-2,3-sialyltransferase_ST3Gal_V_(st3gal5_gene)
TCONS_00001355(XLOC_000837),_10.8634:24.3785,_Change:1.16613,_p:0.001,_q:0.998852 0.0 gi|147903202|ref|NM_001097651.1|_Xenopus_laevis_forkhead_box_I4,_gene_1_(foxi4.1),_mRNA
TCONS_00002204(XLOC_001374),_0.316322:5.32111,_Change:4.07226,_p:0.00485,_q:0.998852 0.0 gi|2598062|emb|AJ001730.1|_Xenopus_laevis_mRNA_for_Xsox17-alpha_protein
You can't inject a variable value in the middle of a variable name. (Well, you can but you shouldn't. Even then you and can't use array indexing in the middle of the name.)
These names aren't valid:
#blast[$i]_info
#blast[$i]_ID
You need to move the index to the end:
#blast_info[$i]
#blast_ID[$i]
That said, I'd get rid of the arrays completely and use a hash instead.
Your second code snippet doesn't show a call to your subroutine. Unless it's explicitly called it will never run and your program will do nothing. I'd modify the process sub to take a single argument and call it for each element of #ARGV. e.g.
process($_) foreach #ARGV;
Here's how I'd write your program:
use strict;
use warnings;
use Data::Dumper;
my #blast;
push #blast, process($_) foreach #ARGV;
print Dumper(\#blast);
sub process {
my $file = shift;
open my $fh, '<', $file or die "Can't read file '$file' [$!]\n";
my %data;
while (<$fh>) {
chomp;
my ($id, undef, $info) = split '\t';
$data{$id} = $info;
}
return \%data;
}
It isn't quite clear what your resulting data structure should look like. (I took my best guess.) I recommend reading perlreftut to gain a better basic understanding of references and using them to build data structures in Perl.
I need to compare the big file(2GB) contains 22 million lines with the another file. its taking more time to process it while using Tie::File.so i have done it through 'while' but problem remains. see my code below...
use strict;
use Tie::File;
# use warnings;
my #arr;
# tie #arr, 'Tie::File', 'title_Nov19.txt';
# open(IT,"<title_Nov19.txt");
# my #arr=<IT>;
# close(IT);
open(RE,">>res.txt");
open(IN,"<input.txt");
while(my $data=<IN>){
chomp($data);
print"$data\n";
my $occ=0;
open(IT,"<title_Nov19.txt");
while(my $line2=<IT>){
my $line=$line2;
chomp($line);
if($line=~m/\b$data\b/is){
$occ++;
}
}
print RE"$data\t$occ\n";
}
close(IT);
close(IN);
close(RE);
so help me to reduce it...
Lots of things wrong with this.
Asides from the usual (lack of use strict, use warnings, use of 2-argument open(), not checking open() result, use of global filehandles), the specific problem in your case is that you are opening/reading/closing the second file once for every single line of the first. This is going to be very slow.
I suggest you open the file title_Nov19.txt once, read all the lines into an array or hash or something, then close it; and then you can open the first file, input.txt and walk along that once, comparing to things in the array so you don't have to reopen that second file all the time.
Futher I suggest you read some basic articles on style/etc.. as your question is likely to gain more attention if it's actually written in vaguely modern standards.
I tried to build a small example script with a better structure but I have to say, man, your problem description is really very unclear. It's important to not read the whole comparison file each time as #LeoNerd explained in his answer. Then I use a hash to keep track of the match count:
#!/usr/bin/env perl
use strict;
use warnings;
# cache all lines of the comparison file
open my $comp_file, '<', 'input.txt' or die "input.txt: $!\n";
chomp (my #comparison = <$comp_file>);
close $comp_file;
# prepare comparison
open my $input, '<', 'title_Nov19.txt' or die "title_Nov19.txt: $!\n";
my %count = ();
# compare each line
while (my $title = <$input>) {
chomp $title;
# iterate comparison strings
foreach my $comp (#comparison) {
$count{$comp}++ if $title =~ /\b$comp\b/i;
}
}
# done
close $input;
# output (sorted by count)
open my $output, '>>', 'res.txt' or die "res.txt: $!\n";
foreach my $comp (#comparison) {
print $output "$comp\t$count{$comp}\n";
}
close $output;
Just to get you started... If someone wants to further work on this: these were my test files:
title_Nov19.txt
This is the foo title
Wow, we have bar too
Nothing special here but foo
OMG, the last title! And Foo again!
input.txt
foo
bar
And the result of the program was written to res.txt:
foo 3
bar 1
Here's another option using memowe's (thank you) data:
use strict;
use warnings;
use File::Slurp qw/read_file write_file/;
my %count;
my $regex = join '|', map { chomp; $_ = "\Q$_\E" } read_file 'input.txt';
for ( read_file 'title_Nov19.txt' ) {
my %seen;
!$seen{ lc $1 }++ and $count{ lc $1 }++ while /\b($regex)\b/ig;
}
write_file 'res.txt', map "$_\t$count{$_}\n",
sort { $count{$b} <=> $count{$a} } keys %count;
Numerically-sorted output to res.txt:
foo 3
bar 1
An alternation regex which quotes meta characters (\Q$_\E) is built and used, so only one pass against the large file's lines is needed. The hash %seen is used to insure that the input words are only counted once per line.
Hope this helps!
Try this:
grep -i -c -w -f input.txt title_Nov19.txt > res.txt