Attach frequent words in text file - perl

I have two files. The first contains frequent
word sequences
extracted from a text file
a.txt :
big pizza
eat big pizza
...
the text file is
b.txt :
i eat big pizza .my big pizza ...
My problem is to add bbb between words from each sequence that exist in file a.txt and write a new file .
so the result will be
i eatbbbbigbbbpizza.my bigbbbpizza...
below is my script. It adds bbb only between pairs 9f words. How can I correct this?
use strict;
use warnings;
use autodie;
my ($f1, $f2) = ('a.txt', 'b.txt');
open( my $fh, $f1 );
my #seq;
foreach ( <$fh> ) {
chomp;
s/^\s+|\s+$//g;
push #seq, $_;
}
close $fh;
open($fh, $f2);
foreach (<$fh> ) {
foreach my $r (#seq) {
my $t = $r =~ s/ /bbb/r;
if (/$r/) {
s/$r/$t/g;
}
}
print ;
}
close $fh;

All that is wrong is your line
my $t = $r =~ s/ /bbb/r;
This substitution runs just once, and so replaces only the first space with bbb
You need to use a global substitution instead. And while we're changing this line it's best to also replace the space with \h+, which matches any amount of "horizontal space", including both tabs and spaces
my $t = $r =~ s/\h+/bbb/gr;
As it stands your code will find and replacing substrings of other entries in #seq if they appear earlier in the array. In this case, that means big pizza will be found first and converted to bigbbbpizza, after which
eat big pizza can no longer be found. You need to first sort your array in descending order of length so that longer phrases are found before shorter ones
#seq = sort { length($b) <=> length($a) } #seq;
Then your program will work a little better

Here is the modified code.
use strict;
use warnings;
use autodie;
my ($f1, $f2) = ('a.txt', 'b.txt');
open(my $fh, $f1);
my #seq;
foreach (<$fh> )
{
chomp;
s/^\s+|\s+$//g;
push #seq, $_;
}
close $fh;
#seq = sort bylen #seq; # need to sort #seq by length.
open($fh, $f2);
foreach (<$fh> ) {
foreach my $r (#seq) {
my $t = $r =~ s/ /bbb/gr;
s/$r/$t/g; # you may need to take care of cases of extra spaces
}
print;
}
close $fh;
exit 0;
sub bylen {
length($b) <=> length($a);
}

Related

Parsing string in multiline data with positive lookbehind

I am trying to parse data like:
header1
-------
var1 0
var2 5
var3 9
var6 1
header2
-------
var1 -3
var3 5
var5 0
Now I want to get e.g. var3 for header2. Whats the best way to do this?
So far I was parsing my files line-by-line via
open(FILE,"< $file");
while (my $line = <FILE>){
# do stuff
}
but I guess it's not possible to handle multiline parsing properly.
Now I am thinking to parse the file at once but wasn't successful so far...
my #Input;
open(FILE,"< $file");
while (<FILE>){ #Input = <FILE>; }
if (#Input =~ /header2/){
#...
}
The easier way to handle this is "paragraph mode".
local $/ = "";
while (<>) {
my ($header, $body) =~ /^([^\n]*)\n-+\n(.*)/s
or die("Bad data");
my #data = map [ split ], split /\n/, $body;
# ... Do something with $header and #data ...
}
The same can be achieved without messing with $/ as follows:
my #buf;
while (1) {
my $line = <>;
$line =~ s/\s+\z// if !defined($line);
if (!length($line)) {
if (#buf) {
my $header = shift(#buf);
shift(#buf);
my #data = map [ split ], splice(#buf);
# ... Do something with $header and #data ...
}
last if !defined($line);
next;
}
push #buf, $line;
}
(In fact, the second snippet includes a couple of small improvements over the first.)
Quick comments on your attempt:
The while loop is useless because #Input = <FILE> places the remaining lines of the file in #Input.
#Input =~ /header2/ matches header2 against the stringification of the array, which is the stringification of the number of elements in #Input. If you want to check of an element of #Input contains header2, will you will need to loop over the elements of #Inputs and check them individually.
while (<FILE>){ #Input = <FILE>; }
This doesn't make much sense. "While you can read a record from FILE, read all of the data on FILE into #Input". I think what you actually want is just:
my #Input = <FILE>;
if (#Input =~ /header2/){
This is quite strange too. The binding operator (=~) expects scalar operands, so it evaluates both operands in scalar context. That means #Input will be evaluated as the number of elements in #Input. That's an integer and will never match "header2".
A couple of approaches. Firstly a regex approach.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $file = 'file';
open my $fh, '<', $file or die $!;
my $data = join '', <$fh>;
if ($data =~ /header2.+var3 (.+?)\n/s) {
say $1;
} else {
say 'Not found';
}
The key to this is the /s on the m// operator. Without it, the two dots in the regex won't match newlines.
The other approach is more of a line by line parser.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $file = 'file';
open my $fh, '<', $file or die $!;
my $section = '';
while (<$fh>) {
chomp;
# if the line all word characters,
# then we've got a section header.
if ($_ !~ /\W/) {
$section = $_;
next;
}
my ($key, $val) = split;
if ($section eq 'header2' and $key eq 'var3') {
say $val;
last;
}
}
We read the file a line at a time and make a note of the section headers. For data lines, we split on whitespace and check to see if we're in the right section and have the right key.
In both cases, I've switched to using a more standard approach (lexical filehandles, 3-arg open(), or die $!) for opening the file.

how to count the specific word inputted in STDIN inside the text (PERL)

how to count the specific word inputted in STDIN inside the text (PERL)
my output just count all the owrds found inside the text . but i need the specific word i inputted inside the STDIN
open my($file), '<','C:\Users\yukari\Desktop\hi.txt' or die "not exist";
print "Search the word:";
$word = <STDIN>;
print "\n";
while ( my $line = <$file> ) {
chomp($line);
# print $line;
foreach $word (split(' ', $line)) {
$count{$word}++;
}
}
foreach $word (sort keys %count) {
print "$word: $count{$word}\n";
}
I believe you want to get a word from the user and count the number of occurrences of that word in the entire text file.
You can try something like this:
use strict;
use warnings;
open(WRITE,'>','log.txt') or die "Unable to open the file";
my $string = <<END;
foo baz bar
baz bar bar foo
foo foo bar bar baz baz
baz baz baz
END
print WRITE $string;
close WRITE;
open(READ, '<','log.txt') or die "unable to open the file";
my $search = <STDIN>;
chomp $search;
my $count = 0;
while ( my $line = <READ> ) {
chomp($line);
my #words = split(' ',$line);
foreach my $word(#words){
$count++ if($word eq $search);
}
}
close READ;
print "Search string: $search, Count: $count","\n";
You have a problem here. You are using the variable $word for three different things.
You are using it as the word that you are searching for:
$word = <STDIN>;
You are using it to store each word on a line read from your file:
foreach $word (split(' ', $line)) {
And you are using it to contain the keys you are reading from your hash at the end.
foreach $word (sort keys %count) {
In particular, the second use is interfering with the first. When you are reading data from your file, you have no way of knowing what word you are looking for.
If you are looking for a single word, there is no need for a hash to store the counts. I'd write your code like this:
# ALWAYS INCLUDE THESE
use strict;
use warnings;
use feature 'say';
# Renamed your variable, it's a file handle, not a file.
# Also, include $! in error message so we know what really
# went wrong.
open my $fh, '<', 'C:\Users\yukari\Desktop\hi.txt'
or die "Can't open file: $!";
print "Search the word:";
my $search_word = <STDIN>;
print "\n";
# Scalar variable to store the count
my $count;
# Remove the $line variable and use Perl's default variable $_
# instead. This makes the code *far* cleaner.
while ( <$file> ) {
chomp;
# By default, split splits $_ on whitespace
foreach $word (split) {
# Skip words we don't care about
next if $word ne $search_word;
# Increment the counter
$count++;
}
}
say "$search_word appeared $word times";

perl: make script fast to use big file

My problem is how to make my script fast (I use big files)
I have the script above it add "bbb" between words if the words exist in an other file that contain sequences of words
for exemple file2.txt : i eat big pizza .my big pizza ...
file1.txt (sequences):
eat big pizza
big pizza
the result Newfile
i eatbbbbigbbbpizza.my bigbbbpizza ...
my script:
use strict;
use warnings;
use autodie;
open Newfile ,">./newfile.txt" or die "Cannot create Newfile.txt";
my %replacement;
my ($f1, $f2) = ('file1.txt', 'file2.txt');
open(my $fh, $f1);
my #seq;
foreach (<$fh> )
{
chomp;
s/^\s+|\s+$//g;
push #seq, $_;
}
close $fh;
#seq = sort bylen #seq;
open($fh, $f2);
foreach (<$fh> ) {
foreach my $r (#seq) {
my $t = $r;
$t =~ s/\h+/bbb/g;
s/$r/$t/g;
}
print Newfile ;
}
close $fh;
close Newfile ;
exit 0;
sub bylen {
length($b) <=> length($a);
}
Instead of an array
my #seq;
define your words as a hash.
my %seq;
Instead of pushing the words
push #seq, $_;
store the words in the hash. Precalculate the replacement and move it out of the loop.
my $t = $_;
$t =~ s/\h+/bbb/g;
$seq{$_} = $t;
Precalculate the words in front of the outer loop:
my #seq = keys %seq;
And use hash look-ups to find the replacement in the inner loop:
my $t = $seq{$r};
This might be a bit faster, but do not expect too much.
In most cases it is better to reduce the problem by preparing the input in a way, which makes the solution easier. For example grep -f is much faster than your Perl loops. Use grep to find the lines, which need a replacement, and do the replacement with Perl or Sed.
Another way is to parallel the job. You can divide your input in n parts and run n processes on n CPUs in parallel. See the GNU parallel tutorial.
What about a regexp like this (beware that this approach can cause security concerns) ?
use strict;
use warnings;
open (my $Newfile, '>', 'newfile.txt') or die "Cannot create Newfile.txt: $!";
my ($f1, $f2) = qw(file1.txt file2.txt);
open (my $fh, $f1) or die "Can't open $f1 for reading: $!";
my #seq = map {split ' ', $_ } <$fh>;
close $fh;
# an improvement would be to use an hash to avoid dupplicates
my $regexp = '(' . join('|', #seq) . ')';
open($fh, $f2) or die "Can't open $f2 for reading: $!";
foreach my $line (<$fh> ) {
$line =~ s/$regexp/$1bbb/g;
print $Newfile $line;
}
close $fh;
close $Newfile ;
exit 0;

comparing hash element to table element perl

I have a program that compares each line of two files, each line contains one word, if simply read the two files and stock the data into table, and compare the element of the two tables,
the first file contain:
straight
work
week
belief time
saturday
wagon
australia
sunday
french
...
and the second file contain
firepower
malaise
bryson
wagon
dalglish
french
...
this will take a long time to compare file, so I propose another solution, but this doesn't work
#!/usr/bin/perl
use strict;
use warnings;
open( FIC, $ARGV[0] );
open( FICC, $ARGV[1] );
print "choose the name of the file\n";
chomp( my $fic2 = <STDIN> );
open( FIC2, ">$fic2" );
my $i=0;
my $j=0;
my #b=();
my %stops;
while (<FIC>) #read each line into $_
{
# Remove newline from $_
chomp;
$_ =~ s/\s+$//;
$stops{$_} = $i; # add the line to
$i++;
}
close FIC;
while (<FICC>) {
my $ligne = $_;
$ligne =~ s/\s+$//;
$b[$i] = lc($ligne);
# $b contain the data
$i++;
}
foreach my $che (#b) {
chomp($che);
print FIC2 $che;
print FIC2 " ";
print FIC2 $stops{"$che"}; print FIC2 "\n";
#this returns nothing
}
The problem is inthis command $stop{"$che"}; in the case that the elment don't exist in the hash %stop, it return an integer and an error
Use of initalized value in print c:/ats2/hash.pl line 44, line 185B2
Does this what you want?
join <(sort file1) <(sort file2) >result
Works in bash.

How can I iterate through nested arrays?

I have created an array as follows
while (defined ($line = `<STDIN>`))
{
chomp ($line);
push #stack,($line);
}
each line has two numbers.
15 6
2 8
how do iterate over each item in each line?
i.e. I want to print
15
6
2
8
I understand it's something like
foreach (#{stack}) (#stack){
print "?????
}
This is where I am stuck.
See the perldsc documentation. That's the Perl Data Structures Cookbook, which has examples for dealing with arrays of arrays. From what you're doing though, it doesn't look like you need an array of arrays.
For your problem of taking two numbers per line and outputting one number per line, just turn the whitespace into newlines:
while( <> ) {
s/\s+/\n/; # turn all whitespace runs into newlines
print; # it's ready to print
}
With Perl 5.10, you can use the new \h character class that matches only horizontal whitespace:
while( <> ) {
s/\h+/\n/; # turn all horizontal whitespace runs into newlines
print; # it's ready to print
}
As a Perl one-liner, that's just:
% perl -pe 's/\h+/\n/' file.txt
#!/usr/bin/perl
use strict;
use warnings;
while ( my $data = <DATA> ) {
my #values = split ' ', $data;
print $_, "\n" for #values;
}
__DATA__
15 6
2 8
Output:
C:\Temp> h
15
6
2
8
Alternatively, if you want to store each line in #stack and print out later:
my #stack = map { [ split ] } grep { chomp; length } <DATA>;
The line above slurps everything coming from the DATA filehandle into a list of lines (because <DATA> happens in list context). The grep chomps each line and filters by length after chomping (to avoid getting any trailing empty lines in the data file -- you can avoid it if there are none). The map then splits each line along spaces, and then creates an anonymous array reference for each line. Finally, such array references are stored in each element of #stack. You might want to use Data::Dumper to look at #stack to understand what's going on.
print join("\n", #$_), "\n" for #stack;
Now, we look over each entry in stack, dereferencing each array in turn, then joining the elements of each array with newlines to print one element per line.
Output:
C:\Temp> h
15
6
2
8
The long way of writing essentially the same thing (with less memory consumption) would be:
my #stack;
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
my #values = split ' ', $line;
push #stack, \#values;
}
for my $ref ( #stack ) {
print join("\n", #$ref), "\n";
}
Finally, if you wanted do something other than printing all values, say, sum all the numbers, you should store one value per element of #stack:
use List::Util qw( sum );
my #stack;
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
my #values = split ' ', $line;
push #stack, #values;
}
printf "The sum is %d\n", sum #stack;
#!/usr/bin/perl
while ($line = <STDIN>) {
chomp ($line);
push #stack, $line;
}
# prints each line
foreach $line (#stack) {
print "$line\n";
}
# splits each line into items using ' ' as separator
# and prints the items
foreach $line (#stack) {
#items = split / /, $line;
foreach $item (#items) {
print $item . "\n";
}
}
I use 'for' for "C" style loops, and 'foreach' for iterating over lists.
#!/usr/bin/perl
use strict;
use warnings;
open IN, "< read.txt" or
die "Can't read in 'read.txt'!";
my $content = join '', <IN>;
while ($content =~ m`(\d+)`g) {
print "$1\n";
}