Following code is for copying file content from readfile to writefile. Instead of copying upto last, i want to copy upto some keyword.
use strict;
use warnings;
use File::Slurp;
my #lines = read_file('readfile.txt');
while ( my $line = shift #lines) {
next unless ($line =~ m/END OF HEADER/);
last; # here suggest some other logic
}
append_file('writefile.txt', #lines);
next will continue to the next iteration of the loop, effectively skipping the rest of the statements in the loop for that iteration (in this case, the last).
last will immediately exit the loop, which sounds like what you want. So you should be able to simply put the conditional statement on the last.
Also, I'm not sure why you want to read the entire file into memory to iterate over its lines? Why not just use a regular while(<>)? And I would recommend avoiding File::Slurp, it has some long-standing issues.
You don't show any example input with expected output, and your description is unclear - you said "i want to copy upto some keyword" but in your code you use shift, which removes items from the beginning of the array.
Do you want to remove the lines before or after and including or not including "END OF HEADER"?
This code will copy over only the header:
use warnings;
use strict;
my $infile = 'readfile.txt';
my $outfile = 'writefile.txt';
open my $ifh, '<', $infile or die "$infile: $!";
open my $ofh, '>', $outfile or die "$outfile: $!";
while (<$ifh>) {
last if /END OF HEADER/;
print $ofh $_;
}
close $ifh;
close $ofh;
Whereas if you want to copy everything after the header, you could replace the while above with:
while (<$ifh>) {
last if /END OF HEADER/;
}
while (<$ifh>) {
print $ofh $_;
}
Which will loop and do nothing until it sees END OF HEADER, then breaking out of the first loop and moving to the second, which prints out the lines after the header.
data.txt:
fsffs
sfsfsf
sfSDFF
END OF HEADER
{ dsgs xdgfxdg zFZ }
dgdbg
vfraeer
Code:
use strict;
use warnings;
use 5.020;
use autodie;
use Data::Dumper;
my $infile = 'data.txt';
my $header_file = 'header.txt';
my $after_header_file = 'after_header.txt';
open my $DATA, '<', $infile;
open my $HEADER, '>', $header_file;
open my $AFTER_HEADER, '>', $after_header_file;
{
local $/ = "END OF HEADER";
my $header = <$DATA>;
say {$HEADER} $header;
my $rest = <$DATA>;
say {$AFTER_HEADER} $rest;
}
close $DATA;
close $HEADER;
close $AFTER_HEADER;
say "Created files: $header_file, $after_header_file";
Output:
$ perl 1.pl
Created files: header.txt, after_header.txt
$ cat header.txt
fsffs
sfsfsf
sfSDFF
END OF HEADER
$ cat after_header.txt
{ dsgs xdgfxdg zFZ }
dgdbg
vfraeer
$/ specifies the input record separator, which by default is a newline. Therefore, when you read from a file:
while (my $x = <$INFILE>) {
}
each value of $x is a sequence of characters up to and including the input recored separator, i.e. a newline, which is what we normally think of as a line of text in a file. Often, we chomp off the newline/input_record_separator at the end of the text:
while (my $x = <$INFILE>) {
chomp $x;
say "$x is a dog";
}
But, you can set the input record separator to anything you want, like your "END OF HEADER" text. That means a line will be all the text up to and including the input record separator, which in this case is "END OF HEADER". For example, a line will be: "abc\ndef\nghi\nEND OF HEADER". Furthermore, chomp() will now remove "END OF HEADER" from the end of its argument, so you could chomp your line if you don't want the "END OF HEADER" marker in the output file.
If perl cannot find the input record separator, then perl keeps reading the file until perl hits the end of the file, then perl returns all the text that was read.
You can use those operations to your advantage when you want to seek to some specific text in a file.
Declaring a variable as local makes the variable magical: when the closing brace of the surrounding block is encountered, perl sets the variable back to the value it had just before the opening brace of the surrounding block:
#Here, by default $/ = "\n", but some code out here could have
#also set $/ to something else
{
local $/ = "END OF HEADER";
} # $/ gets set back to whatever value it had before this block
When you change one of perl's predefined global variables, it's considered good practice to only change the variable for as long as you need to use the variable, then change the variable back to what it was.
If you want to target just the text between the braces, you can do:
data.txt:
fsffs
sfsfsf
sfSDFF
END OF HEADER { dsgs xdgfxdg zFZ }
dgdbg
vfraeer
Code snippet:
...
...
{
local $/ = 'END OF HEADER {';
my $pre_brace = <$DATA>;
$/ = '}';
my $target_text = <$DATA>;
chomp $target_text; #Removes closing brace
say "->$target_text<-";
}
--output:--
-> dsgs xdgfxdg zFZ <-
Related
I am currently working on a code that changes certain words to Shakespearean words. I have to extract the sentences that contain the words and print them out into another file. I had to remove .START from the beginning of each file.
First I split the files with the text by spaces, so now I have the words. Next, I iterated the words through a hash. The hash keys and values are from a tab delimited file that is structured as so, OldEng/ModernEng (lc_Shakespeare_lexicon.txt). Right now, I'm trying to figure out how to find the exact position of each modern English word that is found, change it to the Shakespearean; then find the sentences with the change words and printing them out to a different file. Most of the code is finished except for this last part. Here is my code so far:
#!/usr/bin/perl -w
use diagnostics;
use strict;
#Declare variables
my $counter=();
my %hash=();
my $conv1=();
my $conv2=();
my $ssph=();
my #text=();
my $key=();
my $value=();
my $conversion=();
my #rmv=();
my $splits=();
my $words=();
my #word=();
my $vals=();
my $existingdir='/home/nelly/Desktop';
my #file='Sentences.txt';
my $eng_words=();
my $results=();
my $storage=();
#Open file to tab delimited words
open (FILE,"<", "lc_shakespeare_lexicon.txt") or die "could not open lc_shakespeare_lexicon.txt\n";
#split words by tabs
while (<FILE>){
chomp($_);
($value, $key)= (split(/\t/), $_);
$hash{$value}=$key;
}
#open directory to Shakespearean files
my $dir="/home/nelly/Desktop/input";
opendir(DIR,$dir) or die "can't opendir Shakespeare_input.tar.gz";
#Use grep to get WSJ file and store into an array
my #array= grep {/WSJ/} readdir(DIR);
#store file in a scalar
foreach my $file(#array){
#open files inside of input
open (DATA,"<", "/home/nelly/Desktop/input/$file") or die "could not open $file\n";
#loop through each file
while (<DATA>){
#text=$_;
chomp(#text);
#Remove .START
#rmv=grep(!/.START/, #text);
foreach $splits(#rmv){
#split data into separate words
#word=(split(/ /, $splits));
#Loop through each word and replace with Shakespearean word that exists
$counter=0;
foreach $words(#word){
if (exists $hash{$words}){
$eng_words= $hash{$words};
$results=$counter;
print "$counter\n";
$counter++;
#create a new directory and store senteces with Shakespearean words in new file called "Sentences.txt"
mkdir $existingdir unless -d $existingdir;
open my $FILE, ">>", "$existingdir/#file", or die "Can't open $existingdir/conversion.txt'\n";
#print $FILE "#words\n";
close ($FILE);
}
}
}
}
}
close (FILE);
close (DIR);
Natural language processing is very hard to get right except in trivial cases, for instance it is difficult to define exactly what is meant by a word or a sentence, and it is awkward to distinguish between a single quote and an apostrophe when they are both represented using the U+0027 "apostrophe" character '
Without any example data it is difficult to write a reliable solution, but the program below should be reasonably close
Please note the following
use warnings is preferable to -w on the shebang line
A program should contain as few comments as possible as long as it is comprehensible. Too many comments just make the program bigger and harder to grasp without adding any new information. The choice of identifiers should make the code mostly self documenting
I believe use diagnostics to be unnecessary. Most messages are fairly self-explanatory, and diagnostics can produce large amounts of unnecessary output
Because you are opening multiple files it is more concise to use autodie which will avoid the need to explicitly test every open call for success
It is much better to use lexical file handles, such as open my $fh ... instead of global ones, like open FH .... For one thing a lexical file handle will be implicitly closed when it goes out of scope, which helps to tidy up the program a lot by making explicit close calls unnecessary
I have removed all of the variable declarations from the top of the program except those that are non-empty. This approach is considered to be best practice as it aids debugging and assists the writing of clean code
The program lower-cases the original word using lc before checking to see if there is a matching entry in the hash. If a translation is found, then the new word is capitalised using ucfirst if the original word started with a capital letter
I have written a regular expression that will take the next sentence from the beginning of the string $content. But this is one of the things that I can't get right without sample data, and there may well be problems, for instance, with sentences that end with a closing quotation mark or a closing parenthesis
use strict;
use warnings;
use autodie;
my $lexicon = 'lc_shakespeare_lexicon.txt';
my $dir = '/home/nelly/Desktop/input';
my $existing_dir = '/home/nelly/Desktop';
my $sentences = 'Sentences.txt';
my %lexicon = do {
open my ($fh), '<', $lexicon;
local $/;
reverse(<$fh> =~ /[^\t\n\r]+/g);
};
my #files = do {
opendir my ($dh), $dir;
grep /WSJ/, readdir $dh;
};
for my $file (#files) {
my $contents = do {
open my $fh, '<', "$dir/$file";
join '', grep { not /\A\.START/ } <$fh>;
};
# Change any CR or LF to a space, and reduce multiple spaces to single spaces
$contents =~ tr/\r\n/ /;
$contents =~ s/ {2,}/ /g;
# Find and process each sentence
while ( $contents =~ / \s* (.+?[.?!]) (?= \s+ [A-Z] | \s* \z ) /gx ) {
my $sentence = $1;
my #words = split ' ', $sentence;
my $changed;
for my $word (#words) {
my $eng_word = $lexicon{lc $word};
$eng_word = ucfirst $eng_word if $word =~ /\A[A-Z]/;
if ($eng_word) {
$word = $eng_word;
++$changed;
}
}
if ($changed) {
mkdir $existing_dir unless -d $existing_dir;
open my $out_fh, '>>', "$existing_dir/$sentences";
print "#words\n";
}
}
}
I am working on the perl script and need some help with it. The requirement is, I have to find a lable and once the label is found, I have to replace the word in a line immediately following the label. for Example, if the label is ABC:
ABC:
string to be replaced
some other lines
ABC:
string to be replaced
some other lines
ABC:
string to be replaced
I want to write a script to match the label (ABC) and once the label is found, replace a word in the next line immediately following the label.
Here is my attempt:
open(my $fh, "<", "file1.txt") or die "cannot open file:$!";
while (my $line = <$fh>))
{
next if ($line =~ /ABC/) {
$line =~ s/original_string/replaced_string/;
}
else {
$msg = "pattern not found \n ";
print "$msg";
}
}
Is this correct..? Any help will be greatly appreciated.
The following one-liner will do what you need:
perl -pe '++$x and next if /ABC:/; $x-- and s/old/new/ if $x' inFile > outFile
The code sets a flag and gets the next line if the label is found. If the flag is set, it's unset and the substitution is executed.
Hope this helps!
You're doing this in your loop:
next if ($line =~ /ABC/);
So, you're reading the file, if a line contains ABC anywhere in that line, you skip the line. However, for every other line, you do the replacement. In the end, you're replacing the string on all other lines and printing that out, and your not printing out your labels.
Here's what you said:
I have to read the file until I find a line with the label:
Once the label is found
I have to read the next line and replace the word in a line immediately following the label.
So:
You want to read through a file line-by-line.
If a line matches the label
read the next line
replace the text on the line
Print out the line
Following these directions:
use strict;
use warnings; # Hope you're using strict and warnings
use autodie; # Program automatically dies on failed opens. No need to check
use feature qw(say); # Allows you to use say instead of print
open my $fh, "<", "file1.txt"; # Removed parentheses. It's the latest style
while (my $line = <$fh>) {
chomp $line; # Always do a chomp after a read.
if ( $line eq "ABC:" ) { # Use 'eq' to ensure an exact match for your label
say "$line"; # Print out the current line
$line = <$fh> # Read the next line
$line =~ s/old/new/; # Replace that word
}
say "$line"; # Print the line
}
close $fh; # Might as well do it right
Note that when I use say, I don't have to put the \n on the end of the line. Also, by doing my chomp after my read, I can easily match the label without worrying about the \n on the end.
This is done exactly as you said it should be done, but there are a couple of issues. The first is that when we do $line = <$fh>, there's no guarantee we are really reading a line. What if the file ends right there?
Also, it's bad practice to read a file in multiple places. It makes it harder to maintain the program. To get around this issue, we'll use a flag variable. This allows us to know if the line before was a tag or not:
use strict;
use warnings; # Hope you're using strict and warnings
use autodie; # Program automatically dies on failed opens. No need to check
use feature qw(say); # Allows you to use say instead of print
open my $fh, "<", "file1.txt"; # Removed parentheses. It's the latest style
my $tag_found = 0; # Flag isn't set
while (my $line = <$fh>) {
chomp $line; # Always do a chomp after a read.
if ( $line eq "ABC:" ) { # Use 'eq' to ensure an exact match for your label
$tag_found = 1 # We found the tag!
}
if ( $tag_found ) {
$line =~ s/old/new/; # Replace that word
$tag_found = 0; # Reset our flag variable
}
say "$line"; # Print the line
}
close $fh; # Might as well do it right
Of course, I would prefer to eliminate mysterious values. For example, the tag should be a variable or constant. Same with the string you're searching for and the string you're replacing.
You mentioned this was a word, so your regular expression replacement should probably look like this:
$line =~ s/\b$old_word\b/$new_word/;
The \b mark word boundaries. This way, if you're suppose to replace the word cat with dog, you don't get tripped up on a line that says:
The Jeopardy category is "Say what".
You don't want to change category to dogegory.
Your problem is that reading in a file does not work like that. You're doing it line by line, so when your regex tests true, the line you want to change isn't there yet. You can try adding a boolean variable to check if the last line was a label.
#!/usr/bin/perl;
use strict;
use warnings;
my $found;
my $replacement = "Hello";
while(my $line = <>){
if($line =~ /ABC/){
$found = 1;
next;
}
if($found){
$line =~ s/^.*?$/$replacement/;
$found = 0;
print $line, "\n";
}
}
Or you could use File::Slurp and read the whole file into one string:
use File::Slurp;
$x = read_file( "file.txt" );
$x =~ s/^(ABC:\s*$ [\n\r]{1,2}^.*?)to\sbe/$1to was/mgx;
print $x;
using /m to make the ^ and $ match embedded begin/end of lines
x is to allow the space after the $ - there is probably a better way
Yields:
ABC:
string to was replaced
some other lines
ABC:
string to was replaced
some other lines
ABC:
string to was replaced
Also, relying on perl's in-place editing:
use File::Slurp qw(read_file write_file);
use strict;
use warnings;
my $file = 'fakefile1.txt';
# Initialize Fake data
write_file($file, <DATA>);
# Enclosed is the actual code that you're looking for.
# Everything else is just for testing:
{
local #ARGV = $file;
local $^I = '.bac';
while (<>) {
print;
if (/ABC/ && !eof) {
$_ = <>;
s/.*/replaced string/;
print;
}
}
unlink "$file$^I";
}
# Compare new file.
print read_file($file);
1;
__DATA__
ABC:
string to be replaced
some other lines
ABC:
string to be replaced
some other lines
ABC:
string to be replaced
ABC:
outputs
ABC:
replaced string
some other lines
ABC:
replaced string
some other lines
ABC:
replaced string
ABC:
I need to exit the loop after find first match and go to another search in the loop
use strict;
use warnings;
my %iptv;
sub trim($) {
my $string = shift;
$string =~ s/\r\n//g;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
my #files=</tests/*>;
open IN, "/20131105.csv";
LINE: while (<IN>) {
chomp;
my #result = split(/;/,$_);
my $result1 = trim($_);
$result[1] = trim($result[1]);
$iptv{$result[1]} = $result1;
}
close IN;
foreach my $file (#files) {
open FILE, "$file";
while (<FILE>) {
chomp;
my ($mac, $date) = split(/;/,$_);
my #date1 = split(/\s/, $date);
print "$iptv{$mac};$date1[0]\n" if defined $iptv{$mac};
last LINE if (defined $iptv{$mac});
}
close FILE;
}
I tried to use "last" function but it finds first match and ends program. where I have to put last?
Lets take a look at the documentation:
$ perldoc -f last
last LABEL
last The "last" command is like the "break" statement in C (as used
in loops); it immediately exits the loop in question. If the
LABEL is omitted, the command refers to the innermost enclosing
loop. The "continue" block, if any, is not executed:
LINE: while (<STDIN>) {
last LINE if /^$/; # exit when done with header
#...
}
"last" cannot be used to exit a block that returns a value such
as "eval {}", "sub {}" or "do {}", and should not be used to
exit a grep() or map() operation.
Note that a block by itself is semantically identical to a loop
that executes once. Thus "last" can be used to effect an early
exit out of such a block.
See also "continue" for an illustration of how "last", "next",
and "redo" work.
We can clearly read here about how to use last. If a label is omitted, it breaks out of the innermost loop. So only in the case where we do not want this do we use a label. You want this, so you do not want a label.
Some notes on your code:
Check the return value of open, and use three arguments with a lexical file handle.
open my $fh, "<", $file or die "Cannot open $file: $!";
This also has the benefit that when the lexical variable $fh goes out of scope, the file handle is closed.
When you split on \s you split on a single whitespace. Most often, this is not what you want. If you for example have a date such as
$str = "Jan 1 2013" # (note the two consecutive spaces)
...this will split into the list "Jan", "", "1", "2013" (note the empty field). This is only what you want if empty fields are relevant, such as with csv-like data. The default behaviour of split uses ' ' (a space character), which acts like /\s+/, except that it also strips leading whitespace.
Note that the two last statements inside this loop can be merged. Also, the use of the temp array #date1 is not needed. So that your code looks like:
open my $fh, "<", $file or die "Cannot open $file: $!";
while (<$fh>) {
chomp;
my ($mac, $date) = split /;/, $_;
($date) = split ' ', $date;
if (defined $iptv{$mac}) {
print "$iptv{$mac};$date\n" ;
last;
}
}
foreach my $file (#files) {
open FILE, "$file";
LINE: while (<FILE>) {
chomp;
my ($mac, $date) = split(/;/,$_);
my #date1 = split(/\s/, $date);
print "$iptv{$mac};$date1[0]\n" if defined $iptv{$mac};
last LINE if (defined $iptv{$mac});
}
close FILE;
}
Should make sure that you only exit the inner loop.
I suppose it would work just as well if you got rid of the LINE Label right behind the last alltogether but i would suggest allways using a label with last to be certain that it does not do something unexpected in case you add an additional inner loop and forget about the last within that you expect to leave a loop farther on the outside.
I am trying to print the array but the out put contain only the last line of the array. the partial code is as follow.
open OUT, "> /myFile.txt"
or die "Couldn't open output file: $!";
foreach (#result) {
print OUT;
}
the out put is
List Z
which is the last line, but when I do print "#result" the out put is
List A
List B
List C so on...
I am little bit confuse why the results are different on the same array.
Working on a hunch, I tried adding \r to the end of your input lines, and sure enough, it creates the illusion that only the last line of your input is printed to the file. Here's the code to test it:
use strict;
use warnings;
my #result = map "$_\r", 'A' .. 'Z';
open (OUT, "> myFile.txt") or die("Couldn't open output file: $!");
foreach (#result) {
print OUT ;
}
What you have probably done is performed chomp on lines from a file from a different operating system (DOS, Windows), which does not strip the \r line endings. Hence, when the lines are printed, the lines overwrite each other.
If this is what is wrong, the solution is to use the dos2unix tool to fix your files, or to use:
s/\s+\z//;
to strip your newlines.
You may inspect your input by using the Data::Dumper module, using the option Useqq, e.g.:
use Data::Dumper;
$Data::Dumper::Useqq = 1;
print Dumper \#result;
If these whitespace characters are in your output, they will then be visible.
the problem is here
open OUT, "> /myFile.txt"
this should be
open OUT, ">>", "/myfile.txt"
What you wrote overwrites the entire file for each iteration of the foreach(#result) loop.
What you are intending to do is append to it (">>").
">>" appends, ">" overwrites.
Also take note of how i broke ">> /myfile.txt" into ">>", "/myfile.txt".
This is both more secure, and more robust for less specific applications of open.
Foreign line terminators from any platform can easily be fixed by clearing whitespace from the end of the line and adding it back when printing it
Like this
open my $out, '>', '/myFile.txt' or die "Couldn't open output file: $!";
foreach (#result) {
s/\s+$//;
print $out "$_\n";
}
or
foreach my $line (#result) {
$line =~ s/\s+$//;
print $out "$line\n";
}
Earlier I was working on a loop within a loop and if a match was made it would replace the entire string from the second loop file. Now i have a slightly different situation. I'm trying to replace a substring from the first loop with a string from the second loop. They're both csv files and semicolon delimited. What i'm trying to replace are special characters: from the numerical code to the character itself The first file looks like:
1;2;blałblabla ąbla;7;8
3;4;bląblabla;9;10
2;3;blablablaąał8;9
and the second file has the numerical code and the corresponding character:
Ą;Ą
ą;ą
Ǟ;Ǟ
Á;Á
á;á
Â;Â
ł;ł
The first semicolon in the second file belongs to the numerical code of the corresponding character and should not be used to split the file. The result should be:
1;2;blałblabla ąbla;7;8
3;4;bląblabla;9;10
2;3;blablablaąał;8;9
This is the code I have. How can i fix this?
use strict;
use warnings;
my $inputfile1 = shift || die "input/output!\n";
my $inputfile2 = shift || die "input/output!\n";
my $outputfile = shift || die "output!\n";
open my $INFILE1, '<', $inputfile1 or die "Used/Not found :$!\n";
open my $INFILE2, '<', $inputfile2 or die "Used/Not found :$!\n";
open my $OUTFILE, '>', $outputfile or die "Used/Not found :$!\n";
my $infile2_pos = tell $INFILE2;
while (<$INFILE1>) {
s/"//g;
my #elements = split /;/, $_;
seek $INFILE2, $infile2_pos, 0;
while (<$INFILE2>) {
s/"//g;
my #loopelements = split /;/, $_;
#### The problem part ####
if (($elements[2] =~ /\&\#\d{3}\;/g) and (($elements[2]) eq ($loopelements[0]))){
$elements[2] =~ s/(\&\#\d{3}\;)/$loopelements[1]/g;
print "$2. elements[2]\n";
}
#### End problem part #####
}
my $output_line = join(";", #elements);
print $OUTFILE $output_line;
#print "\n"
}
close $INFILE1;
close $INFILE2;
close $OUTFILE;
exit 0;
Assuming your character codes are standard Unicode entities, you are better off using HTML::Entities to decode them.
This program processes the data you show in your first file and ignores the second file completely. The output seems to be what you want.
use strict;
use warnings;
use HTML::Entities 'decode_entities';
binmode STDOUT, ":utf8";
while (<DATA>) {
print decode_entities($_);
}
__DATA__
1;2;blałblabla ąbla;7;8
3;4;bląblabla;9;10
2;3;blablablaąał8;9
output
1;2;blałblabla ąbla;7;8
3;4;bląblabla;9;10
2;3;blablablaąał8;9
You split your #elements at every occurrence of ;, which is then removed. You will not find it in your data, the semicolon in your Regexp can never match, so no substitutions are done.
Anyway, using seek is somewhat disturbing for me. As you have a reasonable number of replacement codes (<5000), you might consider putting them into a hash:
my %subst;
while(<$INFILE2>){
/^&#(\d{3});;(.*)\n/;
$subst{$1} = $2;
}
Then we can do:
while(<$INFILE1>){
s| &# (\d{3}) | $subst{$1} // "&#$1" |egx;
# (don't try to concat undef
# when no substitution for our code is defined)
print $OUTFILE $_;
}
We do not have to split the files or view them as CSV data if replacement should occur everywhere in INFILE1. My solution should speed things up a bit (parsing INFILE2 only once). Here I assumed your input data is correct and the number codes are not terminated by a semicolon but by length. You might want to remove that from your Regexes.(i.e. m/&#\d{3}/)
If you have trouble with character encodings, you might want to open your files with :uft8 and/or use Encode or similar.