Search and Replace using Perl - perl

I have some tags with values like below,
<section>
<title id="ABC0123">is The human nervous system?</title>
<para>A tag is a keyword or label that categorizes your question with other, similar questions</para>
<section>
<title id="DEF0123">Terms for anatomical directions in the nervous system</title>
<para>A tag is a keyword or label that categorizes your question with other, similar questions</para>
</section>
<section>
<title id="ABC4356">Anatomical terms: is referring to directions</title>
.
.
.
The output I need is like below,
<section>
<title id="ABC0123">Is the Human Nervous System?</title>
<para>A tag is a keyword or label that categorizes your question with other, similar questions</para>
</section>
<section>
<title id="DEF0123">Terms for Anatomical Directions in the Nervous System</title>
<para>A tag is a keyword or label that categorizes your question with other, similar questions</para>
<section>
<title id="ABC4356">Anatomical Terms: Is Referring to Directions</title>
.
.
how could I do this using perl. Here all prepositions and articles will be in lower case. Now the condition is slightly differs as below
condition is if a word that is in #lowercase (suppose is) and it is the first word of the and is in lower case then it should be upper case. Again if any #lowercase word after colon in the should be in upper case.

Probably something like this then:
#!/usr/bin/env perl
use strict;
use warnings;
my $lines = qq#
<title>The human nervous system</title>
<title>Terms for anatomical directions in the nervous system</title>
<title>Anatomical terms referring to directions</title>
#;
foreach my $line ( split(/\n/, $lines ) ) {
$line =~ s|</?title>||g;
if ( $line = /\w+/ ) { # Skip if blank
print "<title>" . ucfirst(
join(" ",
map{ !/^(in|the|on|or|to|for)$/i ? ucfirst($_) : lc($_); }
split(/\s/, $line )
)
) ."<\/title>\n";
}
}
Or however you want to loop your file. But you are going to have to filter the terms you don't want converted like this. As I have shown.

New answer to match the updated question (sample input and desired output changed since the original question). Updated again on Mar 9, 2014, per the op's request to always uppercase the first word in a title tag.
#!/usr/bin/perl
use strict;
use warnings;
# Add your articles and prepositions here!!!
my #lowercase = qw(a an at for in is the to);
# Use a hash since lookup is easier later.
my %lowercase;
# Populate the hash with keys and values from #lowercase.
# Values could have been anything, but it needs to match the number of keys, so this is easiest.
#lowercase{#lowercase} = #lowercase;
open(F, "foo.txt") or die $!;
while(<F>) {
if (m/^<title/i) {
chomp;
my #words;
my $line = $_;
# Save the opening <title> tags
my $titleTag = $line;
$titleTag =~ s/^(<[^>]*>).*/$1/;
# Remove any tags in <brackets>
$line =~ s/<[^>]*>//g;
# Uppercase the first letter in every word, except for those in a certain list.
my $first = 1;
foreach my $word (split(/\s/, $line)) {
if ($first) {
$first = 0;
push(#words, ucfirst($word));
next;
}
if ($first || exists $lowercase{$word}) { push(#words, "$word") }
else { push(#words, ucfirst($word)) }
}
print $titleTag . join(" ", #words) . "</title>\n";
}
else {
print $_;
}
}
close(F)
This code does make 2 assumptions:
Each <title>...</title> is on a single line. It never wraps to more
than one line in the file.
The opening <title> tag is at the beginning of the line. This can be easily be changed in the code if desired though.

Related

Perl collecting xml snippets from log with specific contents

I have a script copied from another stackoverflow question, but it seems to replace the content of the variable, could someone point me to the error? If i remove the if check for the ">OK<" it prints the whole xml to a file, if i put the if back it only prints the line containing the ">OK<", why is the $xml variable modified by the =~?
# Example usage:
# perl script.pl data.xml RootTag > RootTag.xml
use strict;
use warnings;
my $tag = pop;
while (<>){
if ( s/.*(<$tag>)/$1/ .. s/(<(\/)$tag>).*/$1/ ) {
my $xml = $_;
if ($xml =~ m/>OK</) {
print "$xml";
}
}
}
An example of a input file could be
reioirioree
brebreberbre
rbebrbebre
<test>
<id>1</id>
<status>OK</status>
</test>
bbrtbtrbt
rtbtrb
<test>
<id>2</id>
<status>KO</status>
</test>
brtoibjtrbi
bebbetreb
<test>
<id>3</id>
<status>OK</status>
</test>
dfbreberbreb
berbrebre
In this case if we user "test" as parameter, i would like following output
<test>
<id>1</id>
<status>OK</status>
</test>
<test>
<id>3</id>
<status>OK</status>
</test>
The objective is to capture the whole tag when it contains a specific pattern (>OK<).
Here is a step-by-step way which spells out details. I keep your program interface.
use strict;
use warnings;
my $tag = pop;
my ($inside_tag, $found, #buff);
while (<>)
{
if (s/.*(<$tag>)/$1/) {
$inside_tag = 1;
}
elsif (s|(</$tag>).*|$1|) { #/
$inside_tag = 0;
if ($found) {
print #buff, $_;
$found = 0;
}
#buff = ();
}
next unless $inside_tag;
push #buff, $_;
$found = 1 if />OK</;
}
On the opening tag we set the flag that we are inside the tag. On the closing tag we unset it, and if the marker has been $found we print the buffer (and unset marker's flag). We clear the buffer here.
Then we skip the iteration if outside of the tag. Otherwise, add the line to the buffer and test for the marker on that line.
A glitch with using the range in this problem is that we must know when we are on the closing-tag line, and would like to know the opening line as well. Then we need further tests and flip-flop isn't so clean any more. We can use the sequence number that the .. operator returns
The value returned is either the empty string for false, or a sequence number (beginning with 1) for true. The sequence number is reset for each range encountered. The final sequence number in a range has the string "E0" appended to it, which doesn't affect its numeric value, but gives you something to search for if you want to exclude the endpoint. You can exclude the beginning point by waiting for the sequence number to be greater than 1.
It would go something like
if (my $seq = /BEG/ .. /END/)
{
if ($seq == 1) { # first line of range
# ...
}
elsif ($seq =~ /EO$/) { # last line of range
# ...
}
else { ... } # inside
and I don't see that this is clearer or better than keeping the state manually.

count number of times string repeated in file perl

I am new to Perl, by the way. I have a Perl script that needs to count the number of times a string appears in the file. The script gets the word from the file itself.
I need it to grab the first word in the file and then search the rest of the file to see if it is repeated anywhere else. If it is repeated I need it to return the amount of times it was repeated. If it was not repeated, it can return 0. I need it to then get the next word in the file and check this again.
I will grab the first word from the file, search the file for repeats of that word, grab the second word from
the file, search the file for repeats of that word, grab the third word from the file, search the file for repeats of that word.
So far I have a while loop that is grabbing each word I need, but I do not know how to get it to search for repeats without resetting the position of my current line. So how do I do this? Any ideas or suggestions are greatly appreciated! Thanks in advance!
while (<theFile>) {
my $line1 = $_;
my $startHere = rindex($line1, ",");
my $theName = substr($line1, $startHere + 1, length($line1) - $startHere);
#print "the name: ".$theName."\n";
}
Use a hashtable;
my %wordcount = ();
while(my $line = <theFile>)
{
chomp($line);
my #words = split(' ', $line);
foreach my $word(#words)
{
$wordCount{$word} += 1;
}
}
# output
foreach my $key(keys %wordCount)
{
print "Word: $key Repeat_Count: " . ($wordCount{$key} - 1) . "\n";
}
The $wordCount{$key} - 1 in the output accounts for the first time a word was seen; Words that only apprear once in the file will have a count of 0
Unless this is actually homework and/or you have to achieve the results in the specific manor you describe, this is going to be FAR more efficient.
Edit: From your comment below:
Each word i am searching for is not "the first word" it is a certain word on the line. Basically i have a csv file and i am skipping to the third value and searching for repeats of it.
I would still use this approach. What you would want to do is:
split on , since this is a CSV file
Pull out the 3rd word in the array on each line and store the words you are interested in in their own hash table
At the end, iterate through the "search word" hash table, and pull out the counts from the wordcount table
So:
my #words = split(',', $line);
$searchTable{#words[2]} = 1;
...
foreach my $key(keys %searchTable)
{
print "Word: $key Repeat_Count: " . ($wordCount{$key} - 1) . "\n";
}
you'll have to adjust according to what rules you have around counting words that repeat in the third column. You could just remove them from #words before the loop that inserts into your wordCount hash.
my $word = <theFile>
chomp($word); #`assuming word is by itself.
my $wordcount = 0;
foreach my $line (<theFile>) {
$line =~ s/$word/$wordcount++/eg;
}
print $wordcount."\n";
Look up the regex flag 'e' for more on what this does. I didn't test the code, but something like it should work. For clarification, the 'e' flag evaluates the second part of the regex (the substitution) as code before replacing, but it's more than that, so with that flag you should be able to make this work.
Now that I understand what you are asking for, the above solution won't work. What you can do, is use sysread to read the entire file into a buffer, and run the same substition after that, but you will have to get the first word off manually, or you can just decrement after the fact. This is because the sysread filehandle and the regular filehandle are handled differently, so try this:
my $word = <theFile>
chomp($word); #`assuming word is by itself.
my $wordcount = 0;
my $srline = '';
#some arbitrary very long length, longer than file
#Looping also possible.
sysread(theFile,$srline,10000000)
$srline =~ s/$word/$wordcount++/eg;
$wordcount--; # I think that the first word will still be in here, causing issues, you should test.
print $wordcount."\n";
Now, given that I read your comment responding to your question, I don't think that your current algorithm is optimal, and you probably want a hash storing up all of the counts for words in a file. This would probably be best done using something like the following:
my %counts = ();
foreach my $line (<theFile>) {
$line =~ s/(\w+)/$counts{$1}++/eg;
}
# now %counts contains key-value pair words for everything in the file.
To find count of all words present in the file you can do something like:
#!/usr/bin/perl
use strict;
use warnings;
my %count_of;
while (my $line = <>) { #read from file or STDIN
foreach my $word (split /\s+/, $line) {
$count_of{$word}++;
}
}
print "All words and their counts: \n";
for my $word (sort keys %count_of) {
print "'$word': $count_of{$word}\n";
}
__END__

Can't make sense out of this Perl code

This snippet basically reads a file line by line, which looks something like:
Album=In Between Dreams
Interpret=Jack Johnson
Titel=Better Together
Titel=Never Know
Titel=Banana Pancakes
Album=Pictures
Interpret=Katie Melua
Titel=Mary Pickford
Titel=It's All in My Head
Titel=If the Lights Go Out
Album=All the Lost Souls
Interpret=James Blunt
Titel=1973
Titel=One of the Brightest Stars
So it somehow connects the "Interpreter" with an album and this album with a list of titles. But what I don't quite get is how:
while ($line = <IN>) {
chomp $line;
if ($line =~ /=/) {
($name, $wert) = split(/=/, $line);
}
else {
next;
}
if ($name eq "Album") {
$album = $wert;
}
if ($name eq "Interpret") {
$interpret = $wert;
$cd{$interpret}{album} = $album; // assigns an album to an interpreter?
$titelnummer = 0;
}
if ($name eq "Titel") {
$cd{$interpret}{titel}[$titelnummer++] = $wert; // assigns titles to an interpreter - WTF? how can this work?
}
}
The while loop keeps running and putting the current line into $line as long as there are new lines in the file handle <IN>. chomp removes the newline at the end of every row.
split splits the line into two parts on the equal sign (/=/ is a regular expression) and puts the first part in $name and the second part in $wert.
%cd is a hash that contains references to other hashes. The first "level" is the name of interpreter.
(Please ask more specific questions if you still do not understand.)
cd is a hash of hashes.
$cd{$interpret}{album} contains album for interpreter.
$cd{$interpret}{titel} contains an array of Titel, which is filled incrementally in the last if.
Perl is a very concise language.
The best way to figure out what's going on is to inspect the data structure. After the while loop, temporarily insert this code:
use Data::Dumper;
print '%cd ', Dumper \%cd;
exit;
This may have a large output if the input is large.

How can i detect symbols using regular expression in perl?

Please how can i use regular expression to check if word starts or ends with a symbol character, also how to can i process the text within the symbol.
Example:
(text) or te-xt, or tex't. or text?
change it to
(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?
help me out?
Thanks
I assume that "word" means alphanumeric characters from your example? If you have a list of permitted characters which constitute a valid word, then this is enough:
my $string = "x1 .text1; 'text2 \"text3;\"";
$string =~ s/([a-zA-Z0-9]+)/<t>$1<\/t>/g;
# Add more to character class [a-zA-Z0-9] if needed
print "$string\n";
# OUTPUT: <t>x1</t> .<t>text1</t>; '<t>text2</t> "<t>text3</t>;"
UPDATE
Based on your example you seem to want to DELETE dashes and apostrophes, if you want to delete them globally (e.g. whether they are inside the word or not), before the first regex, you do
$string =~ s/['-]//g;
I am using DVK's approach here, but with a slight modification. The difference is that her/his code would also put the tags around all words that don't contain/are next to a symbol, which (according to the example given in the question) is not desired.
#!/usr/bin/perl
use strict;
use warnings;
sub modify {
my $input = shift;
my $text_char = 'a-zA-Z0-9\-\''; # characters that are considered text
# if there is no symbol, don't change anything
if ($input =~ /^[a-zA-Z0-9]+$/) {
return $input;
}
else {
$input =~ s/([$text_char]+)/<t>$1<\/t>/g;
return $input;
}
}
my $initial_string = "(text) or te-xt, or tex't. or text?";
my $expected_string = "(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?";
# version BEFORE edit 1:
#my #aux;
# take the initial string apart and process it one word at a time
#my #string_list = split/\s+/, $initial_string;
#
#foreach my $string (#string_list) {
# $string = modify($string);
# push #aux, $string;
#}
#
# put the string together again
#my $final_string = join(' ', #aux);
# ************ EDIT 1 version ************
my $final_string = join ' ', map { modify($_) } split/\s+/, $initial_string;
if ($final_string eq $expected_string) {
print "it worked\n";
}
This strikes me as a somewhat long-winded way of doing it, but it seemed quicker than drawing up a more sophisticated regex...
EDIT 1: I have incorporated the changes suggested by DVK (using map instead of foreach). Now the syntax highlighting is looking even worse than before; I hope it doesn't obscure anything...
This takes standard input and processes it to and prints on Standard output.
while (<>) {
s {
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;
print ;
}
You might need to change the bit to match the concept of word.
I have use the x modifeid to allow the regexx to be spaced over more than one line.
If the input is in a Perl variable, try
$string =~ s{
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;

perl + numeration word or parameter in file

I need help about how to numeration text in file.
I have also linux machine and I need to write the script with perl
I have file name: file_db.txt
In this file have parameters like name,ParameterFromBook,NumberPage,BOOK_From_library,price etc
Each parameter equal to something as name=elephant
My question How to do this by perl
I want to give number for each parameter (before the "=") that repeated (unique parameter) in the file , and increase by (+1) the new number of the next repeated parameter until EOF
lidia
For example
file_db.txt before numbering
parameter=1
name=one
parameter=2
name=two
file_db.txt after parameters numbering
parameter1=1
name1=one
parameter2=2
name2=two
other examples
Example1 before
name=elephant
ParameterFromBook=234
name=star.world
ParameterFromBook=200
name=home_room1
ParameterFromBook=264
Example1 after parameters numbering
name1=elephant
ParameterFromBook1=234
name2=star.world
ParameterFromBook2=200
name3=home_room1
ParameterFromBook3=264
Example2 before
file_db.txt before numbering
lines_and_words=1
list_of_books=3442
lines_and_words=13
list_of_books=344224
lines_and_words=120
list_of_books=341
Example2 after
file_db.txt after parameters numbering
lines_and_words1=1
list_of_books1=3442
lines_and_words2=13
list_of_books2=344224
lines_and_words3=120
list_of_books3=341
It can be condensed to a one line perl script pretty easily, though I don't particularly recommend it if you want readability:
#!/usr/bin/perl
s/(.*)=/$k{$1}++;"$1$k{$1}="/e and print while <>;
This version reads from a specified file, rather than using the command line:
#!/usr/bin/perl
open IN, "/tmp/file";
s/(.*)=/$k{$1}++;"$1$k{$1}="/e and print while <IN>;
The way I look at it, you probably want to number blocks and not just occurrences. So you probably want the number on each of the keys to be at least as great as the earliest repeating key.
my $in = \*::DATA;
my $out = \*::STDOUT;
my %occur;
my $num = 0;
while ( <$in> ) {
if ( my ( $pre, $key, $data ) = m/^(\s*)(\w+)=(.*)/ ) {
$num++ if $num < ++$occur{$key};
print { $out } "$pre$key$num=$data\n";
}
else {
$num++;
print;
}
}
__DATA__
name=elephant
ParameterFromBook=234
name=star.world
ParameterFromBook=200
name=home_room1
ParameterFromBook=264
However, if you just wanted to give the key it's particular count. This is enough:
my %occur;
while ( <$in> ) {
my ( $pre, $key, $data ) = m/^(\s*)(\w+)=(.*)/;
$occur{$key}++;
print { $out } "$pre$key$occur{$key}=$data\n";
}
in pretty much pseudo code:
open(DATA, "file");
my #lines = <DATA>;
my %tags;
foreach line (#lines)
{
my %parts=split(/=/, $line);
my $name=$parts[0];
my $value=$parts[1];
$name = ${name}$tags{ $name };
$tags{ $name } = $tags{ $name } + 1;
printf "${name}=$value\n";
}
close( DATA );
This looks like a CS101 assignment. Is it really good to ask for complete solutions instead of asking specific technical questions if you have difficulty?
If Perl is not a must, here's an awk version
$ cat file
name=elephant
ParameterFromBook=234
name=star.world
ParameterFromBook=200
name=home_room1
ParameterFromBook=264
$ awk -F"=" '{s[$1]++}{print $1s[$1],$2}' OFS="=" file
name1=elephant
ParameterFromBook1=234
name2=star.world
ParameterFromBook2=200
name3=home_room1
ParameterFromBook3=264