Extract n words from string using Perl - perl

i have a text containing sentences in each line, and in front of each word its lemmetize form exemple:
he "he" went "go" to "to" school "school" with "with" his "his" freinds "freind"
i would like to extract for example three by three word in in each line. The result seems like this:
he "he" went "go" to "to" \n
went "go" to "to" school "school" \n
to "to" school "school" with "with" \n
school "school" with "with" his "his" \n
with "with" his "his" freinds "freind" \n
I'd like to do this using Perl.

thank you all for your helps, i found a solution, it works, but it's in dirty code i think, that's why i asked this question, to find a better solution, the awk solution seems great but the result not like i look,
This is the solution i fixed the window at 7 words and in front of each word it's POS and it's lemmetized form:
he "he" "PRO" went "go" "V" to "to" "PREP" school "school" "N" ...
open(F,"/home/file.txt")||die "error";
my $string;
while($ligne = <F> ) {
my #val = split(/ /, $ligne);
my $long=$#val;
for($i=0; $i<$long;$i+=3){
$string="$val[$i] $val[$i+1] $val[$i+2] $val[$i+3] $val[$i+4] $val[$i+5] $val[$i+6] $val[$i+7] $val[$i+8] $val[$i+9] $val[$i+10] $val[$i+11] $val[$i+12] $val[$i+13] $val[$i+14] $val[$i+15] $val[$i+16] $val[$i+17] $val[$i+18] $val[$i+19] $val[$i+20]";
my #val2 = split(/ /, $string);
my $long2=$#val2;
if($long2 >19){ #if length superior at 19, (3*7)
print FILEOUT "$string\n";
$string="";
}
}
}

This script starts by reading the entire line as an array of words (#words) and then uses an #aux array as a FIFO... discarding the first 2 elements at each pass and keeping the FIFO size always 6 itens... then, reapeat while there are words in the #words array:
#!/usr/bin/perl
use strict;
my $file = 'file.txt';
open(F,$file)||die "error";
my #aux;
while(<F>) {
my #words = split /\s+/;
while($#words >= 0) {
while($#aux < 5 && $#words >= 0) {
my $a = shift #words;
push #aux, $a;
}
print ((join " ", #aux)."\n");
shift #aux;
shift #aux;
}
}

Related

lowercase everything except content between single quotes - perl

Is there a way in perl to replace all text in input line except ones within single quotes(There could be more than one) using regex, I have achieved this using the code below but would like to see if it can be done with regex and map.
while (<>) {
my $m=0;
for (split(//)) {
if (/'/ and ! $m) {
$m=1;
print;
}
elsif (/'/ and $m) {
$m=0;
print;
}
elsif ($m) {
print;
}
else {
print lc;
}
}
}
**Sample input:**
and (t.TARGET_TYPE='RAC_DATABASE' or (t.TARGET_TYPE='ORACLE_DATABASE' and t.TYPE_QUALIFIER3 != 'racinst'))
**Sample output:**
and (t.target_type='RAC_DATABASE' or (t.target_type='ORACLE_DATABASE' and t.type_qualifier3 != 'racinst'))
You can give this a shot. All one regexp.
$str =~ s/(?:^|'[^']*')\K[^']*/lc($&)/ge;
Or, cleaner and more documented (this is semantically equivalent to the above)
$str =~ s/
(?:
^ | # Match either the start of the string, or
'[^']*' # some text in quotes.
)\K # Then ignore that part,
# because we want to leave it be.
[^']* # Take the text after it, and
# lowercase it.
/lc($&)/gex;
The g flag tells the regexp to run as many times as necessary. e tells it that the substitution portion (lc($&), in our case) is Perl code, not just text. x lets us put those comments in there so that the regexp isn't total gibberish.
Don't you play too hard with regexp for such a simple job?
Why not get the kid 'split' for it today?
#!/usr/bin/perl
while (<>)
{
#F = split "'";
#F = map { $_ % 2 ? $F[$_] : lc $F[$_] } (0..#F);
print join "'", #F;
}
The above is for understanding. We often join the latter two lines reasonably into:
print join "'", map { $_ % 2 ? $F[$_] : lc $F[$_] } (0..#F);
Or enjoy more, making it a one-liner? (in bash shell) In concept, it looks like:
perl -pF/'/ -e 'join "'", map { $_ % 2 ? $F[$_] : lc $F[$_] } (0..#F);' YOUR_FILE
In reality, however, we need to respect the shell and do some escape (hard) job:
perl -pF/\'/ -e 'join "'"'"'", map { $_ % 2 ? $F[$_] : lc $F[$_] } (0..#F);' YOUR_FILE
(The single-quoted single quote needs to become 5 letters: '"'"')
If it doesn't help your job, it helps sleep.
One more variant with Perl one-liner. I'm using hex \x27 for single quotes
$ cat sql_str.txt
and (t.TARGET_TYPE='RAC_DATABASE' or (t.TARGET_TYPE='ORACLE_DATABASE' and t.TYPE_QUALIFIER3 != 'racinst'))
$ perl -ne ' { #F=split(/\x27/); for my $val (0..$#F) { $F[$val]=lc($F[$val]) if $val%2==0 } ; print join("\x27",#F) } ' sql_str.txt
and (t.target_type='RAC_DATABASE' or (t.target_type='ORACLE_DATABASE' and t.type_qualifier3 != 'racinst'))
$

Aligning file output with "\t"

I have an assignment that requires me to print out some sorted lists and delimit the fields by '\t'. I've finished the assignment but I cannot seem to get all the fields to line up with just the tab character. Some of the output is below, names that are over a certain length break the fields. How can I still use '\t' and get everything aligned by only that much space?
open(DOB, ">dob.txt") || die "cannot open $!";
# Output name and DOB, sorted by month
foreach my $key (sort {$month{$a} <=> $month{$b}} keys %month)
{
my #fullName = split(/ /, $namelist{$key});
print DOB "$fullName[1], $fullName[0]\t$doblist{$key}\n";
}
close(DOB);
Current output:
Santiago, Jose 1/5/58
Pinhead, Zippy 1/1/67
Neal, Jesse 2/3/36
Gutierrez, Paco 2/28/53
Sailor, Popeye 3/19/35
Corder, Norma 3/28/45
Kirstin, Lesley 4/22/62
Fardbarkle, Fred 4/12/23
You need to know how many spaces are equivalent to a tab. Then you can work out how many tabs are covered by each entry.
If tabs take 4 spaces then the following code works:
$TAB_SPACE = 4;
$NUM_TABS = 4;
foreach my $key (sort {$month{$a} <=> $month{$b}} keys %month) {
my #fullName = split(/ /, $namelist{$key});
my $name = "$fullName[1], $fullName[0]";
# This rounds down, but that just means you need a partial tab
my $covered_tabs = int(length($name) / $TAB_SPACE);
print $name . ("\t" x ($NUM_TABS - $covered_tabs)) . $doblist{$key}\n";
}
You need to know how many tabs to pad out to, but you could work that out in a very similar way to actually printing the lines.

Exact pattern match using perl index() function

I am trying to use the index() function and I want to find the position of a word inside a string, only when it is an exact match. For example:
My string is STRING="CATALOG SCATTER CAT CATHARSIS"
And my search string is KEY=CAT
I want to say something like index($STRING, $KEY) and check match for CAT, and not CATALOG. How do I accomplish this? The documentation says
The index function searches for one string within another, but without the wildcard-like behavior of a full regular-expression pattern match.
which makes me think that it may not be that straight-forward, but my perl skills are limited :). Is it possible to do what I am trying to do?
Hopefully, I was able to articulate my question well. Thanks in advance for your help!
How about:
my $str = "CATALOG SCATTER CAT CATHARSIS";
my $key = "CAT";
if ($str =~ /\b$key\b/) {
say "match at char ",$-[0];;
} else {
say "no match";
}
output:
match at char 16
You need to learn about Regular Expressions in Perl. Perl didn't invent Regular Expressions, but tremendously expanded upon the concept. In fact, many other programming languages talk specifically about using Perl Regular Expressions.
A regular expression matches a specific word pattern. For example, /cat/ matches the sequence cat in a string.
if ( $string =~ /cat/ ) {
print "String contains the letters 'cat' in a row\n";
}
In many ways, this does the same thing as:
my $location = index ( $string, "cat" );
if ( $location =! -1 ) { # index returns -1 when substring isn't found
print "String contains the letters 'cat' in a row\n";
}
But, both of these would match:
"Don't let the cat out of the bag"
"The Sears catalog arrived in the mail"
You don't want to match the last. So, you could do this:
my $location = index $string, " cat ";
Now, index $string, " cat " won't match the word catalog. Case closed! Or is it? What about:
"cat and dog it doth rain."
Maybe you could check and say things are okay if a sentence starts with "cat":
if ( (index ($string, " cat ") != -1) or (index ($string, "cat") = 0) ) {
print "String contains the letters 'cat' in a row\n";
}
But, what about these?
"The word CAT in all uppercase"
"Stupid cat"
"Cat! Here Cat! Common Cat!": Punctuation after the word "cat"
"Don't let the 'cat' out of the 'bag'": Quotation Marks around "cat"
It could take dozens of lines to specify each and every one of these conditions.
However:
if ( $string =~ /\bcat\b/i ) {
print "String contains the word 'cat' in it\n";
}
Specifies each and every one -- and then some. The \b says this is a word boundary. This could be a space, a tab, a quote, the beginning or ending of a line. Thus /\bcat\b/ specifies that this should be the word cat and not catalog. The i on the end tells your regular expression to ignore case when matching, so you'll find Cat, cat, CAT, cAt, and all other possible combinations.
In fact, Perl's regular expressions is what made Perl such a popular language to begin with.
Fortunately, Perl comes with not one, but two tutorials on Regular Expressions:
perlretut: Perl Regular Expression Tutorial
perlrequick: Perl Regular Expression Quick Start.
Hope this helps.
That's (partial) solution of this problem with index:
use warnings;
use strict;
my $test = 'CATALOG SCATTER CAT CATHARSIS';
my $key = 'CAT';
my $k_length = length $key;
my $s_length = (length $test) - $k_length;
my $pos = -1;
while (($pos = index $test, $key, $pos + 1) > -1) {
if ($pos > 0) {
my $prev_char = substr $test, $pos - 1, 1;
### print "Previous character: '$prev_char'\n";
next if $prev_char ge 'A' && $prev_char le 'Z'
|| $prev_char ge 'a' && $prev_char le 'z';
}
if ($pos < $s_length) {
my $next_char = substr $test, $pos + $k_length, 1;
### print "Next character: '$next_char'\n";
next if $next_char ge 'A' && $next_char le 'Z'
|| $next_char ge 'a' && $next_char le 'z';
}
print "Word '$key' found at " . $pos + 1 . "th position.\n";
}
As you see, it's kinda wordy, because it uses basic Perl string functions - index and substr - only. Checking whether the substring found is indeed a word is done via checking its next and previous characters (if they exist): if they belong to either A-Z or a-z range, it's not a word.
You can simplify it a bit by trying to lowercase these characters (with lc), then check against the single character range only:
my $lc_prev_char = lc( substr $test, $pos - 1, 1 );
next if $lc_prev_char ge 'a' && $lc_prev_char le 'z';
... but then again, it's quite a minor improvement (if improvement at all).
Now consider this:
my $test = 'CATALOG SCATTER CAT CATHARSIS CAT';
my $key = 'CAT';
while ($test =~ /(?<![A-Za-z])$key(?![A-Za-z])/g) {
print "Word '$key' found at " . ($-[0] + 1) . "th position.\n";
}
... and that's it! The pattern literally tests the string given ($test) for the substring given ($key) not being either preceded with or followed by the symbol of A-Za-z range, and supporting Perl regex magic (this variable, in particular) makes it easy to get the starting position of such substring.
The bottom line: use regexes to do the regexes' work.
Regular expressions allow for the search to contain word boundaries as well as distinct characters. While
my $string = "CATALOG SCATTER CAT CATHARSIS";
index($string, 'CAT');
will return zero or greater if $string contains the characters CAT, a regular expression like
$string =~ /\bCAT\b/;
will return false as $string doesn't contain CAT preceded and followed by a word boundary. (A word boundary is either the beginning or end of the string, or between an word character and a non-word character. A word character is any alphanumeric character or an underscore.)
use \E value.
so :
#!usr/bin/perl
my $string ="Little Tony";
my $check = "Ton";
if($string =~ m/$check\E/g)
{
print "match";
}
else
{
die("No Match");
}

How do I count the "real" words in a text with Perl?

I've run into a text processing problem. I've an article, and I'd like to find out how many "real" words there are.
Here is what I mean by "real". Articles usually contain various punctuation marks such as dashes, and commas, dots, etc. What I'd like to find out is how many words there are, skipping like "-" dashes and "," commas with spaces, etc.
I tried doing this:
my #words = split ' ', $article;
print scalar #words, "\n";
But that includes various punctuations that have spaces in them as words.
So I'm thinking of using this:
my #words = grep { /[a-z0-9]/i } split ' ', $article;
print scalar #words, "\n";
This would match all words that have characters or numbers in them. What do you think, would this be good enough way to count words in an article?
Does anyone know maybe of a module on CPAN that does this?
Try to use: \W - any non-word character, and also drop _
Solution
use strict;
my $article = 'abdc, dd_ff, 11i-11, ff44';
# case David's, but it didn't work with I'm or There's
$article =~ s/\'//g;
my $number_words = scalar (split /[\W_]+/, $article);
print $number_words;
I think your solution is about as good as you're going to get without resorting to something elaborate.
You could also write it as
my #words = $article =~ /\S*\w\S*/
or count the words in a file by writing
my $n = 0;
while (<>) {
my #words = /\S*\w\S*/g;
$n += #words;
}
say "$n words found";
Try a few sample blocks of text and look at the list of "words" that it finds. If you are happy with that then your code works.

How to isolate a word that corresponds with a letter from a different column of a CSV file?

I have a CSV file, like this:
ACDB,this is a sentence
BECD,this is another sentence
BCAB,this is yet another
Each character in the first column corresponds to a word in the second column, e.g., in the first column, A corresponds with "this", C with "is", D with "a", and B, with sentence.
Given the variable character, which can be set to any of the characters appearing in the first column, I need to isolate the word which corresponds to the selected letter, e.g., if I set character="B", then the output of the above would be:
sentence
this
this another
If I set `character="C", then the output of the above would be:
is
another
is
How can I output only those words which correspond to the position of the selected letter?
The file contains many UTF-8 characters.
For every character in column 1, there is always an equal number of words in column 2.
The words in column 2 are separated by spaces.
Here is the code I have so far:
while read line
do
characters="$(echo $line | awk -F, '{print $1}')"
words="$(echo $line | awk -F, '{print $2}')"
character="B"
done < ./file.csv
This might work for you:
x=B # set wanted key variable
sed '
:a;s/^\([^,]\)\(.*,\)\([^ \n]*\) *\(.*\)/\2\4\n\1 \3/;ta # pair keys with values
s/,// # delete ,
s/\n[^'$x'] [^\n]*//g # delete unwanted keys/values
s/\n.//g # delete wanted keys
s/ // # delete first space
/^$/d # delete empty lines
' file
sentence
this
this another
or in awk:
awk -F, -vx=B '{i=split($1,a,"");split($2,b," ");c=s="";for(n=1;n<=i;n++)if(a[n]==x){c=c s b[n];s=" "} if(length(c))print c}' file
sentence
this
this another
This seems to do the trick. It reads data from within the source file using the DATA file handle, whereas you will have to obtain it from your own source. You may also have to cater for there being no word corresponding to a given letter (as for 'A' in the second data line here).
use strict;
use warnings;
my #data;
while (<DATA>) {
my ($keys, $words) = split /,/;
my #keys = split //, $keys;
my #words = split ' ', $words;
my %index;
push #{ $index{shift #keys} }, shift #words while #keys;
push #data, \%index;
}
for my $character (qw/ B C /) {
print "character = $character\n";
print join(' ', #{$_->{$character}}), "\n" for #data;
print "\n";
}
__DATA__
ACDB,this is a sentence
BECD,this is another sentence
BCAB,this is yet another
output
character = B
sentence
this
this another
character = C
is
another
is
Here's a mostly - done rump answer.
Since SO is not a "Do my work for me" site, you will need to fill in some trivial blanks.
sub get_index_of_char {
my ($character, $charset) = #_;
# Homework: read about index() function
#http://perldoc.perl.org/functions/index.html
}
sub split_line {
my ($line) = #_;
# Separate the line into a charset (before comma),
# and whitespace separated word list.
# You can use a regex for that
my ($charset, #words) = ($line =~ /^([^,]+),(?(\S+)\s+)+(\S+)$/g); # Not tested
return ($charset, \#words);
}
sub process_line {
my ($line, $character) = #_;
chomp($line);
my ($charset, $words) = split_line($line);
my $index = get_index_of_char($character, $charset);
print $words->[$index] . "\n"; # Could contain a off-by-one bug
}
# Here be the main loop calling process_line() for every line from input