finding words in past tense using lex - lex

I want to know how to write a code in lex to identify and print the words in past tense. I have written a sample code but it doesnt print the word though it identifies the words in past tense. Pls help
%{
#include<stdio.h>
%}
%%
[a-zA-Z]"ed" {printf("%s is in past tense\n",yytext);}
[a-zA-Z0-9,$.]
%%
main()
{
yyin = fopen("pos.c","r");
yylex();
}
When i gave the following as input:
wanted loved gained maintained decided received
This is the output i got:
ted is in past tense
ved is in past tense
ned is in past tense
ned is in past tense
ded is in past tense
ved is in past tense
This is the required output:
wanted is in past tense
loved is in past tense
gained is in past tense
maintained is in past tense
decided is in past tense
received is in past tense

Your pattern asked to match a single letter followed by ed, so that's what got printed. To match (and thus print) the whole word, you'd need a pattern like [a-zA-Z]+ to match the whole word.

Related

How to determine the lines, and how many words in my song using split() in python

Music lyrics
Look what you made me do, I'm with somebody new
Ohh, baby, baby, I'm dancing with a stranger
I tried using re.split also tried loop.
Just a heads up, you should always post your code with the question. Anyway, .split() works fine, I would count lines by counting the number of newline characters '\n'
s = "Look what you made me do, I'm with somebody new Ohh, baby, baby, I'm dancing with a stranger"
lines = s.count('\n')
lst = s.replace('\n', '').split()
words = len(lst)
lyric = "Look what you made me do, I'm with somebody new Ohh, baby, baby, I'm dancing with a stranger"
print(len(lyric.split()))
This splits each word into an element of a list and prints the length of the list which is equal to the number of words in the song.

How can I convert text to title case?

I have a text file containing a list of titles that I need to change to title case (words should begin with a capital letter except for most articles, conjunctions, and prepositions).
For example, this list of book titles:
barbarians at the gate
hot, flat, and crowded
A DAY LATE AND A DOLLAR SHORT
THE HITCHHIKER'S GUIDE TO THE GALAXY
should be changed to:
Barbarians at the Gate
Hot, Flat, and Crowded
A Day Late and a Dollar Short
The Hitchhiker's Guide to the Galaxy
I wrote the following code:
while(<DATA>)
{
$_=~s/(\s+)([a-z])/$1.uc($2)/eg;
print $_;
}
But it capitalizes the first letter of every word, even words like "at," "the," and "a" in the middle of a title:
Barbarians At The Gate
Hot, Flat, And Crowded
A Day Late And A Dollar Short
The Hitchhiker's Guide To The Galaxy
How can I do this?
Thanks to See also Lingua::EN::Titlecase – Håkon Hægland given the way to get the output.
use Lingua::EN::Titlecase;
my $tc = Lingua::EN::Titlecase->new();
while(<DATA>)
{
my $line = $_;
my $tc = Lingua::EN::Titlecase->new($line);
print $tc;
}
You can also try using this regex: ^(.)(.*?)\b|\b(at|to|that|and|this|the|a|is|was)\b|\b(\w)([\w']*?(?:[^\w'-]|$)) and replace with \U$1\L$2\U$3\L$4. It works my matching the first letter of words that are not articles, capitalizing it, then matching the rest of the word. This seems to work in PHP, I don't know about Perl but it will likely work.
^(.)(.*?)\b matches the first letter of the first word (group 1) and the rest of the word (group 2). This is done to prevent not capitalizing the first word because it's an article.
\b(word|multiple words|...)\b matches any connecting word to prevent capitalizing them.
(\w)([\w']*?(?:[^\w'-]|$)) matches the first letter of a word (group 3) and the rest of the word (group 4). Here I used [^\w'-] instead of \b so hyphens and apostrophes are counted as word characters too. This prevent 's from becoming 'S
The \U in replacement capitalizes the following characters and \L lowers them. If you want you can add more articles or words to the regex to prevent capitalizing them.
UPDATE: I changed the regex so you can include connecting phrases too (multiple words). But that will still make a very long regex...

rfc 2445, rfc 5545 vs rrule.js, should DTSTART be counted as a first occurence or shouldn't?

Recently I began to work with google-rfc-2445 library and met a same problem as SO user Art Zaborskiy here: Start date returns in some cases when using google-rfc-2445 (iCalendar)
After SO user oberron pointed to specification in him answer here, where being said that:
[...] The COUNT rule part defines the number of occurrences at which
to range-bound the recurrence. The "DTSTART" property value, if
specified, counts as the first occurrence. [...]
I was sure that my issue is over, and it isn't issue but feature. I thought so till I found rrule.js libary here, in its demo page it returns exactly 10 occurrences without DTSTART occurrence, while same RRULE String in google-rfc-2445 return 11 occurrences.
The RRULE String is following: FREQ=WEEKLY;COUNT=10;BYDAY=MO;DTSTART=20150301
Now I totally confused, should or shouldn't DTSTART occurrence be in a list of all occurrences?
Thank you for your clarification.

Iphone work out if string is a UK Postcode

In my app before I send a string off I need to work out if the text entered in the textbox is a UK Postcode. I don't have the regex ability to work that out for myself and after searching around I can't seem to work it out! Just wondered if anyone has done a similar thing in the past?
Or if anyone can point me in the right direction I would be most appreciative!
Tom
Wikipedia has a good section about this. Basically the answer depends on what sort of pathological cases you want to handle. For example:
An alternative short regular expression from BS7666 Schema is:
[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}
The above expressions fail to exclude many non-existent area codes (such as A, AA, Z and ZY).
Basically, read that section of Wikipedia thoroughly and decide what you need.
for post codes without spaces (e.g. SE19QZ) I use: (its not failed me yet ;-) )
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
if spaces (e.g. SE1 9QZ) , then:
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$
You can match most post codes with this regex:
/[A-Z]{1,2}[0-9]{1,2}\s?[0-9]{1,2}[A-Z]{1,2}/i
Which means... A-Z one or two times ({1,2}) followed by 0-9 1 or two times, followed by a space \s optionally ? followed by 0-9 one or two times, followed by A-Z one or two times.
This will match some false positives, as I can make up post codes like ZZ00 00ZZ, but to accurately match all post codes, the only way is to buy post code data from the post office - which is quite expensive. You could also download free post code databases, but they do not have 100% coverage.
Hope this helps.
Wikipedia has some regexes for UK Postcodes: http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation

Emacs fill-paragraph not breaking lines where expected

I have the following HTML code (a list item). The content isn't important--the problem is the end of line 2.
<li>Yes, you can learn how to play piano without becoming a
great notation reader,
however, <strong class="warning">you <em class="emphatic">will</em>
have to acquire a <em class="emphatic">very</em>basic amount
of notation reading skill</strong>. But the extremely
difficult task of honing your note reading skills that
classical students are required to endure for years and years
is <em class="emphatic">totally non-existant</em>as a
requirement for playing non-classical piano.</li>
The command fill-paragraph (M-q) has been applied. I can't for the life of me figure out why a line break is being placed on the second line after "reader," since there's more space available on that line to put "however,". Another weird thing I've noticed is that when I delete and then reapply the tab characters on lines 4 and 5 (starting with "have" and "of" respectively), two space characters are automatically inserted as well, like so:
<li>Yes, you can learn how to play piano without becoming a
great notation reader,
however, <strong class="warning">you <em class="emphatic">will</em>
have to acquire a <em class="emphatic">very</em>basic amount
of notation reading skill</strong>. But the extremely
difficult task of honing your note reading skills that
classical students are required to endure for years and years
is <em class="emphatic">totally non-existant</em>as a
requirement for playing non-classical piano.</li>
I don't know if this is some kind of clue or not. This doesn't happen with any of the other lines.
Is this just a bug, or does any experienced Emacs person know what might be going on here?
Thank you
This is intentional. Lines that start with an XML or SGML tag are paragraph separator lines. If Emacs broke the paragraph in such a way that the tag ended up at the start of a line, subsequent applications of fill-paragraph would stop at that line. This is to ensure that, for instance,
<p>a paragraph</p>
<!-- no blank line -->
<p>another paragraph</p>
does not turn into
<p>a paragraph</p> <!-- no blank line --> <p>another paragraph</p>
For the same reason, Emacs will not break a line after a period unless there are two or more spaces after the period, because it uses a double space to distinguish between a period that ends a sentence and a period that ends an abbreviation, and breaking a line after the period that ends an abbreviation would create an ambiguous situation.
Looks like a bug to me.
I was able to trim down your example to something like this:
<li>blabla
blabla <b>some_long_text_here</b> <b>more_long_text_here</b>
If I remove a single character of text from it, fill-paragraph works as expected. Or if I add a chacter between the two consequtive <b> elements.