How can I convert text to title case? - perl

I have a text file containing a list of titles that I need to change to title case (words should begin with a capital letter except for most articles, conjunctions, and prepositions).
For example, this list of book titles:
barbarians at the gate
hot, flat, and crowded
A DAY LATE AND A DOLLAR SHORT
THE HITCHHIKER'S GUIDE TO THE GALAXY
should be changed to:
Barbarians at the Gate
Hot, Flat, and Crowded
A Day Late and a Dollar Short
The Hitchhiker's Guide to the Galaxy
I wrote the following code:
while(<DATA>)
{
$_=~s/(\s+)([a-z])/$1.uc($2)/eg;
print $_;
}
But it capitalizes the first letter of every word, even words like "at," "the," and "a" in the middle of a title:
Barbarians At The Gate
Hot, Flat, And Crowded
A Day Late And A Dollar Short
The Hitchhiker's Guide To The Galaxy
How can I do this?

Thanks to See also Lingua::EN::Titlecase – Håkon Hægland given the way to get the output.
use Lingua::EN::Titlecase;
my $tc = Lingua::EN::Titlecase->new();
while(<DATA>)
{
my $line = $_;
my $tc = Lingua::EN::Titlecase->new($line);
print $tc;
}

You can also try using this regex: ^(.)(.*?)\b|\b(at|to|that|and|this|the|a|is|was)\b|\b(\w)([\w']*?(?:[^\w'-]|$)) and replace with \U$1\L$2\U$3\L$4. It works my matching the first letter of words that are not articles, capitalizing it, then matching the rest of the word. This seems to work in PHP, I don't know about Perl but it will likely work.
^(.)(.*?)\b matches the first letter of the first word (group 1) and the rest of the word (group 2). This is done to prevent not capitalizing the first word because it's an article.
\b(word|multiple words|...)\b matches any connecting word to prevent capitalizing them.
(\w)([\w']*?(?:[^\w'-]|$)) matches the first letter of a word (group 3) and the rest of the word (group 4). Here I used [^\w'-] instead of \b so hyphens and apostrophes are counted as word characters too. This prevent 's from becoming 'S
The \U in replacement capitalizes the following characters and \L lowers them. If you want you can add more articles or words to the regex to prevent capitalizing them.
UPDATE: I changed the regex so you can include connecting phrases too (multiple words). But that will still make a very long regex...

Related

Replace words but only after a colon

I have been researching this for quite some time but cannot seem to find an answer. Perhaps someone here can help.
I am trying to use sed to replace words in yml / yaml files. Since some of the words are included in the names I want to only replace words that appear after the colon (':').
For example. If the .yml file includes:
en:
label_some_tracker: A tracker
label_all_tracker: All trackers
label_attachment_type_trackers: Select trackers.
tracker_plural: trackers
and I want to replace all occurrences of tracker with issue in all values. The pattern:
s/tracker/issue/
also changes the names of the fields, which breaks my code.
I can reduce the size of the problem somewhat by including terms for all possible variants of a word. For example:
s/trackers/issues/
s/tracker/issue/
but that doesn't deal with all situations.
I have tried inserting a space before the search term:
s/ tracker/ issue/
but that matches names where the search term is at the beginning of the line.
If I search for whole words then it still seems to pick up the names because ':' and '_' are 'non word' characters.
If I try to put spaces at the beginning and end of the search term but then it misses words that are at the end of a line or words patterns with punctuation marks before the training space.
The only sure way seems to be to only replace words after a colon (':') but I cannot seem to figure out how to do that with sed.
Does anyone here know how?
With GNU sed:
sed -E 's/(:.*)tracker/\1issue/g' file
Output:
en:
label_some_tracker: A issue
label_all_tracker: All issues
label_attachment_type_trackers: Select issues.
tracker_plural: issues
Replace second occurance:
sed 's/tracker/issue/2' file

officejs : Search Word document using regular expression

I want to search strings like "number 1" or "number 152" or "number 36985".
In all above strings "number " will be constant but digits will change and can have any length.
I tried Search option using wildcard but it doesn't seem to work.
basic regEx operators like + seem to not work.
I tried 'number*[1-9]*' and 'number*[1-9]+' but no luck.
This regular expression only selects upto one digit. e.g. If the string is 'number 12345' it only matches number 12345 (the part which is in bold).
Does anyone know how to do this?
Word doesn't use regular expressions in its search (Find) functionality. It has its own set of wildcard rules. These are very similar to RegEx, but not identical and not as powerful.
Using Word's wildcards, the search text below locates the examples given in the question. (Note that the semicolon separator in 1;100 may be soemthing else, depending on the list separator set in Windows (or on the Mac). My European locale uses a semicolon; the United States would use a comma, for example.
"number [0-9]{1;100}"
The 100 is an arbitrary number I chose for the maximum number of repeats of the search term just before it. Depending on how long you expect a number to be, this can be much smaller...
The logic of the search text is: number is a literal; the valid range of characters following the literal are 0 through 9; there may be one to one hundred of these characters - anything in that range is a match.
The only way RegEx can be used in Word is to extract a string and run the search on the string. But this dissociates the string from the document, meaning Word-specific content (formatting, fields, etc.) will be lost.
Try putting < and > on the ends of your search string to indicate the beginning and ending of the desired strings. This works for me: '<number [1-9]*>'. So does '<number [1-9]#>' which is probably what you want. Note that in Word wildcards the # is used where + is used in other RegEx systems.

complete list of special characters (e.g \nquit)

I desperatly tried to find out what symbol '\nquit' is... and I couldnt find any reference in the web.
What I tried to find is a complete list of all of those characters (\n, \p, \0, ...) but I couldn't find any.
cheers usche
Wikipedia has a list of C language escapes here.
As noted in my comment, I believe this represents the newline (linefeed) character \n followed by the word quit (which would be forced by the newline to the beginning of the next line of output). But in that case the string should be "double"-quoted rather than 'single'-quoted.

Why does Github Flavored Markup only add newlines for lines that start with [\w\<]?

In our site (which is aimed at highly non-technical people), we let them use Markdown when sending emails. That way, they get nice things like bold, italic, etc. Being non-technical, however, they would never get past the “add two lines to make newlines actually work” quirk.
For that reason mainly, we are using a variant of Github Flavored Markdown.
We mainly borrowed this part:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/^[\w\<][^\n]*\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
This works well, but in some cases it doesn’t add the new-lines, and I guess the key to that is the “in very clear cases” part of that comment.
If I interpret it correctly, this is only adding newlines for lines that start with either a word character or a ‘<’.
Does anyone know why that is? Particularly, why ‘<’?
What would be the harm in just adding the two spaces to essentially anything (lines starting with spaces, hyphens, anything)?
'<' character is used at the beginning of a line to quote messages. I guess that is the reason.
The other answer to this question is quite wrong. This has nothing to do with quoting, and the character for markdown quoting is >.
^[\w\<][^\n]*\n+
Let's break the above regex into parts:
^ = anchor start of string.
[\w\<] matches a word character or the start of word boundary. \< is not a literal, but rather a GNU word boundary. See here (do a ctrl+f for \<).
[^\n]* matches any length of non-newline characters
\n matches a new line.
+ is, I believe, a possessive quantifier.
I believe, but am not 100% sure, that this simply is used to set x to a line of text. Then, the heavy work is done with the next line:
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
This says "if x satisfies the regex \n{2} (that is, has two line breaks), leave x as is. Otherwise, strip x and append a newline character.

Count the number of words in NSString

I'm trying to implement a word count function for my app that uses UITextView.
There's a space between two words in English, so it's really easy to count the number of words in an English sentence.
The problem occurs with Chinese and Japanese word counting because usually, there's no any space in the entire sentence.
I checked with three different text editors in iPad that have a word count feature and compare them with MS Words.
For example, here's a series of Japanese characters meaning the world's idea: 世界(the world)の('s)アイデア(idea)
世界のアイデア
1) Pages for iPad and MS Words count each character as one word, so it contains 7 words.
2) iPad text editor P*** counts the entire as one word --> They just used space to separate words.
3) iPad text editor i*** counts them as three words --> I believe they used CFStringTokenizer with kCFStringTokenizerUnitWord because I could get the same result)
I've researched on the Internet, and Pages and MS Words' word counting seems to be correct because each Chinese character has a meaning.
I couldn't find any class that counts the words like Pages or MS Words, and it would be very hard to implement it from scratch because besides Japanese and Chinese, iPad supports a lot of different foreign languages.
I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.
Is there a way to count words in NSString like Pages and MSWords?
Thank you
I recommend keep using CFStringTokenizer. Because it's platform feature, so will be upgraded by platform upgrade. And many people in Apple are working hardly to reflect real cultural difference. Which are hard to know for regular developers.
This is hard because this is not a programming problem essentially. This is a human cultural linguistic problem. You need a human language specialist for each culture. For Japanese, you need Japanese culture specialist. However, I don't think Japanese people needs word count feature seriously, because as I heard, the concept of word itself is not so important in the Japanese culture. You should define concept of word first.
And I can't understand why you want to force concept of word count into the character count. The Kanji word that you instanced. This is equal with counting universe as 2 words by splitting into uni + verse by meaning. Not even a logic. Splitting word by it's meaning is sometimes completely wrong and useless by the definition of word. Because definition of word itself are different by the cultures. In my language Korean, word is just a formal unit, not a meaning unit. The idea that each word is matching to each meaning is right only in roman character cultures.
Just give another feature like character counting for the users in east-asia if you think need it. And counting character in unicode string is so easy with -[NSString length] method.
I'm a Korean speaker, (so maybe out of your case :) and in many cases we count characters instead of words. In fact, I never saw people counting words in my whole life. I laughed at word counting feature on MS word because I guessed nobody would use it. (However now I know it's important in roman character cultures.) I have used word counting feature only once to know it works really :) I believe this is similar in Chinese or Japanese. Maybe Japanese users use the word counting because their basic alphabet is similar with roman characters which have no concept of composition. However they're using Kanji heavily which are completely compositing, character-centric system.
If you make word counting feature works greatly on those languages (which are using by people even does not feel any needs to split sentences into smaller formal units!), it's hard to imagine someone who using it. And without linguistic specialist, the feature should not correct.
This is a really hard problem if your string doesn't contain tokens identifying word breaks (like spaces). One way I know derived from attempting to solve anagrams is this:
At the start of the string you start with one character. Is it a word? It could be a word like "A" but it could also be a part of a word like "AN" or "ANALOG". So the decision about what is a word has to be made considering all of the string. You would consider the next characters to see if you can make another word starting with the first character following the first word you think you might have found. If you decide the word is "A" and you are left with "NALOG" then you will soon find that there are no more words to be found. When you start finding words in the dictionary (see below) then you know you are making the right choices about where to break the words. When you stop finding words you know you have made a wrong choice and you need to backtrack.
A big part of this is having dictionaries sufficient to contain any word you might encounter. The English resource would be TWL06 or SOWPODS or other scrabble dictionaries, containing many obscure words. You need a lot of memory to do this because if you check the words against a simple array containing all of the possible words your program will run incredibly slow. If you parse your dictionary, persist it as a plist and recreate the dictionary your checking will be quick enough but it will require a lot more space on disk and more space in memory. One of these big scrabble dictionaries can expand to about 10MB with the actual words as keys and a simple NSNumber as a placeholder for value - you don't care what the value is, just that the key exists in the dictionary, which tells you that the word is recognised as valid.
If you maintain an array as you count you get to do [array count] in a triumphal manner as you add the last word containing the last characters to it, but you also have an easy way of backtracking. If at some point you stop finding valid words you can pop the lastObject off the array and replace it at the start of the string, then start looking for alternative words. If that fails to get you back on the right track pop another word.
I would proceed by experimentation, looking for a potential three words ahead as you parse the string - when you have identified three potential words, take the first away, store it in the array and look for another word. If you find it is too slow to do it this way and you are getting OK results considering only two words ahead, drop it to two. If you find you are running up too many dead ends with your word division strategy then increase the number of words ahead you consider.
Another way would be to employ natural language rules - for example "A" and "NALOG" might look OK because a consonant follows "A", but "A" and "ARDVARK" would be ruled out because it would be correct for a word beginning in a vowel to follow "AN", not "A". This can get as complicated as you like to make it - I don't know if this gets simpler in Japanese or not but there are certainly common verb endings like "ma su".
(edit: started a bounty, I'd like to know the very best way to do this if my way isn't it.)
If you are using iOS 4, you can do something like
__block int count = 0;
[string enumerateSubstringsInRange:range
options:NSStringEnumerationByWords
usingBlock:^(NSString *word,
NSRange wordRange,
NSRange enclosingRange,
BOOL *stop)
{
count++;
}
];
More information in the NSString class reference.
There is also WWDC 2010 session, number 110, about advanced text handling, that explains this, around minute 10 or so.
I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.
That's right, you have to iterate through text and simply count number of word tokens encontered on the way.
Not a native chinese/japanese speaker, but here's my 2cents.
Each chinese character does have a meaning, but concept of a word is combination of letters/characters to represent an idea, isn't it?
In that sense, there's probably 3 words in "sekai no aidia" (or 2 if you don't count particles like NO/GA/DE/WA, etc). Same as english - "world's idea" is two words, while "idea of world" is 3, and let's forget about the required 'the' hehe.
That given, counting word is not as useful in non-roman language in my opinion, similar to what Eonil mentioned. It's probably better to count number of characters for those languages.. Check around with Chinese/Japanese native speakers and see what they think.
If I were to do it, I would tokenize the string with spaces and particles (at least for japanese, korean) and count tokens. Not sure about chinese..
With Japanese you can create a grammar parser and I think it is the same with Chinese. However, that is easier said than done because natural language tends to have many exceptions, but it is not impossible.
Please note it won't really be efficient since you have to parse each sentence before being able to count the words.
I would recommend the use of a parser compiler rather than building one yourself as well to start at least you can concentrate on doing the grammar than creating the parser yourself. It's not efficient, but it should get the job done.
Also have a fallback algorithm in case your grammar didn't parse the input correctly (perhaps the input really didn't make sense to begin with) you can use the length of the string to make it easier on you.
If you build it, there could be a market opportunity for you to use it as a natural language Domain Specific Language for Japanese/Chinese business rules as well.
Just use the length method:
[#"世界のアイデア" length]; // is 7
That being said, as a Japanese speaker, I think 3 is the right answer.