Change the way the assistant reads out numbers - actions-on-google

Is there a way to change how google assistant reads out numbers?
For example, 108 is a number, and assistant reads it out as "one hundred eight." Here, instead of "one hundred eight" I want the assistant to speak like "one oh eight".

Probably. While there are some ways you can change how it reads out numbers using the <say-as> ssml tag, you still don't have complete control over how it pronounces each number.
So
<speak>
<say-as interpret-as="cardinal">108</say-as>
</speak>
speaks it the default way - "one hundred eight" (at least for US English. Different locales may speak it differently).
<speak>
<say-as interpret-as="characters">108</say-as>
</speak>
says it the way you want (again, in US English): "one oh eight".
If you wanted it as "one zero eight", however, you'd be out of luck.

Related

Is there a "n/a" symbol in unicode?

Is there an unicode symbol for "n/a"? There are some fractions like ½, but a n/a symbol seems to be missing.
If there is none, what would be the most appropriate unicode symbol to use for n/a in a website (which should be contained in common fonts, to avoid needing a webfont)?
Looking at the Unicode code charts, I do not see a single N/A symbol. I do, however, see ⁿ (U+207F) and ₐ (U+2090), which you could separate with / (U+002F) eg: ⁿ/ₐ, or ̷ (U+0337), eg: ⁿ̷ₐ, or ̸ (U+0338), eg: ⁿ̸ₐ. Probably not what you are hoping for, though. And I don't know if "common" fonts implement them, either.
For future reference, the fastest way I know to answer questions like the OP's when I have them myself is to go to unicodelookup.com, because of the way it works: there's a search bar at the top, and you just type a string and it will return any and all unicode characters containing that string (this is also a great way to discover new and useful symbols). So in the OP's case, he could proceed like this:
first try entering "not" (without the quotes) in the search field
visually scan through the results... doing so would not reveal a "not
applicable" character in this case
try again but this time entering "applic" in the search field
again, doing so would not turn up anything along the lines of what he's
looking for
At that point he would be reasonably confident the current Unicode standard does not have a "n/a" symbol.
If you use Firefox you can define a keyword like "uni" to search that site from the URL bar, meaning any time the browser is open and regardless of what page or site is currently showing, you could do this:
hit [F6]... this moves the cursor to the URL bar at the top
type something like "uni applic" and hit [Enter]... this brings up the
unicodelookup.com website with the search results for "applic" already
showing
For the above to work you would need to define your keyword ("uni" or wtv you prefer) to point to location http://unicodelookup.com/#%s.
There's a Negative Acknowlege icon...
␕ symbol for negative acknowledge 022025 9237 0x2415 ␕
Found by searching negative on the Unicode Lookup site.
I'm not a fan, and for my purposes have just gone with __N/A__ (Markdown..)
I see lots of answers going head-on at the "Not Applicable" abbreviation, without exploring what a symbol is. A quick search for the equivalent phrase "out of scope" brings up a couple of variations on the No symbol: ⃠ – this seems to fit the bill (and since I was looking for a way to represent inapplicability, I'll be using it in my technical document).
Per the Wikipedia article, the Unicode codepoint U+20E0 is a combining character, so it is superimposed on the preceding character; e.g. ! ⃠ overlays an exclamation point. To get it to appear isolated, use a non-breaking space
If you don't want to bother with the combining symbol, the article mentions there's also an emoji U+1F6AB 🚫 but it's typically going to be colored red, or won't render!
There's actually a single character that could be repurposed for this: the "Square Na" character ㎁ (U+3381), which is used to represent the nanoampere in fullwidth (CJK) scripts.
What about the "SYMBOL FOR NULL" ␀ (U+2400)?

Iphone work out if string is a UK Postcode

In my app before I send a string off I need to work out if the text entered in the textbox is a UK Postcode. I don't have the regex ability to work that out for myself and after searching around I can't seem to work it out! Just wondered if anyone has done a similar thing in the past?
Or if anyone can point me in the right direction I would be most appreciative!
Tom
Wikipedia has a good section about this. Basically the answer depends on what sort of pathological cases you want to handle. For example:
An alternative short regular expression from BS7666 Schema is:
[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}
The above expressions fail to exclude many non-existent area codes (such as A, AA, Z and ZY).
Basically, read that section of Wikipedia thoroughly and decide what you need.
for post codes without spaces (e.g. SE19QZ) I use: (its not failed me yet ;-) )
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
if spaces (e.g. SE1 9QZ) , then:
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$
You can match most post codes with this regex:
/[A-Z]{1,2}[0-9]{1,2}\s?[0-9]{1,2}[A-Z]{1,2}/i
Which means... A-Z one or two times ({1,2}) followed by 0-9 1 or two times, followed by a space \s optionally ? followed by 0-9 one or two times, followed by A-Z one or two times.
This will match some false positives, as I can make up post codes like ZZ00 00ZZ, but to accurately match all post codes, the only way is to buy post code data from the post office - which is quite expensive. You could also download free post code databases, but they do not have 100% coverage.
Hope this helps.
Wikipedia has some regexes for UK Postcodes: http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation

Count the number of words in NSString

I'm trying to implement a word count function for my app that uses UITextView.
There's a space between two words in English, so it's really easy to count the number of words in an English sentence.
The problem occurs with Chinese and Japanese word counting because usually, there's no any space in the entire sentence.
I checked with three different text editors in iPad that have a word count feature and compare them with MS Words.
For example, here's a series of Japanese characters meaning the world's idea: 世界(the world)の('s)アイデア(idea)
世界のアイデア
1) Pages for iPad and MS Words count each character as one word, so it contains 7 words.
2) iPad text editor P*** counts the entire as one word --> They just used space to separate words.
3) iPad text editor i*** counts them as three words --> I believe they used CFStringTokenizer with kCFStringTokenizerUnitWord because I could get the same result)
I've researched on the Internet, and Pages and MS Words' word counting seems to be correct because each Chinese character has a meaning.
I couldn't find any class that counts the words like Pages or MS Words, and it would be very hard to implement it from scratch because besides Japanese and Chinese, iPad supports a lot of different foreign languages.
I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.
Is there a way to count words in NSString like Pages and MSWords?
Thank you
I recommend keep using CFStringTokenizer. Because it's platform feature, so will be upgraded by platform upgrade. And many people in Apple are working hardly to reflect real cultural difference. Which are hard to know for regular developers.
This is hard because this is not a programming problem essentially. This is a human cultural linguistic problem. You need a human language specialist for each culture. For Japanese, you need Japanese culture specialist. However, I don't think Japanese people needs word count feature seriously, because as I heard, the concept of word itself is not so important in the Japanese culture. You should define concept of word first.
And I can't understand why you want to force concept of word count into the character count. The Kanji word that you instanced. This is equal with counting universe as 2 words by splitting into uni + verse by meaning. Not even a logic. Splitting word by it's meaning is sometimes completely wrong and useless by the definition of word. Because definition of word itself are different by the cultures. In my language Korean, word is just a formal unit, not a meaning unit. The idea that each word is matching to each meaning is right only in roman character cultures.
Just give another feature like character counting for the users in east-asia if you think need it. And counting character in unicode string is so easy with -[NSString length] method.
I'm a Korean speaker, (so maybe out of your case :) and in many cases we count characters instead of words. In fact, I never saw people counting words in my whole life. I laughed at word counting feature on MS word because I guessed nobody would use it. (However now I know it's important in roman character cultures.) I have used word counting feature only once to know it works really :) I believe this is similar in Chinese or Japanese. Maybe Japanese users use the word counting because their basic alphabet is similar with roman characters which have no concept of composition. However they're using Kanji heavily which are completely compositing, character-centric system.
If you make word counting feature works greatly on those languages (which are using by people even does not feel any needs to split sentences into smaller formal units!), it's hard to imagine someone who using it. And without linguistic specialist, the feature should not correct.
This is a really hard problem if your string doesn't contain tokens identifying word breaks (like spaces). One way I know derived from attempting to solve anagrams is this:
At the start of the string you start with one character. Is it a word? It could be a word like "A" but it could also be a part of a word like "AN" or "ANALOG". So the decision about what is a word has to be made considering all of the string. You would consider the next characters to see if you can make another word starting with the first character following the first word you think you might have found. If you decide the word is "A" and you are left with "NALOG" then you will soon find that there are no more words to be found. When you start finding words in the dictionary (see below) then you know you are making the right choices about where to break the words. When you stop finding words you know you have made a wrong choice and you need to backtrack.
A big part of this is having dictionaries sufficient to contain any word you might encounter. The English resource would be TWL06 or SOWPODS or other scrabble dictionaries, containing many obscure words. You need a lot of memory to do this because if you check the words against a simple array containing all of the possible words your program will run incredibly slow. If you parse your dictionary, persist it as a plist and recreate the dictionary your checking will be quick enough but it will require a lot more space on disk and more space in memory. One of these big scrabble dictionaries can expand to about 10MB with the actual words as keys and a simple NSNumber as a placeholder for value - you don't care what the value is, just that the key exists in the dictionary, which tells you that the word is recognised as valid.
If you maintain an array as you count you get to do [array count] in a triumphal manner as you add the last word containing the last characters to it, but you also have an easy way of backtracking. If at some point you stop finding valid words you can pop the lastObject off the array and replace it at the start of the string, then start looking for alternative words. If that fails to get you back on the right track pop another word.
I would proceed by experimentation, looking for a potential three words ahead as you parse the string - when you have identified three potential words, take the first away, store it in the array and look for another word. If you find it is too slow to do it this way and you are getting OK results considering only two words ahead, drop it to two. If you find you are running up too many dead ends with your word division strategy then increase the number of words ahead you consider.
Another way would be to employ natural language rules - for example "A" and "NALOG" might look OK because a consonant follows "A", but "A" and "ARDVARK" would be ruled out because it would be correct for a word beginning in a vowel to follow "AN", not "A". This can get as complicated as you like to make it - I don't know if this gets simpler in Japanese or not but there are certainly common verb endings like "ma su".
(edit: started a bounty, I'd like to know the very best way to do this if my way isn't it.)
If you are using iOS 4, you can do something like
__block int count = 0;
[string enumerateSubstringsInRange:range
options:NSStringEnumerationByWords
usingBlock:^(NSString *word,
NSRange wordRange,
NSRange enclosingRange,
BOOL *stop)
{
count++;
}
];
More information in the NSString class reference.
There is also WWDC 2010 session, number 110, about advanced text handling, that explains this, around minute 10 or so.
I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.
That's right, you have to iterate through text and simply count number of word tokens encontered on the way.
Not a native chinese/japanese speaker, but here's my 2cents.
Each chinese character does have a meaning, but concept of a word is combination of letters/characters to represent an idea, isn't it?
In that sense, there's probably 3 words in "sekai no aidia" (or 2 if you don't count particles like NO/GA/DE/WA, etc). Same as english - "world's idea" is two words, while "idea of world" is 3, and let's forget about the required 'the' hehe.
That given, counting word is not as useful in non-roman language in my opinion, similar to what Eonil mentioned. It's probably better to count number of characters for those languages.. Check around with Chinese/Japanese native speakers and see what they think.
If I were to do it, I would tokenize the string with spaces and particles (at least for japanese, korean) and count tokens. Not sure about chinese..
With Japanese you can create a grammar parser and I think it is the same with Chinese. However, that is easier said than done because natural language tends to have many exceptions, but it is not impossible.
Please note it won't really be efficient since you have to parse each sentence before being able to count the words.
I would recommend the use of a parser compiler rather than building one yourself as well to start at least you can concentrate on doing the grammar than creating the parser yourself. It's not efficient, but it should get the job done.
Also have a fallback algorithm in case your grammar didn't parse the input correctly (perhaps the input really didn't make sense to begin with) you can use the length of the string to make it easier on you.
If you build it, there could be a market opportunity for you to use it as a natural language Domain Specific Language for Japanese/Chinese business rules as well.
Just use the length method:
[#"世界のアイデア" length]; // is 7
That being said, as a Japanese speaker, I think 3 is the right answer.

How can I figure out what code page I am looking at?

I have a device with some documentation on how to send it text. It uses 0x00-0x7F to send 'special' characters like accented characters, euro signs, ...
I am guessing they copied an existing code page and made some changes, but I have no idea how to figure out what code page is closest to the one in my documentation.
In theory, this should be easy to do. For example, they map Á to 0x41, so if I could find some way to go through all code pages and find the ones that have this character on that position, it would be a piece of cake.
However, all I can find on the internet are links to code page dumps just like the one I'm looking at, or software that uses heuristics to read text and guess the most likely code page. Surely someone out there has made it possible to look up what code page one is looking at ?
If it uses 0x00 to 0x7F for the "special" characters, how does it encode the regular ASCII characters?
In most of the charsets that support the character Á, its codepoint is 193 (0xC1). If you subtract 128 from that, you get 65 (0x41). Maybe your "codepage" is just the upper half of one of the standard charsets like ISO-8859-1 or windows-1252, with the high-order bit set to zero instead of one (that is, subtracting 128 from each one).
If that's the case, I would expect to find a flag you can set to tell it whether the next bunch of codepoints should be converted using the "upper" or "lower" encoding. I don't know of any system that uses that scheme, but it's the most sensible explanation I can come with for the situation you describe.
There is no way to auto-detect the codepage without additional information. Below the display layer it’s just bytes and all bytes are created equal. There’s no way to say “I’m a 0x41 from this and that codepage”, there’s only “I’m 0x41. Display me!”
What endian is the system? Perhaps you're flipping bit orders?
In most codepages, 0x41 is just the normal "A", I don't think any standard codepages have "Á" in that position. It could have a control character somewhere before the A that added the accent, or uses a non-standard codepage.
I don't see any use in knowing the "closest codepage", you just need to use the docs you got with the device.
Your last sentence is puzzling, what do you mean by "possible to look up what code page one is looking at"?
If you include your whole codepage, people here on SO could be more helpful and give you more insight about this issue, having one data point 0x41=Á doesn't help much.
Somewhat random idea, but if you can get replicate a significant amount of the text off the device, you could try running it through something like the detect function in http://chardet.feedparser.org/.

Theory: "Lexical Encoding"

I am using the term "Lexical Encoding" for my lack of a better one.
A Word is arguably the fundamental unit of communication as opposed to a Letter. Unicode tries to assign a numeric value to each Letter of all known Alphabets. What is a Letter to one language, is a Glyph to another. Unicode 5.1 assigns more than 100,000 values to these Glyphs currently. Out of the approximately 180,000 Words being used in Modern English, it is said that with a vocabulary of about 2,000 Words, you should be able to converse in general terms. A "Lexical Encoding" would encode each Word not each Letter, and encapsulate them within a Sentence.
// An simplified example of a "Lexical Encoding"
String sentence = "How are you today?";
int[] sentence = { 93, 22, 14, 330, QUERY };
In this example each Token in the String was encoded as an Integer. The Encoding Scheme here simply assigned an int value based on generalised statistical ranking of word usage, and assigned a constant to the question mark.
Ultimately, a Word has both a Spelling & Meaning though. Any "Lexical Encoding" would preserve the meaning and intent of the Sentence as a whole, and not be language specific. An English sentence would be encoded into "...language-neutral atomic elements of meaning ..." which could then be reconstituted into any language with a structured Syntactic Form and Grammatical Structure.
What are other examples of "Lexical Encoding" techniques?
If you were interested in where the word-usage statistics come from :
http://www.wordcount.org
This question impinges on linguistics more than programming, but for languages which are highly synthetic (having words which are comprised of multiple combined morphemes), it can be a highly complex problem to try to "number" all possible words, as opposed to languages like English which are at least somewhat isolating, or languages like Chinese which are highly analytic.
That is, words may not be easily broken down and counted based on their constituent glyphs in some languages.
This Wikipedia article on Isolating languages may be helpful in explaining the problem.
Their are several major problems with this idea. In most languages, the meaning of a word, and the word associated with a meaning change very swiftly.
No sooner would you have a number assigned to a word, before the meaning of the word would change. For instance, the word "gay" used to only mean "happy" or "merry", but it is now used mostly to mean homosexual. Another example is the morpheme "thank you" which originally came from German "danke" which is just one word. Yet another example is "Good bye" which is a shortening of "God bless you".
Another problem is that even if one takes a snapshot of a word at any point of time, the meaning and usage of the word would be under contention, even within the same province. When dictionaries are being written, it is not uncommon for the academics responsible to argue over a single word.
In short, you wouldn't be able to do it with an existing language. You would have to consider inventing a language of your own, for the purpose, or using a fairly static language that has already been invented, such as Interlingua or Esperanto. However, even these would not be perfect for the purpose of defining static morphemes in an ever-standard lexicon.
Even in Chinese, where there is rough mapping of character to meaning, it still would not work. Many characters change their meanings depending on both context, and which characters either precede or postfix them.
The problem is at its worst when you try and translate between languages. There may be one word in English, that can be used in various cases, but cannot be directly used in another language. An example of this is "free". In Spanish, either "libre" meaning "free" as in speech, or "gratis" meaning "free" as in beer can be used (and using the wrong word in place of "free" would look very funny).
There are other words which are even more difficult to place a meaning on, such as the word beautiful in Korean; when calling a girl beautiful, there would be several candidates for substitution; but when calling food beautiful, unless you mean the food is good looking, there are several other candidates which are completely different.
What it comes down to, is although we only use about 200k words in English, our vocabularies are actually larger in some aspects because we assign many different meanings to the same word. The same problems apply to Esperanto and Interlingua, and every other language meaningful for conversation. Human speech is not a well-defined, well oiled-machine. So, although you could create such a lexicon where each "word" had it's own unique meaning, it would be very difficult, and nigh on impossible for machines using current techniques to translate from any human language into your special standardised lexicon.
This is why machine translation still sucks, and will for a long time to come. If you can do better (and I hope you can) then you should probably consider doing it with some sort of scholarship and/or university/government funding, working towards a PHD; or simply make a heap of money, whatever keeps your ship steaming.
It's easy enough to invent one for yourself. Turn each word into a canonical bytestream (say, lower-case decomposed UCS32), then hash it down to an integer. 32 bits would probably be enough, but if not then 64 bits certainly would.
Before you ding for giving you a snarky answer, consider that the purpose of Unicode is simply to assign each glyph a unique identifier. Not to rank or sort or group them, but just to map each one onto a unique identifier that everyone agrees on.
How would the system handle pluralization of nouns or conjugation of verbs? Would these each have their own "Unicode" value?
As a translations scheme, this is probably not going to work without a lot more work. You'd like to think that you can assign a number to each word, then mechanically translate that to another language. In reality, languages have the problem of multiple words that are spelled the same "the wind blew her hair back" versus "wind your watch".
For transmitting text, where you'd presumably have an alphabet per language, it would work fine, although I wonder what you'd gain there as opposed to using a variable-length dictionary, like ZIP uses.
This is an interesting question, but I suspect you are asking it for the wrong reasons. Are you thinking of this 'lexical' Unicode' as something that would allow you to break down sentences into language-neutral atomic elements of meaning and then be able to reconstitute them in some other concrete language? As a means to achieve a universal translator, perhaps?
Even if you can encode and store, say, an English sentence using a 'lexical unicode', you can not expect to read it and magically render it in, say, Chinese keeping the meaning intact.
Your analogy to Unicode, however, is very useful.
Bear in mind that Unicode, whilst a 'universal' code, does not embody the pronunciation, meaning or usage of the character in question. Each code point refers to a specific glyph in a specific language (or rather the script used by a group of languages). It is elemental at the visual representation level of a glyph (within the bounds of style, formatting and fonts). The Unicode code point for the Latin letter 'A' is just that. It is the Latin letter 'A'. It cannot automagically be rendered as, say, the Arabic letter Alif (ﺍ) or the Indic (Devnagari) letter 'A' (अ).
Keeping to the Unicode analogy, your Lexical Unicode would have code points for each word (word form) in each language. Unicode has ranges of code points for a specific script. Your lexical Unicode would have to a range of codes for each language. Different words in different languages, even if they have the same meaning (synonyms), would have to have different code points. The same word having different meanings, or different pronunciations (homonyms), would have to have different code points.
In Unicode, for some languages (but not all) where the same character has a different shape depending on it's position in the word - e.g. in Hebrew and Arabic, the shape of a glyph changes at the end of the word - then it has a different code point. Likewise in your Lexical Unicode, if a word has a different form depending on its position in the sentence, it may warrant its own code point.
Perhaps the easiest way to come up with code points for the English Language would be to base your system on, say, a particular edition of the Oxford English Dictionary and assign a unique code to each word sequentially. You will have to use a different code for each different meaning of the same word, and you will have to use a different code for different forms - e.g. if the same word can be used as a noun and as a verb, then you will need two codes
Then you will have to do the same for each other language you want to include - using the most authoritative dictionary for that language.
Chances are that this excercise is all more effort than it is worth. If you decide to include all the world's living languages, plus some historic dead ones and some fictional ones - as Unicode does - you will end up with a code space that is so large that your code would have to be extremely wide to accommodate it. You will not gain anything in terms of compression - it is likely that a sentence represented as a String in the original language would take up less space than the same sentence represented as code.
P.S. for those who are saying this is an impossible task because word meanings change, I do not see that as a problem. To use the Unicode analogy, the usage of letters has changed (admittedly not as rapidly as the meaning of words), but it is not of any concern to Unicode that 'th' used to be pronounced like 'y' in the Middle ages. Unicode has a code point for 't', 'h' and 'y' and they each serve their purpose.
P.P.S. Actually, it is of some concern to Unicode that 'oe' is also 'œ' or that 'ss' can be written 'ß' in German
This is an interesting little exercise, but I would urge you to consider it nothing more than an introduction to the concept of the difference in natural language between types and tokens.
A type is a single instance of a word which represents all instances. A token is a single count for each instance of the word. Let me explain this with the following example:
"John went to the bread store. He bought the bread."
Here are some frequency counts for this example, with the counts meaning the number of tokens:
John: 1
went: 1
to: 1
the: 2
store: 1
he: 1
bought: 1
bread: 2
Note that "the" is counted twice--there are two tokens of "the". However, note that while there are ten words, there are only eight of these word-to-frequency pairs. Words being broken down to types and paired with their token count.
Types and tokens are useful in statistical NLP. "Lexical encoding" on the other hand, I would watch out for. This is a segue into much more old-fashioned approaches to NLP, with preprogramming and rationalism abound. I don't even know about any statistical MT that actually assigns a specific "address" to a word. There are too many relationships between words, for one thing, to build any kind of well thought out numerical ontology, and if we're just throwing numbers at words to categorize them, we should be thinking about things like memory management and allocation for speed.
I would suggest checking out NLTK, the Natural Language Toolkit, written in Python, for a more extensive introduction to NLP and its practical uses.
Actually you only need about 600 words for a half decent vocabulary.