How to check whether grapheme is a letter? - unicode

How do I check whether grapheme is a letter (or something that is often used in words, like hieroglyph)?
After looking through Elixir's String documentation the only way I see is to check whether String.downcase and String.upcase return the same string. Iff they do, then the grapheme is not something that is used in words.
This is how I do it, but surely there should be a simpler way?
defmodule Words do
defp all_letters_uppercase?(string) do
String.upcase(string) == string
end
defp all_letters_downcase?(string) do
String.downcase(string) == string
end
defp contains_letter?(string) do
not (all_letters_uppercase?(string) and all_letters_downcase?(string))
end
def single_grapheme?(string) do
with graphemes = String.graphemes(string)
do
length(graphemes) == 1 and hd(graphemes) == string
end
end
#doc """
Check whether string is a single letter.
"""
def letter?(string) do
single_grapheme?(string) and contains_letter?(string)
end
end
Update: my code doesn't work for japanese letters
iex(35)> Words.letter?("グ")
false

You can use regular expressions to check for some unicode features, one of which is \p{Letter}, or \p{L} for short. You might want to add a \p{Mark}*, or \p{M}* to also match multiple following combining diacritics. This would closely match the logic found in String.graphemes/1. Be sure to add the u modifier after the regex to enable these Unicode features. For example:
iex> String.match?("グ", ~r/\A\p{L}\p{M}*\z/u)
true
Also see http://erlang.org/doc/man/re.html, section on "Unicode character properties" and http://www.regular-expressions.info/unicode.html#grapheme.

This seems to be working fine:
defmodule Words do
def letter?(string) do
Regex.match?(~r/^\p{L}$/fu, string)
end
end
iex(51)> Words.letter?("a")
true
iex(52)> Words.letter?("é")
true
iex(53)> Words.letter?("グ")
true
iex(54)> Words.letter?("aa")
false
iex(55)> Words.letter?("1")
false
iex(56)> Words.letter?("-")
false
iex(57)> Words.letter?("")
false
iex(58)> Words.letter?(" ")
false
iex(59)> Words.letter?("éé")
false
iex(60)> Words.letter?("a ")
false

Related

How to determine if a string contains any element of a set

I have a sentence, and I want to determine if it contains any elements of a set.
val sentence = "Hello, today is a fine day to learn scala"
val mySet = Set("day", "scala")
What about:
mySet.exists(word => sentence.contains(word))
It will return true if at least one word from the set is present in the string.
Here's a solution that...
is case-insensitive ("scala" does match "Scala")
ignores sub-strings ("rat" does not match "rats")
ignores punctuation (!?,-) unless specifically specified in mySet
mySet.mkString("(?i)\\b(", "|", ")\\b")
.r.unanchored
.matches(sentence)

How to strip everything except digits from a string in Scala (quick one liners)

This is driving me nuts... there must be a way to strip out all non-digit characters (or perform other simple filtering) in a String.
Example: I want to turn a phone number ("+72 (93) 2342-7772" or "+1 310-777-2341") into a simple numeric String (not an Int), such as "729323427772" or "13107772341".
I tried "[\\d]+".r.findAllIn(phoneNumber) which returns an Iteratee and then I would have to recombine them into a String somehow... seems horribly wasteful.
I also came up with: phoneNumber.filter("0123456789".contains(_)) but that becomes tedious for other situations. For instance, removing all punctuation... I'm really after something that works with a regular expression so it has wider application than just filtering out digits.
Anyone have a fancy Scala one-liner for this that is more direct?
You can use filter, treating the string as a character sequence and testing the character with isDigit:
"+72 (93) 2342-7772".filter(_.isDigit) // res0: String = 729323427772
You can use replaceAll and Regex.
"+72 (93) 2342-7772".replaceAll("[^0-9]", "") // res1: String = 729323427772
Another approach, define the collection of valid characters, in this case
val d = '0' to '9'
and so for val a = "+72 (93) 2342-7772", filter on collection inclusion for instance with either of these,
for (c <- a if d.contains(c)) yield c
a.filter(d.contains)
a.collect{ case c if d.contains(c) => c }

ignore spaces and cases MATLAB

diary_file = tempname();
diary(diary_file);
myFun();
diary('off');
output = fileread(diary_file);
I would like to search a string from output, but also to ignore spaces and upper/lower cases. Here is an example for what's in output:
the test : passed
number : 4
found = 'thetest:passed'
a = strfind(output,found )
How could I ignore spaces and cases from output?
Assuming you are not too worried about accidentally matching something like: 'thetEst:passed' here is what you can do:
Remove all spaces and only compare lower case
found = 'With spaces'
found = lower(found(found ~= ' '))
This will return
found =
withspaces
Of course you would also need to do this with each line of output.
Another way:
regexpi(output(~isspace(output)), found, 'match')
if output is a single string, or
regexpi(regexprep(output,'\s',''), found, 'match')
for the more general case (either class(output) == 'cell' or 'char').
Advantages:
Fast.
robust (ALL whitespace (not just spaces) is removed)
more flexible (you can return starting/ending indices of the match, tokenize, etc.)
will return original case of the match in output
Disadvantages:
more typing
less obvious (more documentation required)
will return original case of the match in output (yes, there's two sides to that coin)
That last point in both lists is easily forced to lower or uppercase using lower() or upper(), but if you want same-case, it's a bit more involved:
C = regexpi(output(~isspace(output)), found, 'match');
if ~isempty(C)
C = found; end
for single string, or
C = regexpi(regexprep(output, '\s', ''), found, 'match')
C(~cellfun('isempty', C)) = {found}
for the more general case.
You can use lower to convert everything to lowercase to solve your case problem. However ignoring whitespace like you want is a little trickier. It looks like you want to keep some spaces but not all, in which case you should split the string by whitespace and compare substrings piecemeal.
I'd advertise using regex, e.g. like this:
a = regexpi(output, 'the\s*test\s*:\s*passed');
If you don't care about the position where the match occurs but only if there's a match at all, removing all whitespaces would be a brute force, and somewhat nasty, possibility:
a = strfind(strrrep(output, ' ',''), found);

Finding non-adjacent subsequences in a string

Say I am searching in a string, for a subsequence, where the elements do not necessarily have to be adjacent, but have to occur within N characters. So,
search("abc","aaabbbccc",7) => True
search("abc","aabbcc",3) => False
I am looking for an efficient data structure / algorithm that will perform this comparison. I can think of a few approaches like searching for all valid combos of interior wildcards, like
search("abc",whatever,4) => "abc","a*bc","ab*c"
And using any of the multi-string search algorithms (probably Aho–Corasick), but I'm wondering if there is a better solution.
I have attached a python code sample that does what you want. It loops through the string to be searched and if the first letter of search string is found, a substring of length=max_length is created and sent to another function. This function simply moves through the substring trying to find all of the search string letters in order. If it finds them all then it returns True, otherwise False.
def check_substring(find_me, substr):
find_index = 0
for letter in substr:
if find_me[find_index] == letter:
find_index +=1
# if we reach the end of find_me, return true
if find_index >= len(find_me):
return True
return False
def check_string(find_me, look_here, max_len):
for index in range(len(look_here)):
if find_me[0] == look_here[index]:
if check_substring(find_me, look_here[index:index + max_len]):
return True
return False
fm = "abc"
lh = "aabbbccceee"
ml = 5
print check_string(fm, lh, ml)

Scala string pattern matching for mathematical symbols

I have the following code:
val z: String = tree.symbol.toString
z match {
case "method +" | "method -" | "method *" | "method ==" =>
println("no special op")
false
case "method /" | "method %" =>
println("we have the special div operation")
true
case _ =>
false
}
Is it possible to create a match for the primitive operations in Scala:
"method *".matches("(method) (+-*==)")
I know that the (+-*) signs are used as quantifiers. Is there a way to match them anyway?
Thanks from a avidly Scala scholar!
Sure.
val z: String = tree.symbol.toString
val noSpecialOp = "method (?:[-+*]|==)".r
val divOp = "method [/%]".r
z match {
case noSpecialOp() =>
println("no special op")
false
case divOp() =>
println("we have the special div operation")
true
case _ =>
false
}
Things to consider:
I choose to match against single characters using [abc] instead of (?:a|b|c).
Note that - has to be the first character when using [], or it will be interpreted as a range. Likewise, ^ cannot be the first character inside [], or it will be interpreted as negation.
I'm using (?:...) instead of (...) because I don't want to extract the contents. If I did want to extract the contents -- so I'd know what was the operator, for instance, then I'd use (...). However, I'd also have to change the matching to receive the extracted content, or it would fail the match.
It is important not to forget () on the matches -- like divOp(). If you forget them, a simple assignment is made (and Scala will complain about unreachable code).
And, as I said, if you are extracting something, then you need something inside those parenthesis. For instance, "method ([%/])".r would match divOp(op), but not divOp().
Much the same as in Java. To escape a character in a regular expression, you prefix the character with \. However, backslash is also the escape character in standard Java/Scala strings, so to pass it through to the regular expression processing you must again prefix it with a backslash. You end up with something like:
scala> "+".matches("\\+")
res1 : Boolean = true
As James Iry points out in the comment below, Scala also has support for 'raw strings', enclosed in three quotation marks: """Raw string in which I don't need to escape things like \!""" This allows you to avoid the second level of escaping, that imposed by Java/Scala strings. Note that you still need to escape any characters that are treated as special by the regular expression parser:
scala> "+".matches("""\+""")
res1 : Boolean = true
Escaping characters in Strings works like in Java.
If you have larger Strings which need a lot of escaping, consider Scala's """.
E. g. """String without needing to escape anything \n \d"""
If you put three """ around your regular expression you don't need to escape anything anymore.