It seems that malayalam language characters can be found between 3328 to 3456 as per this page...
https://github.com/qburst/common-crawl-malayalam/blob/master/src/process_warc_batch.py
malayalam_unicode_decimal_list = list(range(3328, 3456)) + [8204, 8205]
def is_char_malayalam(c):
if ord(c) in malayalam_unicode_decimal_list:
return True
else:
return False
I will like to know what is the range for other languages like Gujarati or hindi.
As suggested in the comment, I have calculated the ranges in python.
Devnagari:
ord("ΰ€")
ord("ΰ₯Ώ")
range(2304, 2432)
Gujarati:
ord("ΰͺ")
ord("ΰ«Ώ")
range(2688, 2816)
Not sure if this is correct.
Related
I have apps in Go and Swift which process strings, such as finding substrings and their indices. At first it worked nicely even with multi-byte characters (e.g. emojis), using to Go's utf8.RuneCountInString() and Swift's native String.
But there are some UTF8 characters that break the string length and indices for substrings, e.g. a string "Lorem ππβοΈπ€ ipsum":
Go's utf8.RuneCountInString("Lorem ππβοΈπ€ ipsum") returns 17 and the start index of ipsum is 12.
Swift's "Lorem ππβοΈπ€ ipsum".count returns 16 and the start index of ipsum is 11.
Using Swift String's utf8, utf16 or casting to NSString gives also different lengths and indices. There are also other emojis composed from multiple other emoji's like π¨βπ©βπ§βπ¦ which gives even funnier numbers.
This is with Go 1.8 and Swift 4.1.
Is there any way to get the same string lengths and substrings' indices with same values with Go and Swift?
EDIT
I created a Swift String extension based on #MartinR's great answer:
extension String {
func runesRangeToNSRange(from: Int, to: Int) -> NSRange {
let length = to - from
let start = unicodeScalars.index(unicodeScalars.startIndex, offsetBy: from)
let end = unicodeScalars.index(start, offsetBy: length)
let range = start..<end
return NSRange(range, in: self)
}
}
In Swift a Character is an βextended grapheme cluster,β and each of "π", "π", "βοΈ", "π€", "π¨βπ©βπ§βπ¦" counts as a single character.
I have no experience with Go, but as I understand it from Strings, bytes, runes and characters in Go,
a βruneβ is a Unicode code point, which essentially corresponds to a UnicodeScalar in Swift.
In your example, the difference comes from "βοΈ" which
counts as a single Swift character, but is built from two Unicode scalars:
print("βοΈ".count) // 1
print("βοΈ".unicodeScalars.count) // 2
Here is an example how you can compute the length and offsets in
terms of Unicode scalars:
let s = "Lorem ππβοΈπ€ ipsum"
print(s.unicodeScalars.count) // 17
if let idx = s.range(of: "ipsum") {
print(s.unicodeScalars.distance(from: s.startIndex, to: idx.lowerBound)) // 12
}
As you can see, this gives the same numbers as in your example from Go.
A rune in Go identifies a specific UTF-8 code point; that does not necessarily mean it maps 1:1 to visually distinct characters. Some characters may be made up of multiple runes/code points, therefor counting runes may not give you what you'd expect from a visual inspection of the string. I don't know what "some text".count actually counts in Swift so I can't offer any comparison there.
let str1 = "π©πͺπ©πͺπ©πͺπ©πͺπ©πͺ"
let str2 = "π©πͺ.π©πͺ.π©πͺ.π©πͺ.π©πͺ."
println("\(countElements(str1)), \(countElements(str2))")
Result: 1, 10
But should not str1 have 5 elements?
The bug seems only occurred when I use the flag emoji.
Update for Swift 4 (Xcode 9)
As of Swift 4 (tested with Xcode 9 beta) grapheme clusters break after every second regional indicator symbol, as mandated by the Unicode 9
standard:
let str1 = "π©πͺπ©πͺπ©πͺπ©πͺπ©πͺ"
print(str1.count) // 5
print(Array(str1)) // ["π©πͺ", "π©πͺ", "π©πͺ", "π©πͺ", "π©πͺ"]
Also String is a collection of its characters (again), so one can
obtain the character count with str1.count.
(Old answer for Swift 3 and older:)
From "3 Grapheme Cluster Boundaries"
in the "Standard Annex #29 UNICODE TEXT SEGMENTATION":
(emphasis added):
A legacy grapheme cluster is defined as a base (such as A or γ«)
followed by zero or more continuing characters. One way to think of
this is as a sequence of characters that form a βstackβ.
The base can be single characters, or be any sequence of Hangul Jamo
characters that form a Hangul Syllable, as defined by D133 in The
Unicode Standard, or be any sequence of Regional_Indicator (RI) characters. The RI characters are used in pairs to denote Emoji
national flag symbols corresponding to ISO country codes. Sequences of
more than two RI characters should be separated by other characters,
such as U+200B ZWSP.
(Thanks to #rintaro for the link).
A Swift Character represents an extended grapheme cluster, so it is (according
to this reference) correct that any sequence of regional indicator symbols
is counted as a single character.
You can separate the "flags" by a ZERO WIDTH NON-JOINER:
let str1 = "π©πͺ\u{200C}π©πͺ"
print(str1.characters.count) // 2
or insert a ZERO WIDTH SPACE:
let str2 = "π©πͺ\u{200B}π©πͺ"
print(str2.characters.count) // 3
This solves also possible ambiguities, e.g. should "π«βπ·βπΊβπΈ"
be "π«βπ·πΊβπΈ" or "π«π·βπΊπΈ" ?
See also How to know if two emojis will be displayed as one emoji? about a possible method
to count the number of "composed characters" in a Swift string,
which would return 5 for your let str1 = "π©πͺπ©πͺπ©πͺπ©πͺπ©πͺ".
Here's how I solved that problem, for Swift 3:
let str = "π©πͺπ©πͺπ©πͺπ©πͺπ©πͺ" //or whatever the string of emojis is
let range = str.startIndex..<str.endIndex
var length = 0
str.enumerateSubstrings(in: range, options: NSString.EnumerationOptions.byComposedCharacterSequences) { (substring, substringRange, enclosingRange, stop) -> () in
length = length + 1
}
print("Character Count: \(length)")
This fixes all the problems with character count and emojis, and is the simplest method I have found.
I have a swift program in whom I need to read the last 20 digits of a string.
Although I would prefer the last 20 digits the first 20 would also be fine if it makes it any easier.
And a way to read all Digits except for the last 20.
You can use suffix:
String(yourString.characters.suffix(20))
It's interesting because the place you'd expect to find the answer would be the string functions -- where is the Swift equivalent of Javascript's String.substr() for example.
What you want is
String str = ...
str.substringFromIndex(advance(str.startIndex, 20)) // first 20 chars
str.substringFromIndex(advance(str.endIndex, -20)) // last 20 chars
In any case, you'll need to check if the str has fewer than 20 characters and just return the string itself.
You can determine the string length by
count(str) (older Swift versions) or str.characters.count (Swift 1.2)
I have a table field where the data contains our memberID numbers followed by character or character + number strings
For example:
My Data
1234567Z1
2345T10
222222T10Z1
111
111A
Should Become
123456
12345
222222
111
111
I want to get just the member number (as shown in Should Become above). I.E. all the digits that are LEFT of the first character.
As the length of the member number can be different for each person (the first 1 to 7 digit) and the letters used can be different (a to z, 0 to 8 characters long), I don't think I can SPLIT the field.
Right now, in Power Query, I do 27 search and replace commands to clean this data (e.g. find T10 replace with nothing, find T20 replace with nothing, etc)
Can anyone suggest a better way to achieve this?
I did successfully create a formula for this in Excel...but I am now trying to do this in Power Query and I don't know how to convert the formula - nor am I sure this is the most efficient solution.
=iferror(value(left([MEMBERID],7)),
iferror(value(left([MEMBERID],6)),
iferror(value(left([MEMBERID],5)),
iferror(value(left([MEMBERID],4)),
iferror(value(left([MEMBERID],3)),0)
)
)
)
)
Thanks
There are likely several ways to do this. Here's one way:
Create a query Letters:
let
Source = { "a" .. "z" } & { "A" .. "Z" }
in
Source
Create a query GetFirstLetterIndex:
let
Source = (text) => let
// For each letter find out where it shows up in the text. If it doesn't show up, we will have a -1 in the list. Make that positive so that we return the index of the first letter which shows up.
firstLetterIndex = List.Transform(Letters, each let pos = Text.PositionOf(text, _), correctedPos = if pos < 0 then Text.Length(text) else pos in correctedPos),
minimumIndex = List.Min(firstLetterIndex)
in minimumIndex
in
Source
In the table containing your data, add a custom column with this formula:
Text.Range([ColumnWithData], 0, GetFirstLetterIndex([ColumnWithData]))
That formula will take everything from your data text until the first letter.
String formatting:
"#0.##%;(#0.##%); "
The above will format a double into a percentage string with two decimal points, put it into parenthesis if itβs negative and leave it a blank string if itβs zero.
The problem is, if the double value has no decimal points, eg if the value is 2, then for some reason the resulting string is β2%β and not β2.00%β.
My question is: how do I make it go to β2.00%β?
p.s. the formatting is happening on a Syncfusion grid cell object and requires a string mask.
p.s.s. the existing functionality described above in italics must be maintained.
Hashes denote an optional character. Use β#0.00%β (etc.).
You can use the string format #0.00% for 2 digital places.
"#" means optional to show the digital while "0" means mandatory to show. In this case (#0.00%) stands for the 2 digital places are mandatory and the digital right before the "." is mandatory as well. If there is any digital before the "0", it will show up. Otherwise, it won't as this digital is optional.
e.g.
2 -> 2.00%
12 -> 12.00%
120 -> 120.00%
11.234 -> 11.23%
And using "P" or "P2" also works fine in this case. "P" stands for percent, "2" is the amount of digital places.
e.g.
double number = .2468013;
Console.WriteLine(number.ToString("P", CultureInfo.InvariantCulture));
// Displays 24.68 %
Console.WriteLine(number.ToString("P",CultureInfo.CreateSpecificCulture("hr-HR")));
// Displays 24,68%
Console.WriteLine(number.ToString("P1", CultureInfo.InvariantCulture));
// Displays 24.7 %
You can refer to the MSDN for more details.