Remove diacritics using Go - unicode

How can I remove all diacritics from the given UTF8 encoded string using Go? e.g. transform the string "žůžo" => "zuzo". Is there a standard way?

You can use the libraries described in Text normalization in Go.
Here's an application of those libraries:
// Example derived from: http://blog.golang.org/normalization
package main
import (
"fmt"
"unicode"
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
)
func isMn(r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
func main() {
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
result, _, _ := transform.String(t, "žůžo")
fmt.Println(result)
}

To expand a bit on the existing answer:
The internet standard for comparing strings of different character sets is called "PRECIS" (Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols) and is documented in RFC7564. There is also a Go implementation at golang.org/x/text/secure/precis.
None of the standard profiles will do what you want, but it would be fairly straight forward to define a new profile that did. You would want to apply Unicode Normalization Form D ("D" for "Decomposition", which means the accents will be split off and be their own combining character), and then remove any combining character as part of the additional mapping rule, then recompose with the normalization rule. Something like this:
package main
import (
"fmt"
"unicode"
"golang.org/x/text/secure/precis"
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
)
func main() {
loosecompare := precis.NewIdentifier(
precis.AdditionalMapping(func() transform.Transformer {
return transform.Chain(norm.NFD, transform.RemoveFunc(func(r rune) bool {
return unicode.Is(unicode.Mn, r)
}))
}),
precis.Norm(norm.NFC), // This is the default; be explicit though.
)
p, _ := loosecompare.String("žůžo")
fmt.Println(p, loosecompare.Compare("žůžo", "zuzo"))
// Prints "zuzo true"
}
This lets you expand your comparison with more options later (eg. width mapping, case mapping, etc.)
It's also worth noting that removing accents is almost never what you actually want to do when comparing strings like this, however, without knowing your use case I can't actually make that assertion about your project. To prevent the proliferation of precis profiles it's good to use one of the existing profiles where possible. Also note that no effort was made to optimize the example profile.

transform.RemoveFunc is deprecated.
Instead you can use the Remove function from runes package:
t := transform.Chain(norm.NFD, runes.Remove(runes.In(unicode.Mn)), norm.NFC)
result, _, _ := transform.String(t, "žůžo")
fmt.Println(result)

For anyone looking how to remove (or replace / flatten) Polish diacritics in Go, you may define a mapping for runes:
package main
import (
"fmt"
"golang.org/x/text/runes"
"golang.org/x/text/secure/precis"
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
)
func main() {
trans := transform.Chain(
norm.NFD,
precis.UsernameCaseMapped.NewTransformer(),
runes.Map(func(r rune) rune {
switch r {
case 'ą':
return 'a'
case 'ć':
return 'c'
case 'ę':
return 'e'
case 'ł':
return 'l'
case 'ń':
return 'n'
case 'ó':
return 'o'
case 'ś':
return 's'
case 'ż':
return 'z'
case 'ź':
return 'z'
}
return r
}),
norm.NFC,
)
result, _, _ := transform.String(trans, "ŻóŁć")
fmt.Println(result)
}
On Go Playground: https://play.golang.org/p/3ulPnOd3L91

Related

In VSCode snippets how do you use transform to lowercase just the first letter of a value?

I've just started getting into vsCode snippets. They seem really handy.
Is there a way to ensure that what a user entered at a tabstop starts with a lowercase value.
Here's my test case/ sandbox :
"junk": {
"prefix": "junk",
"body": [
"original:${1:type some string here then tab}",
"lower:${1/(.*)/${1:/downcase}/}",
"upper:${1/(.*)/${1:/upcase}/}",
"capitalized:${1/(.*)/${1:/capitalize}/}",
"camel:${1/(.*)/${1:/camelcase}/}",
"pascal:${1/(.*)/${1:/pascalcase}/}",
],
"description": "junk"
}
and here's what it produces:
original:SomeValue
lower:somevalue
upper:SOMEVALUE
capitalized:SomeValue
camel:somevalue
pascal:Somevalue
"camel" is pretty close but I want to preserve the capital if the user entered a camelcase value.
I just want the first character lower no matter what.
The answer is:
${1/(.)(.*)/${1:/downcase}$2/}
Just to clarify, if you look at this commit: https://github.com/microsoft/vscode/commit/3d6389bb336b8ca9b12bc1e772f7056d5c03d3ee
function _toCamelCase(value: string): string {
const match = value.match(/[a-z0-9]+/gi);
console.log(match)
if (!match) {
return value;
}
return match.map((word, index) => {
if (index === 0) {
return word.toLowerCase();
} else {
return word.charAt(0).toUpperCase()
+ word.substr(1).toLowerCase();
}
})
.join('');
}
the camelcase transform is intended for input like
some-value
some_value
some.value
I think any non [a-z0-9]/i will work as the separator between words. So your case of SomeValue is not the intended use of camelcase: according to the function above the entire SomeValue is one match (the match is case-insensitve) and then that entire word is lowercased.

How to do a case insensitive match for command line arguments in scala?

I'm working on a command line tool written in Scala which is executed as:
sbt "run --customerAccount 1234567"
Now, I wish to make this flexible to accept "--CUSTOMERACCOUNT" or --cUsToMerAccount or --customerACCOUNT ...you get the drift
Here's what the code looks like:
lazy val OptionsParser: OptionParser[Args] = new scopt.OptionParser[Args]("scopt") {
head(
"XML Generator",
"Creates XML for testing"
)
help("help").text(s"Prints this usage message. $envUsage")
opt[String]('c', "customerAccount")
.text("Required: Please provide customer account number as -c 12334 or --customerAccount 12334")
.required()
.action { (cust, args) =>
assert(cust.nonEmpty, "cust is REQUIRED!!")
args.copy(cust = cust)
}
}
I assume the opt[String]('c', "customerAccount") does the pattern matching from the command line and will match with "customerAccount" - how do I get this to match with "--CUSTOMERACCOUNT" or --cUsToMerAccount or --customerACCOUNT? What exactly does the args.copy (cust = cust) do?
I apologize if the questions seem too basic. I'm incredibly new to Scala, have worked in Java and Python earlier so sometimes I find the syntax a little hard to understand as well.
You'd normally be parsing the args with code like:
OptionsParser.parse(args, Args())
So if you want case-insensitivity, probably the easiest way is to canonicalize the case of args with something like
val canonicalized = args.map(_.toLowerCase)
OptionsParser.parse(canonicalized, Args())
Or, if you for instance wanted to only canonicalize args starting with -- and before a bare --:
val canonicalized =
args.foldLeft(false -> List.empty[String]) { (state, arg) =>
val (afterDashes, result) = state
if (afterDashes) true -> (arg :: result) // pass through unchanged
else {
if (arg == "==") true -> (arg :: result) // move to afterDash state & pass through
else {
if (arg.startsWith("--")) false -> (arg.toLowerCase :: result)
else false -> (arg :: result) // pass through unchanged
}
}
}
._2 // Extract the result
.reverse // Reverse it back into the original order (if building up a sequence, your first choice should be to build a list in reversed order and reverse at the end)
OptionsParser.parse(canonicalized, Args())
Re the second question, since Args is (almost certainly) a case class, it has a copy method which constructs a new object with (most likely, depending on usage) different values for its fields. So
args.copy(cust = cust)
creates a new Args object, where:
the value of the cust field in that object is the value of the cust variable in that block (this is basically a somewhat clever hack that works with named method arguments)
every other field's value is taken from args

What are all the Unicode properties a Perl 6 character will match?

The .uniprop returns a single property:
put join ', ', 'A'.uniprop;
I get back one property (the general category):
Lu
Looking around I didn't see a way to get all the other properties (including derived ones such as ID_Start and so on). What am I missing? I know I can go look at the data files, but I'd rather have a single method that returns a list.
I am mostly interested in this because regexes understand properties and match the right properties. I'd like to take any character and show which properties it will match.
"A".uniprop("Alphabetic") will get the Alphabetic property. Are you asking for what other properties are possible?
All these that have a checkmark by them will likely work. This just displays that status of roast testing for it https://github.com/perl6/roast/issues/195
This may more more useful for you, https://github.com/rakudo/rakudo/blob/master/src/core/Cool.pm6#L396-L483
The first hash is just mapping aliases for the property names to the full names. The second hash specifices whether the property is B for boolean, S for a string, I for integer, nv for numeric value, na for Unicode Name and a few other specials.
If I didn't understand you question please let me know and I will revise this answer.
Update: Seems you want to find out all the properties that will match. What you will want to do is iterate all of https://github.com/rakudo/rakudo/blob/master/src/core/Cool.pm6#L396-L483 and looking only at string, integer and boolean properties. Here is the full thing: https://gist.github.com/samcv/ae09060a781bb4c36ae6cac80ea9325f
sub MAIN {
use Test;
my $char = 'a';
my #result = what-matches($char);
for #result {
ok EVAL("'$char' ~~ /$_/"), "$char ~~ /$_/";
}
}
use nqp;
sub what-matches (Str:D $chr) {
my #result;
my %prefs = prefs();
for %prefs.keys -> $key {
given %prefs{$key} {
when 'S' {
my $propval = $chr.uniprop($key);
if $key eq 'Block' {
#result.push: "<:In" ~ $propval.trans(' ' => '') ~ ">";
}
elsif $propval {
#result.push: "<:" ~ $key ~ "<" ~ $chr.uniprop($key) ~ ">>";
}
}
when 'I' {
#result.push: "<:" ~ $key ~ "<" ~ $chr.uniprop($key) ~ ">>";
}
when 'B' {
#result.push: ($chr.uniprop($key) ?? "<:$key>" !! "<:!$key>");
}
}
}
#result;
}
sub prefs {
my %prefs = nqp::hash(
'Other_Grapheme_Extend','B','Titlecase_Mapping','tc','Dash','B',
'Emoji_Modifier_Base','B','Emoji_Modifier','B','Pattern_Syntax','B',
'IDS_Trinary_Operator','B','ID_Continue','B','Diacritic','B','Cased','B',
'Hangul_Syllable_Type','S','Quotation_Mark','B','Radical','B',
'NFD_Quick_Check','S','Joining_Type','S','Case_Folding','S','Script','S',
'Soft_Dotted','B','Changes_When_Casemapped','B','Simple_Case_Folding','S',
'ISO_Comment','S','Lowercase','B','Join_Control','B','Bidi_Class','S',
'Joining_Group','S','Decomposition_Mapping','S','Lowercase_Mapping','lc',
'NFKC_Casefold','S','Simple_Lowercase_Mapping','S',
'Indic_Syllabic_Category','S','Expands_On_NFC','B','Expands_On_NFD','B',
'Uppercase','B','White_Space','B','Sentence_Terminal','B',
'NFKD_Quick_Check','S','Changes_When_Titlecased','B','Math','B',
'Uppercase_Mapping','uc','NFKC_Quick_Check','S','Sentence_Break','S',
'Simple_Titlecase_Mapping','S','Alphabetic','B','Composition_Exclusion','B',
'Noncharacter_Code_Point','B','Other_Alphabetic','B','XID_Continue','B',
'Age','S','Other_ID_Start','B','Unified_Ideograph','B','FC_NFKC_Closure','S',
'Case_Ignorable','B','Hyphen','B','Numeric_Value','nv',
'Changes_When_NFKC_Casefolded','B','Expands_On_NFKD','B',
'Indic_Positional_Category','S','Decomposition_Type','S','Bidi_Mirrored','B',
'Changes_When_Uppercased','B','ID_Start','B','Grapheme_Extend','B',
'XID_Start','B','Expands_On_NFKC','B','Other_Uppercase','B','Other_Math','B',
'Grapheme_Link','B','Bidi_Control','B','Default_Ignorable_Code_Point','B',
'Changes_When_Casefolded','B','Word_Break','S','NFC_Quick_Check','S',
'Other_Default_Ignorable_Code_Point','B','Logical_Order_Exception','B',
'Prepended_Concatenation_Mark','B','Other_Lowercase','B',
'Other_ID_Continue','B','Variation_Selector','B','Extender','B',
'Full_Composition_Exclusion','B','IDS_Binary_Operator','B','Numeric_Type','S',
'kCompatibilityVariant','S','Simple_Uppercase_Mapping','S',
'Terminal_Punctuation','B','Line_Break','S','East_Asian_Width','S',
'ASCII_Hex_Digit','B','Pattern_White_Space','B','Hex_Digit','B',
'Bidi_Paired_Bracket_Type','S','General_Category','S',
'Grapheme_Cluster_Break','S','Grapheme_Base','B','Name','na','Ideographic','B',
'Block','S','Emoji_Presentation','B','Emoji','B','Deprecated','B',
'Changes_When_Lowercased','B','Bidi_Mirroring_Glyph','bmg',
'Canonical_Combining_Class','S',
);
}
OK, so here's another take on answering this question, but the solution is not perfect. Bring the downvotes!
If you join #perl6 channel on freenode, there's a bot called unicodable6 which has functionality that you may find useful. You can ask it to do this (e.g. for character A and π simultaneously):
<AlexDaniel> propdump: Aπ
<unicodable6> AlexDaniel, https://gist.github.com/b48e6062f3b0d5721a5988f067259727
Not only it shows the value of each property, but if you give it more than one character it will also highlight the differences!
Yes, it seems like you're looking for a way to do that within perl 6, and this answer is not it. But in the meantime it's very useful. Internally Unicodable just iterates through this list of properties. So basically this is identical to the other answer in this thread.
I think someone can make a module out of this (hint-hint), and then the answer to your question will be “just use module Unicode::Propdump”.

Macro matching a single expr after [expr*], path[tt*] and ident[tt*] branches

I'm trying to make a macro that I can call in the following manner:
mactest!(some::Path[1, 2, AnotherName[3, 4]])
Which would be equivalent to the following:
make_result(
"some::Path",
1.convert(),
2.convert(),
make_result(
"AnotherName",
3.convert(),
4.convert()
)
)
where convert is some trait that will be implemented for a bunch of types. (convert and make_result has the same result type).
This is as far as I've come:
// Note: u32 is used as an example result type.
// The real code attempts to create a more complicated object.
trait Foo {
fn convert(&self) -> u32;
}
fn make_result(name: &str, data: Vec<u32>) -> u32 {
// This example ignores name and makes a meaningless result
data.iter().fold(0,|a, &b| a + b)
}
#[macro_export]
macro_rules! mactest {
( [ $($inner:expr),* ] ) => {{
let mut result = Vec::new();
$(
// Process each element.
result.push(mactest!($inner));
)*
result
}};
($name:path [ $($inner:tt),* ] ) => {
make_result(stringify!($name), mactest!([$($inner),*]))
};
($name:ident [ $($inner:tt),* ] ) => {
make_result(stringify!($name), mactest!([$($inner),*]))
};
// Process single value. This is never matched?
($x:expr) => {
$x.convert()
};
}
The first matching branch of the macro is supposed to match each element of a list to either the path/ident[items] or the single item .convert branch at the end. But the final branch is never reached, with rust complaining error: expected ident, found '1' when single items enter the macro, i.e. mactest!(1).
My reasoning as a beginner rust user is that the macro has four patterns: [expr*], path[tt*], ident[tt*] and expr. When I pass something like 1 into the macro, I don't see why any of the above patterns should match/interfere.
Can someone explain why this doesn't work? Is there a workaround to get the intended result?
macro rules are tried by starting with the first one and going down from there. So if you want to prevent your other rules from triggering in special cases, you need to put the special case rule first.
Try it out in the playground

How does the following Java "continue" code translate to Scala?

for (String stock : allStocks) {
Quote quote = getQuote(...);
if (null == quoteLast) {
continue;
}
Price price = quote.getPrice();
if (null == price) {
continue;
}
}
I don't necessarily need a line by line translation, but I'm looking for the "Scala way" to handle this type of problem.
You don't need continue or breakable or anything like that in cases like this: Options and for comprehensions do the trick very nicely,
val stocksWithPrices =
for {
stock <- allStocks
quote <- Option(getQuote(...))
price <- Option(quote.getPrice())
} yield (stock, quote, price);
Generally you try to avoid those situations to begin with by filtering before you even start:
val goodStocks = allStocks.view.
map(stock => (stock, stock.getQuote)).filter(_._2 != null).
map { case (stock, quote) => (stock,quote, quote.getPrice) }.filter(_._3 != null)
(this example showing how you'd carry along partial results if you need them). I've used a view so that results will be computed as-needed, instead of creating a bunch of new collections at each step.
Actually, you'd probably have the quotes and such return options--look around on StackOverflow for examples of how to use those instead of null return values.
But, anyway, if that sort of thing doesn't work so well (e.g. because you are generating too many intermediate results that you need to keep, or you are relying on updating mutable variables and you want to keep the evaluation pattern simple so you know what's happening when) and you can't conceive of the problem in a different, possibly more robust way, then you can
import scala.util.control.Breaks._
for (stock <- allStocks) {
breakable {
val quote = getQuote(...)
if (quoteLast eq null) break;
...
}
}
The breakable construct specifies where breaks should take you to. If you put breakable outside a for loop, it works like a standard Java-style break. If you put it inside, it acts like continue.
Of course, if you have a very small number of conditions, you don't need the continue at all; just use the else of the if-statement.
Your control structure here can be mapped very idiomatically into the following for loop, and your code demonstrates the kind of filtering that Scala's for loop was designed for.
for {stock <- allStocks.view
quote = getQuote(...)
if quoteLast != null
price = quote.getPrice
if null != price
}{
// whatever comes after all of the null tests
}
By the way, Scala will automatically desugar this into the code from Rex Kerr's solution
val goodStocks = allStocks.view.
map(stock => (stock, stock.getQuote)).filter(_._2 != null).
map { case (stock, quote) => (stock,quote, quote.getPrice) }.filter(_._3 != null)
This solution probably doesn't work in general for all different kinds of more complex flows that might use continue, but it does address a lot of common ones.
If the focus is really on the continue and not on the null handling, just define an inner method (the null handling part is a different idiom in scala):
def handleStock(stock: String): Unit {
val quote = getQuote(...)
if (null == quoteLast) {
return
}
val price = quote.getPrice();
if (null == price) {
return
}
}
for (stock <- allStocks) {
handleStock(stock)
}
The simplest way is to embed the skipped-over code in an if with reversed-sense to what you have.
See http://www.scala-lang.org/node/257