This question already has answers here:
When to use Unicode Normalization Forms NFC and NFD?
(2 answers)
Normalizing Unicode
(2 answers)
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 4 months ago.
I was looking into Matt Parker's problem of finding 5 5-letters words that have in total 25 distinct letters (link if you are interested, not really relevant to the issue), but I wanted to do it using french words (and so I had to use the french alphabet).
So I downloaded the following file from github, containing 1 french word per line, encoded in utf-8, with the accents. But something I never seen before occured : There seems to be two ways (maybe more) to encode accents in utf-8.
When I opened the file in VSCode (also works with Windows' Notepad), every accent displayed as they should, but :
when I select a letter with an accent, it shows that 2 letters are selected
when I write a letter with an accent using a french keyboard, and then select it, it shows that only 1 letter is selected (so the two appear to be distinct, though they display exactly identically)
I then tried the following code in python :
# first is typed from keyboard, second is typed from keyboard, returns true
print("é" == "é")
# first is copied from downloaded text file, second is typed from keyboard, returns false
print("é" == "é")
It displays identically, but is not identical !?
I then tried to change the encoding of the file, but it either changed nothing, or removed the accents : the word "abaissé" might for example show as "abaisse�".
After testing, it seems like the file is encoding accented letters as a special accent character adjacent to the letter character it wants to put the accent on. But when I then read the file character by character (it does it in rust for example, but I'm pretty sure it does the same in other languages), they come as two characters : I first get the accent and then the letter. And rust chars are by default valid unicode units, so if they where 1 single unicode char, rust should read them as 1 single unicode char. It's not the case.
I'm honestly stuck, so if you know, I would be really grateful for your help.
Why are these combos of characters showing as 1 character when they should display as 2 distinct ?
I know this is not a bug from utf-8, but it makes handling the data very hard. So how do I recombine them together, as the accent shouldn't be decoupled from the letter in the first place ?
This question already has answers here:
UTF-8, UTF-16, and UTF-32
(14 answers)
What is the Best UTF [closed]
(6 answers)
Closed 2 years ago.
Since an Unicode character is U+XXXX(in hex), so it only needs two bytes, then why we come up with various of different encoding scheme like UTF-8 which takes from one to four bytes? Can't we just map the each Unicode character into binary data of two bytes, why we ever need four bytes to encode?
This question already has answers here:
Regular Expression To Match String Not Starting With Or Ending With Spaces
(2 answers)
Regex for password: repetitive characters
(4 answers)
Closed 4 years ago.
static let password = "^(?=.*\\d)(?=.*[a-z])(?=.*[A-Z]).{8,}$"
I have this regex to validate password
at least one character
at least one Capital character
at least one number
length more than 8
I need to update it to
prevent any repeated sequence of characters or numbers
to prevent "111111" or "aaaaaaa" for example
to prevent starting and ending with space character
how to update my regex to match those requirements ?
This question already has an answer here:
Learning Regular Expressions [closed]
(1 answer)
Closed 4 years ago.
i need some help to write correct regex validation. I want password with no spaces, min. 6 symbols, doesn't matter numbers or letters or symbols. Alphabet a-zA-Z and а-яА-Я(RU). How i can do that?
"^(?=.*[A-Za-z])(?=.*\\d)[A-Za-z\\d]{6,}$"
You can take a look at this link
This question already has an answer here:
Inconsistent Unicode Emoji Glyphs/Symbols
(1 answer)
Closed 6 years ago.
I want to print the unicode character U+21A9 which is the undo arrow (↩), but Apple likes to turn that in a bubbly looking emoji like
Pick a font containing the glyph that you want, like Lucida Grande or Menlo.