fasttext models detecting norwegian text as danish [closed] - fasttext

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am using fasttext (v=0.9.1) to detect the language of a text (see this).
Norwegian text is being detected as Danish when using this model.
!curl "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin" > lid.bin
import fastText
language_detector=fastText.load_model('lid.bin')
language_detector.predict('Hei Jeg viser til hyggelig sam', k=3)
Output:
(('__label__da', '__label__no', '__label__hu'),
array([9.16624188e-01, 8.25065151e-02, 2.37607688e-04]))
Any help?

It seems that distinguishing the Norwegian and Danish languages ​​is difficult (see this).
fastText is not particularly suitable for this task.
You can try to use polyglot, a python library dedicated to multilingual NLP.
from polyglot.detect import Detector
detector = Detector('Hei Jeg viser til hyggelig sam')
print(detector)
output:
Prediction is reliable: True
Language 1: name: Norwegian code: no confidence: 96.0 read bytes: 1189
Language 2: name: un code: un confidence: 0.0 read bytes: 0
Language 3: name: un code: un confidence: 0.0 read bytes: 0
A little note: if you install polyglot, please be careful with dependencies (read this and this).

Related

ISO-8859-9/Latin-9 encoding [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I know it exist ISO-8859-9/Latin-5 or ISO-8859-15/Latin-9, but recently I had to manage some messages encoded with ISO-8859-9/Latin-9 format.
What does it exactly mean?
There is ISO-8859-9 which is called Latin-5.
And there is ISO-8859-15 which is called Latin-9.
Yes, it is confusing. In my opinion it's simplest to always only use the ISO-8859-n moniker. That avoids potential confusions.
So "ISO-8859-9/Latin-9" is probably a typo (or someone wrongly thought that the suffix is identical for the "ISO-8859-" and the "Latin-" prefix).
Depending on the source of the data, you can guess which one they meant. ISO-8859-9 is used for Turkish text and ISO-8859-15 is basically the modern replacement for ISO-8859-1 (covering most of Western Europe, mostly used because it has the € symbol).
Source: ISO/IEC 8859 Wiki page.

What is this unicode character - u'\xf1'? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
What is this unicode character u'\xf1'
Is there a lookup table on the web somewhere? I have seen tables, but nowhere can I search on this character and get the actual representation.
thanks
It is ñ (ntilde).
Unicode Hexadecimal: 0x00F1
Unicode Decimal: 241
UCS-2 Hexadecimal: 0xF100
UCS-2 Decimal: 61696
HTML Hexadecimal: ñ
HTML Decimal: ñ
http://www.fileformat.info/info/unicode/char/f1/index.htm
A search for "unicode character f1" returns what you ask for.
http://www.fileformat.info/info/unicode/char/f1/index.htm
See http://www.unicode.org/charts/ for a full 'lookup table' (several hundreds of these actually).

How does compiler understand Unicode characters so quickly? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have made a document based program lately.
But what intrigues me that how can a compiler(in my case, objective-c) convert any character into Unicode so fast while these characters are only visual presentations.
I think maybe A~Z and all other common characters can be converted from ASCII to Unicode very easily. What about other special character such as brand icon and copyright icon?
I am solely interested in the internal working of such conversion.
Example:
How do compiler understand what "©" is in a blink of second? Is it by looking up a UNICODE table? But if I have 1000000 "©", does my compiler look them up in the table 1000000 times? That is very time consuming, isn't it?
The compiler doesn't see "©". It sees whatever numerical representation of "©" occurs in the source file it's processing. No lookup is needed, because it's already in the form the compiler uses. (Some conversions might be needed if, for example, the source file is in UTF-8 and the compiler uses UTF-32 internally, but such conversions don't require a full Unicode table.)

How to convert character to unicode? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I have this character.
&#8211
How to convert this character to unicode?
Sorry if it is a silly question.
It's not a silly question, character encoding can be tricky to get your head around. I highly recommend reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (I'm sure you can guess the topic).
Unicode itself isn't an encoding, it's a very long list of characters and code points. What I'm guessing you want to do is display the dash character in some way. Where are you wanting to display or store the data? If it's in a browser, then that representation should work as that's the HTML encoded version. If you want to store it in a database then you'll need to convert that encoded version to a string and then convert that string to whatever encoding the database is using.
Take a look at this source has the encoding in different formats
http://www.fileformat.info/info/unicode/char/2013/index.htm
but each language has its own rules on how to write this in a string/char literal

i saw musical symbol in html plain text, but any know how exactly it happen? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
♯
♭
I saw this two symbol and i copied it.
try to do any html entities or special character.. but i can't get any result
I can't find any information on google also because this is not a searchable symbols
any one can explain how this flat and sharp musical symbol exist in which standard?
and how to type or generate them and any siblings?
♯
♭
♪
♬
♫
The standard used to define the characters is Unicode
See Unicode Miscellaneous Symbols (includes common music symbols like ♯) and Unicode Musical Symbols (other music symbols) -- I did a search for "unicode musical symbols", there are many more hits.
Happy coding.
See How to enter Unicode characters in Microsoft Windows -- or use the Windows Character Map. However, you need to know the code-point (or general code-point area)
:-) Other operating systems have different input methods and utilities.
A quick google search find the following page which lists entity codes for musical notes:
http://www.danshort.com/HTMLentities/index.php?w=music
It is in Unicode, and you can insert any Unicode character by putting this in HTML/xHTML markup:
♬
Gives ♬, i.e. you put &#x and suffix it with the Hex code of the character (end it with ;)
P.S: This technique is used as the last resort when facing character encoding problems.
explain how this flat and sharp musical symbol exist in which standard?
Unicode
and how to type or generate them and any siblings?
There are utilities for picking characters from unicode distributed with most operating systems.