Unicode file in notepad [closed] - unicode

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
What does it mean when I save a text file as "Unicode" in notepad? is it Utf-8, Utf-16 or Utf-32? Thanks in advance.

In Notepad, as in Windows software in general, “Unicode” as an encoding name means UTF-16 Little Endian (UTF-16LE). (I first thought it’s not real UTF-16, because Notepad++ recognizes it as UCS-2 and shows the content as garbage, but re-checking with BabelPad, I concluded that Notepad can encode even non-BMP characters correctly.)
Similarly, “Unicode big endian” means UTF-16 Big Endian. And “ANSI” means the system’s native legacy encoding, e.g. the 8-bit windows-1252 encoding in Western versions of Windows.

All of these formats are "Unicode". But usually editors on Mac and Windows mean UTF-8 with that because it is ASCII compatible below code 128 IIRC. UTF-8 can represent more codes than just 256 (which fits in a single byte of 8 bits) by using a special character which means that the following byte also belongs to the same character.
If you look at the output in terminal, say with vi, and if you see a space between every two characters then you are looking at UTF-16 because there every two bytes make up one character. What you should see is that the characters don't have spaces between them, that's an indication for UTF-8.

Related

Encoding US-ASCII to UTF-8 add space character [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I'm trying to encode some log files from US-ASCII to UTF-8
I am using iconv for that.
When I encode utf-8 there is a space in the only first-line just one char.
I tried using sed but it makes the format US-ASCII in the server. When I convert it to UTF-8, there is a space again.
I hope I was able to explain my problem.
I think you have a misconception about what character encodings are, and the relationship between ASCII and UTF-8. When we store text in a computer, we have to convert it into a binary sequence according to some code - we could choose something like "0001 means A, 0010 means B" and so on. To agree which code we're using, we give them names, like "ASCII" and "UTF-8".
If you look at a binary string, you can't tell what code its author was using; the best you can do is guess, by trying different codes and seeing which ones make sense. But some strings of bits will make sense in multiple codes - and, crucially for this question, they might mean exactly the same thing in multiple codes. For instance, two codes might both say that 0001 is an A, but one code says that 1110 means "?" and the other that it means "!". If all you have is a long line of A's, it will be written exactly the same way no matter which code you use.
In the case of UTF-8 and ASCII, this isn't coincidence; UTF-8 is deliberately designed so that anything written using ASCII will have exactly the same representation when written in UTF-8. The definition of UTF-8 basically begins with "if you can represent it using ASCII, do that; if you can't, follow these extra rules".
To use a different analogy, imagine the customer asked you to make sure the text was in British English, not American English - "colour" rather than "color", and so on. If the text you want to send is "It is raining today", there is nothing you need to change - the same exact string is both American English and British English at the same time.
If your text includes characters which can't be represented in ASCII, then it is not in ASCII. In that case, you need to know what encoding it is actually in - there are many encodings which, like UTF-8, are designed to be compatible with ASCII, so the majority of characters will be the same no matter which one you try. ISO 8859-1 and its cousin Windows-1252 are very common; ISO 8859-15 possibly more so in Europe; others in other parts of the world where they're useful for writing the local language.
The extra "space" you're seeing at the start of the file is probably the so-called "Byte Order Mark", a Unicode character deliberately defined as meaning nothing, but having a different representation in different encodings. It's there to give a hint to programs which want to guess the encoding used in a piece of text, but it is not mandatory, and the string was already valid UTF-8 before it was added.

How is Unicode different from ASCII? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
http://www.joelonsoftware.com/articles/Unicode.html. The below statement is from the this article:
"Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode". The author is trying to make a point that Unicode is not just "ASCII with more bytes"(extended ASCII), there is more to Unicode than just it appears. But I am not getting how Unicode is different? To me it appears as extended ASCII.
As unicode has more numbers, it can map more characters.
Yes, that's it.
The ASCII character set defines a whopping 127 numbers, and specifies which characters they represent, and how they should be serialized as byte sequences. It says that each number should be encoded as a single byte, end of story.
Unicode has room for over a million such numbers, and specifies several different ways in which they may be serialized as byte sequences.
In addition, Unicode does quite a lot more than that - for example it doesn't just map integers to characters, it also maps characters to glyphs (the graphical symbols in a font), as well as describing various metadata for each character. But the main thing is just that Unicode defines a much bigger code space and separates the integer/character mapping from the encoding (so the same integer can be encoded as different byte sequences depending on whether you encode as UTF-8, UTF-16 or UTF-32)
Unicode assigns a unique integer to characters (and character modifiers). There are many encodings, but common ones are UTF-16 and UTF-8, which are both variable-width encodings.
ASCII is a 1-byte encoding of a subset of characters.

Why does windows notepad give possibility to save document in unicode and in utf-8? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
Utf-8 is " is a variable-width encoding that can represent every character in the Unicode character set" (wikipedia), unicode is "standard for the consistent encoding, representation and handling of text" (wikipedia). They're difference things. Why does windows notepad give possibility to save document in unicode and utf-8? How can I compare two difference things?
To simplify,
Unicode says what number should represent each character.
UTF-8 says how to arange the bits to form different strings of unicode values.
According to this thread, what Unicode means in notepad is UTF-16 Little Endian (UTF-16LE) which is another way arranging the bits in order to form strings of Unicode values.

What is Unicode? and how Encoding works? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Few hours before I was reading a c programming book. While I was reading the book I came across these words, Character encoding and Unicode. Then I started googling for the information about Unicode. Then I came to know that Unicode character set has every character from every language and UTF-8,16,32 can encode the characters listed in unicode character set.
but I was not able to understand how it works.
Does unicode depends upon the operating systems?
How it is related to softwares and programs?
Is UTF-8 is a software that is installed on my computer when i installed operating system?
or Is it related to hardware?
and how a computer encodes the things?
I have found it so much confusing. Please answer me in detail.
I am new to these things, so please keep that in mind while you give me the answer.
thank you.
I have written about this extensively in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text. Here some highlights:
encodings are plentiful, encodings define how a "character" like "A" can be encoded as bits and bytes
most encodings only specify this for a small number of selected characters; for example all (or at least most) characters needed to write English or Czech; single byte encodings typically support a set of up to 256 characters
Unicode is one large standard effort which has catalogued and specified a number ⟷ character relationship for virtually all characters and symbols of every major language in use, which is hundreds of thousands of characters
UTF-8, 16 and 32 are different sub-standards for how to encode this ginormous catalog of numbers to bytes, each with different size tradeoffs
software needs to specifically support Unicode and its UTF-* encodings, just like it needs to support any other kind of specialized encoding; most of the work is done by the OS these days which exposes supporting functions to an application

Does the Unicode Consortium Intend to make UTF-16 run out of characters? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
The current version of UTF-16 is only capable of encoding 1,112,064 different numbers(code points); 0x0-0x10FFFF.
Does the Unicode Consortium Intend to make UTF-16 run out of characters?
i.e. make a code point > 0x10FFFF
If not, why would anyone write the code for a utf-8 parser to be able to accept 5 or 6 byte sequences? Since it would add unnecessary instructions to their function.
Isn't 1,112,064 enough, do we actually need MORE characters? I mean: How quickly are we running out?
As of 2011 we have consumed 109,449 characters AND set aside for application use(6,400+131,068):
leaving room for over 860,000 unused chars; plenty for CJK extension E(~10,000 chars) and 85 more sets just like it; so that in the event of contact with the Ferengi culture, we should be ready.
In November 2003 the IETF restricted UTF-8 to end at U+10FFFF with RFC 3629, in order to match the constraints of the UTF-16 character encoding: a UTF-8 parser should not accept 5 or 6 byte sequences that would overflow the utf-16 set, or characters in the 4 byte sequence that are greater than 0x10FFFF
Please put edits listing sets that pose threats on the size of the unicode code point limit here if they are over 1/3 the Size of the CJK extension E(~10,000 chars):
CJK extension E(~10,000 chars)
Ferengi culture characters(~5,000 chars)
At present time, the Unicode standard doesn't define any characters above U+10FFFF, so you would be fine to code your app to reject characters above that point.
Predicting the future is hard, but I think you're safe for the near term with this strategy. Honestly, even if Unicode extends past U+10FFFF in the distant future, it almost certainly won't be for mission critical glyphs. Your app might not be compatible with the new Ferengi fonts that come out in 2063, but you can always fix it when it actually becomes an issue.
Cutting to the chase:
It is indeed intentional that the encoding system only supports code points up to U+10FFFF
It does not appear that there is any real risk of running out any time soon.
There is no reason to write a UTF-8 parser that supports 5-6 byte sequences, except for support of any legacy systems that actually used them. The current offical UTF-8 specification does not support 5-6 byte sequences in order to accomodate 100% loss-less conversions to/from UTF-16. If there is ever a time that Unicode has to support new codepoints above U+10FFFF, there will be plenty of time to devise new encoding formats for the higher bit counts. Or maybe by the time that happens, memory and computional power will be sufficient enough that everyone will just switch to UTF-32 for everything, which can handle up to U+FFFFFFFF for over 4 billion characters.