Do any common computers use big endian encoding? - cpu-architecture

I understand big endian and little endian. However, all the processors of all the computers accessible to me -- AMD, Intel, Broadcom -- are little endian. This leads me to wonder whether there are any common computers that use big endian. Can anyone provide examples?

The LEON processor, commonly used in spacecraft applications, is big-endian.
Apparently, the LEON is a popular processor to simulate.

Related

What's Unicode/ASCII's relevance to machine code?

Even though machine language varies according to, well, machine, as far as I've found out, Unicode/ASCII has specific values for characters(this whole concept is still a bit confusing). So, basically, is the binary value for the character, let's say, 'A' in Linux different to that of 'A' in Windows? If different machines understand different sequences of 1s and 0s, shouldn't 'A's 1s and 0s differ according to machine(even though Unicode has set values for each character--I think)?
P.S. I'm kind of new to programming and don't even know if this is the right place to ask this question.(If it isn't, sorry!)
Linux and Windows are different operating systems, which can very well run on the same machine (hardware). ASCII and Unicode (and the Unicode encodings like UTF-8) are standards independent of any specific operating system or machine. These standards define how data should be expressed, and that is independent of any specific implementation of that standard. ASCII in Windows is exactly the same as ASCII in Linux, because ASCII has been defined the way it is and different systems must make their implementation conform to that standard if they want to be interoperable.
Now, different hardware architectures may use big-endian vs. little-endian architectures, in which case the actual bytes may be processed in a different order internally. But that is merely an implementation detail; ASCII will still be ASCII.
Machines don't "understand" characters. They process bytes, made of 0 and 1s.

Big endian and small endian confusion

I have seen two definitions for big endian/small endian which cause my confusion.
The first definition is the classic one related to machine:
Big-endian systems store the most significant byte of a word in the smallest address and the least significant byte is stored in the largest address (also see Most significant bit). Little-endian systems, in contrast, store the least significant byte in the smallest address.
This makes perfect sense and this is the definition of big/small endian in my whole life until I came across various discussions related to cryptography:
book "Cryptography for Developers" By Tom St Denis says, "the OS2IP function converts the octet string to integer by loading the octet strings in big endian fashion. That is, the first byte is the most significant."
https://crypto.stackexchange.com/questions/10824/what-does-an-rsa-signature-look-like/10826#10826
In the accepted answer of this question, it says, "The padded value is then interpreted as an integer x, by decoding it with the big-endian convention."
Apparently, these two crypto discussions does not involve anything related to the machine architecture. What is their definition of big-endian fashion/convention?
Big and little endian are just conventions about representing numbers with bytes. In big endian, the most significant byte comes first, in the little endian it's the other way around. Different architectures, data formats, algorithms and networking protocols may adopt different strategies.
Moreover, good programs will not depend on the endianness of the architecture, for example, to read a number from an array you could write something like:
int read_bit_endian_16(unsigned char *data) {
return (data[0] << 8) + data[1];
}
or using functions like ntohs() and friends.
In Python it's:
struct.unpack('>h', data)
Binary data formats are good example of when endianness is important, if you expect them to be cross-platform. If you write data in a low-endian platform, you want to be able to read it in a big-endian one. That's why any decent format specify those things explicitly, and portable programs take into account chances of being compiled/run in different architectures. Other example would be multibyte character encodings like UTF16-LE and UTF16-BE.
You can find a more detailed explanation here

What's the big deal with unicode?

I've heard a lot of people talk about how some new version of a language now supports unicode, and how much of an achievement unicode is. What's the big deal about being able to support a new characterset. It seems like something which would rarely if ever be used but people mention it quite often. What's the benefit or reason people use or even care about unicode?
Programming languages are used to produce software.
Software is used to solve problems faced by humans.
Producing software has a cost.
Software that solves problems for humans produces value. This value can be expressed in the form of profit, or the reduction of costs, depending on the business model of the software developer. How the value is expressed is irrelevant for the purposes of this discussion; what is relevant is that net value is produced.
There are seven billion humans in the world. A significant fraction of them are most comfortable reading text that is not written in the Latin alphabet.
Software which purports to solve a problem for some fraction of those seven billion humans who do not use the Latin alphabet does so more effectively if developers can easily manipulate text written in non-Latin alphabets.
Therefore, a programming language which supports non-Latin character sets lowers the costs of software developers, thereby enabling them to solve more problems for more people at lower costs, and thereby produce more value.
Unicode is the de facto standard for manipulation of non-Latin text.
Therefore, Unicode is important to the design and implementation of programming languages.
Our goal as programming language designers is the creation of tools which produce maximum value. Supporting Unicode is an easy way to massively increase the scope and range of real human problems that can be solved in software.
In the beginning, there were 256 possible characters and many different Code pages to represent them. It became a tangled mess. Supporting multiple languages and multiple characters sets became a programmer's nightmare.
Then the Unicode Consortium was formed. It created a standard that would allow a single character set with 256 x 256 = 65536 characters (plus combinations thereof) to include almost all languages of the world.
The biggest advantage is that a single character string may contain multiple languages. That is no small thing.
Unicode is now the native character specification used in Windows ever since Windows 2000. it is also allowed as a character set in HTML and on websites.
If your application does not support Unicode, or is not planning to support it, then it is only a matter of time until your application will be left behind.
What's the big deal about being able
to support a new characterset.
Unicode is not just "a new characterset". It's the character set that removes the need to think about character sets.
How would you rather write a string containing the Euro sign?
"\x80", "\x88", "\x9c", "\x9f", "\xa2\xe3", "\xa2\xe6", "\xa3\xe1", "\xa4", "\xa9\xa1", "\xd9\xe6", "\xdb", or "\xff" depending upon the encoding.
"\u20AC", in every locale, on every OS.
Unicode can support pretty much any language in the world. Without such an encoding you would have to worry about choosing the correct encoding for different languages, which is very bothersome (not to mention mixing multiple languages in the same text block, ugh)
Unicode support in a language means that the language's native character/string type supports all those languages as well, without the user having to worry about character encodings or multibyte characters and such while doing computations. Of course, one still has to acnowledge character encodings when doing I/O, but doing your string processing in one single sensible encoding helps a lot.
Well if you care anything about internationalization (AKA the rest of the world) scientific notations, etc you would care about unicode. Unicode is difficult to deal with because we have been so ingrained just ASCII support. But now that modern systems support Unicode, there is no reason really not to just encode your things UTF-8. I know I work in publishing and for a long time we had to do hack things like insert gif images of formulas etc. Now we can put unicode straight in and people can search and copy and paste etc, and our code can deal with it by using unicode regexes etc.
If you wish to communicate with someone whose native language is not English (either the British or American variants), you care. A lot.
As everyone says - support for all the charactersets and formatting used by every other language and locale in the world. Open source and commercial developers both like that because it increases their potential user base by about 20x fold (and growing).
Unicode is a good thing because it eliminates character set problems and leaves one less thing to worry about. Even if your software never leaves the U.S., you never know when you're going to run into a filename or text field with an odd character in it, and Unicode lets you live in ignorance.
Americans like Daisetsu may not care about Unicode, but the rest of the world uses a bit more than 26 Latin letters, and there Unicode is heavily used.
We had hundreds of messed up charsets in the past solely because American computer scientists thought "why would anyone want to use more than 26 Latin characters like we have in English?"
Narrow-mindedness is a bad thing.

Why does anyone use an encoding other than UTF-8? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I want to know why any developer would need to use an encoding other than UTF-8.
Wikipedia lists advantages and disadvantages of UTF-8 as compared to a variety of other encodings:
http://en.wikipedia.org/wiki/UTF-8#Advantages_and_disadvantages
The most important disadvantages are IMHO that UTF-8 might use significantly more space especially in Asian languages such as Chinese, Japanese or Hindi and that not all code points have the same size which makes measurements more difficult and many string operations such as search inefficient.
Well, some do it because their tools are archaic or flawed. Some do it because they don't see a need to support anything other than ASCII. Some do it because they don't know any better.
Those are the usual excuses for not using Unicode.
As for not using UTF-8 specifically there are different reasons. Some systems, like Windows1 (and stemming from that, .NET) and Java came to be in a time where Unicode was a strict 16-bit code. Therefore, there was really only one encoding: UCS-2, encoding code points directly as 16-bit words.
Later Unicode was expanded to 21 bits because 65536 code points weren't enough anymore. This caused encodings such as UTF-32 and UTF-16 to appear. For systems previously working with UCS-2 the transition to UTF-16 was the easiest and most sensible choice. Windows did that transition back in Ye Olde Days of Windows 2000.
So while I think that nearly all application nowadays should support Unicode I don't think it is entirely necessary for them to specifically use UTF-8. There are historic reasons for that and no real benefit in converting existing systems from UTF-16 to UTF-8.
1 NT.
In UTF-8 code points between 0800 and FFFF take up three bytes in UTF-8 but only two in UTF-16. See the wikipedia comparison for more details, but basically if text heavily uses code points in this range (say, if it's Chinese), UTF-8 files will be larger than UTF-16 files with the same content.
UTF-8 is very efficient at encoding plain English text (same as ASCII). If your user base is likely to be mostly, say, Chinese, you will be much better off using UTF-16.
For more information, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Because outside the English-speaking world, people have been using various encodings that predate Unicode and are tailored for their respective languages for decades. These language-specific encodings have become ingrained everywhere and are pretty much a standard. If you want to have any hope of interfacing with legacy systems, you have to use them, so all systems have to support them and usually use them as default even if they by now support UTF-8 as well. There may even be multiple legacy encodings traditionally used for different purposes.
Examples:
ISO-8859-1 in western Europe - actually outdated there as well, as you need ISO-8859-15 for the Euro sign
ISO-2022-JP in Japan for emails, Shift JIS for websites
Big5 in Taiwan
GB2312 in China
The last two examples show that encodings can even be a political issue.
Sometimes they are restricted due to historical/unsupported reasons (I'm developing on Windows using Zend Studio on a Samba share on a Linux box: and something in that mix means I keep reverting to Cp1512 instead of UTF8).
Sometimes you don't need to use UTF-8 (for example when storing a md5 hash in a database: you only need the hexadecimal range 0-9 A-F: why make it a UTF-8 field, which will take at least a byte extra storage instead of normal ASCII).
Sometimes it's just laziness learning the UTF-8 functions for a particular language.
Because they do not know better.
The only valid criticism to utf-8 is that encodings for common Asian languages are oversized from other encodings.
UTF-8 is superior because
It is ASCII compatible. Most known and tried string operations do not need adaptation.
It is Unicode. Anything that isn't Unicode shouldn't even be considered in this day and age. If you have important data in encoding X, spend two minutes on Google and write a conversion function. Even if you have to interface with sourceless legacy app Z, you can run your communications through a pipe so that your logic stays in the 21st century.
UTF-16 isn't fixed length either and assuming it is like many do, will only cause terrible bugs.
Additionally Unicode is very complex and it is almost certain than any fixed-size algorithm adapted from ASCII will yield bad results even in UTF-32.
Say you have this UTF-16 string.
[0][1][2][F|3] [4] [5]
And you want to insert a character with code 8 between [3] and [4]
you would do insert(5,8)
If you don't check for characters outside BMP(serially as in UTF-8 as you cannot know how many double sized characters you have) you get:
[0][1][2][F|8][3][4][5]
Two new garbage characters. So much for your fixed size encoding.
You can of course disallow such characters altogether, but then when your code interfaces with the real world, you might find your program saves the profile for this user who lives in rm -Rf / in .profile instead of [Classical Chinese Proverb].profile.
Or just an angry user that cannot write his thesis on Classical Chinese Proverbs with your software.
One legitimate reason is when you need to deal with legacy documents, software or hardware that are not Unicode compatible.
Another legitimate reason is that you need to use a programming language / libraries that do not support UTF8 / Unicode well ... or at all.
Other answers mention that UTF-16 is more compact than UTF-8 for Asian languages / characters.
And of course there are reasons like short-sightedness, ignorance, laziness ... and deadlines.
Its also worth remembering that in some circumstances (where a non-latin set of characters are needed) UTF-8 can actually bloat larger than the 16 bit Unicode encoding. In those cases ucs-2 or utf-16 would be a better choice.
The reasons for using non-Unicode 8-bit character sets / encodings are all back compatibility of some kind, and/or inertia. For that matter, the most frequent reasons for using UTF-8 are compatibility with standards like XML that mandate or prefer UTF-8.
Differences in the number of bytes you think text will take up in different encodings, especially in storage, are mostly theoretical. In real world situations, compatibility requirements are more important. If compression is used, the size differences go away anyway. Even if compression is not used, total text size is hard to predict and is rarely a deciding factor.
When converting legacy code that used non-Unicode 8-bit encodings, using UTF-16 can be a tool for making sure all code has been converted, because mismatches can be caught as compile-time type errors. Many languages, runtimes and libraries like Javascript, JVM, .NET, ICU use 16-bit strings and UTF-16, even though storage and Internet protocols are usually 8-bit.
Imagine all files to consider are in GB2312 (China mainland standard). Then you might choose GB18030 as Unicode encoding instead. They are compatible the same way as all ASCII is UTF-8. That is useful in China mainland!
You might decide even quicker when you find out that both mentioned GB-standards are required in your IT-product by law (as far as I have heard), if you want to ship in China (mainland).
Another upside is that GB2312, and as such GB18030 as well, are also ASCII compatible.
It is algorithmically not so robust, though. – So if you have no political reasons or any GB2312 legacy, it makes no sense to use it. But if you do, here you got your answer.
Related to the subject, when using MySQL, as if it wasn't complex enough, you get the option the choose which kind of UTF-8 collation you want to use. So what would you use?
UTF-8 general ci
or
UTF-8 unicode ci?
(I tend to use the UTF-8 variant that is used for the database connection)
Because you sometimes want to operate easily on codepoints -- then you'd choose f.e. UCS-2 or UCS-4.
Many APIs require other Unicode encodings - mostly UTF-16. For instance, Java, .NET, Win32.
At my previous employer we used iso-8859-1 for some of our ASP pages to match the collation of our SQL Server, which as you can guess was not Unicode. I wanted to change the collation, but the manager said to wait till we upgrade our SQL Server to do it. Needless to say it never happened - I haven't been with them for a little over a year now, so I don't know if they finally did it.
Unicode certainly is a good place to work from in most cases, but a developer should be familiar with many different types of character encoding. Certainly ASCII might be used if the set of characters is limited.
What if you're a developer and receiving data from a source that doesn't send UTF-8? There could be lots of interface issues if you don't understand your input.
Joel's article on the must-knows for character encoding is good and worth reading.

Why isn't everything we do in Unicode?

Given that Unicode has been around for 18 years, why are there still apps that don't have Unicode support? Even my experiences with some operating systems and Unicode have been painful to say the least. As Joel Spolsky pointed out in 2003, it's not that hard. So what's the deal? Why can't we get it together?
Start with a few questions
How often...
do you need to write an application that deals with something else than ascii?
do you need to write a multi-language application?
do you write an application that has to be multi-language from its first version?
have you heard that Unicode is used to represent non-ascii characters?
have you read that Unicode is a charset? That Unicode is an encoding?
do you see people confusing UTF-8 encoded bytestrings and Unicode data?
Do you know the difference between a collation and an encoding?
Where did you first heard of Unicode?
At school? (really?)
at work?
on a trendy blog?
Have you ever, in your young days, experienced moving source files from a system in locale A to a system in locale B, edited a typo on system B, saved the files, b0rking all the non-ascii comments and... ending up wasting a lot of time trying to understand what happened? (did your editor mix things up? the compiler? the system? the... ?)
Did you end up deciding that never again you will comment your code using non-ascii characters?
Have a look at what's being done elsewhere
Python
Did I mention on SO that I love Python? No? Well I love Python.
But until Python3.0, its Unicode support sucked. And there were all those rookie programmers, who at that time knew barely how to write a loop, getting UnicodeDecodeError and UnicodeEncodeError from nowhere when trying to deal with non-ascii characters. Well they basically got life-traumatized by the Unicode monster, and I know a lot of very efficient/experienced Python coders that are still frightened today about the idea of having to deal with Unicode data.
And with Python3, there is a clear separation between Unicode & bytestrings, but... look at how much trouble it is to port an application from Python 2.x to Python 3.x if you previously did not care much about the separation/if you don't really understand what Unicode is.
Databases, PHP
Do you know a popular commercial website that stores its international text as Unicode?
You will (perhaps) be surprised to learn that Wikipedia backend does not store its data using Unicode. All text is encoded in UTF-8 and is stored as binary data in the Database.
One key issue here is how to sort text data if you store it as Unicode codepoints. Here comes the Unicode collations, which define a sorting order on Unicode codepoints. But proper support for collations in Databases is missing/is in active development. (There are probably a lot of performance issues, too. -- IANADBA) Also, there is no widely-accepted standard for collations yet: for some languages, people don't agree on how words/letters/wordgroups should be sorted.
Have you heard of Unicode normalization? (Basically, you should convert your Unicode data to a canonical representation before storing it) Of course it's critical for Database storage, or local comparisons. But PHP for example only provides support for normalization since 5.2.4 which came out in August 2007.
And in fact, PHP does not completely supports Unicode yet. We'll have to wait PHP6 to get Unicode-compatible functions everywhere.
So, why isn't everything we do in Unicode?
Some people don't need Unicode.
Some people don't care.
Some people don't understand that they will need Unicode support later.
Some people don't understand Unicode.
For some others, Unicode is a bit like accessibility for webapps: you start without, and will add support for it later
A lot of popular libraries/languages/applications lack proper, complete Unicode support, not to mention collation & normalization issues. And until all items in your development stack completely support Unicode, you can't write a clean Unicode application.
The Internet clearly helps spreading the Unicode trend. And it's a good thing. Initiatives like Python3 breaking changes help educating people about the issue. But we will have to wait patiently a bit more to see Unicode everywhere and new programmers instinctively using Unicode instead of Strings where it matters.
For the anecdote, because FedEx does not apparently support international addresses, the Google Summer of Code '09 students all got asked by Google to provide an ascii-only name and address for shipping. If you think that most business actors understand stakes behind Unicode support, you are just wrong. FedEx does not understand, and their clients do not really care. Yet.
Many product developers don't consider their apps being used in Asia or other regions where Unicode is a requirement.
Converting existing apps to Unicode is expensive and usually driven by sales opportunities.
Many companies have products maintained on legacy systems and migrating to Unicode means a totally new development platform.
You'd be surprised how many developers don't understand the full implications of Unicode in a multi-language environment. It's not just a case of using wide strings.
Bottom line - cost.
Probably because people are used to ASCII and a lot of programming is done by native English speakers.
IMO, it's a function of collective habit, rather than conscious choice.
The widespread availability of development tools for working with Unicode may be a more recent event than you suppose. Working with Unicode was, until just a few years ago, a painful task of converting between character formats and dealing with incomplete or buggy implementations. You say it's not that hard, and as the tools improve that is becoming more true, but there are a lot of ways to trip up unless the details are hidden from you by good languages and libraries. Hell, just cutting and pasting unicode characters could be a questionable proposition a few years back. Developer education also took some time, and you still see people make a ton of really basic mistakes.
The Unicode standard weighs probably ten pounds. Even just an overview of it would have to discuss the subtle distinctions between characters, glyphs, codepoints, etc. Now think about ASCII. It's 128 characters. I can explain the entire thing to someone that knows binary in about 5 minutes.
I believe that almost all software should be written with full Unicode support these days, but it's been a long road to achieving a truly international character set with encoding to suit a variety of purposes, and it's not over just yet.
Laziness, ignorance.
One huge factor is programming language support, most of which use a character set that fits in 8 bits (like ASCII) as the default for strings. Java's String class uses UTF-16, and there are others that support variants of Unicode, but many languages opt for simplicity. Space is so trivial of a concern these days that coders who cling to "space efficient" strings should be slapped. Most people simply aren't running on embedded devices, and even devices like cell phones (the big computing wave of the near future) can easily handle 16-bit character sets.
Another factor is that many programs are written only to run in English, and the developers (1) don't plan (or even know how) to localize their code for multiple languages, and (2) they often don't even think about handling input in non-Roman languages. English is the dominant natural language spoken by programmers (at least, to communicate with each other) and to a large extent, that has carried over to the software we produce. However, the apathy and/or ignorance certainly can't last forever... Given the fact that the mobile market in Asia completely dwarfs most of the rest of the world, programmers are going to have to deal with Unicode quite soon, whether they like it or not.
For what it's worth, I don't think the complexity of the Unicode standard is not that big of a contributing factor for programmers, but rather for those who must implement language support. When programming in a language where the hard work has already been done, there is even less reason to not use the tools at hand. C'est la vie, old habits die hard.
All operating systems until very recently were built on the assumption that a character was a byte. It's APIs were built like that, the tools were built like that, the languages were built like that.
Yes, it would be much better if everything I wrote was already... err... UTF-8? UTF-16? UTF-7? UTF-32? Err... mmm... It seems that whatever you pick, you'll annoy someone. And, in fact, that's the truth.
If you pick UTF-16, then all of your data, as in, pretty much the western world whole economy, stops being seamlessly read, as you lose the ASCII compatibility. Add to that, a byte ceases to be a character, which seriously break the assumptions upon which today's software is built upon. Furthermore, some countries do not accept UTF-16. Now, if you pick ANY variable-length encoding, you break some basic premises of lots of software, such as not needing to traverse a string to find the nth character, of being able to read a string from any point of it.
And, then UTF-32... well, that's four bytes. What was the average hard drive size or memory size but 10 years ago? UTF-32 was too big!
So, the only solution is to change everything -- software, utilites, operating systems, languages, tools -- at the same time to be i18n-aware. Well. Good luck with "at the same time".
And if we can't do everything at the same time, then we always have to keep an eye out for stuff which hasn't been i18n. Which causes a vicious cycle.
It's easier for end user applications than for middleware or basic software, and some new languages are being built that way. But... we still use Fortran libraries written in the 60s. That legacy, it isn't going away.
Because UTF-16 became popular before UTF-8 and UTF-16 is a pig to work with. IMHO
Because for 99% of applications, Unicode support is not a checkbox on the customer's product comparison matrix.
Add to the equation:
It takes a conscious effort with almost no readily visible benefit.
Many programmers are afraid of it or don't understand it.
Management REALLY doesn't understand it or care about it, at least not until a customer is screaming about it.
The testing team isn't testing for Unicode compliance.
"We didn't localize the UI, so non-English speakers wouldn't be using it anyway."
Tradition and attitude. ASCII and computers are sadly synonyms to many people.
However, it would be naïve to think that the rôle of Unicode, is only a matter of Exotic languages from Eurasia and other parts of the world. A rich text encoding has lots of meaning to bring even to a "plain" English text. Look in a book sometime.
I would say there are mainly two reason. First one is simply that the Unicode support of your tools just isn't up to snuff. C++ still doesn't have Unicode support and won't get it until the next standard revision, which will take maybe a year or two to be finished and then another five or ten years to be in widespread use. Many other languages aren't much better and even if you finally have Unicode support, it might still be a more cumbersome to use then plain ASCII strings.
The second reason is in part what it causing the first issue, Unicode is hard, its not rocket science, but it gives you a ton of problems that you never had to deal with in ASCII. With ASCII you had a clear one byte == one glyph relationships, could address the Nth character of a string by a simple str[N], could just store all characters of the whole set in memory and so on. With Unicode you no longer can do that, you have to deal with different ways Unicode is encoded (UTF-8, UTF-16, ...), byte order marks, decoding errors, lots of fonts that have only a subset of characters which you would need for full Unicode support, more glyphs then you want to store in memory at a given time and so on.
ASCII could be understand by just looking at an ASCII table without any further documentation, with Unicode that is simply no longer the case.
Because of the inertia caused by C++. It had (has) horrible unicode support and dragged back the developers.
I personally do not like how certain formats of unicode break it so that you can no longer do string[3] to get the 3rd character. Sure it could be abstracted out, but imagine how much slower a big project with strings, such as GCC would be if it had to transverse a string to figure out the nth character. The only option is caching where "useful" positions are and even then it's slow, and in some formats your now taking a good 5 bytes per character. To me, that is just ridiculous.
More overhead, space requirements.
I suspect it's because software has such strong roots in the west. UTF-8 is a nice, compact format if you happen to live in America. But it's not so hot if you live in Asia. ;)
Unicode requires more work (thinking), you usually only get paid for what is required so you go with the fastest less complicated option.
Well that's from my point of view. I guess if you expect code to use std::wstring hw(L"hello world") you have to explain how it all works that to print wstring you need wcout : std::wcout << hw << std::endl; (I think), (but endl seems fine ..) ... so seems like more work to me - of course if I was writing international app I would have to invest into figuring it out but until then I don't (as I suspect most developers).
I guess this goes back to money, time is money.
It's simple. Because we only have ASCII characters on our keyboards, why would we ever encounter, or care about characters other than those? It's not so much an attitude as it is what happens when a programmer has never had to think about this issue, or never encountered it, perhaps doesn't even know what unicode is.
edit: Put another way, Unicode is something you have to think about, and thinking is not something most people are interested in doing, even programmers.