How do I print “What happened on the 4th of June 1989?” in BrainFuck? - brainfuck

I need to learn how to display this text in BrainFuck among other programming languages, BrainFuck included.

You may use the following code in Brainfuck to generate that.
+[--->++<]>+.+++[->++++<]>.-------.--[--->+<]>-.[---->+<]>+++.-[--->++<]>--.-------.-[++>-----<]>..-----------.+++++++++.---------.-.-[--->+<]>-.+++++[->+++<]>.-.-[->+++++<]>-.---[->++++<]>.------------.---.--[--->+<]>-.++[-->+++<]>+.++++++[->++<]>.------------.--[--->+<]>--.+++++[->+++<]>.---------.[--->+<]>--.+++++[->++<]>.-[----->+<]>.-------.---------.--[--->+<]>-.[-->+++<]>+.++++++++.-.+.++++++.
This is a code generated from Brainfuck text generator
Truly brain fucked. (upvoted btw)

Related

Is there such a thing as an "Encoding Language"?

I know that there are at least two types of coding languages: markup and programming. HTML is an example of the former, Python an example of the latter.
Is there such a thing as an encoding language? An example of this could be Unicode.
Here's a concept tree I made to help illustrate my point:
Unicode and ASCII are character sets and no languages, so they only define the amount of symbols you can use and display.
For the other two (Markup and Programming Languages) it depends on your definition of language. Maybe this is interesting for you: formal languages

Latin<->Han Conversion in ICU?

I am just getting started implementing ICU transforms using ICU4C in a C++ program. I am particularly looking at transliteration to and from Chinese.
According to this document, the package supports both "Han-Latin" and "Latin-Han" conversion. As a student of Chinese, this seems surprising to me, as Latin-Han conversion is particularly difficult to do without highly advanced statistical techniques (The closest I have seen is Google Transliterate, which actually does a great job with this even without user input, but this is unfeasible for the present project), much less conversion without tone marks. I am skeptical that this is even possible, without resorting to the de facto foreign-name borrowing characters such as 比尔·莫瑞. This is the approach taken by Google Maps in their international domains, as we can see in this paper (PDF)
Anyhow, I was willing to suspend disbelief, and after consulting documentation and tutorials, I was able to construct two Transliterator objects (to and from) and perform simple transliteration using them.
While Han-Latin worked pretty passably (about 80% accuracy for simple data), Latin-Han seemed not to work at all, returning the same "latin" string that was input, which is consistent with the results I get using the online transform sample, and consistent with what I know about Chinese. I managed to find this table, which I think is what is used for both sources, as we can see here:
{ "Latin-Han", "file", "t_Hani_Latn", "REVERSE" },
{ "Han-Latin", "file", "t_Hani_Latn", "FORWARD" },
I would presume this meant that given a pinyin string it could potentially work to reproduce the original, but this does not seem to be the case.
I guess my general question is this: is this kind of transform even possible with ICU, or anything besides Google Transliterate? What is the expected output? Relatedly, is there a listing somewhere of the script-pairs that ICU actually supports, if this is not really possible?
Thank you for your time
Note that the data is from the CLDR project, http://cldr.unicode.org . The script pairs that ICU supports are many, ICU will attempt to use a pivot script ( such as Han to Latin to Russian ) which is why you can create transliterators such as "Any-Latin". You might try browsing the ICU and CLDR data set. The note at the top of the Han-Latin file says that it does not round trip.

iPhone App Development - mathematical sysmbols

I am quite the beginner - but I have a lot of experience with respect to Electrical Engineering and formulas - over 30 years worth! Trying to construct an app for the iPhone. Loaded the SDK, bought myself a Mac, got a couple of "Chapter" pages written by John at Alpha Aviation - who recomended I try this site.
How do I - well for example I enetered 400/1.732 - it came out like it was a web address with an underline! How can I get mathematical symbols like - sqare root, to the power of, subscript, superscript. I'd like to insert a few "calculator" pages but don't know how to. Appreciate I have only been doing this for the last week or so - so I am still quite rusty compared to what I wrote many years ago. Can anyone help?
cheers
The iPhone supports unicode, which has a fairly good set of mathematical symbols. The easiest way to start is to look in "Special Characters..." in the Edit menu of Xcode or pretty much any Mac application. Find a symbol you want and look for the unicode or utf8 equivalent.
For example, "\u221A" is unicode for the square root symbol. That would be "\xE2\x88\x9A" in UTF8. Either form is fine.
Once you have the string for the symbol, you can use it like any other string. So to make it a button title:
NSString *square_root = #"\xE2\x88\x9A";
[someButton setTitle:square_root forState:UIControlStateNormal];
In the Special Characters window is an insert button you can use to insert characters directly into your source files. This is pretty much fine as long as your source files are UTF8. I tend to use the escape codes from habit.
It is also common practice to put strings like these into a strings files and access them through NSBundle with things like NSLocalizedString. Even if you only have one localization, it lets you keep all the strings in one place.
For more complex mathematical layout, you will probably have to use html in a UIWebView. I think all the standard entities are supported.
You can look at here. Quite enough
A blog about Objective C mathematics

Why isn't everything we do in Unicode?

Given that Unicode has been around for 18 years, why are there still apps that don't have Unicode support? Even my experiences with some operating systems and Unicode have been painful to say the least. As Joel Spolsky pointed out in 2003, it's not that hard. So what's the deal? Why can't we get it together?
Start with a few questions
How often...
do you need to write an application that deals with something else than ascii?
do you need to write a multi-language application?
do you write an application that has to be multi-language from its first version?
have you heard that Unicode is used to represent non-ascii characters?
have you read that Unicode is a charset? That Unicode is an encoding?
do you see people confusing UTF-8 encoded bytestrings and Unicode data?
Do you know the difference between a collation and an encoding?
Where did you first heard of Unicode?
At school? (really?)
at work?
on a trendy blog?
Have you ever, in your young days, experienced moving source files from a system in locale A to a system in locale B, edited a typo on system B, saved the files, b0rking all the non-ascii comments and... ending up wasting a lot of time trying to understand what happened? (did your editor mix things up? the compiler? the system? the... ?)
Did you end up deciding that never again you will comment your code using non-ascii characters?
Have a look at what's being done elsewhere
Python
Did I mention on SO that I love Python? No? Well I love Python.
But until Python3.0, its Unicode support sucked. And there were all those rookie programmers, who at that time knew barely how to write a loop, getting UnicodeDecodeError and UnicodeEncodeError from nowhere when trying to deal with non-ascii characters. Well they basically got life-traumatized by the Unicode monster, and I know a lot of very efficient/experienced Python coders that are still frightened today about the idea of having to deal with Unicode data.
And with Python3, there is a clear separation between Unicode & bytestrings, but... look at how much trouble it is to port an application from Python 2.x to Python 3.x if you previously did not care much about the separation/if you don't really understand what Unicode is.
Databases, PHP
Do you know a popular commercial website that stores its international text as Unicode?
You will (perhaps) be surprised to learn that Wikipedia backend does not store its data using Unicode. All text is encoded in UTF-8 and is stored as binary data in the Database.
One key issue here is how to sort text data if you store it as Unicode codepoints. Here comes the Unicode collations, which define a sorting order on Unicode codepoints. But proper support for collations in Databases is missing/is in active development. (There are probably a lot of performance issues, too. -- IANADBA) Also, there is no widely-accepted standard for collations yet: for some languages, people don't agree on how words/letters/wordgroups should be sorted.
Have you heard of Unicode normalization? (Basically, you should convert your Unicode data to a canonical representation before storing it) Of course it's critical for Database storage, or local comparisons. But PHP for example only provides support for normalization since 5.2.4 which came out in August 2007.
And in fact, PHP does not completely supports Unicode yet. We'll have to wait PHP6 to get Unicode-compatible functions everywhere.
So, why isn't everything we do in Unicode?
Some people don't need Unicode.
Some people don't care.
Some people don't understand that they will need Unicode support later.
Some people don't understand Unicode.
For some others, Unicode is a bit like accessibility for webapps: you start without, and will add support for it later
A lot of popular libraries/languages/applications lack proper, complete Unicode support, not to mention collation & normalization issues. And until all items in your development stack completely support Unicode, you can't write a clean Unicode application.
The Internet clearly helps spreading the Unicode trend. And it's a good thing. Initiatives like Python3 breaking changes help educating people about the issue. But we will have to wait patiently a bit more to see Unicode everywhere and new programmers instinctively using Unicode instead of Strings where it matters.
For the anecdote, because FedEx does not apparently support international addresses, the Google Summer of Code '09 students all got asked by Google to provide an ascii-only name and address for shipping. If you think that most business actors understand stakes behind Unicode support, you are just wrong. FedEx does not understand, and their clients do not really care. Yet.
Many product developers don't consider their apps being used in Asia or other regions where Unicode is a requirement.
Converting existing apps to Unicode is expensive and usually driven by sales opportunities.
Many companies have products maintained on legacy systems and migrating to Unicode means a totally new development platform.
You'd be surprised how many developers don't understand the full implications of Unicode in a multi-language environment. It's not just a case of using wide strings.
Bottom line - cost.
Probably because people are used to ASCII and a lot of programming is done by native English speakers.
IMO, it's a function of collective habit, rather than conscious choice.
The widespread availability of development tools for working with Unicode may be a more recent event than you suppose. Working with Unicode was, until just a few years ago, a painful task of converting between character formats and dealing with incomplete or buggy implementations. You say it's not that hard, and as the tools improve that is becoming more true, but there are a lot of ways to trip up unless the details are hidden from you by good languages and libraries. Hell, just cutting and pasting unicode characters could be a questionable proposition a few years back. Developer education also took some time, and you still see people make a ton of really basic mistakes.
The Unicode standard weighs probably ten pounds. Even just an overview of it would have to discuss the subtle distinctions between characters, glyphs, codepoints, etc. Now think about ASCII. It's 128 characters. I can explain the entire thing to someone that knows binary in about 5 minutes.
I believe that almost all software should be written with full Unicode support these days, but it's been a long road to achieving a truly international character set with encoding to suit a variety of purposes, and it's not over just yet.
Laziness, ignorance.
One huge factor is programming language support, most of which use a character set that fits in 8 bits (like ASCII) as the default for strings. Java's String class uses UTF-16, and there are others that support variants of Unicode, but many languages opt for simplicity. Space is so trivial of a concern these days that coders who cling to "space efficient" strings should be slapped. Most people simply aren't running on embedded devices, and even devices like cell phones (the big computing wave of the near future) can easily handle 16-bit character sets.
Another factor is that many programs are written only to run in English, and the developers (1) don't plan (or even know how) to localize their code for multiple languages, and (2) they often don't even think about handling input in non-Roman languages. English is the dominant natural language spoken by programmers (at least, to communicate with each other) and to a large extent, that has carried over to the software we produce. However, the apathy and/or ignorance certainly can't last forever... Given the fact that the mobile market in Asia completely dwarfs most of the rest of the world, programmers are going to have to deal with Unicode quite soon, whether they like it or not.
For what it's worth, I don't think the complexity of the Unicode standard is not that big of a contributing factor for programmers, but rather for those who must implement language support. When programming in a language where the hard work has already been done, there is even less reason to not use the tools at hand. C'est la vie, old habits die hard.
All operating systems until very recently were built on the assumption that a character was a byte. It's APIs were built like that, the tools were built like that, the languages were built like that.
Yes, it would be much better if everything I wrote was already... err... UTF-8? UTF-16? UTF-7? UTF-32? Err... mmm... It seems that whatever you pick, you'll annoy someone. And, in fact, that's the truth.
If you pick UTF-16, then all of your data, as in, pretty much the western world whole economy, stops being seamlessly read, as you lose the ASCII compatibility. Add to that, a byte ceases to be a character, which seriously break the assumptions upon which today's software is built upon. Furthermore, some countries do not accept UTF-16. Now, if you pick ANY variable-length encoding, you break some basic premises of lots of software, such as not needing to traverse a string to find the nth character, of being able to read a string from any point of it.
And, then UTF-32... well, that's four bytes. What was the average hard drive size or memory size but 10 years ago? UTF-32 was too big!
So, the only solution is to change everything -- software, utilites, operating systems, languages, tools -- at the same time to be i18n-aware. Well. Good luck with "at the same time".
And if we can't do everything at the same time, then we always have to keep an eye out for stuff which hasn't been i18n. Which causes a vicious cycle.
It's easier for end user applications than for middleware or basic software, and some new languages are being built that way. But... we still use Fortran libraries written in the 60s. That legacy, it isn't going away.
Because UTF-16 became popular before UTF-8 and UTF-16 is a pig to work with. IMHO
Because for 99% of applications, Unicode support is not a checkbox on the customer's product comparison matrix.
Add to the equation:
It takes a conscious effort with almost no readily visible benefit.
Many programmers are afraid of it or don't understand it.
Management REALLY doesn't understand it or care about it, at least not until a customer is screaming about it.
The testing team isn't testing for Unicode compliance.
"We didn't localize the UI, so non-English speakers wouldn't be using it anyway."
Tradition and attitude. ASCII and computers are sadly synonyms to many people.
However, it would be naïve to think that the rôle of Unicode, is only a matter of Exotic languages from Eurasia and other parts of the world. A rich text encoding has lots of meaning to bring even to a "plain" English text. Look in a book sometime.
I would say there are mainly two reason. First one is simply that the Unicode support of your tools just isn't up to snuff. C++ still doesn't have Unicode support and won't get it until the next standard revision, which will take maybe a year or two to be finished and then another five or ten years to be in widespread use. Many other languages aren't much better and even if you finally have Unicode support, it might still be a more cumbersome to use then plain ASCII strings.
The second reason is in part what it causing the first issue, Unicode is hard, its not rocket science, but it gives you a ton of problems that you never had to deal with in ASCII. With ASCII you had a clear one byte == one glyph relationships, could address the Nth character of a string by a simple str[N], could just store all characters of the whole set in memory and so on. With Unicode you no longer can do that, you have to deal with different ways Unicode is encoded (UTF-8, UTF-16, ...), byte order marks, decoding errors, lots of fonts that have only a subset of characters which you would need for full Unicode support, more glyphs then you want to store in memory at a given time and so on.
ASCII could be understand by just looking at an ASCII table without any further documentation, with Unicode that is simply no longer the case.
Because of the inertia caused by C++. It had (has) horrible unicode support and dragged back the developers.
I personally do not like how certain formats of unicode break it so that you can no longer do string[3] to get the 3rd character. Sure it could be abstracted out, but imagine how much slower a big project with strings, such as GCC would be if it had to transverse a string to figure out the nth character. The only option is caching where "useful" positions are and even then it's slow, and in some formats your now taking a good 5 bytes per character. To me, that is just ridiculous.
More overhead, space requirements.
I suspect it's because software has such strong roots in the west. UTF-8 is a nice, compact format if you happen to live in America. But it's not so hot if you live in Asia. ;)
Unicode requires more work (thinking), you usually only get paid for what is required so you go with the fastest less complicated option.
Well that's from my point of view. I guess if you expect code to use std::wstring hw(L"hello world") you have to explain how it all works that to print wstring you need wcout : std::wcout << hw << std::endl; (I think), (but endl seems fine ..) ... so seems like more work to me - of course if I was writing international app I would have to invest into figuring it out but until then I don't (as I suspect most developers).
I guess this goes back to money, time is money.
It's simple. Because we only have ASCII characters on our keyboards, why would we ever encounter, or care about characters other than those? It's not so much an attitude as it is what happens when a programmer has never had to think about this issue, or never encountered it, perhaps doesn't even know what unicode is.
edit: Put another way, Unicode is something you have to think about, and thinking is not something most people are interested in doing, even programmers.

Theory: "Lexical Encoding"

I am using the term "Lexical Encoding" for my lack of a better one.
A Word is arguably the fundamental unit of communication as opposed to a Letter. Unicode tries to assign a numeric value to each Letter of all known Alphabets. What is a Letter to one language, is a Glyph to another. Unicode 5.1 assigns more than 100,000 values to these Glyphs currently. Out of the approximately 180,000 Words being used in Modern English, it is said that with a vocabulary of about 2,000 Words, you should be able to converse in general terms. A "Lexical Encoding" would encode each Word not each Letter, and encapsulate them within a Sentence.
// An simplified example of a "Lexical Encoding"
String sentence = "How are you today?";
int[] sentence = { 93, 22, 14, 330, QUERY };
In this example each Token in the String was encoded as an Integer. The Encoding Scheme here simply assigned an int value based on generalised statistical ranking of word usage, and assigned a constant to the question mark.
Ultimately, a Word has both a Spelling & Meaning though. Any "Lexical Encoding" would preserve the meaning and intent of the Sentence as a whole, and not be language specific. An English sentence would be encoded into "...language-neutral atomic elements of meaning ..." which could then be reconstituted into any language with a structured Syntactic Form and Grammatical Structure.
What are other examples of "Lexical Encoding" techniques?
If you were interested in where the word-usage statistics come from :
http://www.wordcount.org
This question impinges on linguistics more than programming, but for languages which are highly synthetic (having words which are comprised of multiple combined morphemes), it can be a highly complex problem to try to "number" all possible words, as opposed to languages like English which are at least somewhat isolating, or languages like Chinese which are highly analytic.
That is, words may not be easily broken down and counted based on their constituent glyphs in some languages.
This Wikipedia article on Isolating languages may be helpful in explaining the problem.
Their are several major problems with this idea. In most languages, the meaning of a word, and the word associated with a meaning change very swiftly.
No sooner would you have a number assigned to a word, before the meaning of the word would change. For instance, the word "gay" used to only mean "happy" or "merry", but it is now used mostly to mean homosexual. Another example is the morpheme "thank you" which originally came from German "danke" which is just one word. Yet another example is "Good bye" which is a shortening of "God bless you".
Another problem is that even if one takes a snapshot of a word at any point of time, the meaning and usage of the word would be under contention, even within the same province. When dictionaries are being written, it is not uncommon for the academics responsible to argue over a single word.
In short, you wouldn't be able to do it with an existing language. You would have to consider inventing a language of your own, for the purpose, or using a fairly static language that has already been invented, such as Interlingua or Esperanto. However, even these would not be perfect for the purpose of defining static morphemes in an ever-standard lexicon.
Even in Chinese, where there is rough mapping of character to meaning, it still would not work. Many characters change their meanings depending on both context, and which characters either precede or postfix them.
The problem is at its worst when you try and translate between languages. There may be one word in English, that can be used in various cases, but cannot be directly used in another language. An example of this is "free". In Spanish, either "libre" meaning "free" as in speech, or "gratis" meaning "free" as in beer can be used (and using the wrong word in place of "free" would look very funny).
There are other words which are even more difficult to place a meaning on, such as the word beautiful in Korean; when calling a girl beautiful, there would be several candidates for substitution; but when calling food beautiful, unless you mean the food is good looking, there are several other candidates which are completely different.
What it comes down to, is although we only use about 200k words in English, our vocabularies are actually larger in some aspects because we assign many different meanings to the same word. The same problems apply to Esperanto and Interlingua, and every other language meaningful for conversation. Human speech is not a well-defined, well oiled-machine. So, although you could create such a lexicon where each "word" had it's own unique meaning, it would be very difficult, and nigh on impossible for machines using current techniques to translate from any human language into your special standardised lexicon.
This is why machine translation still sucks, and will for a long time to come. If you can do better (and I hope you can) then you should probably consider doing it with some sort of scholarship and/or university/government funding, working towards a PHD; or simply make a heap of money, whatever keeps your ship steaming.
It's easy enough to invent one for yourself. Turn each word into a canonical bytestream (say, lower-case decomposed UCS32), then hash it down to an integer. 32 bits would probably be enough, but if not then 64 bits certainly would.
Before you ding for giving you a snarky answer, consider that the purpose of Unicode is simply to assign each glyph a unique identifier. Not to rank or sort or group them, but just to map each one onto a unique identifier that everyone agrees on.
How would the system handle pluralization of nouns or conjugation of verbs? Would these each have their own "Unicode" value?
As a translations scheme, this is probably not going to work without a lot more work. You'd like to think that you can assign a number to each word, then mechanically translate that to another language. In reality, languages have the problem of multiple words that are spelled the same "the wind blew her hair back" versus "wind your watch".
For transmitting text, where you'd presumably have an alphabet per language, it would work fine, although I wonder what you'd gain there as opposed to using a variable-length dictionary, like ZIP uses.
This is an interesting question, but I suspect you are asking it for the wrong reasons. Are you thinking of this 'lexical' Unicode' as something that would allow you to break down sentences into language-neutral atomic elements of meaning and then be able to reconstitute them in some other concrete language? As a means to achieve a universal translator, perhaps?
Even if you can encode and store, say, an English sentence using a 'lexical unicode', you can not expect to read it and magically render it in, say, Chinese keeping the meaning intact.
Your analogy to Unicode, however, is very useful.
Bear in mind that Unicode, whilst a 'universal' code, does not embody the pronunciation, meaning or usage of the character in question. Each code point refers to a specific glyph in a specific language (or rather the script used by a group of languages). It is elemental at the visual representation level of a glyph (within the bounds of style, formatting and fonts). The Unicode code point for the Latin letter 'A' is just that. It is the Latin letter 'A'. It cannot automagically be rendered as, say, the Arabic letter Alif (ﺍ) or the Indic (Devnagari) letter 'A' (अ).
Keeping to the Unicode analogy, your Lexical Unicode would have code points for each word (word form) in each language. Unicode has ranges of code points for a specific script. Your lexical Unicode would have to a range of codes for each language. Different words in different languages, even if they have the same meaning (synonyms), would have to have different code points. The same word having different meanings, or different pronunciations (homonyms), would have to have different code points.
In Unicode, for some languages (but not all) where the same character has a different shape depending on it's position in the word - e.g. in Hebrew and Arabic, the shape of a glyph changes at the end of the word - then it has a different code point. Likewise in your Lexical Unicode, if a word has a different form depending on its position in the sentence, it may warrant its own code point.
Perhaps the easiest way to come up with code points for the English Language would be to base your system on, say, a particular edition of the Oxford English Dictionary and assign a unique code to each word sequentially. You will have to use a different code for each different meaning of the same word, and you will have to use a different code for different forms - e.g. if the same word can be used as a noun and as a verb, then you will need two codes
Then you will have to do the same for each other language you want to include - using the most authoritative dictionary for that language.
Chances are that this excercise is all more effort than it is worth. If you decide to include all the world's living languages, plus some historic dead ones and some fictional ones - as Unicode does - you will end up with a code space that is so large that your code would have to be extremely wide to accommodate it. You will not gain anything in terms of compression - it is likely that a sentence represented as a String in the original language would take up less space than the same sentence represented as code.
P.S. for those who are saying this is an impossible task because word meanings change, I do not see that as a problem. To use the Unicode analogy, the usage of letters has changed (admittedly not as rapidly as the meaning of words), but it is not of any concern to Unicode that 'th' used to be pronounced like 'y' in the Middle ages. Unicode has a code point for 't', 'h' and 'y' and they each serve their purpose.
P.P.S. Actually, it is of some concern to Unicode that 'oe' is also 'œ' or that 'ss' can be written 'ß' in German
This is an interesting little exercise, but I would urge you to consider it nothing more than an introduction to the concept of the difference in natural language between types and tokens.
A type is a single instance of a word which represents all instances. A token is a single count for each instance of the word. Let me explain this with the following example:
"John went to the bread store. He bought the bread."
Here are some frequency counts for this example, with the counts meaning the number of tokens:
John: 1
went: 1
to: 1
the: 2
store: 1
he: 1
bought: 1
bread: 2
Note that "the" is counted twice--there are two tokens of "the". However, note that while there are ten words, there are only eight of these word-to-frequency pairs. Words being broken down to types and paired with their token count.
Types and tokens are useful in statistical NLP. "Lexical encoding" on the other hand, I would watch out for. This is a segue into much more old-fashioned approaches to NLP, with preprogramming and rationalism abound. I don't even know about any statistical MT that actually assigns a specific "address" to a word. There are too many relationships between words, for one thing, to build any kind of well thought out numerical ontology, and if we're just throwing numbers at words to categorize them, we should be thinking about things like memory management and allocation for speed.
I would suggest checking out NLTK, the Natural Language Toolkit, written in Python, for a more extensive introduction to NLP and its practical uses.
Actually you only need about 600 words for a half decent vocabulary.