L-Systems: Order of Substitution - fractals

I posted this over at the mathematics stack excahnge, but as its programming related I wondered if somebody can help out here.
I am working through a subject guide on involving L-Systems and have the following alphabet A = {a, b, c}. The initiator is the string "a" and the rules of substitution a → ba, b → ccb, c → a.
The study guide gives the first five generations as:
[a] → [ba] → [ccba] → [acba] → [aaba] → [aaccba]
I can't for the life of me figure out how this works. No rules regarding the order of substitution are provided, and my lecturer say's that it is possible to get to this.
Does anybody have any ideas?

In your example, it looks like they're only doing one substitution each step, with later rules taking priority over earlier. This might be a variation on the classic L-system, but I've never seen it done that way. The rules in an L-system are supposed to be applied to all symbols each generation. The correct expansion for those rules (spaces added to show which symbols the next generation came from) would be
a
ba
ccb ba
a a ccb ccb ba
ba ba a a ccb a a ccb ccb ba
ccb ba ccb ba ba ba a a ccb ba ba a a ccb a a ccb ccb ba

Related

How to Read MessagePack Response on Charles or Mitmproxy?

Is there a way to convert MessagePack response on Charles/MITM to JSON?
I have a response from https://[...]/application/x-msgpack
Raw mode reads it:
xdd\xb8\xbd\xc9? \x03\xa2Xe\xccO\xc2O\xd6"\x06\x91\xcfB\x9c\xed\x0fl
Hex mode reads it:
0000000010 06 91 cf 42 9c ed 0f 6c 2e 14 ae f1 da 2d 34 e9 ...B...l.....-4.
Hex mode contains contains nonsense at the end of every line.
Other modes could not parse and fall back to Raw mode.
If I fail to give you any other info, let me know.
If you can decode it in Python (e.g. using Kaitai Struct), you can implement a mitmproxy contentview to display it in a human-readable way:
https://github.com/mitmproxy/mitmproxy/blob/master/examples/simple/custom_contentview.py
https://github.com/mitmproxy/mitmproxy/tree/master/mitmproxy/contentviews
(if you go the kaitai struct route - we'd also be more than happy for a PR!)

Why is TeraTerm not putting out the same bytes that are in the send call?

Using TeraTerm and a Serial port adapter I ran a Macro with this line on it:
send $55 $0B $00 $00 $00 $BB $42 $AA
The $BB was sent out as two different bytes instead of just one. I forget which ones they were specifically but the result on the O-scope looked like this:
55 0B 00 00 00 C8 E9 42 AA
Does anyone know why this is?
I looked in the Manual and verified that Send8Ctrl is set to off and so is the Debug option.
Did the research, tested and verified that the answer to the question is that TeraTerm is using UTF-8 instead of English under the general settings menu. There are two options present that may be confusing so the two options are 'English' and 'Default'.

Convert Unicode code point to UTF-8 sequence

I am not sure I've got my nomenclature right, so please correct me :)
I've received a text file representing a Pāli dictionary: a list of words separated by newline \n (0x0a) characters. Supposedly, some of the special letters are encoded using UTF-8, but I doubt that.
Loading this text file into any of my editors (vim, Notepad, TextEdit, ..) shows quite scrambled text, for example
mhiti
A closer look at the actual bytes then reveal the following (using hexdump -C)
0a 0a 1e 6d 68 69 74 69 0a 0a ...mhiti..
which seems to me the Unicode code point U+1E6D ("ṭ" or LATIN SMALL LETTER T WITH DOT BELOW). That particular letter has UTF-8 encoding e1 b9 ad.
My question: is there a tool which helps me convert this particular file into actual UTF-8 encoding? I tried iconv but without success; I looked briefly into a Python script but would think there's an easier way to get this done. It seems that this is a useful link for this problem, but isn't there a tool that can get this done? Am I missing something?
EDIT: Just to make things a little bit more entertaining, there seem to be actual UTF-8 encoded characters scattered throughout as well. For example, the word "ākiñcaññāyatana" has the following sequence of bytes
01 01 6b 69 c3 b1 63 61 c3 b1 c3 b1 01 01 79 61 74 61 6e 61
ā k i ñ c a ñ ñ ā y a t a n a
where the "ā" is encoded by its Unicode code point U-0101, and the "ñ" is encoded by the UTF-8 sequence \xc3b1 which has Unicode code point U-00F1.
EDIT: Here's one that I can't quite figure out what it's supposed to be:
01 1e 37 01 01 76 61 6b 61
? ā v a k a
I can only guess, but that too doesn't make sense. The Unicode code point U+011e is a "Ğ" (UTF-8 \xc49e) but that's not a Pāli character AFAIK; then a "7" follows which doesn't make sense in a word. Then the Unicode code point U+1E37 is a "ḷ" (UTF-8 \xe1b8b7) which is a valid Pāli character. But that would leave the first byte \x01 by itself. If I had to guess I would think this is the name "Jīvaka" but that would not match the bytes. LATER: According to the author, this is "Āḷāvaka" — so assuming the heuristics of character encoding from above, again a \x00 is missing. Adding it back in
01 00 1e 37 01 01 76 61 6b 61
Ā ḷ ā v a k a
Are there "compressions" that remove \x00 bytes from UTF-16 encoded Unicode files?
I'm assuming in this context that "ṭhiti" makes sense as the contents of that file.
From your description, it looks like that file encodes characters < U+0080 as a single byte and characters > U+0100 as two-byte big-endian. That's not decodable, in general; two linefeeds (U+000A, U+000A) would have the same encoding as GURMUKHI LETTER UU (U+0A0A).
There's no invocation of iconv that'll decode it for you; you'll either need to take the heuristics you know, either based on character ranges or ordering in the file, to write a custom decoder (or ask for another copy in a standard encoding).
I think in the end this was my own fault, somehow. Browsing to this file showed a very mangled and broken version of the original UTF-16 encoded file; the "Save as" menu from the browser then saved that broken file which created the initial question for this thread.
It seems that a web browser tries to display that UTF-16 encoded file, removes non-printable characters like \x00 and converts some others to UTF-8, thus completely mangling the original file.
Using wget to fetch the file fixed the problem, and I could convert it nicely into UTF-8 and use it further.

How to define/declare utf-8 code points for Turkish special chars (non-ascii) to use them as standard utf-8 encoding?

Türkish chars 'ÇçĞğİıÖöŞşÜü' are not handled correctly in utf-8 encoding altough they all seem to be defined. Charcodes of all of them is 65533 (replacemnt character, possibly for error display) in usage and a question mark or box is displayed depending on the selected font. In some cases 0/null is returned as charcode. On the internet, there are lots of tools which give utf-8 definitions of them but I am not sure if tools use any defined (real/international) registry or dynamicly create the definition with known rules and calculations. Fonts for them are well-defined and no problem to display them when we enter code points manually. This proves that they are defined in utf-8. But on the other hand they are not handled in encodings or tranaformations such as ajax requests/responses.
So the base question is "HOW CAN WE DEFINE A CODEPOINT FOR A CHAR"?
The question may be tailored as follows to prevent mis-conception. Suppose we have prepared the encoding data for "Ç" like this ->
Character : Ç
Character name : LATIN CAPITAL LETTER C WITH CEDILLA
Hex code point : 00C7
Decimal code point : 199
Hex UTF-8 bytes : C387
......
Where/How can we save this info to be a standard utf-8 char?
How can we distribute/expose it (make ready to be used by others) ?
Do we need any confirmation by anybody/foundation (like unicode/utf-8 consortium)
How can we detect/fixup errors if they are already registered but not working correctly?
Can we have custom-utf8 configuration? If yes how?
Note : No code snippet is needed here as it is not mis-usage problem.
The charcters you mention are present in Unicode. Here are their character codes in hexadecimal and how they are encoded in UTF-8:
Ç ç Ğ ğ İ ı Ö ö Ş ş Ü ü
Code: 00c7 00e7 011e 011f 0130 0131 00d6 00f6 015e 015f 00dc 00fc
UTF8: c3 87 c3 a7 c4 9e c4 9f c4 b0 c4 b1 c3 96 c3 b6 c5 9e c5 9f c3 9c c3 bc
This means that if you write for example the bytes 0xc4 0x9e into a file you have written the character Ğ, and any software tool that understands UTF-8 must read it back as Ğ.
Update: For correct alphabetic order and case conversions in Turkish you have to use a library that understand locales, just like for any other natural language. For example in Java:
Locale tr = new Locale("TR","tr"); // Turkish locale
print("ÇçĞğİıÖöŞşÜü".toUpperCase(tr)); // ÇÇĞĞİIÖÖŞŞÜÜ
print("ÇçĞğİıÖöŞşÜü".toLowerCase(tr)); // ççğğiıööşşüü
Notice how i in uppercase becomes İ, and I in lowercase becomes ı. You don't say which programming language you use but surely its standard library supports locales, too.
Unicode defines the code points and certain properties for each character (for example, if it's a digit or a letter, for a letter if it's uppercase, lowercase, or titlecase), and certain generic algorithms for dealing with Unicode text (e.g. how to mix right-to-left text and left-to-right text). Alphabetic order and correct case conversion are defined by national standardization bodies, like Institute of Languages of Finland in Finland, Real Academia Española in Spain, independent of Unicode.
Update 2:
The test ((ch&0x20)==ch) for lower case is broken for most languages in the world, not just Turkish. So is the algorithm for converting upper case to lower case you mention. Also, the test for being a letter is incorrect: in many languages Z is not the last letter of the alphabet. To work with text correctly you must use library functions that have been written by people who know what they are doing.
Unicode is supposed to be universal. Creating national and language specific variants of encodings is what lead us to the mess that Unicode is trying to solve. Unfortunately there is no universal standard for ordering characters. For example in English a = ä < z, but in Swedish a < z < ä. In German Ü is equivalent to U by one standard, and to UE by another. In Finnish Ü = Y. There is no way to order code points so that the ordering would be correct in every language.

Postgres upper function on turkish character does not return expected result

It looks like postgres upper/lower function does not handle select characters in Turkish character set.
select upper('Aaı'), lower('Aaİ') from mytable;
returns :
AAı, aaİ
instead of :
AAI, aai
Note that normal english characters are converted correctly, but not the Turkish I (lower or upper)
Postgres version: 9.2 32 bit
Database encoding (Same result in any of these): UTF-8, WIN1254, C
Client encoding:
UTF-8, WIN1254, C
OS: Windows 7 enterprise edition 64bit
SQL functions lower and upper return the following same bytes for ı and İ on UTF-8 encoded database
\xc4b1
\xc4b0
And the following on WIN1254 (Turkish) encoded database
\xfd
\xdd
I hope my investigation is wrong, and there is something I missed.
Your problem is 100% Windows. (Or rather Microsoft Visual Studio, which PostgreSQL was built with, to be more precise.)
For the record, SQL UPPER ends up calling Windows' LCMapStringW (via towupper via str_toupper) with almost all the right parameters (locale 1055 Turkish for a UTF-8-encoded, Turkish_Turkey database),
but
the Visual Studio Runtime (towupper) does not set the LCMAP_LINGUISTIC_CASING bit in LCMapStringW's dwMapFlags. (I can confirm that setting it does the trick.) This is not considered a bug at Microsoft; it is by design, and will probably not ever be "fixed" (oh the joys of legacy.)
You have three ways out of this:
implement #Sorrow's wrapper solution (or write your own native function replacement (DLL).)
run your PostgreSQL instance on e.g. Ubuntu which exhibits the right behaviour for Turkic locales (#Sorrow confirmed that it works for him); this is probably the simplest and cleanest way out.
drop in a patched 32-bit MSVCR100.DLL in your PostgreSQL bin directory (but although UPPER and LOWER would work, other things such as collation may continue to fail -- again, at the Windows level. YMMV.)
For completeness (and nostalgic fun) ONLY, here is the procedure to patch a Windows system (but remember, unless you'll be managing this PostgreSQL instance from cradle to grave you may cause a lot of grief to your successor(s); whenever deploying a new test or backup system from scratch you or your successor(s) would have to remember to apply the patch again -- and if let's say you one day upgrade to PostgreSQL 10, which say uses MSVCR120.DLL instead of MSVCR100.DLL, then you'll have to try your luck with patching the new DLL, too.) On a test system
use HxD to open C:\WINDOWS\SYSTEM32\MSVCR100.DLL
save the DLL right away with the same name under you PostgreSQL bin directory (do not attempt to copy the file using Explorer or the command line, they might copy the 64bit version)
with the file still open in HxD, go to Search > Replace, pick Datatype: Hexvalues, then
search for...... 4E 14 33 DB 3B CB 0F 84 41 12 00 00 B8 00 01 00 00
replace with... 4E 14 33 DB 3B CB 0F 84 41 12 00 00 B8 00 01 00 01
...then once more...
search for...... FC 51 6A 01 8D 4D 08 51 68 00 02 00 00 50 E8 E2
replace with... FC 51 6A 01 8D 4D 08 51 68 00 02 00 01 50 E8 E2
...and re-save under the PostgreSQL bin directory, then restart PostgreSQL and re-run your query.
if your query still does not work (make sure your database is UTF-8 encoded with Turkish_Turkey for both LC_CTYPE and LC_COLLATE) open postgres.exe in 32-bit Dependency Walker and make sure it indicates it loads MSVCR100.DLL from the PostgreSQL bin directory.
if all functions well copy the patched DLL to the production PostgreSQL bin directory and restart.
BUT REMEMBER, the moment you move the data off the Ubuntu system or off the patched Windows system to an unpatched Windows system you will have the problem again, and you may be unable to import this data back on Ubuntu if the Windows instance introduced duplicates in a citext field or in a UPPER/LOWER-based function index.
It seems to me that your problem is related to Windows. This is how it looks on Ubuntu (Postgres 8.4.14), database encoding UTF-8:
test=# select upper('Aaı'), lower('Aaİ');
upper | lower
-------+-------
AAI | aai
(1 row)
My recommendation would be - if you have to use Windows - to write a stored procedure that will do the conversion for you. Use built-in replace: replace('abcdefabcdef', 'cd', 'XX') returns abXXefabXXef. There might be a more optimal solution, I do not claim that this approach is the correct one.
This is indeed bug in PostgreSQL (still not fixed, even in current git tree).
Proof: https://github.com/postgres/postgres/blob/master/src/port/pgstrcasecmp.c
PostgreSQL developers even mention specifically those Turkish characters there:
SQL99 specifies Unicode-aware case normalization, which we don't yet
have the infrastructure for. Instead we use tolower() to provide a
locale-aware translation.
However, there are some locales where this is not right either (eg, Turkish may do strange things with 'i' and 'I').
Our current compromise is to use tolower() for characters with
the high bit set, and use an ASCII-only downcasing for 7-bit
characters.
pg_upper() implemented in this file is extremely simplistic (as its companion pg_tolower()):
unsigned char
pg_toupper(unsigned char ch)
{
if (ch >= 'a' && ch <= 'z')
ch += 'A' - 'a';
else if (IS_HIGHBIT_SET(ch) && islower(ch))
ch = toupper(ch);
return ch;
}
As you can see, this code does not treat its parameter as Unicode code point, and cannot possibly work 100% correctly, unless currently selected locale happens to be the one that we care for (like Turkish non-unicode locale) and OS-provided non-unicode toupper() is working correctly.
This is really sad, I just hope that this will be solved in upcoming PostgreSQL releases...
The source of the problem explained above. It seems the problem only occurs with the conversion of 'I' to 'ı' and 'i' to 'İ'. As a workaround just replace those characters directly as below before calling lower or upper functions:
SELECT lower(replace('IİĞ', 'I', 'ı')) -> ıiğ
SELECT upper(replace('ıiğ', 'i', 'İ')) -> IİĞ