Why PostgreSQL store data in hex in own format? - postgresql

I can't understand the reason why PostgreSQL store data in own format
The "hex" format encodes binary data as 2 hexadecimal digits per byte, most significant nibble first. The entire string is preceded by the sequence \x (to distinguish it from the escape format).
Does it's mean that it is not simple hex and it would not possible to simple convert this hex to byte type and I should write parser of PostgreSQL hex format?

The client driver usually takes care of bytea conversion for you, supplying you a native language data type like byte[] for Java. The representation of bytea on the wire shouldn't generally concern you. The only time it'll really matter is if you're using bytea literals in SQL text, rather than sending them as bind parameters.
Anyway, it is normal hex, it just has a \x prefix. So it's utterly trivial to "parse" if you do need to do so manually. E.g. in Python
r'\x736f6d65737472696e67'[2:].decode("hex")
The reason for the \x prefix is largely historical. PostgreSQL used to use an octal escape format for bytea data. When the format was changed to hex - to make it easier for clients to consume and work with and make it a bit more compact - it was necessary for the client to be able to tell what format the data was in. Since \x can never appear in octal ("escape") format literals, any string beginning with \x must be a hex bytea literal. This is even more important when receiving data from a client, which might be sending either hex or escape style literals, and the server must be able to tell which is which.
We could've just required that all clients use the format specified by the server. But that would break compatibility for all old clients that use bytea. Personally I think that's exactly what we should've done, and required that people using old clients set bytea_format = escape or something. That's not what happened, though. The setting bytea_output controls the format the server sends, but it still understands both formats as input. That makes interoperating with old clients and scripts easier. In theory.
In practice lots of old clients blindly interpreted hex literals sent by the server as if they were escape-format even though they were invalid; they'd ignore the backslash or treat it as a literal backslash. So they'd tend to corrupt bytea data when loading it then saving it again. Exactly what we wanted to avoid.

Related

How to store text internally in a Go program?

Software should only work with Unicode strings internally, converting to a particular encoding on output.
-- Python Docs
The above quote is from the Python docs. Python has a unicode string type so this makes sense. Go doesn't have unicode strings. As strings are just an immutable byte slice. What would be the equivalent quote for Go?
Would it be to convert text to utf-8 on entry to the program and store as utf-8 internally, and then output utf-8?
Generally speaking, in Go you will be writing a []byte like when using the ioutil package's WriteFile method; https://golang.org/pkg/io/ioutil/#WriteFile
So yes, the answer is that you explicitly declare the encoding. Since the string is just a byte slice, there is no inherent encoding, however string literals in Go source will be UTF-8. If you haven't already read this Go blog post by Robert Pike on strings, bytes and runes, it's worth the time; https://blog.golang.org/strings

Parsing COPY...WITH BINARY results

I'm using this:
COPY( select field1, field2, field3 from table ) TO 'C://Program
Files/PostgreSql//8.4//data//output.dat' WITH BINARY
To export some fields to a file, one of them is a ByteA field. Now, I need to read the file with a custom made program.
How can I parse this file?
The general format of a file generated by COPY...BINARY is explained in the documentation, and it's non-trivial.
bytea contents are the most easy to deal with, since they're not encoded.
Each other datatype has its own encoding rules, which are not described in the documentation but in the source code. From the doc:
To determine the appropriate binary format for the actual tuple data
you should consult the PostgreSQL source, in particular the *send and
*recv functions for each column's data type (typically these functions are found in the src/backend/utils/adt/ directory of the source
distribution).
It might be easier to use the text format rather than binary (so just remove the WITH BINARY). The text format has better documentation and is designed for better interoperability. The binary format is more intended for moving between postgres installations, and even there they have version incompatibilities.
Text format will write the bytea field as if it was text, and encode any non-printable characters with \nnn octal representation (except for a few special cases that it encodes with C style \x patterns, such as \n and \t etc.) These are listed in the COPY documentation.
The only caveat with this is you need to be absolutely sure that the character encoding you're using is the same when saving the file as when reading it. To make sure that the printable characters map to the same numbers. I'd stick to SQL_ASCII as it keeps thing simpler.

Erlang, io_lib and unicode

I'm having a little trouble getting erlang to give me a unicode string.
Here's what works:
io:format("~ts~n", [<<226,132,162>>]).
™
ok
But instead of printing to the console, I want to assign it to a variable. So I thought:
T = lists:flatten(io_lib:format("~ts~n", [<<226,132,162>>])).
T.
[8482,10]
How can I get T in the io_lib example to contain the ™ symbol so I can write it to a network stream?
Instead of assigning the flattened version to a variable for sending on the network, can you instead re-write your code that sends over the network to accept the binary in the first place and use the formatted write mechanism ~ts when sending over the socket?
That would also let you avoid the lists:flatten, which isn't needed for the built-in IO mechanisms.
It does contain the trademark symbol: as you can see here, 8482 is its code. It isn't printed as ™ in the shell, because the shell prints as strings only lists which contain printable character code in Latin-1. So [8482, 10] is a Unicode string (in UTF-32 encoding). If you want to convert it to a different encoding, use the unicode module.
First thing is knowing what you need to do. Then you can adapt your code the best way you find.
Erlang represents unicode strings as lists of codepoints. Unicode codepoints are integers, not bytes. Snce you can only send bytes over the network, things like unicode strings, need to be encoded in byte squences by the sending side and decoded by the receiving side. UTF-8 is the most used encoding for unicode strings, and that's what your binary is, the UTF-8 encoding of the unicode string composed by the codepoint 8482.
What you get out of the io_lib:format call is the erlang string representation of that codepoint plus the new line character.
A very reasonable way to send unicode strings over the network is encoding them in UTF-8. Don't use io_lib:format for that, though. unicode:characters_to_binary/1 is the function meant to transform unicode strings in UTF-8 encoded binaries.
In the receiving side (and probably even better in your whole application) you'll have to decide how you will handle the strings, either in encoded binaries (or lists) or in plain unicode lists. But over the network the only choice is using binaries (or iolists wich are possibly deep lists of bytes) and I'll bet the most reasonable encoding for your application will be UTF-8.

Command-line arguments as bytes instead of strings in python3

I'm writing a python3 program, that gets the names of files to process from command-line arguments. I'm confused regarding what is the proper way to handle different encodings.
I think I'd rather consider filenames as bytes and not strings, since that avoids the danger of using an incorrect encoding. Indeed, some of my file names use an incorrect encoding (latin1 when my system locale uses utf-8), but that doesn't prevent tools like ls from working. I'd like my tool to be resilient to that as well.
I have two problems: the command-line arguments are given to me as strings (I use argparse), and I want to report errors to the user as strings.
I've successfuly adapted my code to use binaries, and my tool can handle files whose name are invalid in the current default encoding, as long as it is by recursing trough the filesystem, because I convert the arguments to binaries early, and use binaries when calling fs functions. When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8. I do not know what these are, and trying to encode it always fail, be it with utf8 or with the corresponding (wrong) encoding (latin1 here).
The other problem is for reporting errors. I expect users of my tool to parse my stdout (hence wanting to preserve filenames), but when reporting errors on stderr I'd rather encode it in utf-8, replacing invalid sequences with appropriate "invalid/question mark" characters.
So,
1) Is there a better, completely different way to do it ? (yes, fixing the filenames is planned, but i'd still like my tool to be robust)
2) How do I get the command line arguments in their original binary form (not pre-decoded for me), knowing that for invalid sequences re-encoding the decoded argument will fail, and
3) How do I tell the utf-8 codec to replace invalid, undecodable sequences with some invalid mark rather than dying on me ?
When I receive a filename argument
which is invalid, however, it is
handed to me as a unicode string with
strange characters like \udce8.
Those are surrogate characters. The low 8 bits is the original invalid byte.
See PEP 383: Non-decodable Bytes in System Character Interfaces.
Don't go against the grain: filenames are strings, not bytes.
You shouldn't use a bytes when you should use a string. A bytes is a tuple of integers. A string is a tuple of characters. They are different concepts. What you're doing is like using an integer when you should use a boolean.
(Aside: Python stores all strings in-memory under Unicode; all strings are stored the same way. Encoding specifies how Python converts the on-file bytes into this in-memory format.)
Your operating system stores filenames as strings under a specific encoding. I'm surprised you say that some filenames have different encodings; as far as I know, the filename encoding is system-wide. Functions like open default to the default system filename encoding, for example.

Does Lua support Unicode?

Based on the link below, I'm confused as to whether the Lua programming language supports Unicode.
http://lua-users.org/wiki/LuaUnicode
It appears it does but has limitations. I simply don't understand, are the limitation anything big/key or not a big deal?
You can certainly store unicode strings in lua, as utf8. You can use these as you would any string.
However Lua doesn't provide any default support for higher-level "unicode aware" operations on such strings—e.g., counting string length in characters, converting lower-to-upper-case, etc. Whether this lack is meaningful for you really depends on what you intend to do with these strings.
Possible approaches, depending on your use:
If you just want to input/output/store strings, and generally use them as "whole units" (for table indexing etc), you may not need any special handling at all. In this case, you just treat these strings as binary blobs.
Due to utf8's clever design, some types of string manipulation can be done on strings containing utf8 and will yield the correct result without taking any special care.
For instance, you can append strings, split them apart before/after ascii characters, etc. As an example, if you have a string "開発.txt" and you search for "." in that string using string.find (string_var, "."), and then split it using the normal string.sub function into "開発" and ".txt", those result strings will be correct utf8 strings even though you're not using any kind of "unicode-aware" algorithm.
Similarly, you can do case-conversions on only the ASCII characters in strings (those with the high bit zero), and treat the rest of the strings as binary without screwing them up.
Some utf8-aware operations are so simple that it's easy to just write one's own functions to do them.
For instance, to calculate the length in unicode-characters of a string, just count the number of characters with the high bit zero (ASCII characters), and the number of characters with the top two bits 11 ("leading bytes" for non-ASCII characters); the length is the sum of those two.
For more complex operations—e.g., case-conversion on non-ASCII characters, etc.—you'll probably have to use a Lua unicode library, such as those on the (previously mentioned) Lua-users Unicode page
Lua does not have any support for unicode (other than accepting any byte value in strings). The library slnunicode has a lot of unicode string functions, however. For example unicode.utf8.len.
(note: this answer is completely stolen from grom's comment on another question - I just think it deserves its own answer)
If you want a short answer, it is 'yes and no' as put on the linked site.
Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.
What is not supported is iteration by unicode character, there is no standard function for string length in unicode characters etc. So the higher-level kind of Unicode support (like what is available in Python with length, lower -> upper case conversion, encoding in arbitrary coding etc) is not available.
Lua 5.3 was released now. It comes with a basic UTF-8 library.
You can use the utf8 library to do things about UTF-8 encoding, like getting the length of a UTF-8 string (not number of bytes as string.len), matching each characters (not bytes), etc.
It doesn't provide native support other than encoding, like is this character a Chinese character?
It supports it in the sense that you can use Unicode in Lua strings. It depends specifically on what you're planning to do, but most of the limitations can be fairly easily worked around by extending Lua with your own functions.