Base58 Encoder function in PostgreSQL for TEXT

Base58 Encoder function in PostgreSQL for TEXT - postgresql

Can anyone help me to implement Base58 encoding stored procedure in PostgreSQL.
I've found answer for numbers but I'm looking for similar stored procedure that can accept TEXT or VARCHAR value.

On this very rare occasion I'm going to suggest you don't do this. It will be computationally possible but highly inadvisable.
https://en.wikipedia.org/wiki/Base58
In contrast to Base64, the digits of the encoding don't line up well
with byte boundaries of the original data. For this reason, the method
is well-suited to encode large integers, but not designed to encode
longer portions of binary data.
To put this another way, Base58 is not designed to encode strings / text. Your main alternatives are:
Base64 which if copied manually by a human, the human may make mistakes. Otherwise Base64 is safe to copy / paste
Hexadecimal which is easily copied by humans but significantly longer than Base64
If you feel you really need Base58 and not Base64 then it may be worth editing your requirements into your question. This may help someone give an answer more specific to your requiremnts:
What are these strings you need to convert (examples are preferable)?
Why do they need to be Base58 and not Base64 (what other system are you passing these to)?

Related

Why Base64 is used "only" to encode binary data?

I saw many resources about the usages of base64 in today's internet. As I understand it, all of those resources seem to spell out single usecase in different ways : Encode binary data in Base64 to avoid getting it misinterpreted/corrupted as something else during transit (by intermediate systems). But I found nothing that explains following :
Why would binary data be corrupted by intermediate systems? If I am sending an image from a server to client, any intermediate servers/systems/routers will simply forward data to next appropriate servers/systems/routers in the path to client. Why would intermediate servers/systems/routers need to interpret something that it receives? Any example of such systems which may corrupt/wrongly interpret data that it receives, in today's internet?
Why do we fear only binary data to be corrupted. We use Base64 because we are sure that those 64 characters can never be corrupted/misinterpreted. But by this same logic, any text characters that do not belong to base64 characters can be corrupted/misinterpreted. Why then, base64 is use only to encode binary data? Extending the same idea, when we use browser are javascript and HTML files transferred in base64 form?

There's two reasons why Base64 is used:
systems that are not 8-bit clean. This stems from "the before time" where some systems took ASCII seriously and only ever considered (and transferred) 7bits out of any 8bit byte (since ASCII uses only 7 bits, that would be "fine", as long as all content was actually ASCII).
systems that are 8-bit clean, but try to decode the data using a specific encoding (i.e. they assume it's well-formed text).
Both of these would have similar effects when transferring binary (i.e. non-text) data over it: they would try to interpret the binary data as textual data in a character encoding that obviously doesn't make sense (since there is no character encoding in binary data) and as a consequence modify the data in an un-fixable way.
Base64 solves both of these in a fairly neat way: it maps all possible binary data streams into valid ASCII text: the 8th bit is never set on Base64-encoded data, because only regular old ASCII characters are used.
This pretty much solves the second problem as well, since most commonly used character encodings (with the notable exception of UTF-16 and UCS-2, among a few lesser-used ones) are ASCII compatible, which means: all valid ASCII streams happen to also be valid streams in most common encodings and represent the same characters (examples of these encodings are the ISO-8859-* family, UTF-8 and most Windows codepages).
As to your second question, the answer is two-fold:
textual data often comes with some kind of meta-data (either a HTTP header or a meta-tag inside the data) that describes the encoding to be used to interpret it. Systems built to handle this kind of data understand and either tolerate or interpret those tags.
in some cases (notably for mail transport) we do have to use various encoding techniques to ensure text doesn't get mangles. This might be the use of quoted-printable encoding or sometimes even wrapping text data in Base64.
Last but not least: Base64 has a serious drawback and that's that it's inefficient. For every 3 bytes of data to encode, it produces 4 bytes of output, thus increasing the size of the data by ~33%. That's why it should be avoided when it's not necessary.

One of the use of BASE64 is to send email.
Mail servers used a terminal to transmit data. It was common also to have translation, e.g. \c\r into a single \n and the contrary. Note: Also there where no guarantee that 8-bit can be used (email standard is old, and it allowed also non "internet" email, so with ! instead of #). Also systems may not be fully ASCII.
Also \n\n. is considered as end of body, and mboxes uses also \n>From to mark start of new mail, so also when 8-bit flag was common in mail servers, the problems were not totally solved.
BASE64 was a good way to remove all problems: the content is just send as characters that all servers must know, and the problem of encoding/decoding requires just sender and receiver agreement (and right programs), without worrying of the many relay server in between. Note: all \c, \r, \n etc. are just ignored.
Note: you can use BASE64 also to encode strings in URL, without worrying about the interpretation of webbrowsers. You may see BASE64 also in configuration files (e.g. to include icons): special crafted images may not be interpreted as configuration. Just BASE64 is handy to encode binary data into protocols which were not designed for binary data.

What is base32 encoding?

There is enough information on how to implement base32 encoding or the specification of base32 encoding but I don't understand what it is, why we need it and where are the primary applications. Can someone please explain and give nice real life scenarios on usage? Thanks.
crockford base32
wikipedia base32

Like any other "ASCII-only" encoding, base32's primary purpose is to make sure that the data it encodes will survive transportation through systems or protocols which have special restrictions on the range of characters they will accept and emerge unmodified.
For example, b32-encoded data can be passed to a system that accepts single-byte character input, or UTF-8 encoded string input, or appended to a URL, or added to HTML content, without being mangled or resulting in an invalid form. Base64 (which is much more common) is used for the exact same reasons.
The main advantage of b32 over b64 is that it is much more human-readable. That's not much of an advantage because the data will typically be processed by computers, hence the relative rarity of b32 versus b64 (which is more efficient space-wise).
Update: there's the same question asked about Base64 here: What is base 64 encoding used for?

Base32 encoding (and Base64) encoding is motivated by situations where you need to encode unrestricted binary within a storage or transport system that allow only data of a certain form such as plain text. Examples include passing data through URLs, XML, or JSON data, all of which are plain text sort of formats that don't otherwise permit or support arbitrary binary data.

In addition to previous answers for base32 vs base64 in numbers. For same .pdf file encoded result is:
base64.base32encode(content) = 190400 symbols
base64.base64encode(content) = 158668 symbols

Character Encoding Issue

I'm using an API that processes my files and presents optimized output, but some special characters are not preserved, for example:
Input: äöü
Output: Ã¤Ã¶Ã¼
How do I fix this? What encoding should I use?
Many thanks for your help!

It really depend what processing you are done to your data. But in general, one powerful technique is to convert it to UTF-8 by Iconv, for example, and pass it through ASCII-capable API or functions. In general, if those functions don't mess with data they don't understand as ASCII, then the UTF-8 is preserved -- that's a nice property of UTF-8.

I am not sure what language you're using, but things like this occur when there is a mismatch between the encoding of the content when entered and encoding of the content when read in.
So, you might want to specify exactly what encoding to read the data. You may have to play with the actual encoding you need to use
string.getBytes("UTF-8")
string.getBytes("UTF-16")
string.getBytes("UTF-16LE")
string.getBytes("UTF-16BE")
etc...
Also, do some research about the system where this data is coming from. For example, web services from ASP.NET deliver the content as UTF-16LE, but Java uses UTF-16BE encoding. When these two system talk to each other with extended characters, they might not understand each other exactly the same way.

How do I properly implement Unicode passwords?

Adding support for Unicode passwords it an important feature that should not be ignored by developers.
Still, adding support for Unicode in passwords is a tricky job because the same text can be encoded in different ways in Unicode and you don't want to prevent people from logging in because of this.
Let's say that you'll store the passwords as UTF-8, and mind that this question is not related to Unicode encodings and it's related to Unicode normalization.
Now the question is how you should normalize the Unicode data?
You have to be sure that you'll be able to compare it. You need to be sure that when the next Unicode standard will be released it will not invalidate your password verification.
Note: still there are some places where Unicode passwords will probably never be used, but this question is not about why or when to use Unicode passwords, it is about how to implement them in the proper way.
1st update
Is it possible to implement this without using ICU, like using OS for normalizing?

A good start is to read Unicode TR 15: Unicode Normalization Forms. Then you realize that it is a lot of work and prone to strange errors - you probably already know this part since you are asking here. Finally, you download something like ICU and let it do it for you.
IIRC, it is a multistep process. First you decompose the sequence until you cannot further decompose - for example é would become e + ´. Then you reorder the sequences into a well-defined ordering. Finally, you can encode the resulting byte stream using UTF-8 or something similar. The UTF-8 byte stream can be fed into the cryptographic hash algorithm of your choice and stored in a persistent store. When you want to check if a password matches, perform the same procedure and compare the output of the hash algorithm with what is stored in the database.

A question back to you- can you explain why you added "without using ICU"? I see a lot of questions asking for things that ICU does (we* think) pretty well, but "without using ICU". Just curious.
Secondly, you may be interested in StringPrep/NamePrep and not just normalization: StringPrep - to map strings for comparison.
Thirdly, you may be intererested in UTR#36 and UTR#39 for other Unicode security implications.
*(disclosure: ICU developer :)

What text encoding scheme do you use when you have binary data that you need to send over an ascii channel?

If you have binary data that you need to encode, what encoding scheme do you use?
I know about:
Hex encoding. Very simple, but quite verbose, expands one byte to two.
Base 64. Most common, not so verbose, expands three bytes to four.
Base 85. Not common, less verbose again, expands four bytes to five.
Are there any other encoding schemes in common use? If so, what are there advantages and disadvantages?
Edit: This is useful, for example, when trying to store arbitrary data in a cookie. Cookies can only store text, not arbitrary data, so you need to convert it in some way, preferably with a way to convert it back. Further, assume that you are using a stateless server so that you cannot save the state on the server and just put an identifier into the cookie. Of course, if you do this you would also need some way of verifying that what the user is passing back to you is what you passed to the user, for example a signature.
Also, since the current consensus is that you should use base64 since it is widespread, I will also point out that this is what I use... I am just curious if anyone used anything else, and if so, why.
Edit: Just in case someone stumbles across this, if you do want to use Base64 to store data in a cookie, you need to use a modified Base64 implementation. See this answer for the reason why.

For encoding cookie values, you need to be careful. See this older answer:
With Version 0 cookies, values should
not contain white space, brackets,
parentheses, equals signs, commas,
double quotes, slashes, question
marks, at signs, colons, and
semicolons. Empty values may not
behave the same way on all browsers.
Base64 encoding can generate = symbols for certain inputs, and this technically is not permitted in cookies (version 0 cookies, anyway, which are the most widely supported). In practice, I suspect the = will actually work fine, but maybe not.
I would suggest that to be absolutely sure that your encoded binary is cookie-compatible, then basic hex encoding is safest (e.g. in java).
edit: As #Paul helpfully pointed out, there is a modified version of Base 64 that is "URL safe" (and, I assume, "cookie safe"). Using a modified version of a standard algorithm rather dilutes its charm, mind you.
edit: #shoosh pointed out that the = is only used to denote the end of the base64 string, so you could trim the =, set the cookie, then reattach the = again when you need to decode it.

Base64 wins because it's so common that I don't have to ever worry about rolling my own encoder/decoder. I haven't run into any applications where I've been worried about saving bandwidth or filespace in encoded binary data.

Once upon a time, there was UTF-7. It's officially deprecated, but it still works as an ACE (ASCII Compatible Encoding). Now there's IDN.

uuencode is popular is some circles
HTML and XML encode unicode using this syntax
Base64 is the de-facto standard. Using anything else is asking for trouble.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse