Could i use special characters to split aggregation? Postgresql - postgresql

Can i ,without worrying, use special characters to split up the words?
For example:
SELECT STRING_AGG(users.name, '💪')

This will work fine as long as you are using characters supported by your database encoding (check the parameter server_encoding).
I hope and expect that you are using the only sensible encoding here: UTF8. If yes, then you don't have to worry.

Related

filter specific unicode character in SAS

I need to replace a specific unicode character in SAS, exactly the U+0191 with a whitespace or blank. How can I do it by COMPRESS ? Thanks in advance.
You should use the KCOMPRESS function rather than COMPRESS for compressing unicode characters, as it is considered safer for Unicode and DBCS environments.
However, it sounds like you actually want to TRANSLATE, or more accurately KTRANSLATE, which actually replaces characters with whitespace or other characters (as opposed to removing them, as COMPRESS does).
Here's an example:
data have;
charvar = "Ƒellow Americans";
fixed_charvar = translate(charvar,'F','Ƒ');
kfixed_charvar= ktranslate(charvar,'F','Ƒ');
put _all_;
run;
Here I convert U+0191 to a normal F; of course you can convert to space as you wish (Replace the 'F' with whatever you want it converted to).
This will work in an instance of SAS set up in Unicode mode; if you're running in WLATIN1 or similar, you may have more difficulty, particularly with actually passing SAS the U+0191 character.

Unicode character creation in Python 3.4

Using Python 3.4, suppose I have some data from a file, and it is literally the 6 individual characters \ u 0 0 C 0 but I need to convert it to the single unicode character \u00C0. Is there a simple way of doing that conversion? I can't find anything in the Python 3.4 Unicode documentation that seems to provide that kind of conversion, except for a complex way using exec() of an assignment statement which I'd like to avoid if possible.
Thanks.
Well, there is:
>>> b'\\u00C0'.decode('unicode-escape')
'À'
However, the unicode-escape codec is aimed at a particular format of string encoding, the Python string literal. It may produce unexpected results when faced with other escape sequences that are special in Python, such as \xC0, \n, \\ or \U000000C0 and it may not recognise other escape sequences from other string literal formats. It may also handle characters outside the Basic Multilingual Plane incorrectly (eg JSON would encode U+10000 to surrogates\uD800\uDC00).
So unless your input data really is a Python string literal shorn of its quote delimiters, this isn't the right thing to do and it'll likely produce unwanted results for some of these edge cases. There are lots of formats that use \u to signal Unicode characters; you should try to find out what format it is exactly, and use a decoder for that scheme. For example if the file is JSON, the right thing to do would be to use a JSON parser instead of trying to deal with \u/\n/\\/etc yourself.

Replace characters with multi-character strings

I am trying to replace German and Dutch umlauts such as ä, ü, or ß. They should be written like ae instead of ä. So I can't simply translate one char with another.
Is there a more elegant way to do that? Actually it looks like that (not completed yet):
SELECT addr, REPLACE (REPLACE(addr, 'ü','ue'),'ß','ss') FROM search;
On my way trying different commands I got another problem:
When I searched for Ü I got this:
ERROR: invalid byte sequence for encoding "UTF8": 0xdc27
Tried it with U&'\0220', it didn't replace anything. Only by using ü (for lowercase ü) it was replaced correctly. Has to do something with unicode, but how to solve this issue?
Kind regards from Germany. :)
Your server encoding seems to be UTF8.
I suspect your client_encoding does not match, which might give you a wrong impression of what you are dealing with. Check with:
SHOW client_encoding; -- in your actual session
And read this related answers:
Can not insert German characters in Postgres
Replace unicode characters in PostgreSQL
The rest of the tool chain has to be in sync, too. When using puTTY, for instance, one has to make sure, the terminal agrees with the rest: Change settings... Window -> Translation -> Remote character set = UTF-8.
As for your first question, you already have the best solution. A couple of umlauts are best replaced with a string of replace() statements.
As you seem to know already as well, single character replacements are more efficient with (a single) translate() statement.
Related:
Replace unicode characters in PostgreSQL
Regex remove all occurrences of multiple characters in a string
Beside other reasons I decided to write the replacement in python. Like Erwin wrote before, it seems there is no better solution as combining replace- commands.
In general pretty simple, even no encoding had to benn used. My "final" solution now looks like this:
ger_UE="Ü"
ger_AE="Ä"
ger_OE="Ö"
ger_SS="ß"
dk_AA="Å"
dk_OE="Ø"
dk_AE="Æ"
cur.execute("""Select addr, REPLACE (REPLACE (REPLACE( REPLACE (REPLACE (REPLACE (REPLACE(addr, '%s','UE'),'%s','OE'),'%s','AE'),'%s','SS'),'%s','AA'),'%s','OE'),'%s','AE')
from search WHERE x = '1';"""%(ger_UE,ger_OE,ger_AE,ger_SS,dk_AA,dk_OE,dk_AE))
I am now looking forward to the speed when it hits the large table. If anyone would like to make some annotations, they are very welcome.

Does Lua support Unicode?

Based on the link below, I'm confused as to whether the Lua programming language supports Unicode.
http://lua-users.org/wiki/LuaUnicode
It appears it does but has limitations. I simply don't understand, are the limitation anything big/key or not a big deal?
You can certainly store unicode strings in lua, as utf8. You can use these as you would any string.
However Lua doesn't provide any default support for higher-level "unicode aware" operations on such strings—e.g., counting string length in characters, converting lower-to-upper-case, etc. Whether this lack is meaningful for you really depends on what you intend to do with these strings.
Possible approaches, depending on your use:
If you just want to input/output/store strings, and generally use them as "whole units" (for table indexing etc), you may not need any special handling at all. In this case, you just treat these strings as binary blobs.
Due to utf8's clever design, some types of string manipulation can be done on strings containing utf8 and will yield the correct result without taking any special care.
For instance, you can append strings, split them apart before/after ascii characters, etc. As an example, if you have a string "開発.txt" and you search for "." in that string using string.find (string_var, "."), and then split it using the normal string.sub function into "開発" and ".txt", those result strings will be correct utf8 strings even though you're not using any kind of "unicode-aware" algorithm.
Similarly, you can do case-conversions on only the ASCII characters in strings (those with the high bit zero), and treat the rest of the strings as binary without screwing them up.
Some utf8-aware operations are so simple that it's easy to just write one's own functions to do them.
For instance, to calculate the length in unicode-characters of a string, just count the number of characters with the high bit zero (ASCII characters), and the number of characters with the top two bits 11 ("leading bytes" for non-ASCII characters); the length is the sum of those two.
For more complex operations—e.g., case-conversion on non-ASCII characters, etc.—you'll probably have to use a Lua unicode library, such as those on the (previously mentioned) Lua-users Unicode page
Lua does not have any support for unicode (other than accepting any byte value in strings). The library slnunicode has a lot of unicode string functions, however. For example unicode.utf8.len.
(note: this answer is completely stolen from grom's comment on another question - I just think it deserves its own answer)
If you want a short answer, it is 'yes and no' as put on the linked site.
Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.
What is not supported is iteration by unicode character, there is no standard function for string length in unicode characters etc. So the higher-level kind of Unicode support (like what is available in Python with length, lower -> upper case conversion, encoding in arbitrary coding etc) is not available.
Lua 5.3 was released now. It comes with a basic UTF-8 library.
You can use the utf8 library to do things about UTF-8 encoding, like getting the length of a UTF-8 string (not number of bytes as string.len), matching each characters (not bytes), etc.
It doesn't provide native support other than encoding, like is this character a Chinese character?
It supports it in the sense that you can use Unicode in Lua strings. It depends specifically on what you're planning to do, but most of the limitations can be fairly easily worked around by extending Lua with your own functions.

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.
I believe you could use Text::Unidecode for this, it is precisely what it tries to do.
In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);
If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?
If you get cyrilic text there is no "closest ASCII representation" for many characters.