Reading SAS table, showing up � in value. ENCODING - encoding

Currently reading an existing table in SAS. When reading the table, I noticed some values aren't encoded correctly.
For example, Bioparc de la Gasp�.
I saved this table in a csv file to look closely on this matter. When I open this file in notepad++, I notice that the value shows up Bioparc de la GaspxC3. And referencing xC3 on https://www.i18nqa.com/debug/utf8-debug.html, it shows as the character "Ã"
I have tried changing the encoding to utf-8 but it doesn't seem to change this issue.
My expected output is Bioparc de la GaspÃ.
Any help would be appreciated

Related

Displaying foreign characters on the website

I have a small list in a foreign language and I am able to display the special foreign characters on the website I am updating. For example to display ü on the website, I write ü in the file. Or to display ö, I write ö in the file. And they are displayed correctly. So far no problems. But now I must also display the character β. Can you just write me the code for it in that same set? Or better yet, tell me where can I find the corresponding character? such as in a list? what is the name of the list I must look at? Again, I want to display character β on a website, by writing the corresponding special character on the source file, just like I am writing ü to display ü.
Mojibake is what's happening, because your text editor use ISO 8859-1 to open and save the files, but your web server serve them to your user with UTF-8. You can confirm it with https://string-functions.com/encodedecode.aspx or other tools using encode set to ISO 8859-1 but the decode set to UTF-8.
The fix is to set your text editor to use UTF-8.

Pentaho spoon search and replace especial character in rows

I have a csv file with mime type US-ASCII and one column in the dataset look like this:
id
V_name
210001
cha?ne des Puys
210030
M?los
213004
G?ll?
213021
S?phan
221110
Afd?ra
And so on.
I would like to change those characters to:
id
V_name
210001
chaine des Puys
210030
Milos
213004
Gollu
213021
Suphan
221110
Afdera
The thing is that there are 95 rows of this kind, how can I search and replace those rows?
I using the suite PDI spoon.
Thanks in advance.
As #Iłya Bursov has stated, the source file you are reading doesn't provide the correct characters, it is providing the ? in the source, so if you want to correct it, you'll have to do it manually.
I don't think it is worth it, unless you know you are going to get always the same set of V_name over time and different files. In that case you could create a file to correlate the V_name in your source with the ? characters to a V_name_corrected with the correct display for the characters. This seems to be an exercise, so I would let the names as they are. In real life, I would contact the provider of the file with the incorrect character set to tell them that they need to correct the generation of the file to provide the correct characters in the file.

Identify hidden non-UTF8 encoded characters

I am working in postgreSQL database and I have text column which in various languages like russian, chineses, korean, english etc. Although our application handles these languages well, we are having a issue dealing with non-UTF-8 characters.
For example, if you see the image from notepad++ where I have done Encoding > Encode in UTF-8, it neatly shows all the non-recognizable characters.
However, we are facing issue marking such records as non-process-able in postgres. Something like a flag should also do but I am trying something like below but it flags the valid russian records as well whereas notepad++ explicitly shows the hidden/non-UTF-8 characters.
Notepad++
Weird thing about these characters are that they do not show up regular select query but when I convert them to "UTF-8", those show up like below.
Database
Tried something like this (below query) but it does not seem to work i.e give me the desired output. Expectation is to set a flag to such records which have invalid hidden HTML references but not lose the valid text like the valid russian sentence in the snapshot. Should be able to distinctly identify only such texts.
select text, text ~ '[^[:ascii:]]', text ~ '^[\x00-\x7F]*$'
from sample_data;
Sample Data -
"Я не наркоман. Это у меня всегда, когда мне афигитительно. А если серьёзно, это интересно,…"
"Ya le dieron amor a la foto de instagram de mi #UberCALAVERITA?"
"Executive Admininstrative Assistant in Toronto, ON for a Group"
"Сегодня валютные стратеги BMO обновили прогнозы по основным валютам на ближайшие пять кварталов (на конец периода): читать далее…"
"Flicitations Gestion d'actifs pour 6 Trophes #FundGradeA+2016 de fonds communs de placement :"
This answer might help you go back to fix problems. It doesn't directly help you to go forward in the direction you are asking about.
Looking at Flicitations and F\302\202licitations, the escapes look like octal, which is possibly a presentation choice of your "IDE" and/or the convert_to function. From octal, \302\202 is 0xC2 0x82, decoding as UTF-8 gives U+0082. In Unicode, that's a control character, in ISO 8859-1 it's a non-character, either might explain why some renderings make it invisible or take no space.
Now, Google tells me that Flicitations is almost like a French word, Félicitations. So, perhaps there is a character set and encoding where é is encoded as 0x82. Wikipedia helps here—Indeed there is: IBM850, which has been used for some French text.
So, it seems that someone has mishandled the user's text, causing data loss. The fundamental rule of text encoding is that text bytes must be read with the same encoding they were written with. Don't guess; Ask, or reference a standard, specification, documentation, or convention. Maybe you can go back and find the misbehaving process/code—at least that would prevent future data loss.
"Dealing with non-UTF-8 characters": There aren't really any non-UTF-8 characters. UTF-8 is an encoding of the Unicode character set. There are areas with exceptions but, practically speaking, Unicode has all characters, and UTF-8 can encode them all. So, if you think there are non-UTF-8 characters, the writer is either non-compliant or the reader is using the wrong encoding.

Decoding special char from .rtf or .doc using php

i am trying to find a clean way to decode some "special characters" using php, i have an RTF file (aslo PDF and DOC with sane kind of issue), i manage to open it and find clear text in it, put at the end it still output some character like : é for é or ç for ç.
I have tried mb_detect_encoding (with "auto" also) but it detect "ACSII", i have try to convert using mb_convert_encoding($mytext,'ISO-8859-1'), mb_convert_encoding($mytext,'ISO-8859-15'),
mb_convert_encoding($mytext,'UTF-8') and then UTF-8 to ISO-8859-1, htmlspecialchars, utf8_decode (recursively).
I made a mapping table but i don't think it the best way to do it ?
Xavier VILAIN
PS : Most of the document have been create in france latin charset.

Meaning of an RTF faulty piece of code

I am working on an RTF file made by someone else on an unknown platform, and everything is interpreted correctly, except some characters, whatever character set I open them from in openoffice. Here is the plain text, after interpretation:
"Même taille que la Terre, même masse, même âgec Vénus a souvent été qualifiée de sœur de la Terre. "
and here is the original ANSI paragraph:
"M\u234\'3fme taille que la Terre, m\u234\'3fme masse, m\u234\'3fme \u226\'3fge\uc2 \u61825\'ff\'81\uc1 c V\u233\'3fnus a souvent \u233\'3ft\u233\'3f qualifi\u233\'3fe de s\u339\'3fur de la Terre."
To zoom in:
"âgec Vénus" becomes "\u226\'3fge\uc2 \u61825\'ff\'81\uc1 c V\u233\'3fnus"
and finally, what we come up with:
"\uc2 \u61825\'ff\'81\uc1 c"
here \uc2 and \uc1 are to say we are going back and forth between 4-bytes and 2-bytes Unicode encoding.
\u61825 is an unknown Unicode character. Indeed, according to the RTF specification, any UTF character greater than 2^15 should be written in a negative form; negative form with ANSI characters should make the "-" (minus) sign visible to the notepad, am I right? So here already I have something I don't understand, how the RTF writer used by the person who made the rtf file in the first place could have done it. Maybe I missed something in the specification, specific versions, character sets, I don't know. If taken as is, 61825 would correspond to F181 which is in a private area of the Unicode table.
And then, the \'ff\'81 would be some use of the ANSI equivalent field of the whole "specific character" group (whose structure is usually \uN\'XX), to code something that would be 4-byte long. And here again, I could not find:
what is the code page (Windows-1252, ISO-8859-1, other?) being refered to (as in all the other places in the file where a \uN\'XX sequence apears, XX are always 3F, the Windows-1252 code for "?", so it did not give me much information)
what does the \'FF (which looks like some control character inside an escape sequence!) stand for, and then why \'81... Actually, the translation of \u61825 to hex is F181, not FF81...I am lost here!
Finally, what the translated text (in French) would make us expect is the ":" (semicolon): "Same size as Earth, same mass, same age: Venus has often been qualified as Earth's sister". It would make sense. But what rtf writer could imagine such a complicated code for the semicolon?
So again, after 1 hour of search, I open the question to you fellows: does someone recognize this, and could tell me what control word encoding is used, is there a big endian/little endian/2's complement mess here with the 61825, and same with the \'ff\'81, which would assemble as FF81 instead of F181, which itself doesn't mean anything as is...here my question is only to know if there would be a way to find the complete original text back from the bizarre RTF encoding!
what the translated text (in french) would make us expect is the ":" (semicolon
Nearly: it should be the ellipsis. You can see the source text eg here.
The ellipsis should normally be written simply as three periods, but there has traditionally been a separate character representing ellipsis in order better to control their spacing, back before complex text layout algorithms existed that could do automatic glyph replacement. Consequently there exists a Unicode compatibility character U+2026 HORIZONTAL ELLIPSIS to allow round-tripping to legacy encodings such as Windows code page 1252, where it is byte 133.
That, however, is not what has been encoded in your RTF document. That would be too easy.
61825 is an unknown Unicode character.
It's a Private Use Area character, which means it could represent absolutely anything. Word has exported certain common symbol fonts as PUA characters - see this post for the background.
So someone at some point may have used a symbol font where code unit 129 (the 0x81 in U+F181, 61825) maps to something that looks like an ellipsis. Quite what that font is, I have no idea! It doesn't seem to be one of the usual suspects (Symbol, Wingdings, Webdings). You might just have to manually replace U+F181 with U+2026 for now unless you can find out more about the source.