Decoding special char from .rtf or .doc using php - special-characters

i am trying to find a clean way to decode some "special characters" using php, i have an RTF file (aslo PDF and DOC with sane kind of issue), i manage to open it and find clear text in it, put at the end it still output some character like : é for é or ç for ç.
I have tried mb_detect_encoding (with "auto" also) but it detect "ACSII", i have try to convert using mb_convert_encoding($mytext,'ISO-8859-1'), mb_convert_encoding($mytext,'ISO-8859-15'),
mb_convert_encoding($mytext,'UTF-8') and then UTF-8 to ISO-8859-1, htmlspecialchars, utf8_decode (recursively).
I made a mapping table but i don't think it the best way to do it ?
Xavier VILAIN
PS : Most of the document have been create in france latin charset.

Related

Displaying foreign characters on the website

I have a small list in a foreign language and I am able to display the special foreign characters on the website I am updating. For example to display ü on the website, I write ü in the file. Or to display ö, I write ö in the file. And they are displayed correctly. So far no problems. But now I must also display the character β. Can you just write me the code for it in that same set? Or better yet, tell me where can I find the corresponding character? such as in a list? what is the name of the list I must look at? Again, I want to display character β on a website, by writing the corresponding special character on the source file, just like I am writing ü to display ü.
Mojibake is what's happening, because your text editor use ISO 8859-1 to open and save the files, but your web server serve them to your user with UTF-8. You can confirm it with https://string-functions.com/encodedecode.aspx or other tools using encode set to ISO 8859-1 but the decode set to UTF-8.
The fix is to set your text editor to use UTF-8.

How should a properly UTF-8 encoded file look in notepad++

I am integrating data using some flat files. I'm getting the flat files delivered by FTP as .csv-files out of MS SQL exports from a business partner.
I asked him to encode it as UTF-8 (just using the standard I thought).
Now I can see in his files that a lot of UTF-8 bytes such as "& # 2 3 3 ;" (w/o the spaces) can be seen as plain text when I open it in Notedpad++ (or also using my "ETL" tool).
Before I ask him to fix it into proper UTF-8, I would like to understand the issue and whether my claim is actually correct?
Shouldn't special characters be shown as special characters when I open them in Notepad++ and not as plain text UTF-8 codes?
Any help is much appreciated :))
Cheers
Martin
é is an HTML entity. For some reason the text is HTML formatted, which I wouldn't count as "plaintext"/flat files. The file may or may not be encoded in UTF-8 in addition to that, we don't know from the information given.
A file containing "special characters" (meaning non-ASCII characters) encoded in UTF-8 opened in a text editor which correctly interprets the file as UTF-8 looks exactly like the text it should look like, e.g.:
正式名称は、ISO/IEC 10646では “UCS Transformation Format 8”、Unicodeでは “Unicode Transformation Format-8” という。両者はISO/IEC 10646とUnicodeのコード重複範囲で互換性がある。RFCにも仕様がある。
Put this in a file, save it as UTF-8, open it in another application as UTF-8, and this is what the text should look like.

Applescript: Save Word documents as plaintext while retaining accents

I'm trying to save Word documents as plain text docs. Currently, some times the accents turn into other symbols (usually the same ones, for example: é turns into a theta). Other times it works fine. How do I prevent this?
Currently using the line:
save as active document file name FullDocPath file format format Unicode text
When I encounter this error, I can save the document using the dialog (selecting Western Mac OS Roman encoding...that fixes the problem.
The applescript Word dictionary mentions:
[text encoding unsigned integer] : Text encoding to use when saving out as text file
I have no idea if this is the piece I'm missing or how to utilize it (is there a set integer that designates Western Mac OS Roman encoding?)
Anyone have any ideas?
Try:
set wordDoc to choose file
do shell script "textutil -convert txt " & quoted form of POSIX path of (wordDoc as text)
Check out StefanK's solution using textutil
This is in response to your comment beginning "Thanks Stefan and bibadiak"
With .txt file formats is that there is no universally used way to specify the encoding of a file inside the file, so either the application has to guess, or you have to know the encoding and the application has to let you tell it.
AFAIK if you do not specify an output encoding when you use textutil to convert from .doc or .docx format to text, you get UTF-8. But Mac Word just does not seem to recognise that when you try to open it, either programmatically or in the UI.
So I think you need to do some mix of the following:
a. save in, and work with, a format that uses 16-bit Unicode encoding. Word should recognise that, certainly if the BOM is preserved
b. save to UTF and work with UTF elsewhere, but use textutil to do the conversion back to (say) .docx before you re-open the document in Mac Word
c. if all your characters can be encoded using Mac OS Roman, use e.g.
textutil -convert txt -encoding 30
to save, ensure you work only with that character set, and re-open with Word. (30 is the value of the APple NSString value NSMacOSRomanStringEncoding). I think textutil will fail to convert documents that contain characters outside the MacOS Roman set.

Uploading Amazon Inventory UTF 8 Encoding

I am trying to upload my english inventory to various european amazon sites. The issue I am having is that the accents found in certain different languages are not displaying correctly when an "inventory file" is uploaded to amazon. The inventory file is a tab delimited text file.
current setup:
$type = 'text/tab-separated-values; charset=utf-8';
header('Content-Type:'.$type);
header('Content-Disposition: attachment; filename="inventory-'.$_GET['cc'].'.txt');
header('Content-Length: ' . strlen($data));
header('Content-Encoding: UTF-8');
When the text file is outputted and saved it looks exactly how it should when opened in windows (all the characters are correct) but for some reason amazon doesn't see it as UTF8 and re-encodes it with all of the characters found here:
http://www.i18nqa.com/debug/utf8-debug.html
I have tried adding the BOM to the top of the file but this just results in amazon giving an error. Has anyone else experienced this?
As #fvu pointed out in his comment, Amazon is expecting the ISO-8859-1 format, not UTF-8. That's why you should use PHP's utf8_decode method when writing to your file.
Ok so after a lot of trying it turns out that the characters needed to be decoded. I opened the text files in excel and they seemed to encode themselves as weird characters like ü using php utf8_decode turned them back into the correct characters EVEN THOUGH the text file showed them as the right characters... very confusing.
To anyone out there having difficulties with UTF 8 try decoding first.
thanks for your help

Convert non english characters into Unicode (UTF-8)

I copied large amount of text from another system to my PC. When I viewed the text in my PC, it looked weird. So I copied all the fonts from the other PC and installed them in mine too. Now the text looks okay, but actually it seems that is not in Unicode. For example, if I copy the text and paste in another UTF-8 supported editor such as Notepad++, I get English characters ("bgah;") only like shown below.
How to convert this whole text into unicode text, like the one below. So I can copy the text and paste anywhere else.
பெயர்
The above text was manually obtained using http://www.google.com/transliterate/indic/Tamil
I need this conversion to be done, so I can copy them into database tables.
'Ja-01' is a font with a custom 'visual encoding'.
That is to say, the sequence of characters really is "bgah;" and it only looks like Tamil to you because the font's shapes for the Latin characters bg look like பெ.
This is always to be avoided, because by storing the content as "bgah;" you lose the ability to search and process it as real Tamil, but this approach was common in the pre-Unicode days especially for less-widespread scripts without mature encoding standards. This application probably predates widespread use of TSCII.
Because it is a custom encoding not shared by any other font, it is very unlikely you will be able to find a tool to convert content in this encoding to proper Unicode characters. It does not appear to be any standard character ordering, so you will have to look at the font (eg in charmap.exe) and note down every character, find the matching character in Unicode and map between them.
For example here's a trivial Python script to replace characters in a file:
mapping= {
u'a': u'\u0BAF', # Tamil letter Ya
u'b': u'\u0BAA', # Tamil letter Pa
u'g': u'\u0BC6', # Tamil vowel sign E (combining)
u'h': u'\u0BB0', # Tamil letter Ra
u';': u'\u0BCD', # Tamil sign virama (combining)
# fill in the rest of the mapping information here!
}
with open('ja01data.txt', 'rb') as fp:
data= fp.read().decode('utf-8')
for char in mapping:
data= data.replace(char, mapping[char])
with open('utf8data.txt', 'wb') as fp:
fp.write(data.encode('utf-8'))
The font you found is getting you into trouble. The actual cell text is "bgah;", it gets rendered to பெயர் because you found a font that can work with 8-bit non-Unicode characters. So reading it or pasting it into Notepad++ is going to produce "bgah;" since that's the real text. It can only ever be rendered properly again by forcing the program that displays the string to use that same font.
Ditch the font and enter Unicode so it looks like this:
"bgah" looks like a Baamini based system, which is pre-unicode. It was popular in Canada (and the SL Tamil diaspora in general) in the 90s.
As the others mentioned, it looks like a custom visual encoding that mimics the performance of a foreign script while maintaining ASCII encoding.
Google "Baamini to unicode convertor". The University of Colombo seems to have put one up: http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/?maps=t_b-u.xml
Let me know if this works. If not, I can ask around and get something for you.
You could first check whether the encoding is TSCII, as this sounds most probable. It is an 8-bit encoding, and the fonts you copied are probably based on that encoding. Check out whether the TSCII to UTF-8 converter at SourceForge is suitable. The project there is called “Any Tamil Encoding to Unicode” but they say that only TSCII is supported for now.