Applescript: Save Word documents as plaintext while retaining accents - ms-word

I'm trying to save Word documents as plain text docs. Currently, some times the accents turn into other symbols (usually the same ones, for example: é turns into a theta). Other times it works fine. How do I prevent this?
Currently using the line:
save as active document file name FullDocPath file format format Unicode text
When I encounter this error, I can save the document using the dialog (selecting Western Mac OS Roman encoding...that fixes the problem.
The applescript Word dictionary mentions:
[text encoding unsigned integer] : Text encoding to use when saving out as text file
I have no idea if this is the piece I'm missing or how to utilize it (is there a set integer that designates Western Mac OS Roman encoding?)
Anyone have any ideas?

Try:
set wordDoc to choose file
do shell script "textutil -convert txt " & quoted form of POSIX path of (wordDoc as text)
Check out StefanK's solution using textutil

This is in response to your comment beginning "Thanks Stefan and bibadiak"
With .txt file formats is that there is no universally used way to specify the encoding of a file inside the file, so either the application has to guess, or you have to know the encoding and the application has to let you tell it.
AFAIK if you do not specify an output encoding when you use textutil to convert from .doc or .docx format to text, you get UTF-8. But Mac Word just does not seem to recognise that when you try to open it, either programmatically or in the UI.
So I think you need to do some mix of the following:
a. save in, and work with, a format that uses 16-bit Unicode encoding. Word should recognise that, certainly if the BOM is preserved
b. save to UTF and work with UTF elsewhere, but use textutil to do the conversion back to (say) .docx before you re-open the document in Mac Word
c. if all your characters can be encoded using Mac OS Roman, use e.g.
textutil -convert txt -encoding 30
to save, ensure you work only with that character set, and re-open with Word. (30 is the value of the APple NSString value NSMacOSRomanStringEncoding). I think textutil will fail to convert documents that contain characters outside the MacOS Roman set.

Related

What are the characters shown on a file after forcefully changing the extension?

Recently I changed the extension of an .apk file to .txt and despite this, I was able to open it on Notepad with some random characters, that weren't available on the keyboard in the file. org/antlr/runtime/ANTLRFileStream.class…TmOÓP=w[×QËÀ)ê|A…ÑETÔ¢NP¢™ãË—º•Q3ZÓcüþ¿j",£ß4ñGÏmÇñ˽Ïs{žçœçeûùëóW ±¨á0F5d0ÖA˔‹LÈã’ŠËR˜PqEƒ†Iy\•ØkÒºÞÁЂ´¦TL«˜H­95{ÙÚ°2K/­×–Y³Üªù(ð·:%œv\'¸!Гû÷óðª#¢èUܵä¸öòæÆÛ_±^ÔÂt^Ùª­Z¾#ýæc"XwêKž_5-7¨ù¦¿éΆmÞZ^Y*ÍS “ÛÖ¹µ¹7eûUàxn]%µ‘Ð^TÊvË^…kžUˆ;u_àTw<sÁ}µDL%ÛªØ>ùÄš#º…Rø˜¨;o)\,0ǚԞ݇ؓ‡àΪ<ò6ýr³¥GsÃ횪EOÌ_…É =è•Ç¬Ž#8ª£½ú^fùõ˜Ž›¸%pü IT{`Á2þ¶<Š:î`NÇ<î긇A˜èÿïˆ8Ç0Q¥»¨#- Ze7srRÉšíVƒõÐ]0rí&tÀ”O´‡[Y±K ö¬H›¯Ü %÷¬8Ì) r+åšW·ÑÏF†¿,bd—i%h³­ˆá8½YÄiª‘
Not just this, but while converting many other extensions like .jar,.xapk, etc. would show me these characters.Can anyone please explain, what factors are these characters based on, and how does the OS decides or try what characters to show in an unsupported file exactly.
Is there a way to get the original content through this data?
Lets say you created a text editor, which can write and save text files as well as open text files. you also defined the encoding that will be used to save text in binary files(all files when saved are binary). So your encoding looks something as following:
Your encoding Emacs encoding
TEXT BINARY TEXT BINARY
A 01000001 ă 01000001
B 01000010 Ћ 01000010
... ...
Z 01011010 Ϡ 01011010
lets say you create a file with 'ABZ' as its contents. this file when saved contains value 010000010100001001011010. When you open this file with your text editor, the editor finds 010000010100001001011010 as file contents and using above encoding it knows that its 'ABZ' hence it prints 'ABZ' on the screen.
Now lets say you open same file using emacs, since emacs uses its own encoding it displays "ăЋϠ", There is nothing wrong with emacs. it just doesn't know that data was written using your custom encoding.
So the point is that every file is written in a specific format, for example APK format can only be correctly understood by Android system. when you try to open the APK file in a text editor it just tries to make sense of binary data in the same way as emacs does in above example.
Is there a way to get the original content through this data?
If you know the originally encoding using which data was written, then you can read the contents of file using same encoding.

How to retrieve Unicode text from Notepad file saved as ANSI text file

I accidentally saved a file written in Greek as ANSI instead of Unicode. I had so much stuff written there, notes for my upcoming college exams and I really really need them. Now everything is ''???''
Is there a way to retrieve the file?
The data has been lost. Characters that your system's ANSI charset does not support were converted to ?, and you can't undo that. There is no recovering characters that were converted to ?.
Notepad should have warned you about the data loss before it allowed the file to be saved.

How should a properly UTF-8 encoded file look in notepad++

I am integrating data using some flat files. I'm getting the flat files delivered by FTP as .csv-files out of MS SQL exports from a business partner.
I asked him to encode it as UTF-8 (just using the standard I thought).
Now I can see in his files that a lot of UTF-8 bytes such as "& # 2 3 3 ;" (w/o the spaces) can be seen as plain text when I open it in Notedpad++ (or also using my "ETL" tool).
Before I ask him to fix it into proper UTF-8, I would like to understand the issue and whether my claim is actually correct?
Shouldn't special characters be shown as special characters when I open them in Notepad++ and not as plain text UTF-8 codes?
Any help is much appreciated :))
Cheers
Martin
é is an HTML entity. For some reason the text is HTML formatted, which I wouldn't count as "plaintext"/flat files. The file may or may not be encoded in UTF-8 in addition to that, we don't know from the information given.
A file containing "special characters" (meaning non-ASCII characters) encoded in UTF-8 opened in a text editor which correctly interprets the file as UTF-8 looks exactly like the text it should look like, e.g.:
正式名称は、ISO/IEC 10646では “UCS Transformation Format 8”、Unicodeでは “Unicode Transformation Format-8” という。両者はISO/IEC 10646とUnicodeのコード重複範囲で互換性がある。RFCにも仕様がある。
Put this in a file, save it as UTF-8, open it in another application as UTF-8, and this is what the text should look like.

retrieving unicode text from notepad which is saved as ANSI text file

Yesterday I wrote some text in a notepad file which was full of Unicode characters and saved the file as ANSI. Notepad gave me some warning, which i clicked OK without reading it fully and closed notepad.
Today when I again opened the same text in notepad, I am seeing notepad full of ??? signs. I now understand that this happened because I saved Unicode data as ANSI text. Is there a way to retrieve this text back? May be using some hex-editor or so?
No. Certain characters cannot be encoded in certain encodings. "風" cannot be encoded at all in ISO-8859 or any other single-byte encoding, for example. Each ANSI encoding also can only encode a certain subset of all possible characters. It is simply not possible to store characters not defined in a particular ANSI encoding in that encoding, they're simply not defined there.
So, they're gone. You better pull out a backup.

Saving outlook mail items in UTF-8 / Unicode using C#

We have created an Outlook Plugin which (amongst other things) can be used to save Mail items in text form to a specific folder. However, the text of the resulting text file is encoded in ANSI and I would like to save it as UTF8. I have already set the Codepage of the mail item like so:
mail = (MailItem)objItem;
mail.InternetCodepage = 65001; // equal UTF8 encoding; see http://msdn.microsoft.com/en-us/library/office/ff860730.aspx
mail.SaveAs(filePath, olSaveAsType);
However, the resulting file is saved as "ANSI as UTF8" and all extended characters (e.g. in Arabic or Russian) come out as question marks.
Does anyone know how I can save the mail item in utf8?
Thanks a lot.
Cheers,
Martin
Instead of trying to set the encoding, try reading InternetCodepage and then using a System.Text.Encoding object to read the saved file into a string. You could then convert and re-save the string as another file in the encoding you prefer.