Why aren't word processing program documents stored as plaintext? - encoding

Whenever an MS word (or LibreOffice or other word processor) document is opened in its respective program, the words appear normally on the page, but when the document is opened in a text editor, most of it is Unicode gibberish.
I can understand why the document might have some parts that aren't legible, like bullet points or metadata, but why isn't at least some of the content stored as plaintext? Does every letter get encoded?

The last format docx of Microsoft Word is an xml with plain text compressed with zip. You can unzip the file by renaming docx to zip and then open the file with a notepad. So it is stored partially as plain text just compressed.

I find that it is probably a branding thing. If you want you can import it to a Text File.
If you go to File > Export > Change File Type > Plain Text (*.txt), you can export the document there.

Related

What are the characters shown on a file after forcefully changing the extension?

Recently I changed the extension of an .apk file to .txt and despite this, I was able to open it on Notepad with some random characters, that weren't available on the keyboard in the file. org/antlr/runtime/ANTLRFileStream.class…TmOÓP=w[×QËÀ)ê|A…ÑETÔ¢NP¢™ãË—º•Q3ZÓcüþ¿j",£ß4ñGÏmÇñ˽Ïs{žçœçeûùëóW ±¨á0F5d0ÖA˔‹LÈã’ŠËR˜PqEƒ†Iy\•ØkÒºÞÁЂ´¦TL«˜H­95{ÙÚ°2K/­×–Y³Üªù(ð·:%œv\'¸!Гû÷óðª#¢èUܵä¸öòæÆÛ_±^ÔÂt^Ùª­Z¾#ýæc"XwêKž_5-7¨ù¦¿éΆmÞZ^Y*ÍS “ÛÖ¹µ¹7eûUàxn]%µ‘Ð^TÊvË^…kžUˆ;u_àTw<sÁ}µDL%ÛªØ>ùÄš#º…Rø˜¨;o)\,0ǚԞ݇ؓ‡àΪ<ò6ýr³¥GsÃ횪EOÌ_…É =è•Ç¬Ž#8ª£½ú^fùõ˜Ž›¸%pü IT{`Á2þ¶<Š:î`NÇ<î긇A˜èÿïˆ8Ç0Q¥»¨#- Ze7srRÉšíVƒõÐ]0rí&tÀ”O´‡[Y±K ö¬H›¯Ü %÷¬8Ì) r+åšW·ÑÏF†¿,bd—i%h³­ˆá8½YÄiª‘
Not just this, but while converting many other extensions like .jar,.xapk, etc. would show me these characters.Can anyone please explain, what factors are these characters based on, and how does the OS decides or try what characters to show in an unsupported file exactly.
Is there a way to get the original content through this data?
Lets say you created a text editor, which can write and save text files as well as open text files. you also defined the encoding that will be used to save text in binary files(all files when saved are binary). So your encoding looks something as following:
Your encoding Emacs encoding
TEXT BINARY TEXT BINARY
A 01000001 ă 01000001
B 01000010 Ћ 01000010
... ...
Z 01011010 Ϡ 01011010
lets say you create a file with 'ABZ' as its contents. this file when saved contains value 010000010100001001011010. When you open this file with your text editor, the editor finds 010000010100001001011010 as file contents and using above encoding it knows that its 'ABZ' hence it prints 'ABZ' on the screen.
Now lets say you open same file using emacs, since emacs uses its own encoding it displays "ăЋϠ", There is nothing wrong with emacs. it just doesn't know that data was written using your custom encoding.
So the point is that every file is written in a specific format, for example APK format can only be correctly understood by Android system. when you try to open the APK file in a text editor it just tries to make sense of binary data in the same way as emacs does in above example.
Is there a way to get the original content through this data?
If you know the originally encoding using which data was written, then you can read the contents of file using same encoding.

Windows Converting a Folder of Files From RTF to UTF-8

I am trying to analyze a corpus of 620 Korean language newspaper articles using the konlpy module in Python. The files are in rtf formatting. However konlpy only supports files encoded in UTF-8. In Windows, how can I convert a folder containing 620 rtf encoded articles to UTF-8 articles such that, upon opening the files in Notepad, the Korean characters are still in-tact?
Some things I have tried (but to no avail)
Used a freeware converter program (http://www.emreakkas.com/localization-tools/convert-rtf-to-txt) that converted the files into UNICODE and then tried to use a Cygwin iconv batch file to convert the files using the same script as this individual did:
cygwin syntax error near unexpected token `done'
When I do this all of the files are there but they are 0KB and they are blank. (let me know if you need more info about this method as i needed to do another step to get this to even loop over my files)
Used another freeware program (memory a little hazy on this one) that converted the rtf files but all the characters were just scrambled latin characters.
I'm thinking that there has to be an easy way to do this, but everything I tried is really complicated and does not work. Another funny thing is that whenever I simply manually take the original rtf file or the file converted into UNICODE and "Save As" and choose UTF-8, it works fine. I would love it if I did not have to "Save As" for 620 articles.
Thanks!

convert stream file of iText PDF not opening MS word

Our project has requirement to generate end report both in PDF and MS-Word Document. We are using iTextSharp to dynamically generate tables and rows in report. Finally we will upload the file to server as PDF and MS-word. Both will be converted to Byte Array/Stream file and saved as PDF and MS-Word Document. In Which,uploaded PDF working as expected, but MS-word getting error and not opening(Attaching the screen shot).
iTextSharp doesn't produce MS Word documents, so this isn't an actual iText question. When I look at your screen shot, I see that you are trying to import a PDF file into Word. Since Word can't interpret PDF syntax, it shows you the syntax of the PDF file:
%PDF-1.4
%âãÏÓ
1 0 obj
<</Type/Font...
I think your question is wrong. You are not using iTextSharp to create a PDF file and an MS Word file. You are using iTextSharp to create a PDF file, and not an MS Word file.
There is no such thing as "Save a PDF as MS Word file" in iTextSharp, and it will be extremely difficult to find another tool that can convert a PDF document to a Word document in an acceptable way. (There are such tools, but the quality is suboptimal for PDFs that weren't made to be converted to another format.)

Writing CR+LF into Open XML from a Database

I'm trying to take some data stored in a database and populate a Word template's Content Controls with it using the Open XML SDK. The data contains paragraphs and so there are carriage return and line feed characters in it. The data is stored in the database as nvarchar.
When I open the generated document, the CR+LF combination shows up as a question mark with a box around it (not sure the name of this character). This is actually two sequences back to back, so CR+LF CR+LF equals two strange characters:
If I unzip the .docx, take the Custom XML part and do a hex dump, I can clearly see 0d0a 0d0a so the CR+LF is there. Word is just printing it weird.
I've tried enforcing UTF-8 encoding in my XmlWriter's settings, but that didn't seem to help:
Dim docStream As New MemoryStream
Dim settings As XmlWriterSettings = New XmlWriterSettings()
settings.Encoding = New UTF8Encoding(False)
Dim docWriter As XmlWriter = XmlTextWriter.Create(docStream, settings)
Does anyone know how I can get Word to render these characters correctly when written to a .docx through the Open XML SDK?
To bind to a Word 2013 rich text control, your XML element has to contain a complete docx. See [MS-DOCX]:
the data stored in the XML element will be an escaped string comprised of a flattened WordprocessingML document representing the formatted data in the structured document tag range.
Earlier versions couldn't bind a rich text control.
Things should work though (with CR/LF, not w:br), if you bind to a plain text control, with multiline set to true.

Extra characters in .doc file when opened with textpad

When I open a document in textpad, some extra null character is appended between every character.
Like my document is having following text
बॉम्बे testing for webmail.
When I am opening in text it is coming as
I....M....I t.e.s.t.i.n.g. f.o.r. w.e.b.m.a.i.l.
Can Anybody help me on this ?
This file is in UTF-16 or UCS-2 format. When opening it, you must specify in which encoding you want to open it. Your text editor does not recognize this encoding automatically.
If your text editor does not allow for setting encoding on opening file, try using Notepad++ or Textpad.