Can a plain text file be corrupted or have a bug in the source? - plaintext

A rather simple question (I hope), but I am trying to learn more about plain text (in a UTF-8 mode) and how versatile it is. Most importantly, I'm reflecting on the usage of plain text in file systems such as LaTeX and similar processors.
They say that plain text is the most stable. Plain text is stable (by design). No need to worry about metadata being corrupted by a program like Word or similar WYSIWYGs; But I'm wondering:
Can a plain text file be corrupted (in its source) or otherwise contain "Bugs?"
If not, can someone explain how this works? Is plain text just parsed I/O?
This is an elementary question, I'm sure, but I'd like to understand how plain text functions inside a PC.

Who are "they"? Read the Wikipedia article on "Plain Text".
Any file can be corrupted by hardware failures. But if a file has no metadata, the metadata cannot be corrupted by defects in the software that manipulates the file.

Related

How to save the text on a web page in a particular encoding?

I read following sentence from the link
Content authors need to find out how to declare the character encoding
used for the document format they are working with.
Note that just declaring a different encoding in your page won't
change the bytes; you need to save the text in that encoding too.
As per my knowledge, the characters from the text are stored in the computer as one or more bytes irrespective of the 'character encoding' specified in the web page.
I understood the above quoted text also, except the last sentence in bold font
you need to save the text in that encoding too
What does this sentence mean?
Is it saying that the content author/developer has to manually save the same text(which is already stored in the computer as one or more bytes) in the encoding specified by him/her? If yes, how to do it and why it is needed to do? If no, then what this sentence actually mean?
When you make a web page publicly available in the most basic sense you make a text file (that is located on a piece of hardware you own) public in the sense that when a certain adress is requested you return this file. That file can be saved on your local hardware or may not be saved there (dynamic content). Whatever the case, the user that is accessing your web page is provided a file. Once the user gains posession of the file he should be able to read it, that is where the encoding comes in play. If you have a raw binary file you can only guess what it contains and what encoding it is in, so most web pages provide the encoding that they return the file in alongside the file. This is where the bold text you ask about can be related to my answer - if you provide one encoding alongside the file (for example utf 8) but deliver the file in another encoding (ASCII) the user may see parts of the text or may not see it at all. And if you provide a static file it should be saved in the correct encoding (that is the one you told your file will be in).
As for the question how to store it - that is highly specific to the way you provide the file. Most text editors provide means to save a file in specific encoding. And most tools to bring up a page content provide convenient ways to give the file in a form that would be easy for the user to decode.
It is just a note, probably because of confusion by some users.
The text tell us that one should specify in some form the encoding of the file. This is straightforward. Webserver usually cannot know the encoding of a file. Note if pages are delivered by e.g. a database, the encoding could be implicit, but web consider file as first class citizen, so we still need to specify encoding.
The note makes just clears that by changing the encoding, the page is not transcoded by webrowser. The page will remain byte per byte the same, just clients (browsers) will misinterpret the content. So if you want to change the encoding, you should specify the new encoding, but also save the file (or save and convert) to the expected encoding. No magic will be done (usually) by web-servers.
There is no text but encoded text.
The fundamental rule of character encodings is that the reader must use the same encoding as the writer. That requires communication, conventions, specifications or standards to establish an agreement.
"Is it saying that the content author/developer has to manually save the same text(which is already stored in the computer as one or more bytes) in the encoding specified by him/her? If yes, how to do it and why it is needed to do?"
Yes, it always the case for every text file that a character encoding is chosen. Obviously, if the file already exists it is probably best not to change the encoding. You do it by some editor option (try the Save As… dialog or equivalent) or by some library property or configuration.
"save the text in that encoding too"
Actually, it's usually the other way around. You decide on the encoding you want or need to use and the HTML editor or library updates the contents with a matching declaration and any newly necessary character entity references (e.g., does 🚲 need to be written as 🚲? Does ¡ need to be written as ¡?) as it writes or streams the document. (If your editor doesn't do that then get a real HTML editor.)

Notepad++ can recognize encoding?

I created file with UTF-8 encoded content (using PHP fputcsv).
When I open this file in Notepad++ - characters are wrong (Notepad++ starts with ANSI encoding).
When I set Format->"Encode in UTF-8" from menu - everything is fine.
Im worrying, that Notepad++ can recognize encoding somehow, and maybe something is wrong with my file created with fputcsv? First byte or something?
Automatically detecting an encoding is not something that can be done accurately. It's pretty much essential that the encoding be specified explicitly. It can be guessed in some cases, but even then not with 100% certainty.
This documentation (Encoding) explains the situation in relation to Notepad++.
They also point out that the difficulty arises especially if the file has not been saved with a Byte Order Mark (BOM).
Given that your file displays correctly once you manually set the encoding, I would say there's nothing wrong with how you are generating and saving the file. The only thing you can check for is whether a BOM is being saved, which might improve the chances of Notepad++ being able to automatically detect the encoding.
It's worth noting that although it may help editors like Notepad++ identify the encoding more accurately, according to The Unicode Standard document, the BOM is not recommended.
You have to check the lower right corner of the Notepad++ GUI to see the actual enconding that is being used. The problem it's not that Notepad++ specific because guessing the right encoding is a big problem without any real solution so it's better to let the user decide what is the most appropriate encoding in each single case.
When you want to reflect the encoding of the text file in a Java program, you have to consider two thnigs: encoding and character set. When you open a text file, you see encoding under "Encoding" menu. Additionally look at the character set menu point. Under "Eastern European" you will find "ISO 8859-2", and under Central European "Windows-1250". You can set corresponding encoding in the Java program
when you look up in the table:
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
For example, for Cenntral European character set "Windows-1250" the table suggest Java encoding "Cp1250". Set the encoding and you will see the characters in program properly.

Strange character rendered correctly in notepad, but as a control character elsewhere

I have a .csv list of businesses. The file has some strange characters in. For example, in this field: Stocktonon-Tees, the first hyphen, between Stockton and on seems to be a character with the value 6 rather than a hyphen, with the value 45. Stack overflow will probably sanatize this so you can't see it, so here is a pastebin:
http://pastebin.com/NuyyaQy9
Can anyone explain why this could be? Is it some encoding issue that I have missed? Or a corruption in the dataset?
Yes, it's almost certainly an encoding issue. A file just consists of binary data - it's how you interpret that binary data that matters. It sounds like Notepad is guessing at the originally-intended encoding, but whatever else you're using isn't.
Unfortunately you haven't said anything about what software is trying to read the file or what wrote it in the first place - but you should look at what encoding Notepad thinks it is, and work from there.
If it's your code that wrote the file out, and you get to decide the encoding, I'd recommend UTF-8 as a good general purpose, platform-portable encoding.

Input utf-8 characters in management studio

HI,
[background]
We currently build files for many different companies. Our job as a company is basically to sit in between other companies and help with communication and data storage. We have begun to run in to encoding issues where we are receiving data encoded in one format but we need to send it out in another. All files were prevsiously built using the .net framework default of UTF-8. However we've discovered that certain companies cannot read utf-8 files. I assume because they have older systems that require something else. This becomes apparent when sending certain french charaters in particular.
I have a solution in place where we can build a specific file for a specific member using a specific encoding. (While I understand that this may not be enough, unfortunately this is as far as I can go at the moment due to other issues.)
[problem]
Anyways, I'm at the testing stage and I want to input utf-8 or other characters into management studio. Perform an update on some data and then verify that the file is built correctly from that data. I realize that this is not perfect. I've already tried programatically reading the file and verifying the encoding by reading preambles etc. So this is what I'm stuck with. According to this website http://www.biega.com/special-char.html ... I can input utf-8 characters by clicking ALT+&+#+"decimal representation of character" or ALT+"decimal representation of character" but when I use the data specified by the table I get completely different characters in management studio. I've even saved the file in a utf-8 format using management studio by clicking the arrow on the save button in the save dialog and specifying the encoding. So my question is how can I accurately specify a character that will end up being the character I'm trying to input and actually put it in the data that will then be put in a file.
Thanks,
Kevin
I eventually found the solution. The website doesn't specify that you need to type ALT+0+"decimal character representation". The zero was left out. I'd been searching for this for ages.

Is there a way to get the encoding of a text file in UltraEdit?

Is there a setting in UltraEdit that allows me to see the encoding of the file?
In UltraEdit, the encoding that is being used to -display- the file, is shown in the status bar at the right somewhere, together with the line-ending type in use, for example, "U8-UNIX". You can also manually set as what encoding the file has to be displayed. In version 10 this is under menu View -> Set Code Page. You can also -convert- the actual codepage of the file under menu File -> Conversions.
If the file does not have a BOM header, a couple of bytes at the start of the file indicating the encoding, the -actual- encoding of the file, can only be guessed. And even if the file has a BOM header, there can still be encoding issues.
All text editors do this, and some are better at it than others. I haven't done a comparision to see which is best at it. At the moment (2012), I know UltraEdit fails to detect UTF-8 and other variants in 1000 line (or longer) text files if the first UTF-8 character only appears later in the document. It also fails to show the encoding properly when you set it manually.
Notepad++ is also not great at detecting it, but when you know the encoding, you can set it manually.
Sublime Text is, as far as I know, best at detecting the encoding, also in large files.
I think there are also some very good command line tools out there, ported from GNU to Windows, to detect encoding. My bet would be that that's going to be the best option.