Browser's view-source: Can files be "downloaded" this way? - encoding

As you probably know, one can view the original response HTML code for any website URL by prefixing it with view-source: in the browser (e.g. view-source:https://www.google.de/).
Now interestingly, this also works for URLs that lead to files with types other than HTML. For instance, view-source:https://d3.7-zip.org/a/7z2107.exe will show the .exe file (here of 7zip) as byte stream (probably interpreted as latin1 or another encoding). You would get a similar result if you downloaded the .exe file normally and then open it in Notepad.
My question is this: When I just manually copy the code view-source: gives me for a .exe file, paste it in Notepad and then save it as .exe, the file is of roughly correct size but corrupted. Can there anything be done to fix this?
(If you wonder why anyone would want to do this, the admittedly exotic case is browser automatization with Selenium, which is not really able to download files normally, for a resource that is protected in such a way that it practically can only be downloaded by real browsers.)

When an application is compiled, there are static references to parts of the executable, calculated as offset in bytes. These can be as broad as the .text and .data sections of the executable, or more low-level like function call addresses and jumps.
If you open an exe in a real disassembler, you'll see that there are hard coded jumps in bytes, function addresses in bytes, etc. When you open exe in text editor, these jumps make the processor start running random code, which causes an exception. That causes Windows to believe its not a valid executable anymore.

Related

itext pdfreader not working in unix [duplicate]

I have some code that reads pdf files. The code fails at the line :
iTextSharp.text.pdf.PRTokeniser.CheckPdfHeader() at
iTextSharp.text.pdf.PdfReader.ReadPdf()
I know from other entries that this issue is coming from some invalid formatting in the pdf. However I'm not in a position to tell my users to redo their pdfs. Is there some other way around this issue, that can allow reading of the pdf despite this problem?
If a file doesn't start with %PDF- then there's nothing to fix: the file isn't a PDF file.
However, there may be another problem: maybe you're trying to access a file that has zero length due to some problem while creating the InputStream. Another context in which I've seen this happen, is a PDF loaded from a server, where the server returned a 404 message in HTML instead of a PDF file ;-)
Whenever that exception happens, you should store the bytes somewhere, and examine them. Without those bytes, nobody will be able to give you useful advice.

Is there any way, any way at all, a Word document could become a PNG? (Probable case of cheating)

I think a student of mine renamed a PNG a Word document and intentionally submitted a corrupted file to buy more time (or something) on an assignment. The student denies everything and claims it was a computer malfunction. Before I submit an honor code violation I want to be sure that there's no explanation that does not involve cheating that I'm somehow overlooking.
Basically, I'm a TA and a student submitted a paper, let's say it was Smith.docx. When I was working on grading and went to open Smith.docx Word wouldn't open it and said that it was corrupted. I eventually had the idea of opening it in a text editor and there it was a massive jumbled file of all sorts of odd characters (total file size: 180kb for what was supposed to be a 5 page paper).
I noticed, though, that the first few characters of the file were:
‰PNG
I renamed the file Smith.png and it opened. Bizarrely, it was an image of the first page of a Word document. More specifically, it looks like a screenshot of a Word doc cropped so as to show just the page. What makes it seem like a screenshot is that the cursor thingy (the vertical bar marking where you're typing) shows up next to the title.
An additional interesting bit of data is that if I scroll further down in the file (opened in notepad) I come to this:
XML:com.adobe.xmp <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 5.4.0">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:exif="http://ns.adobe.com/exif/1.0/">
<exif:PixelXDimension>996</exif:PixelXDimension>
<exif:PixelYDimension>1286</exif:PixelYDimension>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
I'm not sure what all that means but 1286x996 are the dimensions of the png image. The rest suggests to me that the file was created in some Adobe program but I'm not sure if that's right and how to figure out more about that.
So, my actual question: Is there any conceivable explanation of any kind for how I would come to have a file called Smith.docx that is a perfectly functioning png of what sure looks like a screenshot of the first page of a Word document other than that the student did it on purpose? The student claimed that their computer was "corrupting" files and that they had to take it into the Apple for service. I find this incredibly implausible (student has also not provided the receipt for this, which I requested).
Additionally, other than the case I laid out here, is there any positive evidence for my theory (that it was a straightforward case of cheating) that I can present to strengthen my case? eg, is the data from the file that I posted above a smoking gun that it was created in an Adobe program or is there any conceivable way that could come out of a Word document or other sort of corrupted file?
Also, is there anything else I can look for in the PNG file that would be a smoking gun?
Thanks in advance for any help you might be able to offer!
just rename the file with .png at the end instead of .docx and if it was a png then it should open just fine as a png
The key is that you see the cursor in the screenshot, there is no way Word would export (somehow) a docx file as a png AND draw the cursor for typing. Also, any tool that could do that would save the file as png not docx, only the user could deliberately change the file extension.
Also, does the screenshot show an empty document? or it looks like the final document your student delivered at the end?
Short answer:
The student is lying and is in fact a cheater (in my opinion).
Also, even if they were telling the truth, it is still their responsibility to have their work done, ready, and fully functional on time. Your computer is corrupting your files? Tough cookies. No one cares. You should have done your work on another computer. In the real world, excuses don't get you anywhere and they shouldn't get you anywhere in school either.
Lastly, it is very easy to re-name an extension of another file type and claim it's corrupt and very unlikely that a computer is just creating corrupted files. If their computer would otherwise create corrupted files, I would imagine it would be nearly impossible to get the computer to boot. In other words, they probably wouldn't have been able to turn on their "corrupted" computer to create "corrupted" files in the first place.

iText form filling missing PDF content

I am running into an odd problem with iText. I have a document with a few fields. On my server, I open the local document, set the fields and send the output of the stamper to the browser.
Works perfectly on my local devel machine.
The pdf generated on the server is missing the PDF contents. I only see the content of the fields I set, the rest is completely blank.
Any tips?
Your application on your local machine respects the bytes of the PDF you're using as a template. Your application on the server doesn't respect those bytes. Maybe you've copied the template using the wrong encoding, making all the binary characters corrupt. Or maybe your application is reading the template using the wrong encoding with the same result.
You can find out by opening your PDF file in a text editor (not inside a PDF viewer). Look for the keyword stream and inspect the bytes that follow this keyword. Do you see the difference? In the PDF produced on your local machine, the bytes look like a normal binary stream. In the PDF produced on your server, the bytes look awkward. For instance: it consists of plenty of question marks.
How to solve: check if the template was copied correctly. If so, check the way you're reading the document. For instance: read the PDF template into a byte array without using iText and write it to a new byte array. Can you reproduce the process of corruption? If so, tweak your application (the one that doesn't involve iText) until you've got the correct encoding.

firefox addon development and Unicode

So I started developing my firefox addon.
Most of the work is performed by a referenced javascript file.
Problem is that when I edit some of the html elements on the page and say, set their text it's written as pure giberish. I am writing the text in hebrew. Can't for the life of me figure the reason.
Any ideas?
Javascript strings are already Unicode at runtime. However, you have to make sure that your files are encoded correctly.
Always use utf-8 (without BOM) file encoding for all your js, XUL, DTD, properties files to be sure.
Firefox might try to guess the file character set incorrectly otherwise, and even worse some stuff might not even try guessing the encoding and instead simply always assume utf-8.
Better yet, do not hard-code strings in js/xul, but use DTD/properties files for localization (XUL tutorial, XUL School).
This, e.g. snippet works pretty well for me (on this very page):
document.getElementsByTagName("h1")[0].textContent="русский язык";
(Just fire up the Firefox Web Console)
"Inline" hewbrew embedded in js files might create additional problems because it is right-to-left and bidi sucks, so the localization approach should be preferred.

How is mime type of an uploaded file determined by browser?

I have a web app where the user needs to upload a .zip file. On the server-side, I am checking the mime type of the uploaded file, to make sure it is application/x-zip-compressed or application/zip.
This worked fine for me on Firefox and IE. However, when a coworker tested it, it failed for him on Firefox (sent mime type was something like "application/octet-stream") but worked on Internet Explorer. Our setups seem to be identical: IE8, FF 3.5.1 with all add-ons disabled, Windows XP SP3, WinRAR installed as native .zip file handler (not sure if that's relevant).
So my question is: How does the browser determine what mime type to send?
Please note: I know that the mime type is sent by the browser and, therefore, unreliable. I am just checking it as a convenience--mainly to give a more friendly error message than the ones you get by trying to open a non-zip file as a zip file, and to avoid loading the (presumably heavy) zip file libraries.
Chrome
Chrome (version 38 as of writing) has 3 ways to determine the MIME type and does so in a certain order. The snippet below is from file src/net/base/mime_util.cc, method MimeUtil::GetMimeTypeFromExtensionHelper.
// We implement the same algorithm as Mozilla for mapping a file extension to
// a mime type. That is, we first check a hard-coded list (that cannot be
// overridden), and then if not found there, we defer to the system registry.
// Finally, we scan a secondary hard-coded list to catch types that we can
// deduce but that we also want to allow the OS to override.
The hard-coded lists come a bit earlier in the file: https://cs.chromium.org/chromium/src/net/base/mime_util.cc?l=170 (kPrimaryMappings and kSecondaryMappings).
An example: when uploading a CSV file from a Windows system with Microsoft Excel installed, Chrome will report this as application/vnd.ms-excel. This is because .csv is not specified in the first hard-coded list, so the browser falls back to the system registry. HKEY_CLASSES_ROOT\.csv has a value named Content Type that is set to application/vnd.ms-excel.
Internet Explorer
Again using the same example, the browser will report application/vnd.ms-excel. I think it's reasonable to assume Internet Explorer (version 11 as of writing) uses the registry. Possibly it also makes use of a hard-coded list like Chrome and Firefox, but its closed source nature makes it hard to verify.
Firefox
As indicated in the Chrome code, Firefox (version 32 as of writing) works in a similar way. Snippet from file uriloader\exthandler\nsExternalHelperAppService.cpp, method nsExternalHelperAppService::GetTypeFromExtension
// OK. We want to try the following sources of mimetype information, in this order:
// 1. defaultMimeEntries array
// 2. User-set preferences (managed by the handler service)
// 3. OS-provided information
// 4. our "extras" array
// 5. Information from plugins
// 6. The "ext-to-type-mapping" category
The hard-coded lists come earlier in the file, somewhere near line 441. You're looking for defaultMimeEntries and extraMimeEntries.
With my current profile, the browser will report text/csv because there's an entry for it in mimeTypes.rdf (item 2 in the list above). With a fresh profile, which does not have this entry, the browser will report application/vnd.ms-excel (item 3 in the list).
Summary
The hard-coded lists in the browsers are pretty limited. Often, the MIME type sent by the browser will be the one reported by the OS. And this is exactly why, as stated in the question, the MIME type reported by the browser is unreliable.
Kip, I spent some time reading RFCs, MSDN and MDN. Here is what I could understand. When a browser encounters a file for upload, it looks at the first buffer of data it receives and then runs a test on it. These tests try to determine if the file is a known mime type or not, and if known mime type it will simply further test it for which known mime type and take action accordingly. I think IE tries to do this first rather than just determining the file type from extension. This page explains this for IE http://msdn.microsoft.com/en-us/library/ms775147%28v=vs.85%29.aspx. For firefox, what I could understand was that it tries to read file info from filesystem or directory entry and then determines the file type. Here is a link for FF https://developer.mozilla.org/en/XPCOM_Interface_Reference/nsIFile. I would still like to have more authoritative info on this.
This is probably OS and possibly browser dependent, but on Windows, the MIME type for a given file extension can be found by looking in the registry under HKCR:
For example:
HKEY_CLASSES_ROOT.zip
- ContentType
To go from MIME to file extension, you can look at the keys under
HKEY_CLASSES_ROOT\Mime\Database\Content Type
To get the default extension for a particular MIME type.
While this is not an answer to your question, it does solve the problem you are trying to solve. YMMV.
As you wrote, mime type is not reliable as each browser has its way of determining it. However, browsers send the original name (including extension) of the file. So the best way to deal with the problem is to inspect extension of the file instead of the MIME type.
If you still need the mime type, you can use your own apache's mime.types to determine it server-side.
I agree with johndodo, there are so many variables that make mime types that are sent from browsers unreliable. I would exclude the subtypes that are received and just focus on the type like 'application'. if your app is php based, you can easily do this by using the function explode().
in addition, just check the file extension to make sure it is .zip or any other compression you are looking for!
According to rfc1867 - Form-based file upload in HTML:
Each part should be labelled with an appropriate content-type if the
media type is known (e.g., inferred from the file extension or
operating system typing information) or as application/octet-stream.
So my understanding is, application/octet-stream is kind of like a blanket catch-all identifier if the type cannot be inferred.