DFM file became binary and infected - forms

We have a DFM file which began as text file.
After some years, in one of our newer versions, the Borland Developer Studio changed it into binary format.
In addition, the file became infected.
Can someone explain me what should I do now? Where can I find how binary file structure is read?

Well, I found what happens to the DFM file, but I don't know why.
The occurence of changing from text file to binary one is known, and could be found in stack overflow in another question. I'll describe only the infection of the file.
In Pascal, the original language of DFM files, a string defines so: first byte is the length of the string (0-255) and the other characters are the string. (Unlike C, which its strings length are recognized by a null character).
Someone (maybe BDS?) while changing the file from text file to binary one, also changed all string of length 13 (0D) to be length 10 (0A). This way, the string finished after 10 chars, and the next char was a value of the property.
I downloaded binary editor, fixed all occurences of length 10, and the file was displayed and compiled well.
(Not only properties' length infected, but also one byte on Icon.Data property was replaced from 0D to 0A)

Related

Compare filenames with different encoding in Octave

I'm trying to accomplish following task in Octave:
Read filename from text file
Search for this file in particular location on hard drive
My script works for most files, but for certain files containing unicode characters I'm unable to match the filename from textfile with filename as it appears in the file system.
Filenames in textfile are in UTF-8 encoding and I read them in Octave with function fgetl().
Filenames from file system are obtained via function readdir(). I'm on Windows, NTFS file system.
For example, one problematic filename contains character "Č".
When printed out in Octave console, the characters appear exactly the same. However, a HEX viewer reveals that the characters are not actually the same. In the first case the character is encoded as 0x010C, in the second case as 0x0043 + 0x030C. Comparing both of them via strcmp() fails, of course.
What I tried to do is to omitt all non-ASCII characters from the filename and then compare them. But this didn't work, probably because in the second variant the first part of the character (0x0043) is actually ASCII.
Now I'm looking for some way of converting one format to another to be able to compare them. Any ideas?
EDIT:
As I discovered later, the character Č in the filename on Windows is actually written as C+ˇ, which is just another way you can write that character. So the difference probably insn't in encoding standard, but in 2 different ways to achieve 1 visible character (glyph).
This question basically then changes to a task of matching characters written "at once" and corresponding pair of letter+combining character.

Reading file names inside .zip file

I am familiar with the .zip file format, and able to read the internal file table content so far.
The problem occurs with non-english characters in the file name.
The specification states that file names use OEM character set, yet sometimes I get UTF-8 representation and sometimes I get OEM represantation.
The specification states the "version made by" field should be in range 0-20, yet I get versions 31 and 63 which may or may not affect the character set.
Another related problem: When I read the "extra field" there is "up" (unicode path, id=0x7075) which suppose to store the utf-8 represantation of the filename, well, it starts with 5 redundant bytes before the actual utf-8 string (Created by WinRar), yet the other softwares seems to read it correctly.
Any input about the issue?

Determine whether file is a PDF in perl?

Using perl, what is the best way to determine whether a file is a PDF?
Apparently, not all PDFs start with %PDF. See the comments on this answer: https://stackoverflow.com/a/941962/327528
Detecting a PDF is not hard, but there are some corner cases to be aware of.
All conforming PDFs contain a one-line header identifying the PDF specification to which the file conforms. Usually it's %PDF-1.N where N is a digit between 0 and 7.
The third edition of the PDF Reference has an implementation note that Acrobat viewer require only that the header appears within the first 1024 bytes of a file. (I've seen some cases where a job control prefix was added to the start of a PDF file, so '%PDF-1.' weren't the first seven bytes of the file)
The subsequent implementation note from the third edition (PDF 1.4) states: Acrobat viewers will also accept a header of the form: %!PS-Adobe-N.n PDF-M.m but note that this isn't part of the ISO32000:2008 (PDF 1.7) specification.
If the file doesn't begin immediately with %PDF-1.N, be careful because I've seen a case where a zip file containing a PDF was mistakenly identified as a PDF because that part of the embedded file wasn't compressed. so a check for the PDF file trailer is a good idea.
The end of a PDF will contain a line with '%%EOF',
The third edition of the PDF Reference has an implementation note that Acrobat viewer requires only that the %%EOF marker appears within the last 1024 bytes of a file.
Two lines above the %%EOF should be the 'startxref' token and the line in between should be a number for the byte offset from the start of the file to the last cross reference table.
In sum, read in the first and last 1kb of the file into a byte buffer, check that the relevant identifying byte string tokens are approximately where they are supposed to be and if they are then you have a reasonable expectation that you have a PDF file on your hands.
The module PDF::Parse has method called IsaPDF which
Returns true, if the file could be parsed and is a PDF-file.

understanding different character encodings

When I save a text document in UTF-8 that's basically saying: Computer, use the codepage for UTF8 that's installed somewhere on your computer to figure out, how to turn the 1's and 0's to characters, right?
When I save this content:
激光
äüß
#§
in ISO-8895-1, it becomes this (on Linux, using Kate editor):
æ¿å
äüÃ
#§
What is not displayed here is that in the first and second row that are some weird squares displayed instead of characters (can be seen in developer tools).
So my understanding is that this means that the combination of 0's and 1's that represent 激 in utf-8 is mapped to æ in ISO-8895-1, right? And the weird squares > < happen because there is no mapping for that binary number in the ISO-8895-1 character set so the computer defaults to some other encoding.
Is that correct?
Yes, sort of correct.
If you store a file as UTF-8, it usually gets a special byte combination that indicates its type of encoding at the beginning of the file. I think, Kate (don't know this editor) doesn't recognize this and just displays the file as something else. So basically, your file is still correct, but was just visualized in a wrong way.
The weird squares are another indicator, that Kate doesn't recognize those leading bytes, cause usually editors hide them from the user and just use the information to display the file correctly.
You have it pretty much right. The character U+6FC0 (激) for example is encoded with 3 bytes in UTF-8: 0xE6 0xBF 0x80.
If you interpret these bytes in ISO-8859-1, you get the characters æ¿. Depending on the version of ISO-8859-1, 0x80 is either not mapped to a character at all, or is mapped to a non-printable control character, that's why you can see only two characters for the three bytes.
If you use Windows-1252 instead of ISO-8859-1 you'll see æ¿€.

Why importdata is not working here?

I have two data text files with the same text content but they have different sizes. The following snapshot compares them (using Beyond Compare).
It seems that the hex content of the files is different.
The MATLAB function importdata reads fine the file to the left but gives the following error with the file to the right (the bigger size file):
Unable to load file. Use TEXTSCAN or FREAD for more complex formats.
What exactly is the difference between the two files ?
How to make importdata work with the file to the right ?
The problem and solution is already mentioned in the comments:
it seems like one text file is in ascii encoding (that is 8 bits per char) while the second one is unicode (16 bits per char). try converting the big file into simple ascii and re-read it
You may see the difference already in BeyondCompare: On the left, you have an ANSI-File, on the right you have a unicode with BOM (The hex-code looks like UTF-16LE. I'm not sure, which version the BeyondCompare you used to get the screenshot. My BeyondCompare would not show 'Unicode' but 'UTF16-LE' ...).