md5 checksum of pdf file - hash

Please have a look on the below issue.
1 - Applying the MD5 on a .txt file containing "Hello" (without quotes, length = 5). It gives some hash value (say h1).
2 - Now file content are changed to "Hello " ( without quotes, length = 6). It gives some hash value (say h2).
3 - Now file is changed to "Hello" (exactly as step. 1). Now the hash is h1. Which makes sense.
Now the problem comes if procedure is applied to a .pdf file. Here rather than changing the file content I am chaging the colour of the text and again reverting back to the original file. In this way i am getting three different hash values.
So, is it because of the way pdf reader encode the text and meta-data, hash is different or the analogy itself is wrong?
Info:- Using a freeware in windows to calculate the hash.

So, is it because of the way pdf reader encode the text and meta-data, hash is different or the analogy itself is wrong?
Correct. If you need to test this on your own data, open any PDF in a text editor (I use Notepad++) and scroll to the bottom (where metadata is stored). You'll see something akin to:
<</Subject (Shipping Documents)
/CreationDate (D:20150630070941-06'00')
/Title (Shipping Documents)
/Author (SomeAuthor)
/Producer (iText by lowagie.com \(r0.99 - paulo118\))
/ModDate (D:20150630070941-06'00')
>>
Obviously, /CreationDate and ModDate at the very least will continue to change. Even if you re-generate a pdf from some source, with identical source data, those timestamps meaningfully change the checksum of the target pdf.

Correct, PDFs which look exactly the same can have the same checksum because of some metadata stored in the file like ModDate. I needed to detect PDFs which look the same, so I wrote a kinda-hacky Javascript function. This isn't guaranteed to work, but at least it detects duplicates some of the time (normal checksums will rarely detect duplicate pdfs).
You can read more about the PDF format here https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf and see some similar solutions in this related SO question Why does repeated bursting of a multi-page PDF into individual pages via pdftk change the md5 checksum of those pages?
/**
* The PDF format is weird, and contains various header information and other metadata.
* Most (all?) actual pdf contents appear between keywords `stream` and `endstream`.
* So, to ignore metadata, this function just extracts any contents between "stream" and "endstream".
* This is not guaranteed to find _all_ contents, but it _should_ ignore all metadata.
* Useful for generating checksums.
*/
private getRawContent(buffer: Buffer): string {
const str = buffer.toString();
// FIXME: If the binary stream itself happens to contain "endstream" or "ModDate", this wont work.
const streamParts = str.split('endstream').filter(x => !x.includes('ModDate'));
if (streamParts.length === 0) {
return str;
}
const rawContent: string[] = [];
for (const streamPart of streamParts) {
// Ignore everything before the first `stream`
const streamMatchIndex = streamPart.indexOf('stream');
if (streamMatchIndex >= 0) {
const contentStartIndex = streamMatchIndex + 'stream'.length;
const rawPartContent = streamPart.substring(contentStartIndex);
rawContent.push(rawPartContent);
}
}
return rawContent.join('\n');
}

Related

Limitations in FDW code on passing List* metadata between GetForeignPlan() and BeginForeignScan()

I'm writing an FDW for a non-SQL data source. Platform is Windows 10, C (MS Visual Studio), Postgresql 14. My FDW code is modeled after FDW example codes I have studied such as SQLite, JSON, CSV, File, DB2 and others. There is a common practice of storing metadata in a pg List as part of GetForeignPlan() and passing that via fdw_private. This list is then retrieved in BeginForeignScan() and made available for IterateForeignScan().
My question is how to share a large amount of metadata across the fdw_private mechanism? Using the pg List macros, I have been unable to store and retrieve more than 5 List cells. I tried passing a single List cell with a JSON string containing all of my metadata, but the string becomes corrupted along the way.
List* mdList = NIL;
char feJSON[MAX_FESTATE_JSON_SIZE];
/// ... some code to format a JSON string into the feJSON buffer.
mdList = list_make1(makeString(feJSON));
return mdList;
I have also used lappend() to try and extend the List by more than 5 cells, but the additional cells' values are not maintained across the callbacks...
#define serializeInt(x) makeConst(INT4OID, -1, InvalidOid, 4, Int32GetDatum((int32)(x)), false, true)
result = list_make5(makeInteger(feState->start), makeInteger(feState->rows), makeString(feState->ltName), makeString(feState->ftName), makeInteger(feState->myTable->npgcols));
result = lappend(result, serializeInt(feState->myTable->ncols));
There is a hint in the pg source plannodes.h suggesting the use of bytea (byte array?) as an alternative to the pg List structure, but I'm not finding any examples for that, so far.
I'm suspecting certain characters in JSON strings may be part of the issue, but I also found that...
#define MAX_FESTATE_JSON_SIZE 2048
List* serializeMetadata(...)
{
List* mdList = NIL;
/// ...stuff...
char smokeTest[MAX_FESTATE_JSON_SIZE + 1];
memset(smokeTest, 'X', MAX_FESTATE_JSON_SIZE);
smokeTest[MAX_FESTATE_JSON_SIZE] = '\0';
mdList = list_make1(makeString(smokeTest));
return mdList;
}
...revealed some truncation of the List as a return value (but it's a pointer!?). So I'm not sure if casting a bytea* as a List* will help, but that's where I'm headed.
Suggestions are most welcome!

What is contained in the "function workspace" field in .mat file?

I'm working with .mat files which are saved at the end of a program. The command is save foo.mat so everything is saved. I'm hoping to determine if the program changes by inspecting the .mat files. I see that from run to run, most of the .mat file is the same, but the field labeled __function_workspace__ changes somewhat.
(I am inspecting the .mat files via scipy.io.loadmat -- just loading the files and printing them out as plain text and then comparing the text. I found that save -ascii in Matlab doesn't put string labels on things, so going through Python is roundabout, but I get labels and that's useful.)
I am trying to determine from where these changes originate. Can anyone explain what __function_workspace__ contains? Why would it not be the same from one run of a given program to the next?
The variables I am really interested in are the same, but I worry that I might be overlooking some changes that might come back to bite me. Thanks in advance for any light you can shed on this problem.
EDIT: As I mentioned in a comment, the value of __function_workspace__ is an array of integers. I looked at the elements of the array and it appears that these numbers are ASCII or non-ASCII character codes. I see runs of characters which look like names of variables or functions, so that makes sense. But there are also some characters (non-ASCII) which don't seem to be part of a name, and there are a lot of null (zero) characters too. So aside from seeing names of things in __function_workspace__, I'm not sure what that stuff is exactly.
SECOND EDIT: I found that after commenting out calls to plotting functions, the content of __function_workspace__ is the same from one run of the program to the next, so that's great. At this point the only difference from one run to the next is that there is a __header__ field which contains a timestamp for the time at which the .mat file was created, which changes from run to run.
THIRD EDIT: I found an article, http://nbviewer.jupyter.org/gist/mbauman/9121961 "Parsing MAT files with class objects in them", about reverse-engineering the __function_workspace__ field. Thanks to Matt Bauman for this very enlightening article and thanks to #mpaskov for the pointer. It appears that __function_workspace__ is an undocumented catch-all for various stuff, only one part of which is actually a "function workspace".
1) Diffing .mat files
You may want to take a look at DiffPlug. It can do diffs of MAT files and I believe there is a command line interface for it as well.
2) Contents of function_workspace
SciPy's __function_workspace__ refers to a special variable at the end of a MAT file that contains extra data needed for reference types (e.g. table, string, handle, etc.) and various other stuff that is not covered by the official documentation. The name is misleading as it really refers to the "Subsystem" (briefly mentioned in the official spec as an offset in the header).
For example, if you save a reference type, e.g., emptyString = "", the resulting .mat will contain the following two entries:
(1) The variable itself. It looks sort of like a UInt32 matrix, but is actually an Opaque MCOS Reference (MATLAB Class Object System) to a string object at some location in the subsystem.
[0] Compressed (81 bytes, position = 128)
[0] Matrix (144 bytes, position = 0)
[0] UInt32[2] = [17, 0] // Opaque
[1] Int8[11] = ['emptyString'] // Variable Name
[2] Int8[4] = ['MCOS'] // Object Type
[3] Int8[6] = ['string'] // Class Name
[4] Matrix (72 bytes, position = 72)
[0] UInt32[2] = [13, 0] // UInt32
[1] Int32[2] = [6, 1] // Dimensions
[2] Int8[0] = [''] // Variable Name (not needed)
[3] UInt32[6] = [-587202560, 2, 1, 1, 1, 1] // Data (Reference Target)
(2) A UInt8 matrix without name (SciPy renamed this to __function_workspace__) at the end of the file. Aside from the missing name it looks like a standard matrix, but the data is actually another MAT file (with a reduced header) that contains the real data.
[1] Compressed (251 bytes, position = 217)
[0] Matrix (968 bytes, position = 0)
[0] UInt32[2] = [9, 0] // UInt8
[1] Int32[2] = [1, 920] // Dimensions
[2] Int8[0] = [''] // Variable Name
[3] ... 920 bytes ... // Data (Nested MAT File)
The format of the data is unfortunately completely undocumented and somewhat of a mess. I could post the contents of the Subsystem, but it gets somewhat overwhelming even for such a simple case. It's essentially a MAT file that contains a struct that contains a special variable (MCOS FileWrapper__) that contains a cell array with various values, including one that magically encodes various Object Properties.
Matt Bauman has done some great reverse engineering efforts (Parsing MAT files with class objects in them) that I believe all supporting implementations are based on. The MFL Java library contains a full (read-only) implementation of this (see McosFileWrapper.java).
Some updates on Matt Bauman's post that we found are:
The MCOS reference can refer to an array of handle objects and may have more than 6 values. It contains sizing information followed by an array of indices (see McosReference.java).
The Object Id field looks like a unique id, but the order seems random and sometimes doesn't match. I don't know what this value is, but completely ignoring it seems to work well :)
I've seen Segment 5 populated in .fig files, but I haven't been able to narrow down what's in there yet.
Edit: Fyi, once the string object is correctly parsed and all properties are filled in, the actual string value is encoded in yet another undocumented format (see testDoubleQuoteString)

Why does Open XML API Import Text Formatted Column Cell Rows Differently For Every Row

I am working on an ingestion feature that will take a strongly formatted .xlsx file and import the records to a temp storage table and then process the rows to create db records.
One of the columns is strictly formatted as "Text" but it seems like the Open XML API handles the columns cells differently on a row-by-row basis. Some of the values while appearing to be numeric values are truly not (which is why we format the column as Text) -
some examples are "211377", "211727.01", "209395.388", "209395.435"
what these values represent is not important but what happens is that some values (using the Open XML API v2.5 library) will be read in properly as text whether retrieved from the Shared Strings collection or simply from InnerXML property while others get sucked in as numbers with what appears to be appended rounding or precision.
For example the "211377", "211727.01" and "209395.435" all come in exactly as they are in the spreadsheet but the "209395.388" value is being pulled in as "209395.38800000001" (there are others that this happens to as well).
There seems to be no rhyme or reason to which values get messed up and which ones which import fine. What is really frustrating is that if I use the native Import feature in SQL Server Management Studio and ingest the same spreadsheet to a temp table this does not happen - so how is that the SSMS import can handle these values as purely text for all rows but the Open XML API cannot.
To begin the answer you main problem seems to be values,
"209395.388" value is being pulled in as "209395.38800000001"
Yes in .xlsx file value is stored as 209395.38800000001 instead of 209395.388. And it's the correct format to store floating point numbers; nothing wrong in it. You van simply confirm it by following code snippet
string val = "209395.38800000001"; // <= What we extract from Open Xml
Console.WriteLine(double.Parse(val)); // < = Simply pass it to double and print
The output is :
209395.388 // <= yes the expected value
So there's nothing wrong in the value you extract from .xlsx using Open Xml SDK.
Now to cells, yes cell can have verity of formats. Numbers, text, boleans or shared string text. And you can styles to a cell which would format your string to a desired output in Excel. (Ex - Date Time format, Forced strings etc.). And this the way Excel handle the vast verity of data. It need this kind of formatting and .xlsx file format had to be little complex to support all.
My advice is to use a proper parse method set at extracted values to identify what format it represent (For example to determine whether its a number or a text) and apply what type of parse.
ex : -
string val = "209395.38800000001";
Console.WriteLine(float.Parse(val)); // <= Float parse will be deduce a different value ; 209395.4
Update :
Here's how value is saved in internal XML
Try for yourself ;
Make an .xlsx file with value 209395.388 -> Change extention to .zip -> Unzip it -> goto worksheet folder -> open Sheet1
You will notice that value is stored as 209395.38800000001 as scene in attached image.. So nothing wrong on API for extracting stored number. It's your duty to decide what format to apply.
But if you make the whole column Text before adding data, you will see that .xlsx hold data as it is; simply said as string.

Matlab save sequence of mat files from convertTDMS stored in cell array to sequence of mat files

I have data stored in the .tdms format, gathering the data of many sensors, measured every second, every day. A new tdms file is created every day, and stored in a folder per month. Using the convertTDMS function, I have converted these tdms files to mat files.
As there are some errors in some of the measurements(e.g. negative values which can not physically occur), I have performed some corrections by loading one mat file at a time, do the calculations and then save the data into the original .mat file.
However, when I try to do what I described above in a loop (so: load .mat in folder, do calculations on one mat file (or channel therein), save mat file, repeat until all files in the folder have been done), I end up running into trouble with the limitations of the save function: so far I save all variables (or am unable to save) in the workspace when using the code below.
for k = 1:nFiles
w{k,1} = load(wMAT{k,1});
len = length(w{k,1}.(x).(y).(z));
pos = find(w{k,1}.(x).(y).(z)(1,len).(y)<0); %Wind speed must be >0 m/s
for n = 1:length(pos)
w{k,1}.(x).(y).(z)(1,len).(y)(pos(n)) = mean([w{k,1}.(x).(y).(z)(1,len).(y)(pos(n)+1),...
w{k,1}.(x).(y).(z)(1,len).(y)(pos(n)-1)],2);
end
save( name{k,1});
%save(wMAT{k,1},w{k,1}.(x),w{k,1}.ConvertVer,w{k,1}.ChanNames);
end
A bit of background information: the file names are stored in a cell array wMAT of length nFiles in the folder. Each cell in the cell array wMAT stores the fullfile path to the mat files.
The data of the files is loaded and saved into the cell array w, also of length nFiles.
Each cell in "w" has all the data stored from the tdms to mat conversion, in the format described in the convertTDMS description.
This means: to get at the actual data, I need to go from the
cell in the cell array w{k,1} (my addition)
to the struct array "ConvertedData" (Structure of all of the data objects - part of convertTDMS)
to the struct array below called "Data" (convertTDMS)
to the struct array below called "MeasuredData" (convertTDMS) -> at this level, I can access the channels which store the data.
to finally access/manipulate the values stored, I have to select a channel, e.g. (1,len), and then go via the struct array to the actual values (="Data"). (convertTDMS)
In Matlab format, this looks like "w{1, 1}.ConvertedData.Data.MeasuredData(1, len).Data(1:end)" or "w{1, 1}.ConvertedData.Data.MeasuredData(1, len).Data".
To make typing easier, I took
x = 'ConvertedData';
y = 'Data';
z = 'MeasuredData';
allowing me to write instead:
w{k,1}.(x).(y).(z)(1,len).(y)
using the dot notation.
My goal/question: I want to load the values stored in a .mat file from the original .tdms files in a loop to a cell array (or if I can do better than a cell array: please tell me), do the necessary calculations, and then save each 'corrected' .mat file using the original name.
So far, I have gotten a multitude of errors from trying a variety of solutions, going from "getfieldnames", trying to pass the name of the (dynamically changing) variable(s), etc.
Similar questions which have helped me get in the right direction include Saving matlab files with a name having a variable input, Dynamically Assign Variables in Matlab and http://www.mathworks.com/matlabcentral/answers/4042-load-files-containing-part-of-a-string-in-the-file-name-and-the-load-that-file , yet the result is that I am still no closer than doing manual labour in this case.
Any help would be appreciated.
If I understand your ultimate goal correctly, I think you're pretty much there. I think you're trying to process your .mat files and that the loading of all of the files into a cell array is not a requirement, but just part of your solution? Assuming this is the case, you could just load the data from one file, process it, save it and then repeat. This way you only ever have one file loaded at a time and shouldn't hit any limits.
Edit
You could certainly make a function out of your code and then call that in a loop, passing in the file name to modify. Personally I'd probably do that as I think it's neater solution. If you don't want to do that though, you could just replace w{k,1} with w then each time you load a file w would be overwritten. If you wanted to explicitly clear variables you can use the clear command with a space separated list of variables e.g. clear w len pos, but I don't think that this is necessary.

Writing Private Dicom data in matlab without modifying the dictionary

I am reading a dicom file in matlab and modifying some data of it and trying to save it into another file, but while doing so, the private dicom data are either not written at all (when 'WritePrivate' is set to 0) or it's written as a UINT8 array which become incomprehensible and useless. I even tried to copy the data that I get in from the original dicom file to a new structure and write to a new dicom file but even though the private data remains fine in new structure it doesn't remain so in the new dicom file. Is there any way to keep this private data intact while copying in to a new dicom file without changing the matlab dicom dictionary?
I have provided the following code to show what I'm trying to do.
X=dicomread('Bad011_4CH_01.dcm');
metadata = dicominfo('Bad011_4CH_01.dcm');
metadata.PatientName.FamilyName='LastName';
metadata.PatientName.GivenName='FirstName';
birthday=metadata.PatientBirthDate;
year=birthday(1,1:4);
newyear=strcat(year,'0101');
metadata.PatientBirthDate=newyear;
names=fieldnames(metadata);
h=metadata;
dicomwrite(X,'example.dcm',h,'CreateMode','copy');
newh=dicominfo('example.dcm');
Here the data in newh contains none of the private data. If I change the code to the following
dicomwrite(X,'example.dcm',h,'CreateMode','copy','WritePrivate',1);
In this case the private data gets totally changed to some UIN8 array and useless. The ideal solution for my task would be to enable keeping the private data in the newly created dicom file without changing the matlab dicom dictionary.
Have you tried something like:
dicomwrite(uint16(image), fileName, 'ObjectType', 'MR Image Storage', ...
'WritePrivate', true, header);
where "header" is a struct composed of name-value pairs using the same format as header data that you would get from MATLAB's dicominfo function? My general approach to image creation in MATLAB is to avoid using CreateMode 'copy' and instead build my own DICOM header by explicitly copying the attributes that it makes sense to copy and generating my own values for attributes that should have new values.
To write private tags, you would do something like:
header.Private_0045_10xx_Creator = 'MY_PRIVATE_BLOCK';
header.Private_0045_1001 = int32(65535);
If you then write this out using dicomwrite and read it back in using hdr = dicominfo('mynewimg');, you can see that it really did write the value as a 32-bit integer even though, unfortunately, if is always going to read the data in as a vector of uint8 values.
>> hdr.Private_0045_1001
ans =
255
255
0
0
As long as you know what type to expect, you should be able to typecast the data back to the desired type after you've read the header. For example:
>> typecast(hdr.Private_0045_1001, 'int32')
ans =
65535
I know I'm about 8 years late, but have you tried
dicomwrite(..., 'VR', 'explicit')
?
It solves the "reading as uint8" problem for me.
Edit:
Actually, it looks like you need to specify a dicom dictionary with the VR of that tag. If you combine this with 'VR', 'explicit', then the program reading the dicom won't need to dictionary file.