How do I obtain a hash of the payload of a digital photo container, ideally in Java? - photo

I have edited EXIF properties on digital pictures, and would like to be able to identify them as identical. I believe this implies extracting the payload stream and computing a hash. What is the best way to do this, ideally in the Java language, most ideally in Java using a native implementation for performance.

JPEG files are a series of 'segments'. Some contain image data, others don't.
Exif data is stored in the APP1 segment. You could write some code to compare the other segments, and see if they match. A hash seems like a reasonable approach here. For example, you might compare a hash of only the SOI, DQT or DHT segments. You'd need to experiment to see which of these gives the best result.
Check out the JpegSegmentReader class from my metadata-extractor library.
Java: https://github.com/drewnoakes/metadata-extractor
.NET: https://github.com/drewnoakes/metadata-extractor-dotnet
With that class you can pull out specific segment(s) from a JPEG for processing.
Let us know how you get on!

Related

How to use VTK to efficiently write time-varying field data on a fixed mesh?

I am working on physics simulation research. I have a large fixed grid in one of my projects that does not vary with time. The fields on the grid, on the other hand, vary with time in the simulation. I need to use VTK to record the field data in each step for visualization (Paraview).
The method I am using is to write a separate *.vtu file to disk at each time step. This basically serves the purpose, but actually writes a lot of duplicate data (re-recording the geometry of the mesh at each step), which not only consumes more disk space, but also wastes time on encoding and parsing.
I would like to have a way to write the mesh information only once, and the rest of the time only new field data is written, while being able to guarantee the same visualization. Please let me know if VTK and Paraview provide such an interface and how to implement it.
Using .pvtu and refer to the same .vtu as Piece for each step should do the trick.
See this similar post on the ParaView discourse, and the pvtu doc
EDIT
This seems to be a side effect of the format, this is not supported by the writer.
The correct solution is to use another file format ...
Let me provide my own research findings for reference.
As Nico said, with the combination of pvtu/vtu files, we could theoretically implement a geometry structure stored in a separate vtu file, referenced by a pvtu file. Setting the NumberOfPieces attribute of the ptvu file to 1 would enable the construction of only one separate vtu file.
However, the VTK library does not expose a dedicated operation interface to control the writing process of vtu files. No matter how it is set, as long as the writer's input contains geometry structures, the writer will write geometry information to disk, and this process cannot be skipped through the exposed interface.
However, it is indeed possible to make multiple pvtu files point to the same vtu file by manually editing the piece node in the ptvu file, and paraview can recognize and visualize such a file group properly.
I did not proceed to try adding arrays to the unstructured grid and using pvtu output.
So, I think the conclusion is.
if you don't want to dive into VTK's library code and XML implementation, then this approach doesn't make sense.
if you are willing to write a series of files, delete most of them from the vtu file, and then point all the pvtu's piece nodes to the only surviving vtu file by editing the pvtu file, you can save a lot of disk space, but will not shorten the write, read, and parse times.
If you implement an XML writer by yourself, you can achieve all the requirements in theory, but it requires a lot of coding work.

Understanding the ISO/IEC 15418 barcode specification

I'm trying to learn about scanning and reading encoded data in barcodes both 1D and 2D. ISO/IEC 15418 seems to detail very closely the data I am interested in reading. Unfortunately, the specification is not good at giving full examples of what the specification looks like in practice.
Things like Record Separators and Group Separators (see ASCII characters 29 and 30) appear out of nowhere without any definition in the ANSI specification.
ANS MH10.8.2-2016 (ISO/IEC 15418) PDF
Also relevant: ANS MH10.8.17-2017 (ISO/IEC 15434) PDF
So far, our way of explaining this is by scanning barcodes (data matrixes seem to provide data like the specification shows most commonly), reading the data and checking the document, slowly identifying patterns around how the data is structured. This specification seems to only lightly touch on the structure of data, but goes into exact detail about constituent parts.
I understand that my questions are somewhat broad and unspecific, but I can barely find anything about this specification to begin with. There's barely anything on the entirety of StackOverflow.
General Questions
Where can I find full examples and explanations of what this specification looks like in practice?
Are there any publicly available parsers or APIs surrounding this?
Where should I look for more information, ask questions etc. about this specification and ones like it?
Specific Questions
When scanning barcodes, data matrixes and QR codes, how is one supposed to easily differentiate data stored in this standard versus raw text with no particular standard applied?
a. We assume that there must be a industry standard for doing this besides "check if it kinda looks like one based on what we know".
Currently, the barcodes I am scanning seem to primarily use Data Identifiers, not Application Identifiers. However, online, I did manage to find an example of someone using an Application Identifier, and it very closely resembled a Data Identifier structure like I scanned. Assuming there will be ambiguity, what is the difference between the two?
Do 1D barcodes actually even store this kind of data? So far, I cannot remember if I've scanned a 1D barcode storing more than simple 'text'. Only Data Matrixes provide the raw encoded data from the specifiation.

Need some help understanding Vocabulary of Interlinked Dataset (VoID) in Linked Open Data

I have been trying to understand VoID in Linked Open Data. It would be great if anyone could help clarify some of my confusions.
Does it need to be stored in a separate file or it can be included in the RDF dataset itself? If so, how do I query it? (A sample query would be really helpful)
How is the information in VoID used in real life?
Does it need to be stored in a separate file or it can be included in the RDF dataset itself? If so, how do I query it? (A sample query would be really helpful)
In theory not, but for practical purposes yes. In the end the information is encoded in triples, so it doesn't really matter in what file you put them and you could argue that it's best to actually put the VoID info into the data files and serve these triples with your data as meta-info. It's query-able as all other forms of RDF, either load it into some SPARQL endpoint or use a library that can directly load RDF files. This however also shows the reason why a separate file makes sense: instead of having to load potentially large data files just to get some dataset meta info, it makes sense to offer the meta-data in its own file.
How is the information in VoID used in real life?
VoID is actually used in several scenarios already, but mostly a recommendation and a good idea. The most prominent use-cases i know of is to get your dataset shown in the LOD Cloud. You currently have to register it with datahub.io and add a VoID file (example from my associations dataset).
Other examples (sadly many defunct nowadays) can be found here: http://semanticweb.org/wiki/VoID.html

Better way to load content from web, JSON or XML?

I have an app which will load content from a website.
There will be around 100 articles during every loading.
I would like to know which way is better to load content from web if we look at:
speed
compatibility (will there be any problems with encoding if we use special characters etc.)
your experience
JSON is better if your data is huge
read more here
http://www.json.org/xml.html
Strongly recommend JSON for better performance and less bandwidth consumption.
JSON all the way. The Saad's link is an excellent resource for comparing the two (+1 to the Saad), but here is my take from experience and based on your post:
speed
JSON is likely to be faster in many ways. Firstly the syntax is much simpler, so it'll be quicker to parse and to construct. Secondly, it is much less verbose. This means it will be quicker to transfer over the wire.
compatiblity
In theory, there are no issues with either JSON or XML here. In terms of character encodings, I think JSON wins because you must use Unicode. XML allows you to use any character encoding you like, but I've seen parsers choke because the line at the top specifies one encoding and the actual data is in a different one.
experience
I find XML to be far more difficult to hand craft. You can write JSON in any text editor but XML really needs a special XML editor in order to get it right.
XML is more difficult to manipulate in a program. Parsers have to deal with more complexity: name spaces, attributes, entities, CDATA etc. So if you are using a stream based parser you need to track attributes, element content, namespace maps etc. DOM based parsers tend to produce complex graphs of custom objects (because they have to in order to model the complexity). I have to admit, I've never used a stream based JSON parser, but parsers producing object graphs can use the natural Objective-C collections.
On the iPhone, there is no built in XML DOM parser in Cocoa (you can use the C based parser - libxml2) but there is a simple to use JSON parser as of iOS 5.
In summary, if I have control of both ends of the link, I'll use JSON every time. On OS X, if I need a structured human readable document format, I'll use JSON.
You say you are loading "articles". If you mean documents containing rich text (stuff like italic and bold), then it's not clear that JSON is an option - JSON doesn't really do mixed content.
If it's pure simple structured data, and if you don't have to handle complexities like the need for the software at both ends of the communication to evolve separately rather than remaining in lock sync, then JSON is simpler and cheaper: you don't need the extra power or complexity of XML.

How should I store and compress a Moose object using Perl?

I have created a package using Moose and I would like to nstore some large instances. The resulting binary files are very large (500+MB) so I would like to compress them.
What is the best way for doing that?
Should I open a filehandle with bzip etc. then store using fd_nstore?
With MooseX::Storage, most of this is already done for you -- you just need to specify your serialization and I/O formats.
While compression is certainly a viable option, you might also want consider simply serializing less.
Could it be that your objects contain a lot of data that could easily be rebuilt from other data they also contain? For example, if you have attributes that are lazily build from other attributes (e.g. using Moose's lazy + builder or lazy_build), there is not much point in storing the values of those attributes at all unless the recomputation is incredibly expensive. And even then it might be worth considering, as reading lots of data off disk isn't the fastest thing either.
If you find that you want to serialize only parts of your objects, and still want to use Storable, you can define custom STORABLE_freeze and STORABLE_thaw hooks, as described in the Storable documentation.
However, there's also alternative serializers available. MooseX::Storage is one of them, and happens to support many serialization backends and formats, and can also be told easily about which attributes to serialize and which to skip for that purpose.
Have a look at Data::Serializer. It optionally uses zlib (via Compress::Zlib) or PPMd (via Compress::PPMd) to compress your serialized data.