How can I tell if two image files are the same in Perl? - perl

I have a Perl script I wrote for my own personal use that fetches image files from a website periodically. It then saves these images to a folder. These image files are quite often the same from fetch to fetch, and I'd like to not save duplicates if I can get around it.
My question: What would be the best way to compare/check if they are the same?
My only real thought so far is to open a file handle to existing one, md5 it, md5 the $response->content from the fetch and then compare them. Would that work?
Is there a better way?
EDIT:
Wow, already tons of great suggestions. Does it help if I tell you that this script runs daily via cron? I.e. it is guaranteed to always run at the exact same time everyday? Also: I'm looking at the last-modified headers on some of these, and they don't look 100% accurate, i.e. there are some that have a last-modified of over a week ago when I know the image is more recent than that. I'm assuming that's because the image file itself hasn't been modified on the server since then... which doesn't help me much...

Don't open and hash the stored image each time - stash the hash alongside the image when you store it. Compare sizes as well.
Don't issue a GET request straight away, do a HEAD first and compare the size, last modification date and any Etags to what you got last time.

There are a number of HTTP headers you can use for this -- if you save the time that you last retrieved the file, you can do a conditional get with
If-Modified-Since: <date>
Or, if the server returns an Etag header with the response, you can store that with the image, (or a collection of all of the etags you have seen for that image), and do:
If-None-Match: <all of your etags here>
If the server supports conditional gets, then you will get a "304 Not Modified" response, with no body.

Yep that sounsd right.
Depending on how you're getting the file and how frequently you might also be able to check for HTTP 304 Not Modified and save yourself the download.

md5 would work, but you'd still have to pull the file. Are there any useful metadata in the HTTP headers, content-length, cache-control directives, ETags, etc. ?

There's also a nice fdupes tool for the purpose. Don't know what system you're using and what systems the tool can be built for.

Related

WebSphere MQ binary fiiles

This might be a question that may not be answered due to the nature of the external tool I am using (lack of documentation).
Basically, I am using a tool that pushes and pulls messages from the queue, more precisely - it pushes and pulls files. It worked perfectly for text files but when I tried pushing and then pulling a binary file - the pulled one was corrupted, it's size increased in comparsion with the original file (1.33 ratio).
For example moving a zip file wouldn't work...
I suppose it has something to do with the tools configuration, the only settings that can be changed regarding the problem are CCSID and encoding (UTF-8, Base16, etc.), I tried playing with both, unfortunately without success.
Tried using the following CCSIDs: 65535, 1208, 819
and encodings : UTF-8, Base16, Base64
In every case the binary file was corrupted after pulling it from the queue, I'm not entirely sure how the tool acomplishes that, it's written in Java, also I'm new to MQ so I tried searching for the correct options in IBM's docs but I haven't found anything that makes more sense than 65535 and Base16, yet it still doesn't work, could anyone with more experience with MQ tell if playing with these options makes sense at all in this case and if so - suggest what CCSID and encoding can I try to accomplish what Ive described above?
More information is really needed, but my suspicion is you are putting the message on the queue as a text message and playing around with encodings and ccsid's to try to get it right. You really need to know how the 'Java' app achieves this - is it using JMS (eg JMSBytesMessage) or base Java (something like setMessageData).
At a high level, there is a header on a message (The MD) which 'describes' the data - the MD format field. If you say the data is a string then MQ can convert between codepages should the getter request it etc. Put a tiny binary file into a message onto a queue, and browse the queue with amqsbcg or the GUI - what are the MD fields for format? What headers are on the payload - anything like RFH2's?
Put the same code in to give us a clue, or at least the amqsbcg output

Why would LayoutObjectNames return an empty string in FileMaker 14?

I'm seeing some very strange behavior with FileMaker 14. I'm using LayoutObjectNames for some required functionality. On the development system it's working fine. It returns the list of named objects on the layout.
I close the file, zip it up and send it to the client, and that required functionality isn't working. He sends the file back and I open it and get a data viewer up. The function returns nothing. I go into layout mode and confirm that there are named objects on the layout.
The first time this happened and I tried recovering the file. In the recovered file it worked, so I assumed some corruption had happened on his end. I told him to trash the file I had given him and work with a new version I supplied. The problem came up again.
This morning he sent me the oldest version that the problem manifested in. I confirmed the problem, tried recovering it again, but this time it didn't fix the problem.
I'm at a loss. It works in the version I send him, doesn't on his system. We're both using FileMaker 14, although I'm using Advanced. My next step will be to work from a served file instead of a local one, but I have never seen this type of behavior in FileMaker. Has anyone seen anything similar? Any ideas on a fix? I'm almost ready to just scrap the file and build it again from scratch since we're not too far into the project.
Thanks, Chuck
There is a known issue with the Get (FileName) function when the file name contains dots (other that the one before the extension). I will amend my answer later with more details and a possible solution (I have to look it up).
Here's a quote from 2008:
This is a known issue. It affects not only the ValueListItems()
function, but any function that requires the file name. The solution
is to include the file extension explicitly in the file name. This
works even if you use Get (FileName) to return the file name
dynamically:
ValueListItems ( Get ( FileName ) & ".fp7" ; "MyValueList" )
Of course, this is not required if you take care not to use period
when naming your files.
http://fmforums.com/forums/topic/60368-fm-bug-with-valuelistitems-function/?do=findComment&comment=285448
Apparently the issue is still with us - I wonder if the solution is still the same (I cannot test this at the moment).

Is there a simple way to extract content from a webpage?

Our build software generates a webpage when the build fails, and lists the users who've committed since the last build. I'd like to have a way to parse the page for members of my team. For example:
Commit
18e1bc67b7e3123987daf8c219a4fbe2003de4
by bob.dole</b><pre>1112233- Description on header is not carried forward to BD doc after PCPROJBILL is ran<br></pre></div></td></tr><tr><td width="16"><img title="The file was modified" height="16" alt="The file was modified" width="16" src="/static/fbfd5d7f/images/16x16/document_edit.png" /></td><td><a>pcbatch/projbill.cpp</a></td></tr><tr class="pane"><td colspan="2" class="changeset"><a name="detail54"></a><div class="changeset-message"><b>
So the script would take a URL as input and search the file for 'bob.dole' and output to a file all of the details associated with him (commit hash, pre-data, etc.)
Could someone give me an idea of what would be the easiest way to accomplish this? I was thinking of using perl, but I'm not sure if there's something more straightforward.
If I got you question correctly, you want to get the webpage content and parse it to find the user name. If it is the case, I would use php
Use get_file_content("your_website"), this will return a string to you to parse.
Then you can use strpos() to find indeces of substrings. This will later help you to extract the user name by using substr() function.
Hope it helps.
The Perl module you are looking for that helps you search based on nodes is Mojo::DOM.

Generate a torrent/magnet link from a single file in a torrent collection

I was wondering if it is possible, having a torrent collection (IE a torrent containing multiple files) to extract a single one, generating an almost new torrent/magnet link to download only that single file but using the same source (announce, etc), instead of dowloading the whole torrent and then select what to download or not.
Thanks for any hint about.
2019 Update: Yes, you now can! In 2017 a draft BEP was released that covers the question's behaviour for magnet URIs! This is great, as it creates a standard that keeps a consistent info_hash between a magnet URI pointing to the multi-file torrent, and a magnet URI pointing to a single file within that multi-file torrent. They will share a swarm, which means you can, as the question asks "[generate] an almost new torrent/magnet link to download only that single file but using the same source".
The draft BEP:
http://www.bittorrent.org/beps/bep_0053.html BEP 53: "Magnet URI extension - Select specific file indices for download"
Example URI to request files 0, 2, 4 and the inclusive range 6 through to 8:
magnet:?xt=urn:btih:HASH&dn=NAME&tr=TRACKER&so=0,2,4,6-8
And the draft BEP is making it's way into bittorrent libraries:
https://gitlab.com/proninyaroslav/libretorrent/tags/1.9 LibreTorrent 1.9 2018-NOV-26
https://github.com/webtorrent/webtorrent/issues/1395 Webtorrent 0.100.0 2018-MAY-23
2013-MAY-03 Original Answer:
Sometimes yes, but not often, and the resulting swarm has no peers.
Firstly, you need the original .torrent file, so if you only have a magnet URI you need to resolve that to a .torrent using DHT. Any bittorrent library that supports magnet URIs has the code for that task.
Once you have the .torrent, you then need to get the hashes relating to the file you're interested in. The .torrent file contains a very long string, each 20 bytes representing the hash of each piece in the torrent. Piece length is fixed for a torrent, typically between 256KB and 1MB. If the file starts at exactly a piece offset, and is sized equal to a multiple of the piece size or is the last file in the torrent then you can reuse these hashes. You can then create a new .torrent file with that information, and generate a new magnet URI from the torrent file, re-using the announce or using a new one.
Torrent info structure: https://wiki.theory.org/BitTorrentSpecification#Metainfo_File_Structure
Being lucky enough to get that offset is unlikely, with a piece length generally varying between 256KB & 1MB, you have a 1/262144 to 1/268435456 chance of getting that offset (given that a file could start anywhere in a piece), so the circumstance is unlikely. If you can't re-use the hashes, you need to generate new hashes which means you can't re-use the .torrent and would need to download the files to generate the new piece hashes.
The killer is that in the end, the torrent created has a different info_hash. The info_hash is the hash of the info describing the torrent, which was a description of many files and now in your new hash is the description of a single file, thus is a new torrent so there's no-one available to leech from. Peers collect into swarms based on the info_hash, and if you create a new torrent based on one file from a multifile torrent, the peers from the multifile torrent don't know about it and won't be available to leech from.
Even if you're lucky enough to get the right piece offsets, you create a torrent that doesn't have anyone sharing the file.
So, could you instead re-use the magnet URI and just specify a file name within the torrent? No, the BEP that describes how Bittorrent uses magnet URIs doesn't cover this behaviour. http://www.bittorrent.org/beps/bep_0009.html

How can I make a perl script that will rename files in a specific directory every couple minutes?

I wanted to add a live picture rotation to my site and i could not find any other way to do it so i decided that i need a cgi script that will:
1. Delete the first picture in the rotation (e.g. pic1.jpg)
2. Rename the rest of the pictures (e.g. rename pic2.jpg to pic1.jpg, pic3.jpg to pic2.jpg, pic4.jpg to pic3.jpg, etc...)
3. Do this every 5 minutes so that the viewers of my site will all be viewing the same picture pretty much at the sime time.
Any help will be much appreciated,
Thanks.
To make this work "so that the viewers of my site will all be viewing the same picture pretty much at the same time", you need to use different urls for each image or make sure you are telling browsers and proxies not to cache the pictures...and not caching is a really bad idea; your viewers will not appreciate it, nor will your server.
Sounds like a pretty bad idea, frankly. File operations are relatively slow, and tends to step on itself in a highly-concurrent application like a webapp.
Look for another way. How about using the current time as a key to choose among pictures?
currentImageIndex = currentTimeRoundedToTheNearestFiveMinutes %
totalNumberOfImages
Edited with more detail on request:
Basically, take the current time, and round it to the nearest five minutes. Doing something like currentHour / 12, using integer math, will give you this; otherwise, truncate the result. Then use the modulo operator (% in Perl, and many languages - handy operator that newcomers tend to overlook) to produce a number from 0 to n-1, where n is the total number of images you're serving. Then you can refer to a mapping table to go from that index to a filename.
Since you say in a comment that you don't need caching, rather than change the filenames of each file, why not have your page point to a file which is a symlink and then change the pointer of the symlink every few minutes. Seems like it would do what you need without the overhead of major file operations.