How do you compare the content of two archive files programmatically?

How do you compare the content of two archive files programmatically? - perl

I'm doing some testing to ensure that the all in one zip file that i created using a script file will produce the same output as the content of a few zip files that i must manually click and create via web interface. Therefore the zip will have different folder structure.
Of course i can manually extracted them out and using my powerful eyeball technique to scan them or even lazier i can write a script to do that, but before i invest more time and get accused by my boss for company time robbery, i'm asking if there's a better way to do this?
I'm using perl LAMP stack by the way.
thanks.

You can use perl's Archive::ZIP or Python's zipfile to extract the filenames, sizes and CRC checksums of the files in the archives. Create a file which contains the results sorted by file name (ignore the path).
For your smaller ZIPs, merge the results of the script (cat list1 list2 list3 | sort).
Now, you can use diff to compare the results.

I can wholeheartly recommend Beyond Compare. Unless you're really getting underpaid, it's the biggest bang for your (bosses) buck.
[Edit] I seem to have scanned over the different folder structure, sorry about that.Beyond Compare can compare all files in folders with the same folderstructure. It does not have (I believe) the intelligence to go searching for matches in files in different folders.
Regards,
Lieven

Create a crc checksum for your files.
If your checksum is the same for the original files and the unzipped files, you can be sure the files are the same. And even works for non text data.
A checksum be easily be created with an external program such as "SFV Checker" or programmatically (.net/java for example include libraries to do this).

Taking a cue from Carra's answer...if A.zip is your single big archive and B.zip is the archive generated through the web then use the following algorithm
Extract all files from A.zip and recursively (w.r.t folders) compute the checksum of the files present in the folder (using cksum, md5sum etc) where the contents were extracted and save this information after sorting it (pipe it through sort) to a file (say A.txt)
Do the same for B.zip and generate B.txt
Compare A.txt with B.txt they should be exactly the same.
OR
Use unzip -l to get file/directory lists for both the (zip) archives and then flatten the hierarchy of the user generated zip file and compare with the contents of your script generated zip file using some thing like diff. By flattening of hierarchy I mean you may need to do some kind of pre-precessing on one or both lists before you can do a meaningful comparison with diff.

Related

how to create a script that allows to use the path list as a reference for copying files in PowerShell in .bat script

I'm looking for a way to automate archiving where after I plug my two external drives I can copy all my resources. The problem is that I have different file structures on my laptop and on both external drives so I need to select specific folders to be copied. It means that I can't select one root folder and copy it straightforward. I tried to find a way to declare more than one path in the cp command and in the copy command, without success. An example path:
/my_programming_stuff
/folder1
/folder2
/folder3
/folder4
I want to select only the first 3 folders to copy them into external drive1 and external drive 2. The idea is to create a .bat file that will copy everything at once ( in the best case scenario it will be copied simultaneously on both external drives, so it will be much faster). Another problem is that there needs to be a bypass the ntfs long path limitations (max. 260 characters).
Flags that I want to use:
Copy the files and directories and all of their attributes,
including ownerships and permissions.
Recursively copy directories and their contents.
When copying files from one directory to another, only
copy files that either doesn't exist or are newer than the
existing corresponding files, in the destination
directory.
data verification (so it's certain that the copy was verified)
progression bar with time eta
Until now I was using Total Commander to do this but every day I need to pick only a few folders to be copied which takes time and is inefficient.
I have experience with Bash and PowerShell but I am not sure how to handle this topic.

Create a static batch file with robocopy commands. I think /copyall is the only switch you need to specify for all this. Other defaults should satisfy requirements.
https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/robocopy

I think your time will be better spent learning how to use either FastCopy or FreeFileSynce. I used FreeFileSync some years ago but got disgusted with the it's constantly changing format of its xml file used for starting a backup, so I switched to FastCopy. But it looks like FreeFileSync may be getting their act together and I aim to do some experiments over the summer to see if I want to switch back to it.
Both can handle the long filename format issues, both can be executed by a batch file, both seem to have a lot of quality, but FreeFileSync has more features - and more bloated because of the features. But speed wise, I think FastCopy is probably one of the better products out there and very streamline in use and design.

2 files with same hash, but 1 is corrupted and 1 isn't

I found something very weird on a project.
I have 2 files :
One is the input file, it's a .bip file which you can open with GIS software like QGIS
here's the input. this file is provided by the CCSDS and accessible here
The other is the output after been compressed and decompressed by a lossless compression algorithm (CCSDS 123 by ESA)
Those 2 files shares the exact same sha256 and sha1 hash, so they are identical.
3226009de97d66589fc58cdc9af377e6315ccc69a7095bec8dc04447bf3cea2e test_ptn_x100y36z17_16u.bip
3226009de97d66589fc58cdc9af377e6315ccc69a7095bec8dc04447bf3cea2e test_ptn_decomp.bip (sha256 shown here).
The thing is, if the entry is showed by QGIS, the second one displays a message and refuses to open it shows this message (translated : the file test_ptn_decomp.bip is not a recognized or valid data source)
Is there something i don't understand with hashes ? i've tried moving files to other directories and renaming but nothing changes QGIS wise.

It is highly unlikely you got a different content with same sha256 hash by chance. So I'll assume the files are identical. Anyway, it is easy to use any diff program to compare.
So there should be some other differences, things that come to mind:
file name might contain some meaningful information needed by QGIS. Try renaming decompressed file e.g. decomp_ptn_x100y36z17_16u.bip, maybe x100.. is essential?
There are some additional files, that must have matching names. Do you have a .hdr file, as explained in QGIS tutorials?
https://www.qgistutorials.com/en/docs/open_bil_bip_bsq_files.html

Search for files with MATLAB

My question is how to use MATLAB to search for a certain type of files in a folder. I give an example to detail on my question:
Suppose we have the following folder as well as files in it:
My_folder
Sub_folder1
Sub_sub_folder1
a.txt
1.txt
2.txt
Sub_folder2
3.txt
abc.txt
In this example, I want to find all the .txt files in My_folder as well as its sub-folders. I was wondering what I could do with MATLAB. Thanks!

To my knowledge Matlab doesn't have an inbuilt function to do recursive directory searches, however there are a couple available for download on Matlab Central: here and here.
Alternatively you could write your own recursive function and use the dir function to search at each level for files matching your criterea or other directories to recurse into.

I agree with the Matlab Central options -- another method which I've used when MLC is not an option (no network, or customer computer, etc) is the quick and dirty dos commands:
dos(['dir /s/b ' mywildcard])
The /s will do a recursive directory search for whatever wildcards you specify, and /b will make it so you only get filenames (complete will full path, but no headers, file sizes, etc).
This is obviously platform dependant, so is mostly used when you are forced to work without your "standard" set of utilities you've accumulated.

Even though an answer has been accepted, I would like to point out Matlab's dir function.
This built-in function returns the contents of the folder in question. Furthermore, it indicates which content is a folder of its own. Therefore, with a little loop one could use this function to search sub-directories as well.

how to use perl Archive::Zip to recursively walk archive files?

I have a small perl script that I use to search archives for members matching a name. I'd like to enhance this so that if it finds any members in the archive that are also archives (zip, jar, etc) it will then recursively scan those, looking for the original desired pattern.
I've looked through the "Archive::Zip" documentation, and I thought I saw how to do this. I noticed the "fh()" and "readFromFileHandle()" methods. However, in my testing, it appears that the "fh()" call on an archive member returns the file handle for the containing archive, not the member. Perhaps I'm doing it wrong, but I would appreciate an example of how to do this.

You can't read the contents of any sort of archive member (whether it is text, picture, or another archive) without extracting it from the archive file.
Once you have identified a member that you want to view, you must call extractMember (or, more likely, extractMemberWithoutPaths if the file is to be temporary) to extract it to a disk file. Then you can create a new Archive::Zip object and read the new file while keeping the old one open.
You will presumably want to unlink the archive file once you have catalogued its contents.
Edit
I hadn't come across the Archive::Zip::MemberRead module before. It appears you were on the right track with readFromFileHandle. I would guess that it should work like this, but it would be awkward for me to test it at present.
my $zip = Archive::Zip->new;
$zip->read('myfile.zip');
my $zipfh = Archive::Zip::MemberRead->new($zip, 'archive/path/to/member.zip');
my $newzip = Archive::Zip->new;
$newzip->readFromFileHandle($zipfh)

How to compare two .ear files recursively

I'm modifying a build process and I need to do a complete comparison of the contents of two .ear files. That means recursively comparing each archive in the .ear. These .ear files have archives that contain archives.
I've looked at Beyond Compare and Archive Analyzer, but they only do one level at a time. I have to manually drill down into each archive. I'm looking for something more automatic.
Eclipse and UltraCompare do a binary comparison of the two .ears which is not what I want.
How can I compare two .ear files recursively?

zipdiff provides a very good open source solution.

My problem turned out to be more than just expanding the .ear file recursively (I wrote a Java class to do that - recursion made it simple.) Once the .ear files are expanded I have to diff the directories to check for any changes. If anything other than timestamps changed then I know that the build is producing a different binary.
The second problem is that our build process generates hundreds of .xml files and subsequent builds re-generate those .xml files with the elements in different order. I'm not sure why. When I expand two .ear files made by back-to-back builds with no changes to anything the diff of the resulting directories shows hundreds of .xml files with diffs, even though they are functionally equivalent.
In addition to expanding the .ear files recursively I need to do a diff and exclude the .xml files in certain directories. I thought that Cygwin diff would do this, but the --exclude switch doesn't recognize any path information:
Cygwin diff won't exclude files if a directory is included in the pattern
If I don't find a solution to this I'll write another Java class to step through the whole directory structure doing a single level diff in each directory and excluding the .xml files in the appropriate directories.
I have the feeling that I'm re-inventing the wheel, but I can't find a wheel right now.

In Beyond Compare go into the Session Settings dialog, and on the Handling tab is an Archive Handling option. If it's set to As folders always BC will treat archives just like folders, so it's fully recursive.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse