Spark Scala read a specific file within a compressed archive - scala

How do I read (without OS uncompress) a specific file in a compressed archive?
e.g. I have thousands of compressed archives with:
file1.tsv <--- read by default
file2.tsv <--- I want this one in each archive
file3.tsv
I want to read file2.tsv in every compressed archive without uncompressing each first.
When I tried spark.read("/compressed-archive.tar.gz") it only seems to read file1.tsv.

Related

Does IPFS provide block-level file copying feature?

Update 11-14-2019: Please see the following discussion for feature request to IPFS: git-diff feature: Improve efficiency of IPFS for sharing updated file. Decrease file/block duplication
There is a .tar.gz file, which contains a data.txt file, file.tar.gz (~1 GB) stored in my IPFS-repo, which is pulled from another node-a.
I open the data.txt file and added a single character in a random locations in the file(beginning of the file, middle of the file, and end of the file), and compress it again as file.tar.gz and store it in my IPFS-repo.
When node-a wants to re-get the updated tar.gz file a re-sync will take place.
I want to know using IPFS whether the entire 1 GB of file will be synced or there exists a way by which parts of the file that is changed (called the delta) get synced.
Similar question is asked for the Google Drive: Does Google Drive sync entire file again on making small changes?
The feature you are asking is called "Block-Level File Copying". With
this feature, when you make a change to a file, rather than copying
the entire file from your hard drive to the cloud server again, only
the parts of the file that changed (called the delta) get sent.
As far as I know, Drobox, pCloud, and OneDrive, which however only supports it for Microsoft Office documents, offers block level sync.

Getting a list of all files inside a zip/rar/7z file with Scala

Is there a way to get a list of all the files inside a compressed file without decompressing it?
I don't mind using a Java library but all the solutions I found performed a decompression.
Also, if it is relevant, I know that the compressed file has sub directories in it and I want to also get the files from them.

Get Maximum Compression from 7zip compression algorithm

I am trying to compress some of my large document files. But most of files are getting compresses by only 10% maximum. I am using 7zip Terminal Commands.
7z a filename.7z -m0=LZMA -mx=9 -mmt=on -aoa -mfb=64 filename.pptx
Any suggestion on changing parameters. I need at least 30% compression ratio.
.pptx files or .docx files are internally .zip archives. You can not expect a lot of compression on an already compressed file.
Documentation states lzma2 handles better data that can not be compressed, so you can try with
7z a -m0=lzma2 -mx filename.7z filename.pptx
But the required 30% is almost unreachable.
If you really need that compression, you could use the fact that a pptx is just a fancy zip file:
Unzip the pptx, then compress it with 7zip. To recover an equivalent (but not identical) pptx decompress with 7zip and recompress with zip.
There are probably some complications, for example with epub there is a certain file that must be stored uncompressed as first file in the archive at a certain offset from the start. I'm not familiar with pptx, but it might have similar requirements.
I think it's unlikely that the small reduction in file size is worth the trouble, but it's the only approach I can think of.
Depending on what's responsible for the size of the pptx you could also try to compress the contained files. For example by recompressing png files with a better compressor, stripping unnecessary data (e.g. meta-data or change histories) or applying lossy compression with lower quality settings for jpeg files.
Well just an idea to max compressing is
'recompress' these .zip archives(the .docx, .pptx, jar...) using -m0 (storing = NoCompression) and then
apply lzma2 on them
lzma2 is petty good - however if the file contains many jpg's consider to give the opensource packer peazip or more specify paq8o a try. Paq8 has a build in Jpeg compressor and supports range compression. So it will also come along with jpg's the are inside some other file. Winzip's zipx in contrast to this will require pure jpg files and is useless in this case.
But again to make PAQ effectively working/compressing your target file you'll need to 'null' the zip/deflate compression, turn it into an uncompressed zip.
Well PAQ is probably a little exotic, however it's in my eye's more honest and clear than zipx. PAQ is unsupport so it's as always a good idea to just google for what don't have/know and you will find something.
Zipx in contrast may appears a little intrigious since it looks like a normal zip and files are listed properly in Winrar or 7zip but when you like to extract jpg's it will fail so if the user is not experienced it may seem like the zip corrupted. It'll be much harder to find out that is a zipx that so far only winzip or The Unarchiver(unar.exe) can handle properly.
PPTX, XLSX, and DOCX files can indeed be compressed effectively if there are many of them. By unzipping each of them into their directories, an archiver can find commonalities between them, deduplicating the boilerplate XML as well as any common text between them.
If you must use the ZIP format, first create a zero-compression "store" archive containing all of them, then ZIP that. This is necessary because each file in a ZIP archive is compressed from scratch without taking advantage of redundancies across different files.
By taking advantage of boilerplate deduplication, 30% should be a piece of cake.

Talend writing file names from a list to a file for process completion

My job has following steps:
- Connect to ftp location
- Download compressed files
- Uncompress files to different folder
- Delete compressed files
- Write file names to a tracking file
ftpConnection -OnComponentOk--> ftpList-Iterate--> ftpGet -Iterate--> fileList-Iterate--> fileUnarchive-Iterate--> fileDelete
Question is where can i write the uncompressed filenames to the tracking file. When i try to Iterate from fileUnarchive to fileOutputDelimited it does not
allow me, similarly if i want to add a map from fileDelete it does not allow me. Do i need a map or can i make use of the global variable somehow?
One way i can do it getting it after ftpGet but i would prefer to do it at a latter stage (after unarchiving or deletion) so i don't update the file if the
process fails at one of these steps.
Thanks.
try with tfiledelete-->oncomponentok-->tfixedflowinput(here you can use the same global variable which contains current file name from tfilelist)-->(mainflow)-=->tfileoutputdelimeted...

what is the difference between tar and gZ?

when i compress the file "file.tar.gZ" in iphone SDK, it gives file.tar , but both tar and tar.gZ gives same size?any help please?
*.tar means that multiple files are combined to one. (Tape Archive)
*.gz means that the files are compressed as well. (GZip compression)
Edit: that the size is the same doesn't say a lot. Sometimes files can't be compressed.
As Rhapsody said, tar is an archive containing multiple files, and gz is a file that is compressed using gzip. The reason why two formats are used is because gzip only supports compressing one file - perhaps due to the UNIX philosophy that a program should do one thing, and do it well.
In any case, if you have the option you may want to use bzip2 compression, which is more efficient (IE, compresses files to a smaller size) than gzip.