Reading NetCDF file within tar.gz file without extracting the tar file - matlab

I am looking for a way to read data from netcdf format files stored within a tar file without extracting the file first. The reason for this is we have thousands of such data file of significant file size each, and extracting them would require significant disk space and time.
Is there a way I can achieve this using Matlab or other ways? some online topics discuss reading text file within tar file without extracting using linux, but not netcdf file.
I see there may be ways to do this on a unix/Linux machine, but is there a way to do the same in a Windows operating system?

I reached out to Matlab support and they gave me a solution that reduced the tar extraction time significantly.
Solution: Instead of using Matlab “untar” command, use direct system command as : system(‘tar xzvf filename.tar.gz *.nc’).
This reduced extraction time for a file from 13 minute to 8 seconds.

Related

Which is the fastest way to read a few lines out of a large hdfs dir using spark?

My goal is to read a few lines out of a large hdfs dir, I'm using spark2.2.
This dir is generated by previous spark job and each task generated a single little file in the dir, so the whole dir is like 1GB size and have thousands of little files.
When I use collect() or head() or limit(), spark will load all the files, and creates thousands of tasks(monitoring in sparkUI), which costs a lot of time, even I just want to show the first few lines of the files in this dir.
So which is the fastest way to read this dir? I hope the best solution is only load only a few lines of data so it would save time.
Following is my code:
sparkSession.sqlContext.read.format("csv").option("header","true").option("inferschema","true").load(file).limit(20).toJSON.toString()
sparkSession.sql(s"select * from $file").head(100).toString
sparkSession.sql(s"select * from $file").limit(100).toString
If you directly want to use spark then it will anyways load the files and then it does taking records. So first even before spark logic you have to get one file name from the directory using ur technology like java or scala or python and pass that file name to text File method that won't load all files.

Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

This question already has answers here:
Read whole text files from a compression in Spark
(2 answers)
Closed 6 years ago.
I'm trying to create a Spark RDD from several json files compressed into a tar.
For example, I have 3 files
file1.json
file2.json
file3.json
And these are contained in archive.tar.gz.
I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") results in garbled/extra output.
Is there some way to handle gzipped archives containing multiple files in Spark?
UPDATE
Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem.
I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.
A solution is given in Read whole text files from a compression in Spark .
Using the code sample provided, I was able to create a DataFrame from the compressed archive like so:
val jsonRDD = sc.binaryFiles("gzarchive/*").
flatMapValues(x => extractFiles(x).toOption).
mapValues(_.map(decode())
val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))
This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.
A better solution to the problem seems to be to convert the tar archives to Hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)
See: A Million Little Files – Digital Digressions by Stuart Sierra.
Files inside of a *.tar.gz file, as you already have mentioned are compressed. You cannot put the 3 files into a single compressed tar file and expect the import function (which is looking for only text) to know how to handle decompressing the files, unpacking them from the tar archive, and then importing each file individually.
I would recommend you take the time to upload each individual json file manually since both sc.textfile and sqlcontext.read.json functions cannot handle compressed data.

How to use wget to download large osm dataset?

I want to create a global dataset of wetlands using the OSM database. As there are problems for huge datasets if I use the overpass-turbo or so, I thought I could use wget to download the planet file and filter it for only the data I'm interested in. Problem is, I don't know much about wget. So I wanted to know whether there is a way to filter the data from the planet file while downloading and unzipping it?
In general I'm looking for the least time and disk-space consuming way to get to that data. Would you have any suggestions?
wget is just a tool for downloading, it can't filter on the fly. There is probably a chance that you can pipe the data to a second tool which does filtering on the fly but I don't see any advantage here. And the disadvantage is that you can't verify the file checksum afterwards.
Download the planet and filter it afterwards using osmosis or osmfilter.

Matlab: direct/efficient untar to memory to avoid slow disk interactions

Given a .tar archive, Matlab allows one to extract the contained files to disk via UNTAR command. One can then manipulate the extracted files in the ordinary way.
Issue: When several files are stored in a tarball, they are stored contiguously on disk and, in principle, they can be accessed serially. When such files are extracted, this contiguity doesn't hold any more and the file access can become random, hence slow & inefficient.
This is especially critical when the considered files are many (thousands) and small.
My question: is there any way to access to the archived files avoiding the preliminary extraction (in a sort of HDF5 fashion)?
In other words, would it be possible to cache the .tar so to access the contained files from the memory rather than from the disk?
(In general, direct .tar manipulation is possible, e.g. is C# tar-cs, in python).
No, as far as i know.
If you're using Matlab on Linux, try to extract to tmpnam. This will extract to tmpfs whitch should be faster accessible (bad idea if we are takling about several GB).
Otherway you can use system('untar xf file.tar only/needed/file') or python to get a more flexible untar behavior.
After some time I finally worked out a solution which gave me unbelievable speedups (like 10x or so).
In a word: ramdisk (tested on Linux (Ubuntu & CentOs)).
Recap:
Since the problem has some generality, let me state it again in a more complete fashion.
Say that I have many small files stored on disk (txt,pict, order of millions) which I want to manipulate (e.g. via matlab).
Working on such files (i.e. loading them/transmitting them on network) when they are stored on disk is tremendously slow since the disk access is mostly random.
Hence, tarballing the files in archives (e.g. of fixed size) looked to me like a good way to keep the disk access sequential.
Problem:
If case the manipulation of the .tar requires a preliminary extraction to disk (as it happens with matlab's UNTAR), the speed up given by sequential disk access is mostly loss.
Workaround:
The tarball (provided it is reasonably small) can be extracted to memory and then processed from there. In matlab, as I stated in the question, .tar manipulation in memory is not possible, though.
What can be done (equivalently) is untarring to ramdisk.
In linux, e.g. Ubuntu, a default ramdisk drive is mounted in /run/shm (tempfs). Files can be untarred via matlab there, having then extremely fast access.
In other words, a possible workcycle is:
untar to /run/shm/mytemp
manipulate in memory
possibly tar again the output to disk
This allowed me to change the scale-time of my processing from 8hrs to 40min and full CPUs load.

How To Create File System Fragmentation?

Risk Factors for File Fragmentation include mostly full Disks and repeated file appends. What are other risk factors for file fragmentation? How would one make a program using common languages like C++/C#/VB/VB.NET to work with files & make new files with the goal of increasing file fragmentation?
WinXP / NTFS is the target
Edit: Would something like this be a good approach? Hard Drive free space = FreeMB_atStart
Would creating files of say 10MB to fill
90% of the remaining hard drive space
Deleting every 3rd created file
making file of size FreeMB_atStart * .92 / 3
This should achieve at least some level of fragmentation on most file systems:
Write numerous small files,
Delete some at random files,
Writing a large file, byte-by-byte.
Writing it byte-by-byte is important, because otherwise if the file system is intelligent, it can just write the large file to a single contiguous place.
Another possibility would be to write several files simultaneously byte-by-byte. This would probably have more effect.