Using Spark dataframe how to control output file size when saving text or json to S3 - scala

I need to a way to control the output file size when saving txt/json to S3 using java/scala.
e.g. I would like a rolling file size of 10 mb,
how can i control this using dataframe code,
I have experimented with spark.sql.files.maxPartitionBytes.
This does not give accurate control.
e.g if I set spark.sql.files.maxPartitionBytes=32MB
The output files are of size 33 mb.
Other option is to use reparition, df.rdd.reparition(n)
this will create n files.
The values of n = size of inputfile/roll file size
e.g input file size=200 mb, roll size=32 mb,
n = 200/32 = 7. Creates 6 files of size 32mb and 1 one 8 mb file.
Appreciate any thoughts about controlling the output file size.
Thanks

Related

Why is reading h5 files extremely slow?

I have a data generator which works but is extremely slow to read data from a 200k image dataset.
I use:
X=f[self.trName][idx * self.batch_size:(idx + 1) * self.batch_size]
after having opened the file with f=h5py.File(fileName,'r')
It seems to be slower as the idx is large (sequential access?)
but in any case it is at least 10 seconds (sometimes >20 sec) to read a batch, which is far too slow (moreover reading from an SSD!)
Any ideas?
The dataset is taking 50.4 GB on disk (compressed) and its shape is:
(210000, 2, 128, 128)
(this is the shape of the trainingset, the targets have the same shape, and are stored as another dataset inside this same .h5 file)

What is the maximum size of TIFF metadata?

Is there a maximum limit to the amount of metadata that can be incorporated in an individual field of TIFF file metadata? I'd like to store a large text (up to a few MB) in the ImageDescription field.
There's no specific maximum limit to the ImageDescription field, however, there's a maximum file size for the entire TIFF file. This maximum size is 4 GB. From the TIFF 6.0 spec:
The largest possible TIFF file is 2**32 bytes in length.
This is due to the offsets in the file structure, stored as unsigned 32 bit integers.
Thus, the theoretical maximum size is that which #cgohlke points out in the comments ("2^32 minus the offset"), but most likely you want to keep it smaller, if you also intend to include pixel data...
Storing "a few MB" should not be a problem.

How to reduce size and size on disk?

We have created Automation Projects using Katalon Studio..
Currently The project folder size is it shows like:
Size: 1.61 MB
Size on Disk: 4.45 MB
Contains: 1033 Files, 444 Folders
How to reduce the difference between Size and Size on Disk.. When project grows is it needs to be sorted out?
This is probably related to your disk cluster size. Files can be no smaller than the cluster size, which is usually somewhere in the range of a few KB. For example, if your cluster size is 4KB then a 1 byte file will still take up 4KB on the disk. Generally this is more noticeable when you have many small files. If you want to change this you will need to reformat your filesystem and choose a smaller cluster size.

Partition a large scale HDF5 dataset into sub-files

I have a pretty large HDF5 dataset which is of size [1 12672 1 228020] following the format:[height width channel N]. This file occupies about 22G on hard disk.
I want to partition this file in to smaller parts, say 2G files.
h5repart has been tried out but it does not work well, because I'm not able to display partitioned files in MATLAB using h5disp('...').
One solution would be for you to use the 'chunk' capability of the HDF5 format.
Using the MATLAB low-level HDF5 functions you should be able to read the chunks you require.

What should be the largest size of a file for MATLAB import?

I am trying to import a (.txt) data file to MATLAB(2013a). The file is 64,261 KB. Every time I click on import data, the program freezes.
Is there a limit for the file size or is it just my machine?
You'll want to make sure you're using memmapfile(). MATLAB (your OS) can support up to 2 GB on a 32 bit system and 256 TB on a 64 bit system. For details, see:
http://www.mathworks.com/help/matlab/import_export/overview-of-memory-mapping.html