Spark buffer holder size limit issue - scala

I am doing the aggregate function on column level like
df.groupby("a").agg(collect_set(b))
The column value is increasing beyond default size of 2gb.
Error details:
Spark job fails with an IllegalArgumentException: Cannot grow
BufferHolder error. java.lang.IllegalArgumentException: Cannot grow
BufferHolder by size 95969 because the size after growing exceeds size
limitation 2147483632
As we already known
BufferHolder has a maximum size of 2147483632 bytes (approximately 2 GB).
If a column value exceeds this size, Spark returns the exception.
I have removed all duplicate records, repartition(), increased the default partitions and did increase all memory parameters also but no use it is giving above error.
We have huge volume of data in a column after applying the agg of collect_set.
Is there any way to increase the BufferHolder maximum size of 2gb while processing?

Related

Why is Size in Memory >> Size on disk for an RDD in spark?

For a particular RDD, when I check out the spark UI, the Size in memory is 143.8GiB whereas the Size on disk is 4.7Gib.
Why is the size in memory so large? Is it because the data stored on disk is in a highly compressed format(parquet)?

Redshift database size grew sagnificantly when the cluster was resized

In Redshift I had a cluster with 4 nodes of the type dc2.large
The total size of the cluster was 160*4=640gb. The system showed 100% storage full. The size of the database was close to 640gb
Query I use to check the size of the db:
select sum(used_mb) from (
SELECT schema as table_schema,
"table" as table_name,
size as used_mb
FROM svv_table_info d order by size desc
)
I added 2 dc2.large nodes - classic resize which set the size of the cluster to 160*6=960gb, but when I checked the size of the database suddenly I saw that it also grew and again takes almost 100% of the cluster with increased size.
Database size grew with the size of the cluster!
I had to perform additional resize operation - elastic one. From 6 nodes to 12 nodes. The size of the data remained close to 960gb
How is it possible that the size of the database grew from 640gb to 960gb as a result of cluster resize operation?
I'd guess that your database has a lot of small tables in it. There are other ways this can happen but this is by far the most likely cause. You see Redshift uses a 1MB "block" as the minimum storage unit which is great for large data table storage but is inefficient for small (< 1M rows per slice in the cluster).
If you have a table that has say 100K rows split across your 4 nodes of dc2.large nodes (8 slices), each slice holds 12.5K rows. Each column for this table will need 1 block (1MB) to store the data. However, a block on average can store 200K rows (per column) so most of the blocks for this table are mostly empty. If you add rows the on-disk size (post vacuum) doesn't increase. Now if you add 50% more nodes you are also adding 50% more slices which just adds 50% more nearly empty blocks to the table's storage.
If this isn't your case I can expand on other ways this can happen but this really is the most likely in my experience. Unfortunately the fix for this is often to revamp your data model or to offload some less used data to Spectrum (S3).

What is the maximum size of symbol data type in KDB+?

I cannot find the maximum size of the symbol data type in KDB+.
Does anyone know what it is?
If youa re talking the physical length of a symbol, well symbols exist as interred strings in kdb, so the maximum string length limit would apply. As strings are just a list of characters in kdb, the maximum size of a string would be the maximum length of a list. In 3.x this would be 264 - 1, In previous versions of kdb this limit was 2,000,000,000.
However there is a 2TB maximum serialized size limit that would likely kick in first, you can roughly work out the size of a sym by serializing it,
q)count -8!`
10
q)count -8!`a
11
q)count -8!`abc
13
So each character adds a single byte, this would give a roughly 1012 character length size limit
If you mean the maximum amount of symbols that can exist in memory, then the limit is 1.4B.

How large does the FAT structure and how large is the file?

Consider the following parameters of a FAT based lesystem:
Blocks are 8KB (213 bytes) large
FAT entries are 32 bits wide, of which 24 bits are used to store a block address
A. How large does the FAT structure need to be accommodate a 1GB (2^30 bytes) disk?
B. What is the largest theoretical le size supported by the FAT structure from part (A)?
A. How large does the FAT structure need to be accommodate a 1GB (2^30 bytes) disk?
The FAT file system splits the space into clusters, then has a table (the "cluster allocation table" or FAT) with an entry for each cluster (to say if it's free, faulty or which cluster is the next cluster in a chain of clusters). To work out size of the "cluster allocation table" divide the total size of the volume by the size of a cluster (to determine how many clusters and how many entries in the "cluster allocation table"), then multiply by the size of one entry, then maybe round up to a multiple of the cluster size or not (depending on which answer you want - actual size or space consumed).
B. What is the largest theoretical le size supported by the FAT structure from part (A)?
The largest file size supported is determined by either (whichever is smaller):
the size of "file size" field in the file's directory entry (which is 32-bit for FAT32 and would therefore be 4 GiB); or
the total size of the space minus the space consumed by the hidden/reserved/system area, cluster allocation table, directories and faulty clusters.
For a 1 GiB volume formatted with FAT32, the max. size of a file would be determined by the latter ("total space - sum of areas not usable by the file").
Note that if you have a 1 GiB disk, this might (e.g.) be split into 4 partitions and a FAT file system might be given a partition with a fraction of 1 GiB of space. Even if there is only one partition for the "whole" disk, typically (assuming "MBR partitions" and not the newer "GPT partitions" which takes more space for partition tables, etc) the partition begins on the second track (the first track is "reserved" for MBR, partition table and maybe "boot manager") or a later track (e.g. to align the start of the partition to a "4 KiB physical sector size" and avoid performance problems caused by "512 logical sector size").
In other words, the size of the disk has very little to do with the size of the volume used for FAT; and when questions only tell you the size of the disk and don't tell you the size of the partition/volume you can't provide accurate answers.
What you could do is state your assumptions clearly in your answer, for example:
"I assume that a "1 GB" disk is 1000000 KiB (1024000000 bytes, and not 1 GiB or 1073741824 bytes, and not 1 GB or 1000000000 bytes); and I assume that 1 MiB (1024 KiB) of disk space is consumed by the partition table and MBR and all remaining space is used for a single FAT partition; and therefore the FAT volume itself is 998976 KiB."

What is the maximum size of TIFF metadata?

Is there a maximum limit to the amount of metadata that can be incorporated in an individual field of TIFF file metadata? I'd like to store a large text (up to a few MB) in the ImageDescription field.
There's no specific maximum limit to the ImageDescription field, however, there's a maximum file size for the entire TIFF file. This maximum size is 4 GB. From the TIFF 6.0 spec:
The largest possible TIFF file is 2**32 bytes in length.
This is due to the offsets in the file structure, stored as unsigned 32 bit integers.
Thus, the theoretical maximum size is that which #cgohlke points out in the comments ("2^32 minus the offset"), but most likely you want to keep it smaller, if you also intend to include pixel data...
Storing "a few MB" should not be a problem.