Confluent Schema Registry - maximum number of schemas - apache-kafka

I'm referring to the Confluent Schema Registry:
Is there reliable information on how many distinct schemas a single schema registry can support?
From how I understand the schema registry, it reads the available schemas on startup from a kafka topic.
So possible limitations could be memory consumption (= amount of schemas in memory at a time) or performance (= lookup of schemas from Kafka).

Internally, it uses a ConcurrentHashMap to store that information, so, in theory, the limit is roughly the max size of a backing Java array.
Do Java arrays have a maximum size?
However, there are multiple maps, and therefore, JVM heap constraints will also exist. If you have larger raw-schema strings, then more memory will be used, so there is no good calculation for this.

I created my own benchmark tool for finding about possible limitations.
Link to Github repo is here.
TL;DR:
As suspected by #OneCricketeer, the scalability factor is the ~ nr of schemas * size of avg schema. I created a tool to see how the registry memory and cpu usage scales for registration of many different AVRO schemas of the same size (using a custom field within the schema to differentiate them).
I ran the tool for ~48 schemas, for that ~900 MB of memory where used with low cpu usage.
Findings:
The ramp up of memory usage is a lot higher in the beginning. After the intial ramp up, the memory usage increases step-wise when new memory is allocated to hold more schemas.
Most of the memory is used for storing the schemas in the ConcurrentHashMap (as expected).
The CPU usage does not change significantly with many schemas - also not the the time to retrieve a schema.
There is a cache for holding RawSchema -> ParsedSchema mappings (var SCHEMA_CACHE_SIZE_CONFIG, default 1000), but at least in my tests I could not see negative impact for a cache miss, it was both in hit and miss ~1-2ms for retrieving a schema.
Memory usage (x scale = 100 schemas, y scale = 1 MB):
CPU usage (x scale = 100 schemas, y scale = usage in %):
Top 10 objects in Java heap:
num #instances #bytes class name (module)
-------------------------------------------------------
1: 718318 49519912 [B (java.base#11.0.17)
2: 616621 44396712 org.apache.avro.JsonProperties$2
3: 666225 15989400 java.lang.String (java.base#11.0.17)
4: 660805 15859320 java.util.concurrent.ConcurrentLinkedQueue$Node (java.base#11.0.17)
5: 616778 14802672 java.util.concurrent.ConcurrentLinkedQueue (java.base#11.0.17)
6: 264000 12672000 org.apache.avro.Schema$Field
7: 6680 12568952 [I (java.base#11.0.17)
8: 368958 11806656 java.util.HashMap$Node (java.base#11.0.17)
9: 88345 7737648 [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base#11.0.17)
10: 197697 6326304 java.util.concurrent.ConcurrentHashMap$Node (java.base#11.0.17)

Related

How can I choose the right key-value store for my use case?

I will describe the data and case.
record {
customerId: "id", <---- indexed
binaryData: "data" <---- not indexed
}
Expectations:
customerId is random 10 digit number
Average size of binary record data - 1-2 kilobytes
There may be up to 100 records per one customerId
Overall number of records - 500M
Write pattern #1: insert one record at a time
Write pattern #2: batch, maybe in parallel, with speed of at least 20M record per hour
Search pattern #1: find all records by customerId
Search pattern #2: find of all records by customerId group, in parallel, at a rate of at least 10M customerId per hour
Data is not too important, we can trade some aspects of reliability for speed
We suppose to work in AWS / GCP - it's best we key-value store is administered by the cloud
We want to spend no more that 1K USD per month on cloud costs for this solution
What we have tried:
We have this approach implemented in relational database, in AWS RDS MariaDB. Server is 32GB RAM, 2TB GP2 SSD, 8 CPU. I found that IOPS usage was high and insert speed was not satisfactory. After investigation I concluded that due to random nature of customerId there is high rate of different writes to index. After this I did the following:
input data is sorted by customerId ASC
Additional trade was made to reduce index size with little degradation of single record read speed. For this I did some sort of buckets where records 1111111185 and 1111111186 go to same "bucket" 11111111. This way bucket can't contain more than 100 customerIds so read speed will be ok, and write speed improves.
Even like this, I could not make more than 1-3M record writes per hour. Different write concurrencies were tested, current value is 4 concurrent writers. After all modifications it's not clear what else we can improve:
IOPS is not at the top use (~4K per second),
CPU use is not high,
Network is not fully utilized,
Write and read throughputs are not capped.
Apparently, ACID principles are holding us back. I am in look for flatly scalable key-value store and will be glad to hear any ideas and roughly estimations.
So if I understand you...
2kb * 500m records ≈ 1 TB of data
20m writes/hr ≈ 5.5k writes/sec
That's quite doable in NoSQL.
The scale is not the issue. It's your cost.
$1k a month for 1 TB of data sounds like a reasonable goal. I just don't think that the public clouds are quite there yet.
Let me give an example with my recommendation: Scylla Cloud and Scylla Open Source. (Disclosure: I work for ScyllaDB.)
I will caution you that your $1k/month capitation on costs might cause you to consider and make some tradeoffs.
As is typical in high availability deployments, to ensure data redundancy in case of node failure, you could use 3x i3.2xlarge instances on AWS (can store 1.9 TB per instance).
You want the extra capacity to run compactions. We use incremental compaction, which saves on space amplification, but you don't want to go with the i3.xlarge (0.9 tb each), which is under your 1 tb limit unless really pressed for costs. In which case you'll have to do some sort of data eviction (like a TTL) to keep your data to around <600 gb.
Even with annual reserved pricing for Scylla Cloud (see here: https://www.scylladb.com/product/scylla-cloud/#pricing) of $764.60/server, to run the three i3.2xlarge would be $2,293.80/month. More than twice your budget.
Now, if you eschew managed services, and want to run self-service, you could go Scylla Open Source, and just look at the on-demand instance pricing (see here: https://aws.amazon.com/ec2/pricing/on-demand/). For 3x i3.2xlarge, you are running each at $0.624/hour. That's a raw on-demand cost of $449.28 each, which doesn't include incidentals like backups, data transfer, etc. But you could get three instances for $1,347.84. Open Source. Not managed.
Still over your budget, but closer. If you could get reserved pricing, that might just make it.
Edit: Found the reserve pricing:
3x i3.2xlarge is going to cost you
At monthly pricing $312.44 x 3 = $937.32, or
1 year up-front $3,482 annual/12 = $290.17/month/server x 3 = $870.50.
So, again, backups, monitoring, and other costs are above that. But you should be able to bring the raw server cost <$1,000 to meet your needs using Scylla Open Source.
But the admin burden is on your team (and their time isn't exactly zero cost).
For example, if you want monitoring on your system, you'll need to set up something like Prometheus, Grafana or Datadog. That will be other servers or services, and they aren't free. (The cost of backups and monitoring by our team are covered with Scylla Cloud. Part of the premium for the service.)
Another way to save money is to only do 2x replication. Which puts your data in a real risky place in case you lose a server. It is not recommended.
All of this was based on maximal assumptions of your data. That your records are all around 2k (not 1k). That you're not getting much utility out of data compression, which ScyllaDB has built in – see part one (https://www.scylladb.com/2019/10/04/compression-in-scylla-part-one/) and part two (https://www.scylladb.com/2019/10/07/compression-in-scylla-part-two/).
To my mind, you should be able to squeak through with your $1k/month budget if you go reserved pricing and open source. Though adding on monitoring and backups and other incidental costs (which I haven't calculated here) may end you up back over that number again.
Otherwise, $2.3k/month in a fully-managed-cloud enterprise package and you can sleep easy at night.

Garbage Collection issues on MapPartitions

I currently have a mapPartitions job which is flatMapping each value in the
iterator, and I'm running into an issue where there will be major GC costs
on certain executions. Some executors will take 20 minutes, 15 of which are
pure garbage collection, and I believe that a lot of it has to do with the
ArrayBuffer that I am outputting. Does anyone have any suggestions as to how
I can do some form of a stream output?
Also, does anyone have any advice in general for tracking down/addressing GC
issues in spark?
Please refer to the below documentation from official page of Spark tuning. I hope it will at least help to give direction to your analysis:
Memory Management Overview
Memory usage in Spark largely falls under one of two categories: execution and storage. Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster. In Spark, execution and storage share a unified region (M). When no execution memory is used, storage can acquire all the available memory and vice versa. Execution may evict storage if necessary, but only until total storage memory usage falls under a certain threshold (R). In other words, R describes a subregion within M where cached blocks are never evicted. Storage may not evict execution due to complexities in implementation.
This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:
spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.
spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.
The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. See the discussion of advanced GC tuning below for details.

Spark out of memory

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).
I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:
manually loop through all the files, do the calculations per file and merge the results in the end
read the whole folder to one RDD, do all the operations on this single RDD and let spark do all the parallelization
I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.
The code I'm using:
- reads TSV files, and extracts meaningful data to (String, String, String) triplets
- afterwards some filtering, mapping and grouping is performed
- finally, the data is reduced and some aggregates are calculated
I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).
I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.
I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.
Me and my team had processed a csv data sized over 1 TB over 5 machine #32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.
If you repartition an RDD, it requires additional computation that
has overhead above your heap size, try loading the file with more
paralelism by decreasing split-size in
TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE
(if you're using TextInputFormat) to elevate the level of
paralelism.
Try using mapPartition instead of map so you can handle the
computation inside a partition. If the computation uses a temporary
variable or instance and you're still facing out of memory, try
lowering the number of data per partition (increasing the partition
number)
Increase the driver memory and executor memory limit using
"spark.executor.memory" and "spark.driver.memory" in spark
configuration before creating Spark Context
Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine
To add another perspective based on code (as opposed to configuration): Sometimes it's best to figure out at what stage your Spark application is exceeding memory, and to see if you can make changes to fix the problem. When I was learning Spark, I had a Python Spark application that crashed with OOM errors. The reason was because I was collecting all the results back in the master rather than letting the tasks save the output.
E.g.
for item in processed_data.collect():
print(item)
failed with OOM errors. On the other hand,
processed_data.saveAsTextFile(output_dir)
worked fine.
Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.

Ways to improve Cassandra read performance in my scenario

We have recently started using Cassandra database in production. We have a single cross colo cluster of 24 nodes meaning 12 nodes in PHX and 12 nodes in SLC colo. We have a replication factor of 4 which means 2 copies will be there in each datacenter.
Below is the way by which keyspace and column families have been created by our Production DBA's.
create keyspace profile with placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy' and
strategy_options = {slc:2,phx:2};
create column family PROFILE_USER
with key_validation_class = 'UTF8Type'
and comparator = 'UTF8Type'
and default_validation_class = 'UTF8Type'
and gc_grace = 86400;
We are running Cassandra 1.2.2 and it has org.apache.cassandra.dht.Murmur3Partitioner, with KeyCaching, SizeTieredCompactionStrategy and Virtual Nodes enabled as well.
Machine Specifications for Cassandra production nodes-
16 cores, 32 threads
128GB RAM
4 x 600GB SAS in Raid 10, 1.1TB usable
2 x 10GbaseT NIC, one usable
Below is the result I am getting.
Read Latency(95th Percentile) Number of Threads Duration the program was running(in minutes) Throughput(requests/seconds) Total number of id's requested Total number of columns requested
9 milliseconds 10 30 1977 3558701 65815867
I am not sure what other things I should try it out with Cassandra to get much better read performance. I am assuming it is hitting the disk in my case. Should I try increasing the Replication Factor to some higher number? Any other suggestion?
I believe reading the data from HDD is around 6-12ms as compared to SSD's? In my case it is hitting the disk everytime I guess and enabling key cache is not working fine here. I cannot enable RowCache becasue it’s more efficient to use OS page cache. Maintaining row cache in JVM is very expensive, thus row cache is recommended for smaller number of rows, like <100K rows, only.
Is there any way I can verify whether keycaching is working fine in my case or not?
This is what I get when I do show schema for column family-
create column PROFILE
with column_type = 'Standard'
and comparator = 'UTF8Type'
and default_validation_class = 'UTF8Type'
and key_validation_class = 'UTF8Type'
and read_repair_chance = 0.1
and dclocal_read_repair_chance = 0.0
and populate_io_cache_on_flush = false
and gc_grace = 86400
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
and caching = 'KEYS_ONLY'
and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'};
Is there anything I should make a change to get good read performance?
I am assuming it is hitting the disk in my case. Should I try increasing the Replication Factor to some higher number? Any other suggestion?
If your data is much larger than memory and your access is close to random you will be hitting disk. This is consistent with latencies of ~10ms.
Increasing the replication factor might help, although it will make your cache less efficient since each node will store more data. It is probably only worth doing if your read pattern is mostly random, your data is very large, you have low consistency requirements and your access is read heavy.
If you want to decrease read latency, you can use a lower consistency level. Reading at consistency level CL.ONE generally gives the lowest read latency at a cost of consistency. You will only get consistent reads at CL.ONE if writes are at CL.ALL. But if consistency is not required it is a good tradeoff.
If you want to increase read throughput, you can decrease read_repair_chance. This number specifies the probability that Cassandra performs a read repair on each read. Read repair involves reading from available replicas and updating any that have old values.
If reading at a low consistency level, read repair incurs extra read I/O so decreases throughput. It doesn't affect latency (for low consistency levels) since read repair is done asynchronously. Again, if consistency isn't important for your application, decrease read_repair_chance to maybe 0.01 to improve throughput.
Is there any way I can verify whether keycaching is working fine in my
case or not?
Look at the output of 'nodetool info' and it will output a line like:
Key Cache : size 96468768 (bytes), capacity 96468992 (bytes), 959293 hits, 31637294 requests, 0.051 recent hit rate, 14400 save period in seconds
This gives you the key cache hit rate, which is quite low in the example above.
Old post but incase someone else comes by this.
Don't use even RF. Your RF of 4 requires quorum of 3 nodes, this is no different than a RF of 5.
Your key cache is probably working fine, this only tells cassandra where on the disk it's located. This only reduces seek times.
You have a rather large amount of ram pre 3.0, likely you're not leveraging all of this. Try G1GC on newer cassandra nodes.
Row key cache, make sure that your partitions are ordered in the way you intend to access them. Ex: If you're picking up only recent data, make sure you order by timestamp ASC instead of timestamp DESC as it will cache from the START of the partition.
Parallelize and bucket queries. Use nodetool cfhistograms to evaluate the size of your partitions. Then try and break the partitions into smaller chunks if they exceed 100mb. From here you change your queries to SELECT x FROM table WHERE id = X and bucket in (1,2,3) if you need to scan. Significant performance can then be gained from removing the "in bucket" and moving that to 3 separate queries. Ex running: Select... WHERE id = X and bucket = 1, Select ... WHERE id = X and bucket = 2 and doing the aggregation at the application layer.

Can't map file memory-mongo requires 64 bit build for larger datasets

I have a sharded cluster in 3 systems.
While inserting I get the error message:
cant map file memory-mongo requires 64 bit build for larger datasets
I know that 32 bit machine have a limit size of 2 gb.
I have two questions to ask.
The 2 gb limit is for 1 system, so the total data will be, 6gb as my sharding is done in 3 systems. So it would be only 2 gb or 6 gb?
While sharding is done properly, all the data are stored in single system in spite of distributing data in all the three sharded system?
Does Sharding play any role in increasing the datasize limit?
Does chunk size play any vital role in performance?
I would not recommend you do anything with 32bit MongoDB beyond running it on a development machine where you perhaps cannot run 64bit. Once you hit the limit the file becomes unuseable.
The documentation states "Use 64 bit for production. This is important as if you hit the mmap size limit (exact limit varies but less than 2GB) you will be unable to write to the database (analogous to a disk full condition)."
Sharding is all about scaling out your data set across multiple nodes so in answer to your question, yes you have increased the possible size of your data set. Remember though that namespaces and indexes also take up space.
You haven't specified where your mongos resides??? Where are you seeing the error from - a mongod or the mongos? I suspect that it's the mongod. I believe that you need to look at pre-splitting the chunks - http://docs.mongodb.org/manual/administration/sharding/#splitting-chunks.
which would seem to indicate that all your data is going to the one mongod.
If you have a mongos, what does sh.status() return? Are chunks spread across all mongod's?
For testing, I'd recommend a chunk size of 1mb. In production, it's best to stick with the default of 64mb unless you've some really important reason why you don't want the default and you really know what you are doing. If you have too small of a chunk size, then you will be performing splits far too often.