We have recently started using Cassandra database in production. We have a single cross colo cluster of 24 nodes meaning 12 nodes in PHX and 12 nodes in SLC colo. We have a replication factor of 4 which means 2 copies will be there in each datacenter.
Below is the way by which keyspace and column families have been created by our Production DBA's.
create keyspace profile with placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy' and
strategy_options = {slc:2,phx:2};
create column family PROFILE_USER
with key_validation_class = 'UTF8Type'
and comparator = 'UTF8Type'
and default_validation_class = 'UTF8Type'
and gc_grace = 86400;
We are running Cassandra 1.2.2 and it has org.apache.cassandra.dht.Murmur3Partitioner, with KeyCaching, SizeTieredCompactionStrategy and Virtual Nodes enabled as well.
Machine Specifications for Cassandra production nodes-
16 cores, 32 threads
128GB RAM
4 x 600GB SAS in Raid 10, 1.1TB usable
2 x 10GbaseT NIC, one usable
Below is the result I am getting.
Read Latency(95th Percentile) Number of Threads Duration the program was running(in minutes) Throughput(requests/seconds) Total number of id's requested Total number of columns requested
9 milliseconds 10 30 1977 3558701 65815867
I am not sure what other things I should try it out with Cassandra to get much better read performance. I am assuming it is hitting the disk in my case. Should I try increasing the Replication Factor to some higher number? Any other suggestion?
I believe reading the data from HDD is around 6-12ms as compared to SSD's? In my case it is hitting the disk everytime I guess and enabling key cache is not working fine here. I cannot enable RowCache becasue itβs more efficient to use OS page cache. Maintaining row cache in JVM is very expensive, thus row cache is recommended for smaller number of rows, like <100K rows, only.
Is there any way I can verify whether keycaching is working fine in my case or not?
This is what I get when I do show schema for column family-
create column PROFILE
with column_type = 'Standard'
and comparator = 'UTF8Type'
and default_validation_class = 'UTF8Type'
and key_validation_class = 'UTF8Type'
and read_repair_chance = 0.1
and dclocal_read_repair_chance = 0.0
and populate_io_cache_on_flush = false
and gc_grace = 86400
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
and caching = 'KEYS_ONLY'
and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'};
Is there anything I should make a change to get good read performance?
I am assuming it is hitting the disk in my case. Should I try increasing the Replication Factor to some higher number? Any other suggestion?
If your data is much larger than memory and your access is close to random you will be hitting disk. This is consistent with latencies of ~10ms.
Increasing the replication factor might help, although it will make your cache less efficient since each node will store more data. It is probably only worth doing if your read pattern is mostly random, your data is very large, you have low consistency requirements and your access is read heavy.
If you want to decrease read latency, you can use a lower consistency level. Reading at consistency level CL.ONE generally gives the lowest read latency at a cost of consistency. You will only get consistent reads at CL.ONE if writes are at CL.ALL. But if consistency is not required it is a good tradeoff.
If you want to increase read throughput, you can decrease read_repair_chance. This number specifies the probability that Cassandra performs a read repair on each read. Read repair involves reading from available replicas and updating any that have old values.
If reading at a low consistency level, read repair incurs extra read I/O so decreases throughput. It doesn't affect latency (for low consistency levels) since read repair is done asynchronously. Again, if consistency isn't important for your application, decrease read_repair_chance to maybe 0.01 to improve throughput.
Is there any way I can verify whether keycaching is working fine in my
case or not?
Look at the output of 'nodetool info' and it will output a line like:
Key Cache : size 96468768 (bytes), capacity 96468992 (bytes), 959293 hits, 31637294 requests, 0.051 recent hit rate, 14400 save period in seconds
This gives you the key cache hit rate, which is quite low in the example above.
Old post but incase someone else comes by this.
Don't use even RF. Your RF of 4 requires quorum of 3 nodes, this is no different than a RF of 5.
Your key cache is probably working fine, this only tells cassandra where on the disk it's located. This only reduces seek times.
You have a rather large amount of ram pre 3.0, likely you're not leveraging all of this. Try G1GC on newer cassandra nodes.
Row key cache, make sure that your partitions are ordered in the way you intend to access them. Ex: If you're picking up only recent data, make sure you order by timestamp ASC instead of timestamp DESC as it will cache from the START of the partition.
Parallelize and bucket queries. Use nodetool cfhistograms to evaluate the size of your partitions. Then try and break the partitions into smaller chunks if they exceed 100mb. From here you change your queries to SELECT x FROM table WHERE id = X and bucket in (1,2,3) if you need to scan. Significant performance can then be gained from removing the "in bucket" and moving that to 3 separate queries. Ex running: Select... WHERE id = X and bucket = 1, Select ... WHERE id = X and bucket = 2 and doing the aggregation at the application layer.
Related
I'm referring to the Confluent Schema Registry:
Is there reliable information on how many distinct schemas a single schema registry can support?
From how I understand the schema registry, it reads the available schemas on startup from a kafka topic.
So possible limitations could be memory consumption (= amount of schemas in memory at a time) or performance (= lookup of schemas from Kafka).
Internally, it uses a ConcurrentHashMap to store that information, so, in theory, the limit is roughly the max size of a backing Java array.
Do Java arrays have a maximum size?
However, there are multiple maps, and therefore, JVM heap constraints will also exist. If you have larger raw-schema strings, then more memory will be used, so there is no good calculation for this.
I created my own benchmark tool for finding about possible limitations.
Link to Github repo is here.
TL;DR:
As suspected by #OneCricketeer, the scalability factor is the ~ nr of schemas * size of avg schema. I created a tool to see how the registry memory and cpu usage scales for registration of many different AVRO schemas of the same size (using a custom field within the schema to differentiate them).
I ran the tool for ~48 schemas, for that ~900 MB of memory where used with low cpu usage.
Findings:
The ramp up of memory usage is a lot higher in the beginning. After the intial ramp up, the memory usage increases step-wise when new memory is allocated to hold more schemas.
Most of the memory is used for storing the schemas in the ConcurrentHashMap (as expected).
The CPU usage does not change significantly with many schemas - also not the the time to retrieve a schema.
There is a cache for holding RawSchema -> ParsedSchema mappings (var SCHEMA_CACHE_SIZE_CONFIG, default 1000), but at least in my tests I could not see negative impact for a cache miss, it was both in hit and miss ~1-2ms for retrieving a schema.
Memory usage (x scale = 100 schemas, y scale = 1 MB):
CPU usage (x scale = 100 schemas, y scale = usage in %):
Top 10 objects in Java heap:
num #instances #bytes class name (module)
-------------------------------------------------------
1: 718318 49519912 [B (java.base#11.0.17)
2: 616621 44396712 org.apache.avro.JsonProperties$2
3: 666225 15989400 java.lang.String (java.base#11.0.17)
4: 660805 15859320 java.util.concurrent.ConcurrentLinkedQueue$Node (java.base#11.0.17)
5: 616778 14802672 java.util.concurrent.ConcurrentLinkedQueue (java.base#11.0.17)
6: 264000 12672000 org.apache.avro.Schema$Field
7: 6680 12568952 [I (java.base#11.0.17)
8: 368958 11806656 java.util.HashMap$Node (java.base#11.0.17)
9: 88345 7737648 [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base#11.0.17)
10: 197697 6326304 java.util.concurrent.ConcurrentHashMap$Node (java.base#11.0.17)
I will describe the data and case.
record {
customerId: "id", <---- indexed
binaryData: "data" <---- not indexed
}
Expectations:
customerId is random 10 digit number
Average size of binary record data - 1-2 kilobytes
There may be up to 100 records per one customerId
Overall number of records - 500M
Write pattern #1: insert one record at a time
Write pattern #2: batch, maybe in parallel, with speed of at least 20M record per hour
Search pattern #1: find all records by customerId
Search pattern #2: find of all records by customerId group, in parallel, at a rate of at least 10M customerId per hour
Data is not too important, we can trade some aspects of reliability for speed
We suppose to work in AWS / GCP - it's best we key-value store is administered by the cloud
We want to spend no more that 1K USD per month on cloud costs for this solution
What we have tried:
We have this approach implemented in relational database, in AWS RDS MariaDB. Server is 32GB RAM, 2TB GP2 SSD, 8 CPU. I found that IOPS usage was high and insert speed was not satisfactory. After investigation I concluded that due to random nature of customerId there is high rate of different writes to index. After this I did the following:
input data is sorted by customerId ASC
Additional trade was made to reduce index size with little degradation of single record read speed. For this I did some sort of buckets where records 1111111185 and 1111111186 go to same "bucket" 11111111. This way bucket can't contain more than 100 customerIds so read speed will be ok, and write speed improves.
Even like this, I could not make more than 1-3M record writes per hour. Different write concurrencies were tested, current value is 4 concurrent writers. After all modifications it's not clear what else we can improve:
IOPS is not at the top use (~4K per second),
CPU use is not high,
Network is not fully utilized,
Write and read throughputs are not capped.
Apparently, ACID principles are holding us back. I am in look for flatly scalable key-value store and will be glad to hear any ideas and roughly estimations.
So if I understand you...
2kb * 500m records β 1 TB of data
20m writes/hr β 5.5k writes/sec
That's quite doable in NoSQL.
The scale is not the issue. It's your cost.
$1k a month for 1 TB of data sounds like a reasonable goal. I just don't think that the public clouds are quite there yet.
Let me give an example with my recommendation: Scylla Cloud and Scylla Open Source. (Disclosure: I work for ScyllaDB.)
I will caution you that your $1k/month capitation on costs might cause you to consider and make some tradeoffs.
As is typical in high availability deployments, to ensure data redundancy in case of node failure, you could use 3x i3.2xlarge instances on AWS (can store 1.9 TB per instance).
You want the extra capacity to run compactions. We use incremental compaction, which saves on space amplification, but you don't want to go with the i3.xlarge (0.9 tb each), which is under your 1 tb limit unless really pressed for costs. In which case you'll have to do some sort of data eviction (like a TTL) to keep your data to around <600 gb.
Even with annual reserved pricing for Scylla Cloud (see here: https://www.scylladb.com/product/scylla-cloud/#pricing) of $764.60/server, to run the three i3.2xlarge would be $2,293.80/month. More than twice your budget.
Now, if you eschew managed services, and want to run self-service, you could go Scylla Open Source, and just look at the on-demand instance pricing (see here: https://aws.amazon.com/ec2/pricing/on-demand/). For 3x i3.2xlarge, you are running each at $0.624/hour. That's a raw on-demand cost of $449.28 each, which doesn't include incidentals like backups, data transfer, etc. But you could get three instances for $1,347.84. Open Source. Not managed.
Still over your budget, but closer. If you could get reserved pricing, that might just make it.
Edit: Found the reserve pricing:
3x i3.2xlarge is going to cost you
At monthly pricing $312.44 x 3 = $937.32, or
1 year up-front $3,482 annual/12 = $290.17/month/server x 3 = $870.50.
So, again, backups, monitoring, and other costs are above that. But you should be able to bring the raw server cost <$1,000 to meet your needs using Scylla Open Source.
But the admin burden is on your team (and their time isn't exactly zero cost).
For example, if you want monitoring on your system, you'll need to set up something like Prometheus, Grafana or Datadog. That will be other servers or services, and they aren't free. (The cost of backups and monitoring by our team are covered with Scylla Cloud. Part of the premium for the service.)
Another way to save money is to only do 2x replication. Which puts your data in a real risky place in case you lose a server. It is not recommended.
All of this was based on maximal assumptions of your data. That your records are all around 2k (not 1k). That you're not getting much utility out of data compression, which ScyllaDB has built in β see part one (https://www.scylladb.com/2019/10/04/compression-in-scylla-part-one/) and part two (https://www.scylladb.com/2019/10/07/compression-in-scylla-part-two/).
To my mind, you should be able to squeak through with your $1k/month budget if you go reserved pricing and open source. Though adding on monitoring and backups and other incidental costs (which I haven't calculated here) may end you up back over that number again.
Otherwise, $2.3k/month in a fully-managed-cloud enterprise package and you can sleep easy at night.
I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf as follows:
spark.python.worker.memory 1g
spark.executor.cores 4
spark.executor.instances 4
spark.sql.shuffle.partitions 4
Next I imported 1.5 million rows in Cassandra:
test(
tid int,
cid int,
pid int,
ev list<double>,
primary key (tid)
)
test.ev is a list containing numeric values i.e. [2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680]
Now in the code, to test the whole thing I just created a SparkSession, connected to Cassandra and make a simple select count:
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testks",table="test")
df.select().count()
At this point, Spark outputs the count and takes about 28 seconds to finish the Job, distributed in 13 Tasks (in Spark UI, the total Input for the Tasks is 331.6MB)
Questions:
Is that the expected performance? If not, what am I missing?
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to 4, why is creating 13 Tasks? (Also made sure the number of partitions calling rdd.getNumPartitions() on my DataFrame)
Update
A common operation I would like to test over this data:
Query a large data set, say, from 100,000 ~ N rows grouped by pid
Select ev, a list<double>
Perform an average on each member, assuming by now each list has the same length i.e df.groupBy('pid').agg(avg(df['ev'][1]))
As #zero323 suggested, I deployed a external machine (2Gb RAM, 4 cores, SSD) with Cassandra just for this test, and loaded the same data set. The result of the df.select().count() was an expected greater latency and overall poorer performance in comparison with my previous test (took about 70 seconds to finish the Job).
Edit: I misunderstood his suggestion. #zero323 meant to let Cassandra perform the count instead of using Spark SQL, as explained in here
Also I wanted to point out that I am aware of the inherent anti-pattern of setting a list<double> instead a wide row for this type of data, but my concerns at this moment are more the time spent on retrieval of a large dataset rather than the actual average computation time.
Is that the expected performance? If not, what am I missing?
It looks slowish but it is not exactly unexpected. In general count is expressed as
SELECT 1 FROM table
followed by Spark side summation. So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.
As explained by the docs Cassandra backed RDD (not Datasets) provide optimized cassandraCount method which performs server side counting.
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to (...), why is creating (...) Tasks?
Because spark.sql.shuffle.partitions is not used here. This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset creation or global aggregations like count(*) (which always use 1 partition for final aggregation).
If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:
Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism
As you can see another factor here is spark.default.parallelism but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.
I see that it is very old question but maybe someone needs it now.
When running Spark on local machine it is very important to set into SparkConf master "local[*]" that according to documentation allows to run Spark with as many worker threads as logical cores on your machine.
It helped me to increase performance of count() operation by 100% on local machine comparing to master "local".
Some followups at the bottom
I have a test installation of Spark and Cassandra where I have 6 nodes with 128GiB of RAM and 16 CPU cores each. Each node runs Spark and Cassandra. I set up my keyspace with the SimpleStrategy and a replication factor of 3 (i.e., fairly standard).
My table is very simple, like this:
create table if not exists mykeyspace.values (channel_id timeuuid, day int, time bigint, value double, primary key ((channel_id, day), time)) with clustering order by (time asc)
time is simply a unix timestamp in nanoseconds (the measuring devices the values come from are that precise and this precision is wanted), day is this timestamp in days (i.e., days since 1970-01-01).
I now inserted about 200 GiB of values for about 400 channels and tested a very simple thing - calculate the 10-minute average of every channel:
sc.
cassandraTable("mykeyspace", "values").
map(r => (r.getLong("time"), r.getUUID("channel_id"), r.getDouble("value"))).
map(t => (t._1 / 600L / 1000000000L, t._2) -> (t._3, 1.0)).
reduceByKey((a, b) => (a._1 + b._1) -> (a._2 + b._2)).
map(t => (t._1._1 * 600L * 1000000000L, t._1._2, t._2._1 / t._2._2))
when I now do this calculation, even without saving the result (i.e., by using a simple count()) this takes a VERY long time and I have a very bad read performance.
When I do top on the nodes, Cassandra's java process takes about 800% CPU, which is OK because this is about half the load the node can take; the other half is taken by Spark.
However, I noticed a strange thing:
When I run iotop I expect to see a lot of disk read, but I see a lot of disk WRITE instead, all of which comes from kworker.
When I do iostat -x -t 10, I also see a lot of writes going on.
Swap is disabled.
When I run a similar calculation directly on the CSV files the data came from, which are stored in HDFS and loaded via sc.newAPIHadoopFile with a custom input format, the process finishes much faster (the calculation takes about an hour with Cassandra but about 5 minutes with files from HDFS).
So where can I start troubleshooting and tuning?
Followup 1
With the help of RussS (see comments) I discovered that logging was set to DEBUG. I disabled this, set logging to ERROR, and also disabled GC logging, but this did not change anything at all.
I also tried keyBy as the very same user pointed out, but this also did not change anything.
I also tried doing it locally, I tried it once from .net and once from Scala, and here, the database is accessed as expected, i.e., no writes.
Followup 2
I think I got it. For once, I didn't see the forest for the trees, because the hour I stated earlier for 200GiB is still about 56 MiB/s throughput. Since the hardware I run my installation on is far from optional (it is a high performance server which runs Microsoft HyperV which in turn runs the nodes virtually, and the hard disks of this machine are quite slow) this is indeed a throughput I expect. Since the host is just one machine with one RAID array where the disks of the nodes are virtual HDDs, I can't expect the performance to magically go through the roof.
I also tried to run Spark standalone which improves the performance a bit (I now get about 75 MiB/s), and also the constant writes are gone with this - I only get occasional spikes I expect because of shuffling.
For the CSV files being much faster, the reason is that the raw data in CSV is about 50 GiB and my custom FileInputFormat that reads it, does it line by line, and is also using a very fast string-to-double parser which only knows the US format but is faster than Java's parseDouble or Scala's toDouble. With this special tweaking I get about 170MiB/s in YARN mode.
So I suppose I should, for once, improve my CQL queries to limit the data that gets read, and try to tweak some YARN settings.
I have a sharded cluster in 3 systems.
While inserting I get the error message:
cant map file memory-mongo requires 64 bit build for larger datasets
I know that 32 bit machine have a limit size of 2 gb.
I have two questions to ask.
The 2 gb limit is for 1 system, so the total data will be, 6gb as my sharding is done in 3 systems. So it would be only 2 gb or 6 gb?
While sharding is done properly, all the data are stored in single system in spite of distributing data in all the three sharded system?
Does Sharding play any role in increasing the datasize limit?
Does chunk size play any vital role in performance?
I would not recommend you do anything with 32bit MongoDB beyond running it on a development machine where you perhaps cannot run 64bit. Once you hit the limit the file becomes unuseable.
The documentation states "Use 64 bit for production. This is important as if you hit the mmap size limit (exact limit varies but less than 2GB) you will be unable to write to the database (analogous to a disk full condition)."
Sharding is all about scaling out your data set across multiple nodes so in answer to your question, yes you have increased the possible size of your data set. Remember though that namespaces and indexes also take up space.
You haven't specified where your mongos resides??? Where are you seeing the error from - a mongod or the mongos? I suspect that it's the mongod. I believe that you need to look at pre-splitting the chunks - http://docs.mongodb.org/manual/administration/sharding/#splitting-chunks.
which would seem to indicate that all your data is going to the one mongod.
If you have a mongos, what does sh.status() return? Are chunks spread across all mongod's?
For testing, I'd recommend a chunk size of 1mb. In production, it's best to stick with the default of 64mb unless you've some really important reason why you don't want the default and you really know what you are doing. If you have too small of a chunk size, then you will be performing splits far too often.