Cassandra random read speed - nosql

We're still evaluating Cassandra for our data store. As a very simple test, I inserted a value for 4 columns into the Keyspace1/Standard1 column family on my local machine amounting to about 100 bytes of data. Then I read it back as fast as I could by row key. I can read it back at 160,000/second. Great.
Then I put in a million similar records all with keys in the form of X.Y where X in (1..10) and Y in (1..100,000) and I queried for a random record. Performance fell to 26,000 queries per second. This is still well above the number of queries we need to support (about 1,500/sec)
Finally I put ten million records in from 1.1 up through 10.1000000 and randomly queried for one of the 10 million records. Performance is abysmal at 60 queries per second and my disk is thrashing around like crazy.
I also verified that if I ask for a subset of the data, say the 1,000 records between 3,000,000 and 3,001,000, it returns slowly at first and then as they cache, it speeds right up to 20,000 queries per second and my disk stops going crazy.
I've read all over that people are storing billions of records in Cassandra and fetching them at 5-6k per second, but I can't get anywhere near that with only 10mil records. Any idea what I'm doing wrong? Is there some setting I need to change from the defaults? I'm on an overclocked Core i7 box with 6gigs of ram so I don't think it's the machine.
Here's my code to fetch records which I'm spawning into 8 threads to ask for one value from one column via row key:
ColumnPath cp = new ColumnPath();
cp.Column_family = "Standard1";
cp.Column = utf8Encoding.GetBytes("site");
string key = (1+sRand.Next(9)) + "." + (1+sRand.Next(1000000));
ColumnOrSuperColumn logline = client.get("Keyspace1", key, cp, ConsistencyLevel.ONE);
Thanks for any insights

purely random reads is about worst-case behavior for the caching that your OS (and Cassandra if you set up key or row cache) tries to do.
if you look at contrib/py_stress in the Cassandra source distribution, it has a configurable stdev to perform random reads but with some keys hotter than others. this will be more representative of most real-world workloads.

Add more Cassandra nodes and give them lots of memory (-Xms / -Xmx). The more Cassandra instances you have, the data will be partitioned across the nodes and much more likely to be in memory or more easily accessed from disk. You'll be very limited with trying to scale a single workstation class CPU. Also, check the default -Xms/-Xmx setting. I think the default is 1GB.

It looks like you haven't got enough RAM to store all the records in memory.
If you swap to disk then you are in trouble, and performance is expected to drop significantly, especially if you are random reading.
You could also try benchmarking some other popular alternatives, like Redis or VoltDB.

VoltDB can certainly handle this level of read performance as well as writes and operates using a cluster of servers. As an in-memory solution you need to build a large enough cluster to hold all of your data in RAM.

Related

What are settings to lookout for with Citus PostgresQL

We are looking into using CitusDB. After reading all the documentation we are not clear on some fundamentals. Hoping somebody can give some directions.
In Citus you specify a shard_count and a shard_max_size, these settings are set on the coordinator according to the docs (but weirdly can also be set on a node).
What happens when you specify 1000 shards and distribute 10 tables with 100 clients?
Does it create a shard for every table (users_1, users_2, shops_1, etc.) (so effectively using all 1000 shards.
If you would grow with another 100 clients, we already hit the 1000 limit, how are these tables partitioned?
The shard_max_size defaults to 1Gb. If a shard is > 1Gb a new shard is created, but what happens when the shard_count is already hit?
Lastly, is it advisible to go for 3000 shards? We read in the docs 128 is adviced for a saas. But this seams low if you have 100 clients * 10 tables. (I know it depends.. but..)
Former Citus/current Microsoft employee here, chiming in with some advice.
Citus shards are based on integer hash ranges of the distribution key. When a row is inserted, the value of the distribution key is hashed, the planner looks up what shard was assigned the range of hash values that that key falls into, then looks up what worker the shard lives on, and then runs the insert on that worker. This means that the customers are divided up across shards in a roughly even fashion, and when you add a new customer it'll just go into an existing shard.
It is critically important that all distributed tables that you wish to join to each other have the same number of shards and that their distribution columns have the same type. This lets us perform joins entirely on workers, which is awesome for performance.
If you've got a super big customer (100x as much data as your average customer is a decent heuristic), I'd use the tenant isolation features in advance to give them their own shard. This will make moving them to dedicated hardware much easier if you decide to do so down the road.
The shard_max_size setting has no effect on hash distributed tables. Shards will grow without limit as you keep inserting data, and hash-distributed tables will never increase their shard count under normal operations. This setting only applies to append distribution, which is pretty rarely used these days (I can think of one or two companies using it, but that's about it).
I'd strongly advise against changing the citus.shard_count to 3000 for the use case you've described. 64 or 128 is probably correct, and I'd consider 256 if you're looking at >100TB of data. It's perfectly fine if you end up having thousands of distributed tables and each one has 128 shards, but it's better to keep the number of shards per table reasonable.

What about expected performance in Pentaho?

I am using Pentaho to create ETL's and I am very focused on performance. I develop an ETL process that copy 163.000.000 rows from Sql server 2088 to PostgreSQL and it takes 17h.
I do not know how good or bad is this performance. Do you know how to measure if the time that takes some process is good? At least as a reference to know if I need to keep working heavily on performance or not.
Furthermore, I would like to know if it is normal that in the first 2 minutes of ETL process it load 2M rows. I calculate how long will take to load all the rows. The expected result is 6 hours, but then the performance decrease and it takes 17h.
I have been investigating in goole and I do not find any time references neither any explanations about performance.
Divide and conquer, and proceed by elimination.
First, add a LIMIT to your query so it takes 10 minutes instead of 17 hours, this will make it a lot easier to try different things.
Are the processes running on different machines? If so, measure network bandwidth utilization to make sure it isn't a bottleneck. Transfer a huge file, make sure the bandwidth is really there.
Are the processes running on the same machine? Maybe one is starving the other for IO. Are source and destination the same hard drive? Different hard drives? SSDs? You need to explain...
Examine IO and CPU usage of both processes. Does one process max out one cpu core?
Does a process max out one of the disks? Check iowait, iops, IO bandwidth, etc.
How many columns? Two INTs, 500 FLOATs, or a huge BLOB with a 12 megabyte PDF in each row? Performance would vary between these cases...
Now, I will assume the problem is on the POSTGRES side.
Create a dummy table, identical to your target table, which has:
Exact same columns (CREATE TABLE dummy LIKE table)
No indexes, No constraints (I think it is the default, double check the created table)
BEFORE INSERT trigger on it which returns NULL and drop the row.
The rows will be processed, just not inserted.
Is it fast now? OK, so the problem was insertion.
Do it again, but this time using an UNLOGGED TABLE (or a TEMPORARY TABLE). These do not have any crash-resistance because they don't use the journal, but for importing data it's OK.... if it crashes during the insert you're gonna wipe it out and restart anyway.
Still No indexes, No constraints. Is it fast?
If slow => IO write bandwidth issue, possibly caused by something else hitting the disks
If fast => IO is OK, problem not found yet!
With the table loaded with data, add indexes and constraints one by one, find out if you got, say, a CHECK that uses a slow SQL function, or a FK into a table which has no index, that kind of stuff. Just check how long it takes to create the constraint.
Note: on an import like this you would normally add indices and constraints after the import.
My gut feeling is that PG is checkpointing like crazy due to the large volume of data, due to too-low checkpointing settings in the config. Or some issue like that, probably random IO writes related. You put the WAL on a fast SSD, right?
17H is too much. Far too much. For 200 Million rows, 6 hours is even a lot.
Hints for optimization:
Check the memory size: edit the spoon.bat, find the line containing -Xmx and change it to half your machine memory size. Details varies with java version. Example for PDI V7.1.
Check if the query from the source database is not too long (because too complex, or server memory size, or ?).
Check the target commit size (try 25000 for PostgresSQL), the Use batch update for inserts in on, and also that the index and constraints are disabled.
Play with the Enable lazy conversion in the Table input. Warning, you may produce difficult to identify and debug errors due to data casting.
In the transformation property you can tune the Nr of rows in rowset (click anywhere, select Property, then the tab Miscelaneous). On the same tab check the transformation is NOT transactional.

MongoDB High Avg. Flush Time - Write Heavy

I'm using MongoDB with approximately 4 million documents and around 5-6GB database size. The machine has 10GB of RAM, and free only reports around 3.7GB in use. The database is used for a video game related ladder (rankings) website, separated by region.
It's a fairly write heavy operation, but still gets a significant number of reads as well. We use an updater which queries an outside source every hour or two. This updater then processes the records and updates documents on the database. The updater only processes one region at a time (see previous paragraph), so approximately 33% of the database is updated.
When the updater runs, and for the duration that it runs, the average flush time spikes up to around 35-40 seconds, and we experience general slowdowns with other queries. The updater is RAN on a SEPARATE MACHINE and only queries MongoDB at the end, when all the data has been retrieved and processed from the third party.
Some people have suggested slowing down the number of updates, or only updating players who have changed, but the problem comes down to rankings. Since we support ties between players, we need to pre-calculate the ranks - so if only a few users have actually changed ranks, we still need to update the rest of the users ranks accordingly. At least, that was the case with MySQL - I'm not sure if there is a good solution with MongoDB for ranking ~800K->1.2 million documents while supporting ties.
My question is: how can we improve the flush and slowdown we're experiencing? Why is it spiking so high? Would disabling journaling (to take some load off the i/o) help, as data loss isn't something I'm worried about as the database is updated frequently regardless?
Server status: http://pastebin.com/w1ETfPWs
You are using the wrong tool for the job. MongoDB isn't designed for ranking large ladders in real time, at least not quickly.
Use something like Redis, Redis have something called a "Sorted List" designed just for this job, with it you can have 100 millions entries and still fetch the 5000000th to 5001000th at sub millisecond speed.
From the official site (Redis - Sorted sets):
Sorted sets
With sorted sets you can add, remove, or update elements in a very fast way (in a time proportional to the logarithm of the number of elements). Since elements are taken in order and not ordered afterwards, you can also get ranges by score or by rank (position) in a very fast way. Accessing the middle of a sorted set is also very fast, so you can use Sorted Sets as a smart list of non repeating elements where you can quickly access everything you need: elements in order, fast existence test, fast access to elements in the middle!
In short with sorted sets you can do a lot of tasks with great performance that are really hard to model in other kind of databases.
With Sorted Sets you can:
Take a leader board in a massive online game, where every time a new score is submitted you update it using ZADD. You can easily take the top users using ZRANGE, you can also, given an user name, return its rank in the listing using ZRANK. Using ZRANK and ZRANGE together you can show users with a score similar to a given user. All very quickly.
Sorted Sets are often used in order to index data that is stored inside Redis. For instance if you have many hashes representing users, you can use a sorted set with elements having the age of the user as the score and the ID of the user as the value. So using ZRANGEBYSCORE it will be trivial and fast to retrieve all the users with a given interval of ages.
Sorted Sets are probably the most advanced Redis data types, so take some time to check the full list of Sorted Set commands to discover what you can do with Redis!
Without seeing any disk statistics, I am of the opinion that you are saturating your disks.
This can be checked with iostat -xmt 2, and checking the %util column.
Please don't disable journalling - you will only cause more issues later down the line when your machine crashes.
Separating collections will have no effect. Separating databases may, but if you're IO bound, this will do nothing to help you.
Options
If I am correct, and your disks are saturated, adding more disks in a RAID 10 configuration will vastly help performance and durability - more so if you separate the journal off to an SSD.
Assuming that this machine is a single server, you can setup a replicaset and send your read queries there. This should help you a fair bit, but not as much as the disks.

Billions rows in PostgreSql: partition or not to partition?

What i have:
Simple server with one xeon with 8 logic cores, 16 gb ram, mdadm raid1 of 2x 7200rpm drives.
PostgreSql
A lot of data to work with. Up to 30 millions of rows are being imported per day.
Time - complex queries can be executed up to an hour
Simplified schema of table, that will be very big:
id| integer | not null default nextval('table_id_seq'::regclass)
url_id | integer | not null
domain_id | integer | not null
position | integer | not null
The problem with the schema above is that I don't have the exact answer on how to partition it.
Data for all periods is going to be used (NO queries will have date filters).
I thought about partitioning on "domain_id" field, but the problem is that it is hard to predict how many rows each partition will have.
My main question is:
Does is make sense to partition data if i don't use partition pruning and i am not going to delete old data?
What will be pros/cons of that ?
How will degrade my import speed, if i won't do partitioning?
Another question related to normalization:
Should url be exported to another table?
Pros of normalization
Table is going to have rows with average size of 20-30 bytes.
Joins on "url_id" are supposed to be much faster than on "url" field
Pros of denormalization
Data can be imported much, much faster, as i don't have to make lookup into "url" table before each insert.
Can anybody give me any advice? Thanks!
Partitioning is most useful if you are going to either have selection criteria in most queries which allow the planner to skip access to most of the partitions most of the time, or if you want to periodically purge all rows that are assigned to a partition, or both. (Dropping a table is a very fast way to delete a large number of rows!) I have heard of people hitting a threshold where partitioning helped keep indexes shallower, and therefore boost performance; but really that gets back to the first point, because you effectively move the first level of the index tree to another place -- it still has to happen.
On the face of it, it doesn't sound like partitioning will help.
Normalization, on the other hand, may improve performance more than you expect; by keeping all those rows narrower, you can get more of them into each page, reducing overall disk access. I would do proper 3rd normal form normalization, and only deviate from that based on evidence that it would help. If you see a performance problem while you still have disk space for a second copy of the data, try creating a denormalized table and seeing how performance is compared to the normalized version.
I think it makes sense, depending on your use cases. I don't know how far back in time your 30B row history goes, but it makes sense to partition if your transactional database doesn't need more than a few of the partitions you decide on.
For example, partitioning by month makes perfect sense if you only query for two months' worth of data at a time. The other ten months of the year can be moved into a reporting warehouse, keeping the transactional store smaller.
There are restrictions on the fields you can use in the partition. You'll have to be careful with those.
Get a performance baseline, do your partition, and remeasure to check for performance impacts.
With the given amount of data in mind, you'll be waiting on IO mostly. If possible, perform some tests with different HW configurations trying to get best IO figures for your scenarios. IMHO, 2 disks will not be enough after a while, unless there's something else behind the scenes.
Your table will be growing daily with a known ratio. And most likely it will be queried daily. As you haven't mentioned data being purged out (if it will be, then do partition it), this means that queries will run slower each day. At some point in time you'll start looking at how to optimize your queries. One of the possibilities is to parallelize query on the application level. But here some conditions should be met:
your table should be partitioned in order to parallelize queries;
HW should be capable of delivering the requested amount of IO in N parallel streams.
All answers should be given by the performance tests of different setups.
And as others mentioned, there're more benefits for DBA in partitioned tables, so I, personally, would go for partitioning any table that is expected to receive more then 5M rows per interval, be it day, week or month.

MongoDB Insert performance - Huge table with a couple of Indexes

I am testing Mongo DB to be used in a database with a huge table of about 30 billion records of about 200 bytes each. I understand that Sharding is needed for that kind of volume, so I am trying to get 1 to 2 billion records on one machine. I have reached 1 billion records on a machine with 2 CPU's / 6 cores each, and 64 GB of RAM. I mongoimport-ed without indexes, and speed was okay (average 14k records/s). I added indexes, which took a very long time, but that is okay as it is a one time thing. Now inserting new records into the database is taking a very long time. As far as I can tell, the machine is not loaded while inserting records (CPU, RAM, and I/O are in good shape). How is it possible to speed -up inserting new records?
I would recommend adding this host to MMS (http://mms.10gen.com/help/overview.html#installation) - make sure you install with munin-node support and that will give you the most information. This will allow you to track what might be slowing you down. Sorry I can't be more specific in the answer, but there are many, many possible explanations here. Some general points:
Adding indexes means that that the indexes as well as your working data set will be in RAM now, this may have strained your resources (look for page faults)
Now that you have indexes, they must be updated when you are inserting - if everything fits in RAM this should be OK, see first point
You should also check your Disk IO to see how that is performing - how does your background flush average look?
Are you running the correct filesystem (XFS, ext4) and a kernel version later than 2.6.25? (earlier versions have issues with fallocate())
Some good general information for follow up can be found here:
http://www.mongodb.org/display/DOCS/Production+Notes