What the significance of thread in YCSB - nosql

I try to compare performances of Mongodb and Cassandra by changing the number of threads. In fact, I don't really understand what the significance of threads is ? (Number of requests or number of clients or number of cpus?)
Please help. Thanks.

Related

How to figure out optimal max_worker_processes?

Yet I can not find any reliable recommendation regarding the optimal value for max_worker_processes.
Some sources suppose that the value should not be higher than the number of available cores, but is that correct taking that server threads do a lot of IO?
Say I have 8 cores for PG container and plan to handle about 100 clients in parallel. Is that feasible, especially with the default max_worker_processes=8 ?
Any trusted reference would be much appreciated.
The reasonable limit dies not depend on the number of client connections, but on the actual upper limit on concurrent queries.
If it is guaranteed that only one of these clients will ever be active at the same time, you could set max_worker_processes, max_parallel_workers and max_parallel_workers_per_gather one less than the number of cores or parallel I/O operations that your storage can handle, whatever of the two is smaller. In essence, one query can then consume all the available resources.
On the other hand, if many of these clients are likely to run queries concurrently, you should disable parallel query by setting max_parallel_workers_per_gather to 0 to avoid overloading your database.
Concerning your comments to my answer: if your goal is to limit the number of active queries, use a connection pool.

Clickhouse prioritize kafka import threads over other queries

TLDR; Is there any way to prioritize the kafka-engine import threads over any other CH threads OR can i reserve CPUs for the kafka consumers?
In my setup, the kafkalag increases too much when issuing a big query. I guess, this is because the import thread doesn't receive enough CPU time when there is too much CPU load. I tried to set a maximum thread cap for users as well as setting nice values. Nothing seems to work, so any advice is welcome.
upgrade to 20.9.7.11
recreate kafka engine tables with settings kafka_num_consumers=5(10), kafka_thread_per_consumer=1
add to default profile (users.xml) background_schedule_pool_size=30

Writing more than 50 millions from Pyspark df to PostgresSQL, best efficient approach

What would be the most efficient way to insert millions of records say 50-million from a Spark dataframe to Postgres Tables.
I have done this from spark to
MSSQL in the past by making use of bulk copy and batch size option which was successful too.
Is there something similar that can be here for Postgres?
Adding the code I have tried and the time it took to run the process:
def inserter():
start = timer()
sql_res.write.format("jdbc").option("numPartitions","5").option("batchsize","200000")\
.option("url", "jdbc:postgresql://xyz.com:5435/abc_db") \
.option("dbtable", "public.full_load").option("user", "root").option("password", "password").save()
end = timer()
print(timedelta(seconds=end-start))
inserter()
So I did the above approach for 10 million records and had 5 parallel connections as specified in numPartitions and also tried batch size of 200k.
The total time it took for the process was 0:14:05.760926 (fourteen minutes and five seconds).
Is there any other efficient approach which would reduce the time?
What would be the efficient or optimal batch size I can use ? Will increasing my batch size do the job quicker ? Or opening multiple connections i.e > 5 help me make the process quicker ?
On an average 14 mins for 10 million records is not bad, but looking for people out there who would have done this before to help answer this question.
I actually did kind of the same work a while ago but using Apache Sqoop.
I would say that for answering this questions we have to try to optimize the communication between Spark and PostgresSQL, specifically the data flowing from Spark to PostgreSql.
But be careful, do not forget Spark side. It does not make sense to execute mapPartitions if the number of partitions is too high compared with the number of maximum connections which
PostgreSQL support, if you have too many partitions and you are opening a connection for each one, you will probably have the following error org.postgresql.util.PSQLException: FATAL: sorry, too many clients already.
In order to tune the insertion process I would approach the problem following the next steps:
Remember the number of partitions is important. Check the number of partitions and then adjust it based on the number of parallel connection you want to have. You might want to have one connection per partition, so I would suggest to check coalesce, as is mentioned here.
Check the max number of connections which your postgreSQL instance support and you want to increase the number.
For inserting data into PostgreSQL is recommended using COPY command. Here is also a more elaborated answer about how to speed up postgreSQL insertion.
Finally, there is no silver bullet to do this job. You can use all the tips I mentioned above but it will really depends on your data and use cases.

How configure mongodb for support 1100 threads?

How can i configure mongodb's pool connection for support 1100 threads per seconds?
I tried some configurations like bellow without sucess.
connectionsPerHost = 200
threadsAllowedToBlockForConnectionMultiplier = 5
Can someone help me?
Thanks.
It won't.
That number of threads may be prejudicial, there's a lot of techniques to calculate some ideal number, and none of them get even close to 1100. If you're looking to attend a large number of users you should work with server redundancy. You won't get speed because 99.9% (really) of your threads will be locked waiting for a resource become available.
I've worked with java in fast processing, using distributed systems and threads, we used 0mq(tcp alternative) to acelerate communication and get more use of the threads, but we found that moderate number of threads was the ideal (if I remember correctly, 12).
Instead of letting hundreds of threads do the job, try to keep a limited number of workers threads, you won't have more resources anyway. The ideal for this kind of application would be have many servers attending your users.

Scaling DB2 to increase tps

I wanna have brainstorm with you guys all about scaling option that DB2 have. Hope can helped me to resolve the problem.
I need to scale my DB2 database to anticipate flash crowd transaction to database server. My database can only serve around 200 transaction++ per sec in application term not database tps before my database totally stalled and out of cpu.
What are you guys think, if I want to increase to reach 2000++ or 10 times before, what options i have to scale my database?
Recently i read about pureScale feature. Its look promising but its not flexible solution by mean it just can be deploy on IBM System X and ours is not. Are there other solution like pureScale in shared-everything approach?
The second option maybe database partition. Is database partition or shared-nothing approach can help resolve my problem? Can add processing power to my system?
Thanks and regards,
Fritz
Before you worry about how to scale up (more hardware in 1 server) or out (more servers), look at how to tune your database. Buying your way out of a performance problem is almost always more expensive than spending time to find and fix the performance problem.
Assuming that the process(es) consuming CPU on your database server are the database engine, then high CPU activity and low I/O activity is indicative that you're doing a LOT of reads, but they are just all in memory. Scanning a huge table is still in inefficient, even if that table is stored completely in memory (buffer pools).
Find the SQL statements that are using the most CPU. Look at the explain plans, and figure out how to make them more efficient. There are LOTS of resources on the web for database performance tuning.