Dealing with NoHostAvailableException with phantom DSL - scala

When trying to insert several thousand records at once into a remote Cassandra db, I reproducibly run into timeouts (with 5 to 6 thousand elements on a slow connection)
error:
All host(s) tried for query failed (tried: /...:9042
(com.datastax.driver.core.exceptions.OperationTimedOutException: [/...]
Timed out waiting for server response))
com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s) tried for query failed (tried: /...:9042
(com.datastax.driver.core.exceptions.OperationTimedOutException: [/...]
Timed out waiting for server response))
the model:
class RecordModel extends CassandraTable[ConcreteRecordModel, Record] {
object id extends StringColumn(this) with PartitionKey[String]
...
abstract class ConcreteRecordModel extends RecordModel
with RootConnector with ResultSetFutureHelper {
def store(rec: Record): Future[ResultSet] =
insert.value(_.id, rec.id).value(...).future()
def store(recs: List[Record]): Future[List[ResultSet]] = Future.traverse(recs)(store)
the connector:
val connector = ContactPoints(hosts).withClusterBuilder(
_.withCredentials(
config.getString("username"),
config.getString("password")
).withPoolingOptions(
new PoolingOptions().setCoreConnectionsPerHost(HostDistance.LOCAL, 4)
.setMaxConnectionsPerHost(HostDistance.LOCAL, 10)
.setCoreConnectionsPerHost(HostDistance.REMOTE, 2)
.setMaxConnectionsPerHost(HostDistance.REMOTE, 4)
.setMaxRequestsPerConnection(HostDistance.LOCAL, 32768)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 2000)
.setPoolTimeoutMillis(10000)
)
).keySpace(keyspace)
I have tried tweaking the pooling options, separately and together. But even doubling all of the REMOTE settings did not change the timeout noticeably
current workaround, which I would like to avoid - splitting the list into batches and wait for completion of each:
def store(recs: List[Record]): Future[List[ResultSet]] = {
val rs: Iterator[List[ResultSet]] = recs.grouped(1000) map { slice =>
Await.result(Future.traverse(slice)(store), 100 seconds)
}
Future.successful(rs.to[List].flatten)
}
What would be a good way to handle this issue?
Thank you
EDIT
The errors do suggest failing/overloaded cluster, but I suspect network plays a major role here. The numbers provided above are from a remote machine. They are MUCH higher, when the same C* is fed from a machine in the same datacenter. Another suspicious detail is that feeding the same C* instance with quill does not encounter any timeout issues, remote or not.
What I really dislike about throttling is that the batch sizes are random and static, while they should be adaptible.

Sounds like you're hitting the limits of your cluster. If you want to avoid timeouts you will need to add more capacity to be able to handle the load. If you want to just do burst writes you should throttle them (as you are doing), as sending too many queries to too few nodes will inhibit performance. You can also increase the timeouts on the server side (read_request_timeout_in_ms, write_request_timeout_in_ms, request_timeout_in_ms) if you want to wait until you can write however this is not advisable as you will not give Cassandra any time to recover and likely cause large amounts of ParNew GC.

Related

how to reduce the getConnection time consume in Atomikos

I am using Atomkios XA configuration for Oracle.
This is my code for creating datasource connection.
OracleXADataSource oracleXADataSource = new OracleXADataSource();
oracleXADataSource.setURL(sourceURL);
oracleXADataSource.setUser(UN);
oracleXADataSource.setPassword(PS);
AtomikosDataSourceBean sourceBean= new AtomikosDataSourceBean();
sourceBean.setXaDataSource(oracleXADataSource);
sourceBean.setUniqueResourceName(resourceName);
sourceBean.setMaxPoolSize(max-pool-size); // 10
atomikos:
datasource:
resourceName: insight
max-pool-size: 10
min-pool-size: 3
transaction-manager-id: insight-services-tm
This is my configuration is fine for medium user load around 5000 requests.
But when user count increases assume more than 10000 requests, this class com.atomikos.jdbc.AbstractDataSourceBean:getConnection consuming more then than normal.
This class time taking approximately 1500ms but normal time it takes less then 10ms. I can understand user demand increases, getConnection is going to wait state to pick the free connection from the connection pool so if I increase my max-pool-size, will my problem sort out or any other option feature available to sort out my problem.
Try setting concurrentConnectionValidation=true on your datasource.
If that does not help then consider a free trial of the commercial Atomikos product:
https://www.atomikos.com/Main/ExtremeTransactionsFreeTrial
Best

Calling a rest service from Spark

I'm trying to figure out the best approach to call a Rest endpoint from Spark.
My current approach (solution [1]) looks something like this -
val df = ... // some dataframe
val repartitionedDf = df.repartition(numberPartitions)
lazy val restEndPoint = new restEndPointCaller() // lazy evaluation of the object which creates the connection to REST. lazy vals are also initialized once per JVM (executor)
val enrichedDf = repartitionedDf
.map(rec => restEndPoint.getResponse(rec)) // calls the rest endpoint for every record
.toDF
I know I could have used .mapPartitions() instead of .map(), but looking at the DAG, it looks like spark optimizes the repartition -> map to a mapPartition anyway.
In this second approach (solution [2]), a connection is created once for every partition and reused for all records within the partition.
val newDs = myDs.mapPartitions(partition => {
val restEndPoint = new restEndPointCaller /*creates a db connection per partition*/
val newPartition = partition.map(record => {
restEndPoint.getResponse(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
restEndPoint.close() // close dbconnection here
newPartition.iterator // create a new iterator
})
In this third approach (solution [3]), a connection is created once per JVM (executor) reused across all partitions processed by the executor.
lazy val connection = new DbConnection /*creates a db connection per partition*/
val newDs = myDs.mapPartitions(partition => {
val newPartition = partition.map(record => {
readMatchingFromDB(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
newPartition.iterator // create a new iterator
})
connection.close() // close dbconnection here
[a] With Solutions [1] and [3] which are very similar, is my understanding of how lazy val work correct? The intention is to restrict the number of connections to 1 per executor/ JVM and reuse the open connections for processing subsequent requests. Will I be creating 1 connection per JVM or 1 connection per partition?
[b] Are there any other ways by which I can control the number of requests (RPS) we make to the rest endpoint ?
[c] Please let me know if there are better and more efficient ways to do this.
Thanks!
IMO the second solution with mapPartitions is better. First, you explicitly tells what you're expecting to achieve. The name of the transformation and the implemented logic tell it pretty clearly. For the first option you need to be aware of the how Apache Spark optimizes the processing. And it's maybe obvious to you just now but you should also think about the people who will work on your code or simply about you in 6 months, 1 year, 2 years and so fort. And they should understand better the mapPartitions than repartition + map.
Moreover maybe the optimization for repartition with map will change internally (I don't believe in it but you can still consider is as a valid point) and at this moment your job will perform worse.
Finally, with the 2nd solution you avoid a lot of problems that you can encounter with the serialization. In the code you wrote the driver will create one instance of the endpoint object, serialize it and send to the executors. So yes, maybe it'll be a single instance but only if it's serializable.
[edit]
Thanks for clarification. You can achieve what are you looking for in different manners. To have exactly 1 connection per JVM you can use a design pattern called singleton. In Scala it's expressed pretty easily as an object (the first link I found on Google https://alvinalexander.com/scala/how-to-implement-singleton-pattern-in-scala-with-object)
And that it's pretty good because you don't need to serialize anything. The singletons are read directly from the classpath on the executor side. With it you're sure to have exactly one instance of given object.
[a] With Solutions [1] and [3] which are very similar, is my
understanding of how lazy val work correct? The intention is to
restrict the number of connections to 1 per executor/ JVM and reuse
the open connections for processing subsequent requests. Will I be
creating 1 connection per JVM or 1 connection per partition?
It'll create 1 connection per partition. You can execute this small test to see that:
class SerializationProblemsTest extends FlatSpec {
val conf = new SparkConf().setAppName("Spark serialization problems test").setMaster("local")
val sparkContext = SparkContext.getOrCreate(conf)
"lazy object" should "be created once per partition" in {
lazy val restEndpoint = new NotSerializableRest()
sparkContext.parallelize(0 to 120).repartition(12)
.mapPartitions(numbers => {
//val restEndpoint = new NotSerializableRest()
numbers.map(nr => restEndpoint.enrich(nr))
})
.collect()
}
}
class NotSerializableRest() {
println("Creating REST instance")
def enrich(id: Int): String = s"${id}"
}
It should print Creating REST instance 12 times (# of partitions)
[b] Are there ways by which I can control the number of requests (RPS)
we make to the rest endpoint ?
To control the number of requests you can use an approach similar to database connection pools: HTTP connection pool (one quickly found link: HTTP connection pooling using HttpClient).
But maybe another valid approach would be the processing of smaller subsets of data ? So instead of taking 30000 rows to process, you can split it into different smaller micro-batches (if it's a streaming job). It should give your web service a little bit more "rest".
Otherwise you can also try to send bulk requests (Elasticsearch does it to index/delete multiple documents at once https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html). But it's up to the web service to allow you to do so.

Scala Play 2.5 with Slick 3 and Spec2

I have a play application using Slick that I want to test using Spec2, but I keep getting the error org.postgresql.util.PSQLException: FATAL: sorry, too many clients already. I have tried to shut down the database connection by using
val mockApp = new GuiceApplicationBuilder()
val db = mockApp.injector.instanceOf[DBApi].database("default")
...
override def afterAll = {
db.getConnection().close()
db.shutdown()
}
But the error persists. The Slick configuration is
slick.dbs.default.driver="slick.driver.PostgresDriver$"
slick.dbs.default.db.driver="org.postgresql.Driver"
slick.dbs.default.db.url="jdbc:postgresql://db:5432/hygge_db"
slick.dbs.default.db.user="*****"
slick.dbs.default.db.password="*****"
getConnection of DbApi either gets connection from underlying data-source's (JdbcDataSource I presume) pool or creates a new one. I see no pool specified in your configuration, so I think it always creates a new one for you. So if you didn't close connection inside the test - getConnection won't help - it will just try to create a new one or take random connection from pool (if pooling is enabled).
So the solution is to either configure connection pooling:
When using a connection pool (which is always recommended in
production environments) the minimum size of the connection pool
should also be set to at least the same size. The maximum size of the
connection pool can be set much higher than in a blocking application.
Any connections beyond the size of the thread pool will only be used
when other connections are required to keep a database session open
(e.g. while waiting for the result from an asynchronous computation in
the middle of a transaction) but are not actively doing any work on
the database.
so you can just set maximum available connections number in your config:
connectionPool = 5
Or you can share same connection (you'll probably have to ensure sequentiality then):
object SharedConnectionForAllTests{
val connection = db.getConnection()
def close() = connection.close()
}
It's better to inject it with Spring/Guice of course, so you could conviniently manage connection's lifecycle.

Akka actor pipeline and congested store actor

I am attempting to implement a message processing pipeline using actors. The steps of the pipeline include functions such as reading, filtering, augmentation and, finally, storage into a database.
Something similar to this: http://sujitpal.blogspot.nl/2013/12/akka-content-ingestion-pipeline-part-i.html
The issue is that the reading, filtering and augmentation steps are much faster than the storage step which results in having a congested store actor and an unreliable system.
I am considering the following option: have the store actor pull the processed and ready to store messages. Is this a good option? better suggestions?
Thank you
You may consider several options:
if order of messages doesn't matter - just execute every storage operation inside separate actor (or future). It will cause all data storage to be doing in parallel - I recommend to use separate thread pool for that. If some messages are amendments to others or participate in same transaction - you may create separate actors only for each messageId/transactionId to avoid pessimistic/optimistic lock problems (don't forget to kill such actors on transaction end or by timeout) .
use bounded mailboxes (back-pressure) - then you will block new messages from your input if older are still not processed (for example you may block the receiving thread til message will be acknowledged by last actor in the chain). It will move responsibility to source system. It's working pretty much good with JMS durables - messages are storing in reliable way on JMS-broker side til your system finally have them processed.
combine the previous two
I am using an approach similar to this: Akka Work Pulling Pattern (source code here: WorkPullingPattern.scala). It has the advantage that it works both locally & with Akka Cluster. Plus the whole approach is fully asynchronous, no blocking at all.
If your processed "objects" won't all fit into memory, or one of the steps is slow, it is an awesome solution. If you spawn N workers, then N "tasks" will be processed at one time. It might be a good idea to put the "steps" into BalancingPools also with parallelism N (or less).
I have no idea if your processing "pipeline" is sequential or not, but if it is, just a couple hours ago I have developed a type safe abstraction based on the above + Shapeless library. A glimpse at the code, before it was merged with WorkPullingPattern is here: Pipeline.
It takes any pipeline of functions (of properly matching signatures), spawns them in BalancingPools, creates Workers and links them to a master actor which can be used for scheduling the tasks.
The new AKKA stream (still in beta) has back pressure. It's designed to solve this problem.
You could also use receive pipeline on actors:
class PipelinedActor extends Actor with ReceivePipeline {
// Increment
pipelineInner { case i: Int ⇒ Inner(i + 1) }
// Double
pipelineInner { case i: Int ⇒ Inner(i * 2) }
def receive: Receive = { case any ⇒ println(any) }
}
actor ! 5 // prints 12 = (5 + 1) * 2
http://doc.akka.io/docs/akka/2.4/contrib/receive-pipeline.html
It suits your needs the best as you have small pipelining tasks before/after processing of the message by actor. Also it is blocking code but that is fine for your case, I believe

How check that Cluster sharding is started properly?

I want to check whether ClusterSharding started on not for one region. Here is the code:
def someMethod: {
val system = ActorSystem("ClusterSystem", ConfigFactory.load())
val region: ActorRef = ClusterSharding(system).shardRegion("someActorName")
}
Method akka.contrib.pattern.ClusterSharding#shardRegion throws IllegalArgumentException if it do not find shardRegion. I do not like approach to catch IllegalArgumentException just to check that ClusterSharding did not started.
Is there another approach like ClusterSharding(system).isStarted(shardRegionName = "someActorName")?
Or it is assumed that I should start all shardingRegion at ActorSystem start up?
You should indeed start all regions as soon as possible. According to the docs:
"When using the sharding extension you are first, typically at system startup on each node in the cluster, supposed to register the supported entry types with the ClusterSharding.start method."
Startup of a region is not immediate. In particular, even in local cases, it would take at the very least the time specified in the akka.contrib.cluster.sharding.retry-interval (the name is misleading: this value is both the initial delay of registration and the retry interval) parameter of your configuration before your sharded actors can effectively receive messages (the messages sent in that period are not lost, but not delivered until after a while).
If you want to be 100% sure that your region started, you should have one of your sharded actor respond to an identify message after you call cluster.start . Once it replies, you are guaranteed that your region is up and running. You can use a ask pattern if you want to be blocking and await on the ask future.