Play Framework 2.5 Requests timing out randomly - postgresql

Symptom
After some time of running just fine our backend will stop giving responses for most of its endpoints. It will just start behaving like a blackhole for those. Once in this state, it will stay there if we don't take any action.
Update
We can reproduce this behaviour with a db dump we made when the backend was in the non responding state.
Infrastructure Setup
We are running Play 2.5 in AWS on an EC2 instance behind a loadbalancer with a PostgreSQL database on RDS. We are using slick-pg as our database connector.
What we know
Here a few things we figured out so far.
About the HTTP requests
Our logs and debugging shows us that the requests are passing the filters. Also, we see that for the authentication (we are using Silhoutte for that) the application is able to perform database queries to receive the identity for that request. The controllers action will just never be called, though.
The backend is responing for HEAD requests. Further logging showed us that it seems that Controllers using injected services (we are using googles guice for that) are the ones whose methods are not being called anymore. Controllers without injected services seem to work fine.
About the EC2 instance
Unfortunately, we are not able to get much information from that one. We are using boxfuse which provides us with an immutable and by that in-ssh-able infrastructure. We are about to change this to a docker based deployment and might be able to provide more information soon. Nevertheless, we have New Relic setup to monitor our servers. We cannot find anything suspicious there. The memory and CPU usages look fine.
Still, this setup gives us a new EC2 instance on every deployment anyway. And even after a redeployment the issue persists at least for most of the times. Eventually it is possible to resolve this with a redeployment.
Even more weird is the fact that we can run the backend locally connected to the Database on AWS and everything will just work fine there.
So it is hard to say for us where the problem lies. It seems the db is not working with any EC2 instance (until it will work with a new one eventually) but with our local machines.
About the Database
The db is the only stateful entity in this setup, so we think the issue somehow should be related to it.
As we have a production and a staging environment, we are able to dump the production db into staging when the later is not working anymore. We found that this indeed resolves the issue immediately. Unfortunately, we were not able to take a snapshot from a somehow corrupt database to dump it into the staging environment and see whether this will break it immediately. We have a snapshot of the db when the backend was not responding anymore. When we dump this to our staging environment the backend will stop responding immediately.
The number of connections to the DB is around 20 according to the AWS console which is normal.
TL;DR
Our backend starts behaving like a blackhole for some of its endpoints eventually
The requests are not reaching the controller actions
A new instance in EC2 might resolve this, but not necessarily
locally with the very same db everything is working fine
dumping a working db into it resolves the issue
CPU and memory usages of the EC2 instances as well as the number of connections to the DB look totally fine
We can reproduce the behaviour with a db dump we made when the backend was not responding anymore (see Update 2)
with new slick threadpool settings, we would get ThreadPoolExecutor exceptions from slick after a reboot of the db with a reboot of our ec2 instance afterwards. (see Update 3)
Update 1
Responding to marcospereira:
Take for instance this ApplicationController.scala:
package controllers
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import akka.actor.ActorRef
import com.google.inject.Inject
import com.google.inject.name.Named
import com.mohiva.play.silhouette.api.Silhouette
import play.api.i18n.{ I18nSupport, MessagesApi }
import play.api.mvc.Action
import play.api.mvc.Controller
import jobs.jobproviders.BatchJobChecker.UpdateBasedOnResourceAvailability
import utils.auth.JobProviderEnv
/**
* The basic application controller.
*
* #param messagesApi The Play messages API.
* #param webJarAssets The webjar assets implementation.
*/
class ApplicationController #Inject() (
val messagesApi: MessagesApi,
silhouette: Silhouette[JobProviderEnv],
implicit val webJarAssets: WebJarAssets,
#Named("batch-job-checker") batchJobChecker: ActorRef
)
extends Controller with I18nSupport {
def index = Action.async { implicit request =>
Future.successful(Ok)
}
def admin = Action.async { implicit request =>
Future.successful(Ok(views.html.admin.index.render))
}
def taskChecker = silhouette.SecuredAction.async {
batchJobChecker ! UpdateBasedOnResourceAvailability
Future.successful(Ok)
}
}
The index and admin are just working fine. The taskchecker will show the weird behaviour, though.
Update 2
We are able to reproduce this issue now! We found that we made a db dump the last time our backend was not responding anymore. When we dump this into our staging database, the backend will stop responding immediately.
We started logging the number of threads now in one of our filters using Thread.getAllStackTraces.keySet.size and found that there are between 50 and 60 threads running.
Update 3
As #AxelFontaine suggested we enabled MultiAZ Deployment failover for the database. We rebooted the database with failover. Before, during and after the reboot the backend was not responding.
After the reboot we noticed that the number of connections to the db stayed 0. Also, we did not get any logs for authentication anymore (before we did so, the authentication step could even make db requests and would get responses).
After a rebooting of the EC2 instance, we are now getting
play.api.UnexpectedException: Unexpected exception[RejectedExecutionException: Task slick.backend.DatabaseComponent$DatabaseDef$$anon$2#76d6ac53 rejected from java.util.concurrent.ThreadPoolExecutor#6ea1d0ce[Running, pool size = 4, active threads = 4, queued tasks = 5, completed tasks = 157]]
(we did not get those before)
for our requests as well as for our background jobs that need to access the db. Our slick settings now include
numThreads = 4
queueSize = 5
maxConnections = 10
connectionTimeout = 5000
validationTimeout = 5000
as suggested here
Update 4
After we got the exceptions as described in Update 3, the backend is now running fine again. We didn't do anything for that. This was the first time the backend would recover from this state without us being involved.

It sounds like a thread management issue at first glance. Slick will provide its own execution context for database operations if you are using Slick 3.1, but you do want to manage the queue size so that it maps out to roughly the same size as the database:
myapp = {
database = {
driver = org.h2.Driver
url = "jdbc:h2:./test"
user = "sa"
password = ""
// The number of threads determines how many things you can *run* in parallel
// the number of connections determines you many things you can *keep in memory* at the same time
// on the database server.
// numThreads = (core_count (hyperthreading included))
numThreads = 4
// queueSize = ((core_count * 2) + effective_spindle_count)
// on a MBP 13, this is 2 cores * 2 (hyperthreading not included) + 1 hard disk
queueSize = 5
// https://groups.google.com/forum/#!topic/scalaquery/Ob0R28o45eM
// make larger than numThreads + queueSize
maxConnections = 10
connectionTimeout = 5000
validationTimeout = 5000
}
}
Also, you may want to use a custom ActionBuilder, and inject a Futures component and add
import play.api.libs.concurrent.Futures._
once you do that, you can add future.withTimeout(500 milliseconds) and time out the future so that an error response will come back. There's an example of a custom ActionBuilder in the Play example:
https://github.com/playframework/play-scala-rest-api-example/blob/2.5.x/app/v1/post/PostAction.scala
class PostAction #Inject()(messagesApi: MessagesApi)(
implicit ec: ExecutionContext)
extends ActionBuilder[PostRequest]
with HttpVerbs {
type PostRequestBlock[A] = PostRequest[A] => Future[Result]
private val logger = org.slf4j.LoggerFactory.getLogger(this.getClass)
override def invokeBlock[A](request: Request[A],
block: PostRequestBlock[A]): Future[Result] = {
if (logger.isTraceEnabled()) {
logger.trace(s"invokeBlock: request = $request")
}
val messages = messagesApi.preferred(request)
val future = block(new PostRequest(request, messages))
future.map { result =>
request.method match {
case GET | HEAD =>
result.withHeaders("Cache-Control" -> s"max-age: 100")
case other =>
result
}
}
}
}
so you'd add the timeout, metrics (or circuit breaker if the database is down) here.

After some more investigations we found that one of our jobs was generating deadlocks in our database. The issue we were running into is a known bug in the slick version we were using and is reported on github.
So the problem was that we were running DB transactions with .transactionally within a .map of a DBIOAction on too many threads at the same time.

Related

Flask -- Reloading database which was loaded before first request

I have an API run on flask which is connected to MongoDB and it uses this DB for reading only.
I connect to db before first request:
#app.before_first_request
def load_dicti():
c = MongoClient('mongodb://' + app.config['MONGO_DSN'], connect=False)
db = c.my_name
app.first, app.second = dictionary_compilation(db.my_base, another_dictionary)
However, this mongodb may be updated from time to time. My API doesn't know about it because this db was already loaded before first request.
What's the most efficient way to cope with it? I'd be grateful for explanations and code examples.
I don't quite figure out what you are going to do, but Application Context may be best practice. Just like demo in Flask docs, you could do:
def get_db():
"""Opens a new database connection if there is none yet for the
current application context.
"""
if not hasattr(g, 'db'):
c = MongoClient('mongodb://' + app.config['MONGO_DSN'], connect=False)
g.db = c.my_name
return g.db
Then, you could use get_db() directly in your view function, mongdb will be conntected once only when there is no db attr in g.
If your connection is not that stable that you need to change it everytime, you could connect every request or every session.

How to combine py.test fixtures with Flask-SQLAlchemy and PostgreSQL?

I'm struggling to write py.test fixtures for managing my app's database that maximize speed, supports pytest-xdist parallelization of tests, and isolates the tests from each other.
I'm using Flask-SQLAlchemy 2.1 against a PostgreSQL 9.4 database.
Here's the general outline of what I'm trying to accomplish:
$ py.test -n 3 spins up three test sessions for running tests.
Within each session, a py.test fixture runs once to setup a transaction, create the database tables, and then at the end of the session it rolls back the transaction. Creating the database tables needs to happen within a PostgreSQL transaction that's only visible to that particular test-session, otherwise the parallelized test sessions created by pytest-xdist cause conflicts with each other.
A second py.test fixture that runs for every test connects to the existing transaction in order to see the created tables, creates a nested savepoint, runs the test, then rolls back to the nested savepoint.
Ideally, these pytest fixtures support tests that call db.session.rollback(). There's a potential recipe for accomplishing this at the bottom of this SQLAlchemy doc.
Ideally the pytest fixtures should yield the db object, not just the session so that
folks can write tests without having to remember to use a session that's
different than the standard db.session they use throughout the app.
Here's what I have so far:
import pytest
# create_app() is my Flask application factory
# db is just 'db = SQLAlchemy()' + 'db.init_app(app)' within the create_app() function
from app import create_app, db as _db
#pytest.yield_fixture(scope='session', autouse=True)
def app():
'''Session-wide test application'''
a = create_app('testing')
with a.app_context():
yield a
#pytest.yield_fixture(scope='session')
def db_tables(app):
'''Session-wide test database'''
connection = _db.engine.connect()
trans = connection.begin() # begin a non-ORM transaction
# Theoretically this creates the tables within the transaction
_db.create_all()
yield _db
trans.rollback()
connection.close()
#pytest.yield_fixture(scope='function')
def db(db_tables):
'''db session that is joined to existing transaction'''
# I am quite sure this is broken, but it's the general idea
# bind an individual Session to the existing transaction
db_tables.session = db_tables.Session(bind=db_tables.connection)
# start the session in a SAVEPOINT...
db_tables.session.begin_nested()
# yield the db object, not just the session so that tests
# can be written transparently using the db object
# without requiring someone to understand the intricacies of these
# py.test fixtures or having to remember when to use a session that's
# different than db.session
yield db_tables
# rollback to the savepoint before the test ran
db_tables.session.rollback()
db_tables.session.remove() # not sure this is needed
Here's the most useful references that I've found while googling:
http://docs.sqlalchemy.org/en/latest/orm/session_transaction.html#joining-a-session-into-an-external-transaction-such-as-for-test-suites
http://koo.fi/blog/2015/10/22/flask-sqlalchemy-and-postgresql-unit-testing-with-transaction-savepoints/
https://github.com/mitsuhiko/flask-sqlalchemy/pull/249
I'm a couple years late here, but you might be interested in pytest-flask-sqlalchemy, a plugin I wrote to help address this exact problem.
The plugin provides two fixtures, db_session and db_engine, which you can use like regular Session and Engine objects to run updates that will get rolled back at the end of the test. It also exposes a few configuration directives (mocked-engines and mocked-sessions) that will mock out connectables in your app and replace them with these fixtures so that you can run methods and be sure that any state changes will get cleaned up when the test exits.
The plugin should work with a variety of databases, but it's been tested most heavily against Postgres 9.6 and is in production in the test suite for https://dedupe.io. You can find some examples in the documentation that should help you get started, but if you're willing to provide some code I'd be happy to demonstrate how to use the plugin, too.
I had similar issue trying to combine yield fixtures. Unfortunately according to the doc you are not able to combine more than one yield level.
But you might be able to find a work around using request.finalizer:
#pytest.fixture(scope='session', autouse=True)
def app():
'''Session-wide test application'''
a = create_app('testing')
with a.app_context():
return a
#pytest.fixture(scope='session')
def db_tables(request, app):
'''Session-wide test database'''
connection = _db.engine.connect()
trans = connection.begin() # begin a non-ORM transaction
# Theoretically this creates the tables within the transaction
_db.create_all()
def close_db_session():
trans.rollback()
connection.close()
request.addfinalizer(close_db_session)
return _db

RedisClient fails strategy

I'm using play framework with scala. I also use RedisScala driver (this one https://github.com/etaty/rediscala ) to communicate with Redis. If Redis doesn't contain data then my app is looking for data in MongoDB.
When Redis fails or just not available for some reason then application waits a response too long. How to implement failover strategy in this case. I would like to stop requesting Redis if requests take too long time. And start working with Redis when it is back online.
To clarify the question my code is like following now
private def getUserInfo(userName: String): Future[Option[UserInfo]] = {
CacheRepository.getBaseUserInfo(userName) flatMap{
case Some(userInfo) =>
Logger.trace(s"AuthenticatedAction.getUserInfo($userName). User has been found in cache")
Future.successful(Some(userInfo))
case None =>
getUserFromMongo(userName)
}
}
I think you need to distinguish between the following cases (in order of their likelihood of occurrence) :
No Data in cache (Redis) - I guess in this case, Redis will return very quickly and you have to get it from Mongo. In your code above you need to set the data in Redis after you get it from Mongo so that you have it in the cache for subsequent calls.
You need to wrap your RedisClient in your application code aware of any disconnects/reconnects. Essentially have a two states - first, when Redis is working properly, second, when Redis is down/slow.
Redis is slow - this could happen because of one of the following.
2.1. Network is slow: Again, you cannot do much about this but to return a message to your client. Going to Mongo is unlikely to resolve this if your network itself is slow.
2.2. Operation is slow: This happens if you are trying to get a lot of data or you are running a range query on a sorted set, for example. In this case you need to revisit the Redis data structure you are using the the amount of data you are storing in Redis. However, looks like in your example, this is not going to be an issue. Single Redis get operations are generally low latency on a LAN.
Redis node is not reachable - I'm not sure how often this is going to happen unless your network is down. In such a case you also will have trouble connecting to MongoDB as well. I believe this can also happen when the node running Redis is down or its disk is full etc. So you should handle this in your design. Having said that the Rediscala client will automatically detect any disconnects and reconnect automatically. I personally have done this. Stopped and updgrade Redis version and restart Redis without touching my running client(JVM).
Finally, you can use a Future with a timeout (see - Scala Futures - built in timeout?) in your program above. If the Future is not completed by the timeout you can take your other action(s) (go to Mongo or return an error message to the user). Given that #1 and #2 are likely to happen much more frequently than #3, you timeout value should reflect these two cases. Given that #1 and #2 are fast on a LAN you can start with a timeout value of 100ms.
Soumya Simanta provided detailed answer and I just would like to post code I used for timeout. The code requires Play framework which is used in my project
private def get[B](key: String, valueExtractor: Map[String, ByteString] => Option[B], logErrorMessage: String): Future[Option[B]] = {
val timeoutFuture = Promise.timeout(None, Duration(Settings.redisTimeout))
val mayBeHaveData = redisClient.hgetall(key) map { value =>
valueExtractor(value)
} recover{
case e =>
Logger.info(logErrorMessage + e)
None
}
// if timeout occurred then None will be result of method
Future.firstCompletedOf(List(mayBeHaveData, timeoutFuture))
}

How to use Cashbah MongoDB connections?

Note: I realise there is a similar question on SO but it talks about an old version of Casbah, plus, the behaviour explained in the answer is not what I see!
I was under the impression that Casbah's MongoClient handled connection pooling. However, doing lsof on my process I see a big and growing number of mongodb connections, which makes me doubt this pooling actually exists.
Basically, this is what I'm doing:
class MongodbDataStore {
val mongoClient = MongoClient("host",27017)("database")
var getObject1(): Object1 = {
val collection = mongoClient("object1Collection")
...
}
var getObject2(): Object2 = {
val collection = mongoClient("object2Collection")
...
}
}
So, I never close MongoClient.
Should I be closing it after every query? Implement my own pooling? What then?
Thank you
Casbah is a wrapper around the MongoDB Java client, so the connection is actually managed by it.
According to the Java driver documentation (http://docs.mongodb.org/ecosystem/drivers/java-concurrency/) :
If you are using in a web serving environment, for example, you should
create a single MongoClient instance, and you can use it in every
request. The MongoClient object maintains an internal pool of
connections to the database (default maximum pool size of 100). For
every request to the DB (find, insert, etc) the Java thread will
obtain a connection from the pool, execute the operation, and release
the connection. This means the connection (socket) used may be
different each time.
By the way, that's what I've experienced in production. I did not see any problem with this.

Re-using sessions in ScalaQuery?

I need to do small (but frequent) operations on my database, from one of my api methods. When I try wrapping them into "withSession" each time, I get terrible performance.
db withSession {
SomeTable.insert(a,b)
}
Running the above example 100 times takes 22 seconds. Running them all in a single session is instantaneous.
Is there a way to re-use the session in subsequent function invocations?
Do you have some type of connection pooling (see JDBC Connection Pooling: Connection Reuse?)? If not you'll be using a new connection for every withSession(...) and that is a very slow approach. See http://groups.google.com/group/scalaquery/browse_thread/thread/9c32a2211aa8cea9 for a description of how to use C3PO with ScalaQuery.
If you use a managed resource from an application server you'll usually get this for "free", but in stand-alone servers (for example jetty) you'll have to configure this yourself.
I'm probably stating the way too obvious, but you could just put more calls inside the withSession block like:
db withSession {
SomeTable.insert(a,b)
SomeOtherTable.insert(a,b)
}
Alternately you can create an implicit session, do your business, then close it when you're done:
implicit val session = db.createSession
SomeTable.insert(a,b)
SomeOtherTable.insert(a,b)
session.close