I am trying to query a MySQL database asynchronously using Slick. The following code template, which I use to query about 90k rows in a for comprehension, seems to be working initially, but the program consumes several gigabytes of RAM and fails without warning after around 200 queries.
import scala.slick.jdbc.{StaticQuery => Q}
def doQuery(): Future[List[String]] = future {
val q = "select name from person"
db withSession {
Q.query[String](q).list
}
}
I have tried setting up connections both using the fromURL method and also using a c3p0 connection pool. My question is: Is this the way to do asynchronous calls to the database?
Async is still an open issue for Slick.
You could try using Iterables and stream data instead of storing it in memory with a solution similar to this: Treating an SQL ResultSet like a Scala Stream
Although please omit the .toStream call at the end. It will cache the data in memory, while Iterable will not.
If you want an async version of iterable you could look into Observables.
It turns out that this is a non issue (actually a bug in my code, which opened a new database connection for each query). In my experience, you can wrap DB queries in Futures as shown above and compose them later with Scala Async or Rx, as shown here. All is required for good performance is a large thread pool (x2 the CPUs in my case) and an equally large connection pool.
Slick 3 (Reactive Slick) looks like it might address this.
Related
Here, we developed multi services each uses akka actors and communication between services are via Akka GRPC. There is one service which fills an in memory database and other service called Reader applies some query and shape data then transfer them to elasticsearch service for insertion/update. The volume of data in each reading phase is about 1M rows.
The problem arises when Reader transfers large amount of data so elasticsearch can not process them and insert/update them all.
I used akka stream method for these two services communication. I also use scalike jdbc lib and code like below to read and insert batch data instead of whole ones.
def applyQuery(query: String,mergeResult:Map[String, Any] => Unit) = {
val publisher = DB readOnlyStream {
SQL(s"${query}").map(_.toMap()).list().fetchSize(100000)
.iterator()
}
Source.fromPublisher(publisher).runForeach(mergeResult)
}
////////////////////////////////////////////////////////
var batchRows: ListBuffer[Map[String, Any]] = new ListBuffer[Map[String, Any]]
val batchSize: Int = 100000
def mergeResult(row:Map[String, Any]):Unit = {
batchRows :+= row
if (batchRows.size == batchSize) {
send2StorageServer(readyOutput(batchRows))
batchRows.clear()
}
}
def readyOutput(res: ListBuffer[Map[String, Any]]):ListBuffer[StorageServerRequest] = {
// code to format res
}
Now, when using 'foreach' command, it makes operations much slower. I tried different batch size but it made no sense. Am I wrong in using foreach command or is there any better way to resolve speed problem using akka stream, flow, etc.
I found that operation to be used to append to ListBuffer is
batchRows += row
but using :+ does not produce bug but is very inefficient so by using correct operator, foreach is no longer slow, although the speed problem again exists. This time, reading data is fast but writing to elasticsearch is slow.
After some searches, I came up with these solutions:
1. The use of queue as buffer between database and elasticsearch may help.
2. Also if blocking read operation until write is done is not costly,
it can be another solution.
I am using Scala.
I tried to fetch all data from a table with about 4 million rows. I used stream and the code is like
val stream Stream[Record] = expression.stream().iterator().asScala.toStream
stream.map(println(_))
expression is SelectFinalStep[Record] in Jooq.
However, the first line is too slow. It costs minutes. Am I doing something wrong?
Use the Stream API directly
If you're using Scala 2.12, you don't have to transform the Java stream returned by expression.stream() to a Scala Iterator and then to a Scala Stream. Simply call:
expression.stream().forEach(println);
While jOOQ's ResultQuery.stream() method creates a lazy Java 8 Stream, which is discarded again after consumption, Scala's Stream keeps previously fetched records in memory for re-traversal. That's probably what's causing most performance issues, when fetching 4 million records.
A note on resources
Do note that expression.stream() returns a resourceful stream, keeping an open underlying ResultSet and PreparedStatement. Perhaps, it's a good idea to explicitly close the stream after consumption.
Optimise JDBC fetch size
Also, you might want to look into calling expression.fetchSize(), which calls through to JDBC's Statement.setFetchSize(). This allows for the JDBC driver to fetch batches of N rows. Some JDBC drivers default to a reasonable fetch size, others default to fetching all rows into memory prior to passing them to the client.
Another solution would be to fetch the records lazily and construct the a scala stream. For example:
def allRecords():Stream[Record] = {
val cur = expression.fetchLazy()
def inner(): Stream[Record] = {
if(cur.hasNext) {
val next = cur.fetchOne
next #:: inner()
}
else
Stream.empty
}
inner()
}
In the following code:
val users = TableQuery[Users]
def getUserById(id:Int) = db(users.filter(_.id === id).result)
From what I understand, getUserById would create a prepared statement everytime the getUserById is executed and then discarded. Is there a way to cache the prepared statement so it is created only once and called many times.
The documentation for Slick indicates that you need to enable prepared statement caching on the connection pool configuration.
Here is quite a good article on it also.
The summary is that Slick seems to cache the strings that are used to prepare the actual statements, but delegates the caching of the actual prepared statements to the underlying connection pool implementation.
I'm using play framework with scala. I also use RedisScala driver (this one https://github.com/etaty/rediscala ) to communicate with Redis. If Redis doesn't contain data then my app is looking for data in MongoDB.
When Redis fails or just not available for some reason then application waits a response too long. How to implement failover strategy in this case. I would like to stop requesting Redis if requests take too long time. And start working with Redis when it is back online.
To clarify the question my code is like following now
private def getUserInfo(userName: String): Future[Option[UserInfo]] = {
CacheRepository.getBaseUserInfo(userName) flatMap{
case Some(userInfo) =>
Logger.trace(s"AuthenticatedAction.getUserInfo($userName). User has been found in cache")
Future.successful(Some(userInfo))
case None =>
getUserFromMongo(userName)
}
}
I think you need to distinguish between the following cases (in order of their likelihood of occurrence) :
No Data in cache (Redis) - I guess in this case, Redis will return very quickly and you have to get it from Mongo. In your code above you need to set the data in Redis after you get it from Mongo so that you have it in the cache for subsequent calls.
You need to wrap your RedisClient in your application code aware of any disconnects/reconnects. Essentially have a two states - first, when Redis is working properly, second, when Redis is down/slow.
Redis is slow - this could happen because of one of the following.
2.1. Network is slow: Again, you cannot do much about this but to return a message to your client. Going to Mongo is unlikely to resolve this if your network itself is slow.
2.2. Operation is slow: This happens if you are trying to get a lot of data or you are running a range query on a sorted set, for example. In this case you need to revisit the Redis data structure you are using the the amount of data you are storing in Redis. However, looks like in your example, this is not going to be an issue. Single Redis get operations are generally low latency on a LAN.
Redis node is not reachable - I'm not sure how often this is going to happen unless your network is down. In such a case you also will have trouble connecting to MongoDB as well. I believe this can also happen when the node running Redis is down or its disk is full etc. So you should handle this in your design. Having said that the Rediscala client will automatically detect any disconnects and reconnect automatically. I personally have done this. Stopped and updgrade Redis version and restart Redis without touching my running client(JVM).
Finally, you can use a Future with a timeout (see - Scala Futures - built in timeout?) in your program above. If the Future is not completed by the timeout you can take your other action(s) (go to Mongo or return an error message to the user). Given that #1 and #2 are likely to happen much more frequently than #3, you timeout value should reflect these two cases. Given that #1 and #2 are fast on a LAN you can start with a timeout value of 100ms.
Soumya Simanta provided detailed answer and I just would like to post code I used for timeout. The code requires Play framework which is used in my project
private def get[B](key: String, valueExtractor: Map[String, ByteString] => Option[B], logErrorMessage: String): Future[Option[B]] = {
val timeoutFuture = Promise.timeout(None, Duration(Settings.redisTimeout))
val mayBeHaveData = redisClient.hgetall(key) map { value =>
valueExtractor(value)
} recover{
case e =>
Logger.info(logErrorMessage + e)
None
}
// if timeout occurred then None will be result of method
Future.firstCompletedOf(List(mayBeHaveData, timeoutFuture))
}
I need to do small (but frequent) operations on my database, from one of my api methods. When I try wrapping them into "withSession" each time, I get terrible performance.
db withSession {
SomeTable.insert(a,b)
}
Running the above example 100 times takes 22 seconds. Running them all in a single session is instantaneous.
Is there a way to re-use the session in subsequent function invocations?
Do you have some type of connection pooling (see JDBC Connection Pooling: Connection Reuse?)? If not you'll be using a new connection for every withSession(...) and that is a very slow approach. See http://groups.google.com/group/scalaquery/browse_thread/thread/9c32a2211aa8cea9 for a description of how to use C3PO with ScalaQuery.
If you use a managed resource from an application server you'll usually get this for "free", but in stand-alone servers (for example jetty) you'll have to configure this yourself.
I'm probably stating the way too obvious, but you could just put more calls inside the withSession block like:
db withSession {
SomeTable.insert(a,b)
SomeOtherTable.insert(a,b)
}
Alternately you can create an implicit session, do your business, then close it when you're done:
implicit val session = db.createSession
SomeTable.insert(a,b)
SomeOtherTable.insert(a,b)
session.close