Why adding async boundary in Akka Streams costs a lot of CPU? - scala

I've found that my Akka Streams program had unexpected CPU usage.
Here is a simple example:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Sink, Source}
implicit val system: ActorSystem = ActorSystem.create("QuickStart")
implicit val materializer: ActorMaterializer = ActorMaterializer()
Source.repeat(Unit)
.to(Sink.ignore)
.run()
The code piece above will let source and sink runs in the same actor.
It uses about 105% CPU usage on my laptop. Works as expected.
And after I was added an async boundary:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Sink, Source}
implicit val system: ActorSystem = ActorSystem.create("QuickStart")
implicit val materializer: ActorMaterializer = ActorMaterializer()
Source.repeat(Unit)
.async // <------ async boundary here
.to(Sink.ignore)
.run()
This code piece now will use about 600% of CPU usage on my 4c8t laptop.
I was expecting by adding an async boundary this stream will run in 2 separate actors and will cost a little more than 200% CPU. But it costs a lot more than 200%.
What may causes async boundary to use that much CPU?

Default akka.actor.default-dispatcher parameter is Java's ForkJoinPool. It's initialized via call to ThreadPoolConfig.scaledPoolSize. Thus it defaults to starting pool of size (number of processors * 3) and max = parallelism-max (64).

Related

Akka streams - why is an ActorSystem accepted as a Materalizer only when it is marked as implicit?

I'm looking at the akka streams quickstart tutorial and I wanted to understand how a piece of code works. The following code from the example prints out the values from 1 to 100 to the console:
import akka.stream.scaladsl._
import akka.{Done, NotUsed}
import akka.actor.ActorSystem
object Akka_Streams extends App{
implicit val system: ActorSystem = ActorSystem("QuickStart")
val source: Source[Int, NotUsed] = Source(1 to 100)
source.runForeach(i => println(i))
}
What I don't understand is, when I change the code to the following and remove the implicit, the code no longer works. I get a type mismatch error (shown below the following code):
object Akka_Streams extends App{
val system: ActorSystem = ActorSystem("QuickStart")
val source: Source[Int, NotUsed] = Source(1 to 100)
source.runForeach(i => println(i))(system)
}
Error:
type mismatch;
found : akka.actor.ActorSystem
required: akka.stream.Materializer
source.runForeach(i => println(i))(system)
Why did this work before but not now? the source.runForeach method takes a Materalizer type so I wonder why this was working at all to begin with? From what I can see, an ActorSystem is not a Materalizer or a sub-type of it so I'm confused.
It is related to how Scala compiler converts ActorSystem to a Materializer
It is done via implicit conversions with following method
/**
* Implicitly provides the system wide materializer from a classic or typed `ActorSystem`
*/
implicit def matFromSystem(implicit provider: ClassicActorSystemProvider): Materializer =
SystemMaterializer(provider.classicSystem).materializer
It requires parameter provider to be implicit.
So having implicit key is allowing the compiler to take an implicit actor system when it needs an instance of a Materializer, and such conversion is done without any need to explicitly define a materializer in the scope.

Processing a big table with Slick fails with OutOfMemoryError

I am querying a big MySQL table with Akka Streams and Slick, but it fails with an OutOfMemoryError. It seems that Slick is loading all the results into memory (it does not fail if the query is limited to a few rows). Why is this the case, and what is the solution?
val dbUrl = "jdbc:mysql://..."
import akka.NotUsed
import akka.actor.ActorSystem
import akka.stream.alpakka.slick.scaladsl.SlickSession
import akka.stream.alpakka.slick.scaladsl.Slick
import akka.stream.scaladsl.Source
import akka.stream.{ActorMaterializer, Materializer}
import com.typesafe.config.ConfigFactory
import slick.jdbc.GetResult
import scala.concurrent.Await
import scala.concurrent.duration.Duration
val slickDbConfig = s"""
|profile = "slick.jdbc.MySQLProfile$$"
|db {
| dataSourceClass = "slick.jdbc.DriverDataSource"
| properties = {
| driver = "com.mysql.jdbc.Driver",
| url = "$dbUrl"
| }
|}
|""".stripMargin
implicit val actorSystem: ActorSystem = ActorSystem()
implicit val materializer: Materializer = ActorMaterializer()
implicit val slickSession: SlickSession = SlickSession.forConfig(ConfigFactory.parseString(slickDbConfig))
import slickSession.profile.api._
val responses: Source[String, NotUsed] = Slick.source(
sql"select my_text from my_table".as(GetResult(r => r.nextString())) // limit 100
)
val future = responses.runForeach((myText: String) =>
println("my_text: " + myText.length)
)
Await.result(future, Duration.Inf)
From the Slick documentation:
Note: Some database systems may require session parameters to be set in a certain way to support streaming without caching all data at once in memory on the client side. For example, PostgreSQL requires both .withStatementParameters(rsType = ResultSetType.ForwardOnly, rsConcurrency = ResultSetConcurrency.ReadOnly, fetchSize = n) (with the desired page size n) and .transactionally for proper streaming.
In other words, to prevent the database from loading all the query results into memory, one might need additional configuration. This configuration is database dependent. The MySQL documentation states the following:
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row.
To set the above configuration in Slick:
import slick.jdbc._
val query =
sql"select my_text from my_table".as(GetResult(r => r.nextString()))
.withStatementParameters(
rsType = ResultSetType.ForwardOnly,
rsConcurrency = ResultSetConcurrency.ReadOnly,
fetchSize = Int.MinValue
)//.transactionally <-- I'm not sure whether you need ".transactionally"
val responses: Source[String, NotUsed] = Slick.source(query)

Play Framework test helpers need implicit `Materializer`

I'm using Play 2.6.x and the test helper for status(result) has the method:
def status(of: Accumulator[ByteString, Result])(implicit timeout: Timeout, mat: Materializer): Int = status(of.run())
Running tests throws when the compiler can't find the implicit value:
could not find implicit value for parameter mat: akka.stream.Materializer
What is the Materializer -- I'm assuming it's part of Akka-HTTP
And how can I provide one?
From akka streams docs:
The Materializer is a factory for stream execution engines, it is the
thing that makes streams run [...]
The Materializer is the cornerstone of Akka Streams, on which Akka HTTP is built on. You need one of these to be implicitly resolved to make your test compile.
Presently the ActorMaterializer is the only available implementation of Materializer. It is a Materializer based on Akka actors. This is the reason why, to create one, you need in turn to have an ActorSystem in scope.
The following code is what you need in your test:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
implicit val sys = ActorSystem("MyTest")
implicit val mat = ActorMaterializer()
there's also a status method in the form:
def status(of: Future[Result])(implicit timeout: Timeout): Int
make sure the controller return type is correct so the action returns a Future[Result]
How about doing this:
implicit val materializer = ActorMaterializer()
As of Play 2.6.0 ActorMaterializer() is deprecated, but you can do this instead:
val as = ActorSystem()
implicit val materializer = Materializer(as)

How to it make pure?

I have following scala code:
import akka.Done
import akka.actor.ActorSystem
import akka.kafka.ConsumerMessage.CommittableOffsetBatch
import akka.kafka.scaladsl.Consumer
import akka.kafka.{ConsumerSettings, Subscriptions}
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import scala.concurrent.Future
object TestConsumer {
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("KafkaConsumer")
implicit val materializer = ActorMaterializer()
val consumerSettings = ConsumerSettings(system, new StringDeserializer, new StringDeserializer)
.withBootstrapServers("localhost:9092")
.withGroupId("group1")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val result = Consumer
.committableSource(consumerSettings, Subscriptions.topics("test"))
.mapAsync(2)(rec => Future.successful(rec.record.value()))
.runWith(Sink.foreach(ele => {
print(ele)
system.terminate()
}))
}
}
As you can recognize, the application consumes message from kafka printed out on the shell.
runWith is not pure, it generates some side effect, print out the received message and shutdown the actor.
The question is, how to make it pure with cats IO effects? It is possible?
You don't need cats IO to make it pure. Note that your sink is already pure, because it's just the value that describes what will happen when it's used (in this case using means "connecting to the Source and running the stream").
val sink: Sink[String, Future[Done]] = Sink.foreach(ele => {
print(ele)
// system.terminate() // PROBLEM: terminating the system before stream completes!
})
The problem you described has nothing to do with purity. The problem is that the sink above closes over the value of system, and then tries to terminate it when processing each element of the source.
Terminating the system means that you are destroying the whole runtime environment (used by ActorMaterializer) that is used to run the stream. This should only be done when your stream completes.
val result: Future[Done] = Consumer
.committableSource(consumerSettings, Subscriptions.topics("test"))
.mapAsync(2)(rec => Future.successful(rec.record.value()))
.runWith(sink)
result.onComplete(_ => system.terminate())

Schedule in Akka - not found

Here is a basic example of using schedule in Akka:
import akka.pattern
import akka.util.Timeout
import scala.concurrent.Await
import akka.actor.Actor
import akka.actor.Props
import akka.actor.ActorSystem
import akka.pattern.ask
import scala.concurrent.duration
object Application extends App {
val supervisor = ActorSystem().actorOf(Props[Supervisor])
implicit val timeout = Timeout(10 seconds)
import system.dispatcher
supervisor.scheduler.scheduleOnce(120 seconds) {
val future = supervisor ? Supervisor.Start
val resultIdList = Await.result(future, timeout.duration).asInstanceOf[List[MyIdList]]
supervisor ! resultIdList
}
}
I'm really confused of Akka's documentation. Here Having problems with Akka 2.1.2 Scheduler ('system' not recognized) was said that import system.dispatcher is not a package import but something else. What is that then?
What is system? Do I have to replace it with supervisor? Even if I didn't do that and keep using system, I'd have pretty much the same errors:
//(using system)
value scheduler is not a member of akka.actor.ActorRef
not found: value system
//or (using supervisor)
not found: value system
not found: value system
Try this ;)
val system = ActorSystem()
val supervisor = system.actorOf(Props[Supervisor])
(Posting as answer since does not fit as comment)
Marius, you were referring to another question which started with this line:
val system = akka.actor.ActorSystem("system")
That is the identifier 'system' the import statement is referring to.
The line
import system.dispatcher
means that the dispatcher member of the variable system will be available in scope (you can use the name 'dispatcher' to refer to 'system.dispatcher' from that point). This also means that since dispatcher is an implicit that it will be now available for implicit resolution. Please note that the signature of schedule is
scheduleOnce(delay: FiniteDuration, runnable: Runnable)(implicit executor: ExecutionContext): Cancellable
So it either needs an explicitly passed ExecutionContext, or an implicit one. By using the import statement you bring the dispatcher (which is an ExecutionContext) into scope, so you don't have to provide it manually.