Too many instances of a DoFn in a Dataflow streaming pipeline - streaming

I am currently developing a dataflow streaming pipeline that has many interactions with Cloud SQL.
The pipeline interact with an instance of Postgres in Cloud SQL through the Python connector developed by Google.
It connects through DoFn functions that inherit from a DoFn class "CloudSQLDoFn" that handles a pool of connexions trough setup() and teardown() calls.
In total, we have 16 DoFns that inherit from this CloudSQLDoFn class.
import apache_beam as beam
from google.cloud.sql.connector import Connector, IPTypes
from sqlalchemy import create_engine
INSTANCE_CONNECTION_NAME = ########
DB_USER = #########
DB_PASS = #########
DB_NAME = #########
POOL_SIZE = 5
class CloudSqlDoFn(beam.DoFn):
def __init__(
self,
local
):
self.local = local
self.connected_pool = None
self.instance_connexion_name = INSTANCE_CONNECTION_NAME
self.db_user = DB_USER
self.db_pass = DB_PASS
self.db_name = DB_NAME
self.pool_size = POOL_SIZE
def get_conn(self):
"""Create connexion"""
conn = Connector().connect(
self.instance_connexion_name,
"pg8000",
user=self.db_user,
password=self.db_pass,
db=self.db_name,
ip_type=IPTypes.PRIVATE
)
return conn
def get_pool(self):
"""Create pool of connexion"""
pool = create_engine(
"postgresql+pg8000://",
creator=self.get_conn,
pool_size=self.pool_size,
pool_recycle=1800
)
return pool
def setup(self):
"""Open connection or pool of connections to Postgres"""
self.connected_pool = self.get_pool()
def teardown(self):
"""Close connection to Postgres"""
self.connected_pool.dispose()
In a word, we are facing a typical "backpressure" problem : we received a lot of "Too Many Requests" errors from the "Cloud SQL Admin Server" (that sets the SQL connection) when too many files arrive at the same time.
RuntimeError: aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('https://sqladmin.googleapis.com/sql/v1beta4/projects/.../instances/db-csql:generateEphemeralCert')
We know that this is due to the creation of many DoFns instances which are calling the setup() method and therefore request too many connections but we are not able to control the number of connections.
We thought that by limiting the maximum number of workers and threads, we could force the latency to go up (which would be OK) but it seems like other parameters determine the number of instances of a DoFn.
My questions:
Aside from the number of threads and workers, what determines the number of instances of a DoFn instantiated at the same time in a streaming Dataflow ?
How could we force the system to accept a higher latency/lower freshness so that we don't saturate the Cloud SQL Admin server ?
Thank you for your help.

You can make your pool process-level (i.e. attach it to some global static variable/module, or the DoFn class itself) and share it among all DoFn instances to limit the number of connections per process regardless of the number of DoFns instantiated. If you need more than one of these, you can give each DoFn a unique identifier, and then have a static map of ids to pools.
On Dataflow, you can also set no_use_multiple_sdk_containers to limit the number of processes per worker VM (though this of course will limit CPU in other parts of your pipeline as well).

Related

Azure Function creating too many connections to PostgreSQL

I have an Azure Durable Function that interacts with a PostgreSQL database, also hosted in Azure.
The PostgreSQL database has a connection limit of 50, and furthermore, my connection string limits the connection pool size to 40, leaving space for super user / admin connections.
Nonetheless, under some loads I get the error
53300: remaining connection slots are reserved for non-replication superuser connections
This documentation from Microsoft seemed relevant, but it doesn't seem like I can make a static client, and, as it mentions,
because you can still run out of connections, you should optimize connections to the database.
I have this method
private IDbConnection GetConnection()
{
return new NpgsqlConnection(Environment.GetEnvironmentVariable("PostgresConnectionString"));
}
and when I want to interact with PostgreSQL I do like this
using (var connection = GetConnection())
{
connection.Open();
return await connection.QuerySingleAsync<int>(settings.Query().Insert, settings);
}
So I am creating (and disposing) lots of NpgsqlConnection objects, but according to this, that should be fine because connection pooling is handled behind the scenes. But there may be something about Azure Functions that invalidates this thinking.
I have noticed that I end up with a lot of idle connections (from pgAdmin):
Based on that I've tried fiddling with Npgsql connection parameters like Connection Idle Lifetime, Timeout, and Pooling, but the problem of too many connections seems to persist to one degree or another. Additionally I've tried limiting the number of concurrent orchestrator and activity functions (see this doc), but that seems to partially defeat the purpose of Azure Functions being scalable. It does help - I get fewer of the too many connections error). Presumably If I keep testing it with lower numbers I may even eliminate it, but again, that seems like it defeats the point, and there may be another solution.
How can I use PostgreSQL with Azure Functions without maxing out connections?
I don't have a good solution, but I think I have the explanation for why this happens.
Why is Azure Function App maxing out connections?
Even though you specify a limit of 40 for the pool size, it is only honored on one instance of the function app. Note that that a function app can scale out based on load. It can process several requests concurrently in the same function app instance, plus it can also create new instances of the app. Concurrent requests in the same instance will honor the pool size setting. But in the case of multiple instances, each instance ends up using a pool size of 40.
Even the concurrency throttles in durable functions don't solve this issue, because they only throttle within a single instance, not across instances.
How can I use PostgreSQL with Azure Functions without maxing out connections?
Unfortunately, function app doesn't provide a native way to do this. Note that the connection pool size is not managed by the function runtime, but by npgsql's library code. This library code running on different instances can't talk to each other.
Note that, this is the classic problem of using shared resources. You have 50 of these resources in this case. The most effective way to support more consumers would be to reduce the time each consumer uses the resource. Reducing the Connection Idle Lifetime substantially is probably the most effective way. Increasing Timeout does help reduce errors (and is a good choice), but it doesn't increase the throughput. It just smooths out the load. Reducing Maximum Pool size is also good.
Think of it in terms of locks on a shared resource. You would want to take the lock for the minimal amount of time. When a connection is opened, it's a lock on one of the 50 total connections. In general, SQL libraries do pooling, and keep the connection open to save the initial setup time that is involved in each new connection. However, if this is limiting the concurrency, then it's best to kill idle connections asap. In a single instance of an app, the library does this automatically when max pool size is reached. But in multiple instances, it can't kill another instance's connections.
One thing to note is that reducing Maximum Pool Size doesn't necessarily limit the concurrency of your app. In most cases, it just decreases the number of idle connections - at the cost of - paying the initial setup time when a new connection will need to be established at a later time.
Update
WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT might be useful. You can set this to 5, and pool size to 8, or similar. I would go this way if reducing Maximum Pool Size and Connection Idle Lifetime is not helping.
This is where Dependency Injection can be really helpful. You can create a singleton client and it will do the job perfectly. If you want to know more about service lifetimes you can read it here in docs
First add this nuget Microsoft.Azure.Functions.Extensions.DependencyInjection
Now add a new class like below and resolve your client.
[assembly: FunctionsStartup(typeof(Kovai.Serverless360.Functions.Startup))]
namespace MyFunction
{
class Startup : FunctionsStartup
{
public override void Configure(IFunctionsHostBuilder builder)
{
ResolveDependencies(builder);
}
}
public void ResolveDependencies(IFunctionsHostBuilder builder)
{
var conStr = Environment.GetEnvironmentVariable("PostgresConnectionString");
builder.Services.AddSingleton((s) =>
{
return new NpgsqlConnection(conStr);
}
}
}
Now you can easily consume it from any of your function
public FunctionA
{
private readonly NpgsqlConnection _connection;
public FunctionA(NpgsqlConnection conn)
{
_connection = conn;
}
public async Task<HttpResponseMessage> Run()
{
//do something with your _connection
}
}
Here's an example of using a static HttpClient, something which you should consider so that you don't need to explicitly manage connections, rather allow your client to do it:
public static class PeriodicHealthCheckFunction
{
private static HttpClient _httpClient = new HttpClient();
[FunctionName("PeriodicHealthCheckFunction")]
public static async Task Run(
[TimerTrigger("0 */5 * * * *")]TimerInfo healthCheckTimer,
ILogger log)
{
string status = await _httpClient.GetStringAsync("https://localhost:5001/healthcheck");
log.LogInformation($"Health check performed at: {DateTime.UtcNow} | Status: {status}");
}
}

Scala Play 2.5 with Slick 3 and Spec2

I have a play application using Slick that I want to test using Spec2, but I keep getting the error org.postgresql.util.PSQLException: FATAL: sorry, too many clients already. I have tried to shut down the database connection by using
val mockApp = new GuiceApplicationBuilder()
val db = mockApp.injector.instanceOf[DBApi].database("default")
...
override def afterAll = {
db.getConnection().close()
db.shutdown()
}
But the error persists. The Slick configuration is
slick.dbs.default.driver="slick.driver.PostgresDriver$"
slick.dbs.default.db.driver="org.postgresql.Driver"
slick.dbs.default.db.url="jdbc:postgresql://db:5432/hygge_db"
slick.dbs.default.db.user="*****"
slick.dbs.default.db.password="*****"
getConnection of DbApi either gets connection from underlying data-source's (JdbcDataSource I presume) pool or creates a new one. I see no pool specified in your configuration, so I think it always creates a new one for you. So if you didn't close connection inside the test - getConnection won't help - it will just try to create a new one or take random connection from pool (if pooling is enabled).
So the solution is to either configure connection pooling:
When using a connection pool (which is always recommended in
production environments) the minimum size of the connection pool
should also be set to at least the same size. The maximum size of the
connection pool can be set much higher than in a blocking application.
Any connections beyond the size of the thread pool will only be used
when other connections are required to keep a database session open
(e.g. while waiting for the result from an asynchronous computation in
the middle of a transaction) but are not actively doing any work on
the database.
so you can just set maximum available connections number in your config:
connectionPool = 5
Or you can share same connection (you'll probably have to ensure sequentiality then):
object SharedConnectionForAllTests{
val connection = db.getConnection()
def close() = connection.close()
}
It's better to inject it with Spring/Guice of course, so you could conviniently manage connection's lifecycle.

Creating a global connection object across an executor in spark when pushing data to mongodb

In the link it was suggested to create a connection pool that is available across multiple rdds in the spark streaming job.
rdd.foreachpartition( iter => {
val client = MongoClient(host,port)
val col = client.getDataBase("testDataBase").getCollection("testCollection")
// i am bascically inserting data in the iterator to the testcollection
})
However I was not able to figure out how to create a connection pool that returns a connection object to a mongodb collection. I was able to use foreachpartition to create a single connection for the whole partition. can someone please let me know how to create a connection object that available across the executor for reuse.
The MongoDB Spark Connector internally uses broadcast variables to achieve this:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
So you should be able to share the MongoClient and connection pool across tasks.
Mongo dB spark connection doesn’t help in collecting exceptions. How do we do that. Also if we insert in batches , if one fails - it stops the remaining inserts. Mongo spark driver helps to insert multiple documents as well as you can set ordered = false so it even inserts the remaining documents even if there is some duplicates or some timeouts.

Dealing with NoHostAvailableException with phantom DSL

When trying to insert several thousand records at once into a remote Cassandra db, I reproducibly run into timeouts (with 5 to 6 thousand elements on a slow connection)
error:
All host(s) tried for query failed (tried: /...:9042
(com.datastax.driver.core.exceptions.OperationTimedOutException: [/...]
Timed out waiting for server response))
com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s) tried for query failed (tried: /...:9042
(com.datastax.driver.core.exceptions.OperationTimedOutException: [/...]
Timed out waiting for server response))
the model:
class RecordModel extends CassandraTable[ConcreteRecordModel, Record] {
object id extends StringColumn(this) with PartitionKey[String]
...
abstract class ConcreteRecordModel extends RecordModel
with RootConnector with ResultSetFutureHelper {
def store(rec: Record): Future[ResultSet] =
insert.value(_.id, rec.id).value(...).future()
def store(recs: List[Record]): Future[List[ResultSet]] = Future.traverse(recs)(store)
the connector:
val connector = ContactPoints(hosts).withClusterBuilder(
_.withCredentials(
config.getString("username"),
config.getString("password")
).withPoolingOptions(
new PoolingOptions().setCoreConnectionsPerHost(HostDistance.LOCAL, 4)
.setMaxConnectionsPerHost(HostDistance.LOCAL, 10)
.setCoreConnectionsPerHost(HostDistance.REMOTE, 2)
.setMaxConnectionsPerHost(HostDistance.REMOTE, 4)
.setMaxRequestsPerConnection(HostDistance.LOCAL, 32768)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 2000)
.setPoolTimeoutMillis(10000)
)
).keySpace(keyspace)
I have tried tweaking the pooling options, separately and together. But even doubling all of the REMOTE settings did not change the timeout noticeably
current workaround, which I would like to avoid - splitting the list into batches and wait for completion of each:
def store(recs: List[Record]): Future[List[ResultSet]] = {
val rs: Iterator[List[ResultSet]] = recs.grouped(1000) map { slice =>
Await.result(Future.traverse(slice)(store), 100 seconds)
}
Future.successful(rs.to[List].flatten)
}
What would be a good way to handle this issue?
Thank you
EDIT
The errors do suggest failing/overloaded cluster, but I suspect network plays a major role here. The numbers provided above are from a remote machine. They are MUCH higher, when the same C* is fed from a machine in the same datacenter. Another suspicious detail is that feeding the same C* instance with quill does not encounter any timeout issues, remote or not.
What I really dislike about throttling is that the batch sizes are random and static, while they should be adaptible.
Sounds like you're hitting the limits of your cluster. If you want to avoid timeouts you will need to add more capacity to be able to handle the load. If you want to just do burst writes you should throttle them (as you are doing), as sending too many queries to too few nodes will inhibit performance. You can also increase the timeouts on the server side (read_request_timeout_in_ms, write_request_timeout_in_ms, request_timeout_in_ms) if you want to wait until you can write however this is not advisable as you will not give Cassandra any time to recover and likely cause large amounts of ParNew GC.

How to free Redis Scala client allocated by RedisClientPool?

I am using debasishg/scala-redis as my Redis Client.
I want it to support multi threaded executions. Following their documentation: https://github.com/debasishg/scala-redis I defined
val clients = new RedisClientPool("localhost", 6379)
and then using it on each access to redis:
clients.withClient {
client => {
...
}
}
My question is, do I need to free each allocated client? And if so, what is a correct way to do it?
If you look at the constructor for RedisClientPool, there is a default value maxIdle ("the maximum number of objects that can sit idle in the pool", as per this), and a default value for poolWaitTimeout. You can change those values, but basically if you wait poolWaitTimeout you are guaranteed to have your ressources cleaned, except for the maxIdle clients on stand-by.
Also, if you can't stand the idea of idle clients, you can shut down the whole pool with mypool.close, and create it again when needed, but depending on your use case that might defeat the purpose of using a pool (if it's a cron job I guess that's fine).