RDS Data API BatchExecute taking significantly longer than standard connection - postgresql

I have an AWS Lambda function that needs to insert several thousand rows of data into an RDS PostgreSQL database within a Serverless Cluster. Previously I used a normal database connection using psycopg2, but I switched to the RDS Data API in order to improve performance. However, using the Data API, BatchExecute exceeds the 5 minute lambda limit and still fails to commit the transaction in this time. Meanwhile, the psycopg2 solution, which uses a different transfer protocol, inserts all data in under 30 seconds.
Why is this possible? Shouldn't the Data API give superior performance as it doesn't need to establish a connection? Can I change any settings to make the RDS Data API perform suitably?
I don't believe I am reaching any of the data size limits, because the lambda times out rather than explicitly throwing an error. Also, I know that the connection is succeeding, as other small queries are able to execute successfully.

Related

How to setup mutli-tenancy using row level security on Postgres with knex

I am architecting a database where I expected to have 1,000s of tenants where some data will be shared between tenants. I am currently planning on using Postgres with row level security for tenant isolation. I am also using knex and Objection.js to model the database in node.js.
Most of the tutorials I have seen look like this where you create a separate knex connection per tenant. However, I've run into a problem on my development machine where after I create ~100 connections, I received this error: "remaining connection slots are reserved for non-replication superuser connections".
I'm investigating a few possible solutions/work-arounds, but I was wondering if anyone has been able to make this setup work the way I'm intending. Thanks!
Perhaps one solution might be to cache a limited number of connections, and destroy the oldest cached connection when the limit is reached. See this code as an example.
That code should probably be improved, however, to use a Map as the knexCache instead of an object, since a Map remembers the insertion order.

streaming PostgreSQL tables into Google BigQuery

I would like to automatically stream data from an external PostgreSQL database into a Google Cloud Platform BigQuery database in my GCP account. So far, I have seen that one can query external databases (MySQL or PostgreSQL) with the EXTERNAL_QUERY() function, e.g.:
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
But for that to work, the database has to be in GCP Cloud SQL. I tried to see what options are there for streaming from the external PostgreSQL into a Cloud SQL PostgreSQL database, but I could only find information about replicating it in a one time copy, not streaming:
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external
The reason why I want this streaming into BigQuery is that I am using Google Data Studio to create reports from the external PostgreSQL, which works great, but GDS can only accept SQL query parameters if it comes from a Google BigQuery database. E.g. if we have a table with 1M entries, and we want a Google Data Studio parameter to be added by the user, this will turn into a:
SELECT * from table WHERE id=#parameter;
which means that the query will be faster, and won't hit the 100K records limit in Google Data Studio.
What's the best way of creating a connection between an external PostgreSQL (read-only access) and Google BigQuery so that when querying via BigQuery, one gets the same live results as querying the external PostgreSQL?
Perhaps you missed the options stated on the google cloud user guide?
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external#setup-replication
Notice in this section, it says:
"When you set up your replication settings, you can also decide whether the Cloud SQL replica should stay in-sync with the source database server after the initial import is complete. A replica that should stay in-sync is online. A replica that is only updated once, is offline."
I suspect online mode is what you are looking for.
What you are looking for will require some architecture design based on your needs and some coding. There isn't a feature to automatically sync your PostgreSQL database with BigQuery (apart from the EXTERNAL_QUERY() functionality that has some limitations - 1 connection per db - performance - total of connections - etc).
In case you are not looking for the data in real time, what you can do is with Airflow for instance, have a DAG to connect to all your DBs once per day (using KubernetesPodOperator for instance), extract the data (from past day) and loading it into BQ. A typical ETL process, but in this case more EL(T). You can run this process more often if you cannot wait one day for the previous day of data.
On the other hand, if streaming is what you are looking for, then I can think on a Dataflow Job. I guess you can connect using a JDBC connector.
In addition, depending on how you have your pipeline structure, it might be easier to implement (but harder to maintain) if at the same moment you write to your PostgreSQL DB, you also stream your data into BigQuery.
Not sure if you have tried this already, but instead of adding a parameter, if you add a dropdown filter based on a dimension, Data Studio will push that down to the underlying Postgres db in this form:
SELECT * from table WHERE id=$filter_value;
This should achieve the same results you want without going through BigQuery.

Expose Redshift tables through REST API

I am currently contemplating on how to expose data present in Redshift tables in a meaningful and consistent way through REST API.
The way I want it to work is that caller calls the API and then we do some kind of dynamic querying on the tables. I am worried about the latency as the queries could range from simple to very complicated ones. Since Redshift requires connecting to the database as a client, Some of the approaches we could have are:
Create a lambda function connecting to Redshift, which is invoked through API gateway
Using OData to create RESTful APIs. However, I don't think Redshift supports OData out of the box.
I am inclining towards OData since it has advance filtering options along with pagination.
I am seeking advice, will OData be enough and if yes, how exactly one integrates OData with redshift.
Any other advise/approaches are welcome too.
Thanks!
Let me go over the different options:
Redshift data api
Redshift data API let's you invoke queries and get their results in an asynchronous manner.
You can use the API directly from front-end, or you can put it behind API Gateway.
Lambda
If you trust your users and can get a proper authentication you can simply invoke the Lambda directly from front-end and pass it some SQL to run or generate SQL based on the parameters. You can potentially swap this with Athena using federated query. Optionally you can add in API Gateway for some additional features like rate-limiting and different forms of authentication. Keep in mind that both Lambda and API Gateway have limit in terms of data returned and execution time.
For long running queries I would suggest that the Lambda, API Gateway or even from the front-end itself invoke an AWS Glue Python Shell Job which will use an unload query to drop the results in S3. The front-end can pool for when the job is done.
If you have few types of queries then you can make proper rest API.
Instead of Lambda, you can also use Amazon Athena Federated Query, which you can actually query with directly from the front-end.
OData Implementation
There are third party OData implementations for Redshift. Just google it. With a front-end library that consumes OData(I used KendoUI in the past) you can potentially make a working feature rich front-end in days. The main concern with this option is the tools costs may over your budget. Of course the hours you spent making things is also a cost but it really depends on what are your actual requirements.
So how to choose?
Depending on your requirements I would suggest simply going through the option and selecting them based on costs, time to implementation, performance, reliability and security.
How about Redshift performance?
This is the most difficult part about Redshift and on-demand queries. On Redshift you don't have indexes, data can be compressed and the data is stored columnar fashion. All of these can make Redshift slower than your average relational database for a random query.
However you can make sure that your table is sorted with a distribution style that matches your queries and your queries use the columnar storage to their advantage(not all columns are requested), then it can be faster.
Another thing to keep in mind is that Redshift doesn't handle concurrency well, I believe by default there can only 8 concurrent queries, which you can increase it but you definitely wouldn't to go more than 20.
If your users can wait for their queries(I've seen bad queries go over 2 hours. I'm sure you can make them take longer, then Redshift is fine, if not then you could try putting Postgres in front of Redshift by using external tables and then use your average indexes in front of it to speed things up.

Processing multiple concurrent read queries in Postgres

I am planning to use AWS RDS Postgres version 10.4 and above for storing data in a single table comprising of ~15 columns.
My use case is to serve:
1. Periodically (after 1 hour) store/update rows in to this table.
2. Periodically (after 1 hour) fetch data from the table say 500 rows at a time.
3. Frequently fetch small data (10 rows) from the table (100's of queries in parallel)
Does AWS RDS Postgres support serving all of above use cases
I am aware of Read-Replicas support, but is there any in built load balancer to serve the queries that come in parallel?
How many read queries can Postgres be able to process concurrently?
Thanks in advance
Your usecases seems to be a normal fit for all relational database systems. So I would say: yes.
The question is: how fast the DB can handle the 100 queries (3).
In general the postgresql documentation is one of the best I ever read. So give it a try:
https://www.postgresql.org/docs/10/parallel-query.html
But also take into consideration how big your data is!
That said, try w/o read replicas first! You might not need them.

How should I manage postgres database handles in a serverless environment?

I have an API running in AWS Lambda and AWS Gateway using Up. My API creates a database connection on startup, and therefore Lambda does this when the function is triggered for the first time. My API is written in node using Express and pg-promise to connect to and query the database.
The problem is that Lambda creates new instances of the function as it sees fit, and sometimes it appears as though there are multiple instances of it at one time.
I keep running out of DB connections as my Lambda function is using up too many database handles. If I log into Postgres and look at the pg_stat_activity table I can see lots of connections to the database.
What is the recommended pattern for solving this issue? Can one limit the number of simultaneous instances of a function in Lambda? Can you share a connection pool across instances of a function (I doubt it).
UPDATE
AWS now provides a product called RDS Proxy which is a managed connection pooling solution to solve this very issue: https://aws.amazon.com/blogs/compute/using-amazon-rds-proxy-with-aws-lambda/
There a couple ways that you can run out of database connections:
You have more concurrent Lambda executions than you have available database connections. This is certainly possible.
Your Lambda function is opening database connections but not closing them. This is a likely culprit, since web frameworks tend to keep database connections open across requests (which is more efficient), but on Lambda have no opportunity to close them since AWS will silently terminate the instance.
You can solve 1 by controlling the number of available connections on the database server (the max_connections setting on PostgreSQL) and the maximum number of concurrent Lambda function invocations (as documented here). Of course, that just trades one problem for another, since Lambda will return 429 errors when it hits the limit.
Addressing 2 is more tricky. The traditional and right way of dealing with database connection exhaustion is to use connection pooling. But with Lambda you can't do that on the client, and with RDS you don't have the option to do that on the server. You could set up an intermediary persistent connection pooler, but that makes for a more complicated setup.
In the absence of pooling, one option is to create and destroy a database connection on each function invocation. Unfortunately that will add quite a bit of overhead and latency to your requests.
Another option is to carefully control your client-side and server-side connection parameters. The idea is first to have the database close connections after a relatively short idle time (on PostgreSQL this is controlled by the tcp_keepalives_* settings). Then, to make sure that the client never tries to use a closed connection, you set a connection timeout on the client (how to do so will be framework dependent) that is shorter than that value.
My hope is that AWS will give us a solution for this at some point (such as server-side RDS connection pooling). You can see various proposed solutions in this AWS forum thread.
You have two options to fix this:
You can tweak Postgres to disconnect those idle connections. This is the best way but may require some trial-and-error.
You have to make sure that you connect to the database inside your handler and disconnect before your function returns or exits. In express, you'll have to connect/disconnect while inside your route handlers.