I am currently contemplating on how to expose data present in Redshift tables in a meaningful and consistent way through REST API.
The way I want it to work is that caller calls the API and then we do some kind of dynamic querying on the tables. I am worried about the latency as the queries could range from simple to very complicated ones. Since Redshift requires connecting to the database as a client, Some of the approaches we could have are:
Create a lambda function connecting to Redshift, which is invoked through API gateway
Using OData to create RESTful APIs. However, I don't think Redshift supports OData out of the box.
I am inclining towards OData since it has advance filtering options along with pagination.
I am seeking advice, will OData be enough and if yes, how exactly one integrates OData with redshift.
Any other advise/approaches are welcome too.
Thanks!
Let me go over the different options:
Redshift data api
Redshift data API let's you invoke queries and get their results in an asynchronous manner.
You can use the API directly from front-end, or you can put it behind API Gateway.
Lambda
If you trust your users and can get a proper authentication you can simply invoke the Lambda directly from front-end and pass it some SQL to run or generate SQL based on the parameters. You can potentially swap this with Athena using federated query. Optionally you can add in API Gateway for some additional features like rate-limiting and different forms of authentication. Keep in mind that both Lambda and API Gateway have limit in terms of data returned and execution time.
For long running queries I would suggest that the Lambda, API Gateway or even from the front-end itself invoke an AWS Glue Python Shell Job which will use an unload query to drop the results in S3. The front-end can pool for when the job is done.
If you have few types of queries then you can make proper rest API.
Instead of Lambda, you can also use Amazon Athena Federated Query, which you can actually query with directly from the front-end.
OData Implementation
There are third party OData implementations for Redshift. Just google it. With a front-end library that consumes OData(I used KendoUI in the past) you can potentially make a working feature rich front-end in days. The main concern with this option is the tools costs may over your budget. Of course the hours you spent making things is also a cost but it really depends on what are your actual requirements.
So how to choose?
Depending on your requirements I would suggest simply going through the option and selecting them based on costs, time to implementation, performance, reliability and security.
How about Redshift performance?
This is the most difficult part about Redshift and on-demand queries. On Redshift you don't have indexes, data can be compressed and the data is stored columnar fashion. All of these can make Redshift slower than your average relational database for a random query.
However you can make sure that your table is sorted with a distribution style that matches your queries and your queries use the columnar storage to their advantage(not all columns are requested), then it can be faster.
Another thing to keep in mind is that Redshift doesn't handle concurrency well, I believe by default there can only 8 concurrent queries, which you can increase it but you definitely wouldn't to go more than 20.
If your users can wait for their queries(I've seen bad queries go over 2 hours. I'm sure you can make them take longer, then Redshift is fine, if not then you could try putting Postgres in front of Redshift by using external tables and then use your average indexes in front of it to speed things up.
Related
Does it make sense to get data from REST API and store it as JSON in an Azure Data Lake? Or the data should be stored directly into Azure SQL?
I've tried both options, but it's not clear in which scenario it is worth to save the data into Azure Data Lake.
Yes this is a perfectly normal pattern that has emerged for collecting large volumes in particular. Writing to a database is great but there are (at least) two aspects to consider:
schema-on-write - you have to know the schema before you write to the database. That means all columns, all datatypes, nullability, collation even before you can even think about writing a record. How are you going to handle the schema of your JSON changing for example?
transaction logging - most Microsoft SQL databases work with write-ahead-log or WAL, which means the transaction logging has to complete before the transaction is considered complete as part of an ACID transaction. What will happen in situations of heavy load on the database or high concurrency - queuing and blocking. Often these things take milliseconds but low tiers etc come into play. Alternate patterns like eventual consistency eg with Cosmos are a possibility if you need that sort of thing.
Data Lakes in contract are schema-on-read, ie you do not have to know the schema in order to write to the lake, so you can just land it and figure out the other stuff later.
This does not necessarily apply to your other question about Synapse as you run the risk of losing your perfectly good SQL Server datatypes. Look at one of the migration wizards for that instead.
I am using the delta lake oss version 0.8.0.
Let's assume we calculated aggregated data and cubes using the raw data and saved the results in a gold table using delta lake.
My question is, is there a well known way to access these gold table data and deliver them to a web dashboard for example?
In my understanding, you need a running spark session to query a delta table.
So one possible solution could be to write a web api, which executes these spark queries.
Also you could write the gold results in a database like postgres to access it, but that seems just duplicating the data.
Is there a known best practice solution?
The real answer depends on your requirements regarding latency, number of requests per second, amount of data, deployment options (cloud/on-prem, where data located - HDFS/S3/...), etc. Possible approaches are:
Have the Spark running in the local mode inside your application - it may require a lot of memory, etc.
Run Thrift JDBC/ODBC server as a separate process, and access data via JDBC/ODBC
Read data directly using the Delta Standalone Reader library for JVM, or via delta-rs library that works with Rust/Python/Ruby
I am working on a project which uses graphql and PostgreSQL where we want to select data from the database with a value after a certain date. It is currently selecting all data from the database and then filtering it on the server:
.filter(({time}) => moment(time).isAfter(startTime))
However I would have thought it would be best to do this filtering in the database query as the full dataset is never used.
Is there any benefit to doing it on the server rather than in the database query?
Barring some unusual edge case -- such as other parts of your backend code really do need all the data for some reason -- it would definitely be more efficient to filter everything on the Postgres side via the SQL that is being used to fetch the data in the first place.
This is true for several reasons:
Assuming the table is properly indexed, the filtering will be able to occur much faster within the database.
The unneeded data will not need to be serialized and sent over the wire to the backend, only to then be discarded by the backend's own filtering.
The memory footprint should be reduced on both the Postgres and server end due to needing to process only a portion of the results.
I've not worked with GraphQL myself, but from doing a bit of poking around through its docs, it appears GraphQL often uses other mechanisms in different layers (outside of the database) to try to improve performance.
It would be worth seeing what the actual SQL is that your GraphQL query is generating (that may be possible via a function in GraphQL; it could also be done by enabling certain log settings on the Postgres server and correlating the log output to the query). That may lead to further optimization possibilities if you want to keep things purely GraphQL.
Jumping down to a raw query seems like it would be a good possibility though. Certainly that is something that is often done with ORMs like Django and ActiveRecord.
There is a web application which is running for a years and during its life time the application has gathered a lot of user data. Data is stored in relational DB (postgres). Not all of this data is needed to run application (to do the business). However form time to time business people ask me to provide reports of this data data. And this causes some problems:
sometimes these SQL queries are long running
quires are executed against production DB (not cool)
not so easy to deliver reports on weekly or monthly base
some parts of data is stored in way which is not suitable for such
querying (queries are inefficient)
My idea (note that I am a developer not the data mining specialist) how to improve this whole process of delivering reports is:
create separate DB which regularly is update with production data
optimize how data is stored
create a dashboard to present reports
Question: But is there a better way? Is there another DB which better fits for such data analysis? Or should I look into modern data mining tools?
Thanks!
Do you really do data mining (as in: classification, clustering, anomaly detection), or is "data mining" for you any reporting on the data? In the latter case, all the "modern data mining tools" will disappoint you, because they serve a different purpose.
Have you used the indexing functionality of Postgres well? Your scenario sounds as if selection and aggregation are most of the work, and SQL databases are excellent for this - if well designed.
For example, materialized views and triggers can be used to process data into a scheme more usable for your reporting.
There are a thousand ways to approach this issue but I think that the path of least resistance for you would be postgres replication. Check out this Postgres replication tutorial for a quick, proof-of-concept. (There are many hits when you Google for postgres replication and that link is just one of them.) Here is a link documenting streaming replication from the PostgreSQL site's wiki.
I am suggesting this because it meets all of your criteria and also stays withing the bounds of the technology you're familiar with. The only learning curve would be the replication part.
Replication solves your issue because it would create a second database which would effectively become your "read-only" db which would be updated via the replication process. You would keep the schema the same but your indexing could be altered and reports/dashboards customized. This is the database you would query. Your main database would be your transactional database which serves the users and the replicated database would serve the stakeholders.
This is a wide topic, so please do your diligence and research it. But it's also something that can work for you and can be quickly turned around.
If you really want try Data Mining with PostgreSQL there are some tools which can be used.
The very simple way is KNIME. It is easy to install. It has full featured Data Mining tools. You can access your data directly from database, process and save it back to database.
Hardcore way is MADLib. It installs Data Mining functions in Python and C directly in Postgres so you can mine with SQL queries.
Both projects are stable enough to try it.
For reporting, we use non-transactional (read only) database. We don't care about normalization. If I were you, I would use another database for reporting. I will desing the tables following OLAP principals, (star schema, snow flake), and use an ETL tool to dump the data periodically (may be weekly) to the read only database to start creating reports.
Reports are used for decision support, so they don't have to be in realtime, and usually don't have to be current. In other words it is acceptable to create report up to last week or last month.
With version 8.4 PostgreSQL finally integrated a proprietary API into their JDBC driver, which allows stream based inserts and selects. The so called Copy API grants access to COPY TO/COPY FROM SQL commands, which read text data from a stream/reader into one table at a time or write text data to a stream/writer from one table. Constraints and triggers are regarded for insert operations. Basic transformations (delimiter, quotation, null values etc.) are available. The performance gain is quite impressive, which probably is because of less object instantiation and a much simpler protocol between client and server backend.
Has anyone experiences with this API, good or bad. Is it production ready? Are there any pitfalls one has to be aware of? BTW: The fact that it is a proprietary API is a non-issue for me.
The COPY API is present in PostgreSQL C library for at least 6 years. It is very stable.
See: http://www.postgresql.org/docs/9.0/interactive/libpq-copy.html
and http://www.postgresql.org/docs/9.0/interactive/sql-copy.html
JDBC implementation should have same properties, but I haven't used it.
PS. I think there is a misunderstanding when you call this "proprietary". Both protocol specification and server/client/driver source code is free (as in freedom).