How to access gold table in delta lake for web dashboards and other? - rest

I am using the delta lake oss version 0.8.0.
Let's assume we calculated aggregated data and cubes using the raw data and saved the results in a gold table using delta lake.
My question is, is there a well known way to access these gold table data and deliver them to a web dashboard for example?
In my understanding, you need a running spark session to query a delta table.
So one possible solution could be to write a web api, which executes these spark queries.
Also you could write the gold results in a database like postgres to access it, but that seems just duplicating the data.
Is there a known best practice solution?

The real answer depends on your requirements regarding latency, number of requests per second, amount of data, deployment options (cloud/on-prem, where data located - HDFS/S3/...), etc. Possible approaches are:
Have the Spark running in the local mode inside your application - it may require a lot of memory, etc.
Run Thrift JDBC/ODBC server as a separate process, and access data via JDBC/ODBC
Read data directly using the Delta Standalone Reader library for JVM, or via delta-rs library that works with Rust/Python/Ruby

Related

GCP Dataflow vs Cloud Functions to automate scrapping output and file-on-cloud merge into JSON format to insert in DB

I have two sources:
A csv that will be uploaded to a cloud storage service, probably GCP Cloud Storage.
The output of a scrapping process done with Python.
When a user updates 1) (the cloud stored file) an event should be triggered to execute 2) (the scrapping process) and then some transformation should take place in order to merge these two sources into one in a JSON format. Finally, the content of this JSON file should be stored in a DB of easy access and low cost. The files the user will update are of max 5MB and the updates will take place once weekly.
From what I've read, I can use GCP Cloud Functions to accomplish this whole process or I can use Dataflow too. I've even considered using both. I've also thought of using MongoDB to store the JSON objects of the two sources final merge.
Why should I use Cloud Functions, Dataflow or both? What are your thoughts on the DB? I'm open to different approaches. Thanks.
Regarding de use of Cloud Functions and Dataflow. In your case I will go for Cloud Functions as you don't have a big volume of data. Dataflow is more complex, more expensive and you will have to use Apache Beam. If you are confortable with python and having into consideration your scenario I will choose Cloud Functions. Easy, convenient...
To trigger a Cloud Functions when Cloud Storage object is updated you will have to configure the triggers. Pretty easy.
https://cloud.google.com/functions/docs/calling/storage
Regarding the DB. MongoDB is a good option but if you wanth something quick an inexpensive consider DataStore
As a managed service it will make your life easy with a lot of native integrations. Also it has a very interesting free tier.

Getting data from REST API and storing it in Azure Data Lake

Does it make sense to get data from REST API and store it as JSON in an Azure Data Lake? Or the data should be stored directly into Azure SQL?
I've tried both options, but it's not clear in which scenario it is worth to save the data into Azure Data Lake.
Yes this is a perfectly normal pattern that has emerged for collecting large volumes in particular. Writing to a database is great but there are (at least) two aspects to consider:
schema-on-write - you have to know the schema before you write to the database. That means all columns, all datatypes, nullability, collation even before you can even think about writing a record. How are you going to handle the schema of your JSON changing for example?
transaction logging - most Microsoft SQL databases work with write-ahead-log or WAL, which means the transaction logging has to complete before the transaction is considered complete as part of an ACID transaction. What will happen in situations of heavy load on the database or high concurrency - queuing and blocking. Often these things take milliseconds but low tiers etc come into play. Alternate patterns like eventual consistency eg with Cosmos are a possibility if you need that sort of thing.
Data Lakes in contract are schema-on-read, ie you do not have to know the schema in order to write to the lake, so you can just land it and figure out the other stuff later.
This does not necessarily apply to your other question about Synapse as you run the risk of losing your perfectly good SQL Server datatypes. Look at one of the migration wizards for that instead.

streaming PostgreSQL tables into Google BigQuery

I would like to automatically stream data from an external PostgreSQL database into a Google Cloud Platform BigQuery database in my GCP account. So far, I have seen that one can query external databases (MySQL or PostgreSQL) with the EXTERNAL_QUERY() function, e.g.:
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
But for that to work, the database has to be in GCP Cloud SQL. I tried to see what options are there for streaming from the external PostgreSQL into a Cloud SQL PostgreSQL database, but I could only find information about replicating it in a one time copy, not streaming:
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external
The reason why I want this streaming into BigQuery is that I am using Google Data Studio to create reports from the external PostgreSQL, which works great, but GDS can only accept SQL query parameters if it comes from a Google BigQuery database. E.g. if we have a table with 1M entries, and we want a Google Data Studio parameter to be added by the user, this will turn into a:
SELECT * from table WHERE id=#parameter;
which means that the query will be faster, and won't hit the 100K records limit in Google Data Studio.
What's the best way of creating a connection between an external PostgreSQL (read-only access) and Google BigQuery so that when querying via BigQuery, one gets the same live results as querying the external PostgreSQL?
Perhaps you missed the options stated on the google cloud user guide?
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external#setup-replication
Notice in this section, it says:
"When you set up your replication settings, you can also decide whether the Cloud SQL replica should stay in-sync with the source database server after the initial import is complete. A replica that should stay in-sync is online. A replica that is only updated once, is offline."
I suspect online mode is what you are looking for.
What you are looking for will require some architecture design based on your needs and some coding. There isn't a feature to automatically sync your PostgreSQL database with BigQuery (apart from the EXTERNAL_QUERY() functionality that has some limitations - 1 connection per db - performance - total of connections - etc).
In case you are not looking for the data in real time, what you can do is with Airflow for instance, have a DAG to connect to all your DBs once per day (using KubernetesPodOperator for instance), extract the data (from past day) and loading it into BQ. A typical ETL process, but in this case more EL(T). You can run this process more often if you cannot wait one day for the previous day of data.
On the other hand, if streaming is what you are looking for, then I can think on a Dataflow Job. I guess you can connect using a JDBC connector.
In addition, depending on how you have your pipeline structure, it might be easier to implement (but harder to maintain) if at the same moment you write to your PostgreSQL DB, you also stream your data into BigQuery.
Not sure if you have tried this already, but instead of adding a parameter, if you add a dropdown filter based on a dimension, Data Studio will push that down to the underlying Postgres db in this form:
SELECT * from table WHERE id=$filter_value;
This should achieve the same results you want without going through BigQuery.

Data mining with postgres in production environment - is there a better way?

There is a web application which is running for a years and during its life time the application has gathered a lot of user data. Data is stored in relational DB (postgres). Not all of this data is needed to run application (to do the business). However form time to time business people ask me to provide reports of this data data. And this causes some problems:
sometimes these SQL queries are long running
quires are executed against production DB (not cool)
not so easy to deliver reports on weekly or monthly base
some parts of data is stored in way which is not suitable for such
querying (queries are inefficient)
My idea (note that I am a developer not the data mining specialist) how to improve this whole process of delivering reports is:
create separate DB which regularly is update with production data
optimize how data is stored
create a dashboard to present reports
Question: But is there a better way? Is there another DB which better fits for such data analysis? Or should I look into modern data mining tools?
Thanks!
Do you really do data mining (as in: classification, clustering, anomaly detection), or is "data mining" for you any reporting on the data? In the latter case, all the "modern data mining tools" will disappoint you, because they serve a different purpose.
Have you used the indexing functionality of Postgres well? Your scenario sounds as if selection and aggregation are most of the work, and SQL databases are excellent for this - if well designed.
For example, materialized views and triggers can be used to process data into a scheme more usable for your reporting.
There are a thousand ways to approach this issue but I think that the path of least resistance for you would be postgres replication. Check out this Postgres replication tutorial for a quick, proof-of-concept. (There are many hits when you Google for postgres replication and that link is just one of them.) Here is a link documenting streaming replication from the PostgreSQL site's wiki.
I am suggesting this because it meets all of your criteria and also stays withing the bounds of the technology you're familiar with. The only learning curve would be the replication part.
Replication solves your issue because it would create a second database which would effectively become your "read-only" db which would be updated via the replication process. You would keep the schema the same but your indexing could be altered and reports/dashboards customized. This is the database you would query. Your main database would be your transactional database which serves the users and the replicated database would serve the stakeholders.
This is a wide topic, so please do your diligence and research it. But it's also something that can work for you and can be quickly turned around.
If you really want try Data Mining with PostgreSQL there are some tools which can be used.
The very simple way is KNIME. It is easy to install. It has full featured Data Mining tools. You can access your data directly from database, process and save it back to database.
Hardcore way is MADLib. It installs Data Mining functions in Python and C directly in Postgres so you can mine with SQL queries.
Both projects are stable enough to try it.
For reporting, we use non-transactional (read only) database. We don't care about normalization. If I were you, I would use another database for reporting. I will desing the tables following OLAP principals, (star schema, snow flake), and use an ETL tool to dump the data periodically (may be weekly) to the read only database to start creating reports.
Reports are used for decision support, so they don't have to be in realtime, and usually don't have to be current. In other words it is acceptable to create report up to last week or last month.

Is there a way to persist HSQLDB data?

We have all of our unit tests written so that they create and populate tables in HSQL. I want the developers who use this to be able to write queries against this HSQL DB ( 1) by writing queries they can better understand the data model and the ones not as familiar with SQL can play with the data before writing the runtime statements and 2) since they don't have access to the test DB/security reasons). Is there a way to persist the results of the test data so that it may be examine and analyzed with a an sql client?
Right now I am jury rigging it by switching the data source to a different DB (like DB2/mysql, then connecting to that DB on my machine so I can play with persistant data), however it would be easier for me if HSQL supports persisting this than to explain how to do this to every new developer.
Just to be clear, I need an SQL client to interact with persistent data, so debugging and checking memory won't be clean. This has more to do with initial development and not debugging/maintenance/testing.
If you use an HSQLDB Server instance for your tests, the data will survive the test run.
If the server uses a jdbc:hsqldb:mem:aname (all-in-memory) url for its database, then the data will be available while the server is running. Alternatively the server can use a jdbc:hsqldb:file:filepath url and the data is persisted to files.
The latest HSQLDB docs explain the different options. Most of the observations also apply to older (1.8.x) versions. However, the latest version 2.0.1 supports starting a server and creating databases dynamically upon the first connection, which can simplify testing a lot.
http://hsqldb.org/doc/2.0/guide/deployment-chapt.html#N13C3D