I have a Spark job that reads from an Oracle table into a dataframe. The way it seems the jdbc.read method works is to pull an entire table in at once, so I constructed a spark-submit job to work in batch. Whenever I have data I need manipulated I put it in a table and run the spark-submit.
However, I would like this to be more event driven...essentially I want it so anytime data is moved into this table it is run through Spark, so I can have events in a UI drive these insertions and spark is just running. I was thinking about using a spark streaming context just to have it watching and operating on the table all the time, but with a long wait between streaming contexts. This way I can use the results (also written to Oracle in part) to trigger a deletion of the read table and not run data more than once.
Is this a bad idea? Will this work? It seems more elegant than using a cron-job.
Related
We are using Spark SQL with Apache Hive tables (via AWS Glue Data catalog). One problem is that when we execute a Spark SQL query without specifying the partitions to read via the WHERE clause, it gives us/the user no warning about the fact that it will proceed to load all partitions and thus likely time out or fail.
Is there a way to ideally error out, or at least give some warning, when a user executes a Spark SQL query on Apache Hive table without specifying partition keys? It's very easy to forget to do this.
I searched for existing solutions to this and found none, both on Stack Overflow and on the wider internet. I was expecting some configuration option/code that would help me achieve the goal.
I have flow files with data records in it. I'm able to place it on S3 bucket. From there on I want to run COPY command and update command with joins to achieve MERGE / UPSERT operation. Can anyone suggest ways to solve this as firehose only executes copy command and I can't make UPSERT / MERGE operation as prescribed by AWS docs directly, so has to copy into staging table and update or insert using some conditions.
There are a number of ways to do this but I usually go with a lambda function run every 5 minutes or so that takes the data put in Redshift from firehose and merges it with existing data. Redshift likes to run on larger "chunks" of data and it is most efficient if you build up some size before performing these operations. The best practice is to move the data from the firehose target in an atomic operation like ALTER TABLE APPEND and use this new table as the source for merging. This is so firehose can keep adding data while the merge is in process.
Working on an ETL process that requires me to pull data from one postgres table and update data to another Postgres table in a seperate environment (same columns names). Currently, I am running the python job in a windows EC2 instance, and I am using pangres upsert library to update existing rows and insert new rows.
However, my organization wants me to move the python ETL script in Managed Apache Airflow on AWS.
I have been learning DAGs and most of the tutorials and articles are about querying data from postgres table using hooks or operators.
However, I am looking to understand how to update existing table A incrementally (i.e. upsert) using new records from table B (and ignore/overwrite existing matching rows).
Any chunk of code (DAG) that explains how to perform this simple task would be extremely helpful.
In Apache Airflow, operations are done using operators. You can package any Python code into an operator, but your best bet is always to use a pre-existing open source operator if one already exists. There is an operator for Postgres (https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/operators/postgres_operator_howto_guide.html).
It will be hard to provide a complete example of what you should write for your situation, but it sounds to be like the best approach for you to take here is to take any SQL present in your Python ETL script and use it with the Postgres operator. The documentation I linked should be a good example.
They demonstrate inserting data, reading data, and even creating a table as a pre-requisite step. Just like how in a Python script, lines execute one at a time, in a DAG, operators execute in a particular order, depending on how they're wired up, like in their example:
create_pet_table >> populate_pet_table >> get_all_pets >> get_birth_date
In their example, populating the pet table won't happen until the create pet table step succeeds, etc.
Since your use case is about copying new data from one table to another, a few tips I can give you:
Use a scheduled DAG to copy the data over in batches. Airflow isn't meant to be used a streaming system for many small pieces of data.
Use the "logical date" of the DAG run (https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html) in your DAG to know the interval of data that run should process. This works well for your requirement that only new data should be copied over during each run. It will also give you repeatable runs in case you need to fix code, then re-run each run (one batch a time) after pushing your fix.
We have queries that run on a schedule every several minutes that join a few different glue tables (via Athena) before coming up with some results. For the table in question, we have Glue Crawlers set up and partitions based on snapshot_date and a couple other columns.
In the query, we grab the latest snapshot_date and use only data from that snapshot_date for the query. The data in S3 gets updated and put into the right folder a few times a day, but it looks like sometimes, if we try to query the data right as the data in S3 is getting updated, we end up with empty results due to the query trying to access the new snapshot_date partition while Glue is still getting the data set up(?)
Is there a built-in way to ensure that our glue partitions are ready before we start querying them? So far, we considered building in artificial time "buffers" in our query around when we expect the snapshot_date partition data to be written and the glue update to be complete, but I'm aware that that's really brittle and depends on the exact timing.
We are having a tricky situation while performing ACID operation using Databricks Spark .
We want to perform UPSERT on a Azure Synapse table over a JDBC connection using PySpark . We are aware of Spark providing only 2 mode for writing data . APPEND and OVERWRITE (only these two use full in our case) . So based these two mode we thought of below options:
We will write whole dataframe into a stage table . And we will use this stage table to perform MERGE operation( ~ UPSERT )with final Table .Stage table will be truncated / dropped after that .
We Will bring target table data into Spark also. Inside Spark We will perform MERGE using Delta lake and will generate a final Dataframe .This dataframe will be written back to Target table in OVERWRITE mode.
Considering the cons. sides..
in Option 1 , We have to use two table just to write the final data. And In,case both Stage and target tables are big , then performing MERGE operation inside Synapse is another herculean task and May take time .
in option 2 ,We have to bring the Target table into Spark in-memory. Even though network IO is not much of our concern as both Databricks and Synpse will be in same Azure AZ, It may leads to memory issue in Spark side.
Is there any other feasible options ?? Or any recommendation ??
Answer would depend on many factors not listed in your question. It's a very open ended question.
(Given the way your question is phrased I'm assuming you're using Dedicated SQL Pools and not an On-demand Synapse)
Here are some thoughts:
You'll be using spark cluster's compute in option 1 and Synapse' compute in option 2. Compare cost.
Pick the lower cost.
Read and write to/from Spark to/from Synapse using their driver uses Datalake as stage. I.e. while reading a table from Synapse into a datafrmae in Spark, driver will first make Synapse export data to Datalake (as parquet IIRC) and then read the files in Datalake to create the Dataframe. This scales nicely if you're talking about 10s or million or billions of rows. But the overhead could become a performance overhead if row counts are low (10-100s of thousands).
Test and pick the faster one.
Remember that Synapse is not like a traditional MySQL or SQL-Server. It's an MPP DB.
"performing MERGE operation inside Synapse is another herculean task and May take time" is a wrong statement. It scales just like a Spark cluster.
It may leads to memory issue in Spark side, yes and no. One one hand all data isn't going to be loaded into a single worker node. OTOH yes, you do need enough memory for each node to do it's own part.
Although Synapse can be scaled up and down dynamically, I've seen it take up to 40 minutes to complete a scale up. Databricks on the other hand is fully on-demand and you can probably get away with turning on cluster, do upsert, shutdown cluster. With Synapse you'll probably have other clients using it, so may not be able to shut it down.
So with Synapse either you'll have to live with 40-80 minutes down time for each upsert (scale up, upsert, scale down), OR
pay for high DWU flat-rate all the time, though your usage is high only when you upsert but otherwise it's pretty low.
Lastly, remember that MERGE is in preview at the time of writing this. Means no Sev-A support cases/immediate support if something breaks in your prod because you're using MERGE.
You can always use DELETE + INSERT instead. Assumes the delta you receive has all columns from target table and not just updated ones.
Did you try creating checksum to do merge upsert only for rows that have actual data change?