Good way to ensure data is ready to be queried while Glue partition is created? - aws-glue-data-catalog

We have queries that run on a schedule every several minutes that join a few different glue tables (via Athena) before coming up with some results. For the table in question, we have Glue Crawlers set up and partitions based on snapshot_date and a couple other columns.
In the query, we grab the latest snapshot_date and use only data from that snapshot_date for the query. The data in S3 gets updated and put into the right folder a few times a day, but it looks like sometimes, if we try to query the data right as the data in S3 is getting updated, we end up with empty results due to the query trying to access the new snapshot_date partition while Glue is still getting the data set up(?)
Is there a built-in way to ensure that our glue partitions are ready before we start querying them? So far, we considered building in artificial time "buffers" in our query around when we expect the snapshot_date partition data to be written and the glue update to be complete, but I'm aware that that's really brittle and depends on the exact timing.

Related

How to get schema of Delta table without reading content?

I have a delta table with millions of rows and several columns of various types, incl. nested Structs. And I want to create an empty DataFrame clone of the delta table, in the runtime - i.e. same schema, no rows.
Can I read schema without reading any content of the table (so that I can then create an empty DataFrame based on the schema)? I assumed it would be possible since there are the delta transaction logs and that Delta needs to quickly access table schemas itself.
What I tried:
df.schema - Accessing schema immediately after the delta table load took several minutes as well.
limit(0) - Calling limit(0) immediately after the load still took several minutes.
limit(0).cache() - limit gets sometimes moved around in the plan, so I also tried adding cache to "fix its position".
Are there any other options? Would it be correct to just access the transaction log JSON and read the schema from the latest transaction? (Given that we
Context: I want to add a step into our CI that checks the code and various assumptions around schema before it gets actually run with the data.
When you access schema of the Delta it doesn't go through all the data as Delta stores the schema in the transaction log itself, so df.schema should be enough. But when transaction log accessed, it may require sometime to reconstruct the actual schema from the JSON/Parquet files that are used for transaction log. Although several minutes is quite strange & you need to dig into execution plan.
I wouldn't recommend to read transaction log directly as its format is an internal thing, plus the latest transaction may not contain schema (it's not put into every log file, only when changes are happening).

Results of rdd.count, count via spark sql are the same, but they are different from count with hive sql

I use count to calculate the number of RDD,got 13673153,but after I transfer the rdd to df and insert into hive,and count again,got 13673182,why?
rdd.count
spark.sql("select count(*) from ...").show()
hive sql: select count(*) from ...
This could be caused by a mismatch between data in the underlying files and the metadata registered in hive for that table. Try running:
MSCK REPAIR TABLE tablename;
in hive, and see if the issue is fixed. The command updates the partition information of the table. You can find more info in the documentation here.
During a Spark Action and part of SparkContext, Spark will record which files were in scope for processing. So, if the DAG needs to recover and reprocess that Action, then the same results are gotten. By design.
Hive QL has no such considerations.
UPDATE
As you noted, the other answer did not help in this use case.
So, when Spark processes Hive tables it looks at the list of files that it will use for the Action.
In the case of a failure (node failure, etc.) it will recompute data from the generated DAG. If it needs to go back and re-compute as far as the start of reading from Hive itself, then it will know which files to use - i.e the same files, so that same results are gotten instead of non-deterministic outcomes. E.g. think of partitioning aspects, handy that same results can be recomputed!
It's that simple. It's by design. Hope this helps.

Spark Streaming dropDuplicates

Spark 2.1.1 (scala api) streaming json files from an s3 location.
I want to deduplicate any incoming records based on an ID column (“event_id”) found in the json for every record. I do not care which record is kept, even if duplication of the record is only partial. I am using append mode as the data is merely being enriched/filtered, with no group by/window aggregations, via the spark.sql() method. I then use the append mode to write parquet files to s3.
According to the documentation, I should be able to use dropDuplicates without watermarking in order to deduplicate (obviously this is not effective in long-running production). However, this fails with the error:
User class threw exception: org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets
That error seems odd as I am doing no aggregation (unless dropDuplicates or sparkSQL counts as an aggregation?).
I know that duplicates won’t occur outside 3 days of each other, so I then tried it again by adding a watermark (by using .withWatermark() immediately before the drop duplicates). However, it seems to want to wait until 3 days are up before writing the data. (ie since today is July 24, only data up to the same time on July 21 is written to the output).
As there is no aggregation, I want to write every row immediately after the batch is processed, and simply throw away any rows with an event id that has occurred in the previous 3 days. Is there a simple way to accomplish this?
Thanks
In my case, I used to achieve that in two ways through DStream :
One way:
load tmp_data(contain 3 days unique data, see below)
receive batch_data and do leftOuterJoin with tmp_data
do filter on step2 and output new unique data
update tmp_data with new unique data through step2's result and drop old data(more than 3 days)
save tmp_data on HDFS or whatever
repeat above again and again
Another way:
create a table on mysql and set UNIQUE INDEX on event_id
receive batch_data and just save event_id + event_time + whatever to mysql
mysql will ignore duplicate automatically
Solution we used was a custom implementation of org.apache.spark.sql.execution.streaming.Sink that inserts into a hive table after dropping duplicates within batch and performing a left anti join against the previous few days worth of data in the target hive table.

Aggregate as part of ETL or within the database?

Is there a general preference or best practice when it comes whether the data should be aggregated in memory on an ETL worker (with pandas groupby or pd.pivot_table, for example), versus doing a groupby query at the database level?
At the visualization layer, I connect to the last 30 days of detailed interaction-level data, and then the last few years of aggregated data (daily level).
I suppose that if I plan on materializing the aggregated table, it would be best to just do it during the ETL phase since that can be done remotely and not waste the resources of the database server. Is that correct?
If your concern is to put as little load on the source database server as possible, it is best to pull the tables from the source database to a staging area and do joins and aggregations there. But take care that the ETL tool does not perform a nested loop join on the source database tables, that is to pull in one of the tables and then run thousands of queries against the other table to find matching rows.
If your target is to perform joins and aggregations as fast and efficient as possible, by all means push them down to the source database. This may put more load on the source database though. I say “may” because if all you need is an aggregation on a single table, it can be cheaper to perform this in the source database than to pull the whole table.
If you aggregate by day, what if you boss wants it aggregated by hour or week?
The general rule is: Your fact table granularity should be as granular as possible. Then you can drill-down.
You can create pre-aggregated tables too, for example by hour, day, week, month, etc. Space is cheap these days.
Tools like Pentaho Aggregation Designer can automate this for you.

hadoop - large database query

Situation: I have a Postgres DB that contains a table with several million rows and I'm trying to query all of those rows for a MapReduce job.
From the research I've done on DBInputFormat, Hadoop might try and use the same query again for a new mapper and since these queries take a considerable amount of time I'd like to prevent this in one of two ways that I've thought up:
1) Limit the job to only run 1 mapper that queries the whole table and call it
good.
or
2) Somehow incorporate an offset in the query so that if Hadoop does try to use
a new mapper it won't grab the same stuff.
I feel like option (1) seems more promising, but I don't know if such a configuration is possible. Option(2) sounds nice in theory but I have no idea how I would keep track of the mappers being made and if it is at all possible to detect that and reconfigure.
Help is appreciated and I'm namely looking for a way to pull all of the DB table data and not have several of the same query running because that would be a waste of time.
DBInputFormat essentially does already do your option 2. It does use LIMIT and OFFSET in its queries to divide up the work. For example:
Mapper 1 executes: SELECT field1, field2 FROM mytable ORDER BY keyfield LIMIT 100
Mapper 2 executes: SELECT field1, field2 FROM mytable ORDER BY keyfield LIMIT 100 OFFSET 100
Mapper 3 executes: SELECT field1, field2 FROM mytable ORDER BY keyfield LIMIT 100 OFFSET 200
So if you have proper indexes on the key field, you probably shouldn't mind that multiple queries are being run. Where you do get some possible re-work is with speculative execution. Sometimes hadoop will schedule multiples of the same task, and simply only use the output from whichever finishes first. If you wish, you can turn this off with by setting the following property:
mapred.map.tasks.speculative.execution=false
However, all of this is out the window if you don't have a sensible key for which you can efficiently do these ORDER, LIMIT, OFFSET queries. That's where you might consider using your option number 1. You can definitely do that configuration. Set the property:
mapred.map.tasks=1
Technically, the InputFormat gets "final say" over how many Map tasks are run, but DBInputFormat always respects this property.
Another option that you can consider using is a utility called sqoop that is built for transferring data between relational databases and hadoop. This would make this a two-step process, however: first copy the data from Postgres to HDFS, then run your MapReduce job.