delta lake merge missing reocords - pyspark

I am excuting delta Lake function on aws. However, I am not getting the correct result. below is the pyspark script. It ran successfully. However, the output contains less records than the origianl table.
deltaTable.alias("old")\
.merge(df.alias("new"),join_string)\
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
As the below image show, numOutputrows should be ~226k . however, i only get 21k in the final result.
enter image description here
~226k records in the output table.

Related

How to create an BQ external table on top of the delta table in GCS and show only latest snapshot

I am trying to create an external BQ external table on top of delta table which uses google storage as a storage layer. On the delta table, we perform DML which includes deletes.
I can create an BQ external table on top of the gs bucket where all the delta files exists. However, it is pulling even the delete records since BQ external table cannot read the transaction logs of delta where it says which parquet files to consider and which one to remove.
is there a way we can expose the latest snapshot of the delta table (gs location) in BQ as an external table apart from doing it programmatically copying data from delta to BQ?
So this question is asked like more than a year ago but I have tricky but powerful addition to Oliver's answer which eliminates data duplication and additional load logic.
Step 1 As Oliver suggested generate symlink_format_manifest files; you can either trigger it everytime you've updated or you can add a tblproperty to your file as stated here to
automatically create those files when delta table is updated;
ALTER TABLE delta.`<path-to-delta-table>` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)
Step 2 Create a external table that points to delta table location
> bq mkdef --source_format=PARQUET "gs://test-delta/*.parquet" > bq_external_delta_logs
> bq mk --external_table_definition=bq_external_delta_logs test.external_delta_logs
Step 3 Create another external table pointing to symlink_format_manifest/manifest file
> bq mkdef --autodetect --source_format=CSV gs://test-delta/_symlink_format_manifest/manifest > bq_external_delta_manifest
> bq mk --table --external_table_definition=bq_external_delta_manifest test.external_delta_manifest
Step 4 Create a view with following query
> bq mk \
--use_legacy_sql=false \
--view \
'SELECT
*
FROM
`project_id.test.external_delta_logs`
WHERE
_FILE_NAME in (select * from `project_id.test.external_delta_logs`)' \
test.external_delta_snapshot
Now you can get the latest snapshot whenever your delta table refreshed from test.external_delta_snapshot view without any additional loading or data duplication. A downside of this solution is, in case of schema changes you have to add new fields to you table definition either manually or from your spark pipeline using BQ client etc. For those who are curious about how this solution works, please continue reading.
How this works;
symlink manifest files contains list of parquet files in new line delimited format pointing to current delta version partitions;
gs://delta-test/......-part1.parquet
gs://delta-test/......-part2.parquet
....
In addition to our delta location, we are defining another external table by treating this manifest file as CSV file (its actually a single column CSV file). The view we've defined takes advantage of _FILE_NAME pseudo column mentioned here, which points to parquet file location of every row in table. As stated in docs, _FILE_NAME pseudo column is defined for every external table that points to data stored in Cloud Storage and Google Drive.
So at this point, we have the list of parquet files required for loading latest snapshot and ability to filter files we want to read using _FILE_NAME column. The view we have defined just defines this procedure to get the latest snapshot. Whenever our delta table gets updated, manifest and delta log table will look for the newest data therefore we will always get the latest snapshot without any additional loading or data duplication.
Last word, its a known fact that execution on external tables are more expensive(execution cost) than BQ managed tables, so its better to experiment with dual writes as Oliver suggested and external table solution like you asked. Storage is cheaper than execution so there may be some cases where keeping data in both GCS and BQ costs less than keeping an external table like this.
I'm also developing this kind of pipeline where we dump our delta lake files in GCS and present it on Bigquery. The generation of manifest file from your GCS delta files will give you the latest snapshot based on what version current set on your delta files. Then you need to create a custom script to parse that manifest file to get the list of files and then run a bq load mentioning those files.
val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Below workaround might work for small datasets.
Have a separate BQ table.
Read the delta lake files into a DataFrame and then df.overwrite into the BigQuery table.

PySpark structured streaming and filtered processing for parts

I want to evaluate a streamed (unbound) data frame within Spark 2.4:
time id value
6:00:01.000 1 333
6:00:01.005 1 123
6:00:01.050 2 544
6:00:01.060 2 544
When all the data of id 1 got into the dataframe and the data of the next id 2 comes I want to do calculations for the complete data of id 1. But how do I do that? I think I cannot use the window functions since I do not know the time in advance that also varies for each id. And I also do not know the id from other sources besides the streamed data frame.
The only solution that come to my mind contains variable comparison (a memory) and a while loop:
id_old = 0 # start value
while true:
id_cur = id_from_dataframe
if id_cur != id_old: # id has changed
do calulation for id_cur
id_old = id_cur
But I do not think that this is the right solution. Can you give me a hint or documentation that helps me since I cannot find examples or documentation.
I get it running with a combination of watermarking and grouping:
import pyspark.sql.functions as F
d2 = d1.withWatermark("time", "60 second") \
.groupby('id', \
F.window('time', "40 second")) \
.agg(
F.count("*").alias("count"), \
F.min("time").alias("time_start"), \
F.max("time").alias("time_stop"), \
F.round(F.avg("value"),1).alias('value_avg'))
Most of the documentation only shows the basic stuff with grouping only by time and I saw just one example with another parameter for grouping, so I put my 'id' there.

create_dynamic_frame_from_catalog returning zero results

I'm trying to create a dynamic glue dataframe from an athena table but I keep getting an empty data frame.
The athena table is part of my glue data catalog
The create_dynamic_frame_method call doesn't raise any error. I tried loading a random table and it did complain just as a sanity check.
I know the Athena table has data, since querying the exact same table using Athena returns results
The table is an external json, partitioned table on s3
I'm using pyspark as shown below:
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
# Create a DynamicFrame using the 'raw_data' table
raw_data_df =
glueContext.create_dynamic_frame.from_catalog(database="***",
table_name="raw_***")
# Print out information about this data, im getting zero here
print "Count: ", raw_data_df.count()
#also getting nothing here
raw_data_df.printSchema()
Anyone facing the same issue ? Could this be a permissions issue or a glue bug since no errors are raised?
There are several poorly documented features/gotchas in Glue which is sometimes frustrating.
I would suggest to investigate the following configurations of your Glue job:
Does the S3 bucket name has aws-glue-* prefix?
Put the files in S3 folder and make sure the crawler table definition is on folder
rather than actual file.
I have also written a blog on LinkedIn about other Glue gotchas if that helps.
Do you have subfolders under the path where your Athena table points to? glueContext.create_dynamic_frame.from_catalog does not recursively read the data. Either put the data in the root of where the table is pointing to or add additional_options = {"recurse": True} to your from_catalog call.
credit: https://stackoverflow.com/a/56873939/5112418

Spark Dataframe returns an inconsistent value on count()

I am using pypark to perform some computations on data obtained from a PostgreSQL database. My pipeline is something similar to this:
limit = 1000
query = "(SELECT * FROM table LIMIT {}) as filter_query"
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://path/to/db") \
.option("dbtable", query.format(limit)) \
.option("user", "user") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver")
df.createOrReplaceTempView("table")
df.count() # 1000
So far, so good. The problem starts when I perform some transformations on the data:
counted_data = spark.sql("SELECT column1, count(*) as count FROM table GROUP BY column1").orderBy("column1")
counted_data.count() # First value
counted_data_with_additional_column = counted_data.withColumn("column1", my_udf_function)
counted_data_with_additional_column.count() # Second value, inconsistent with the first count (should be the same)
The first transformation alters the number of rows, (the value should be <= 1000). However, the second one does not, it just adds a new column. How can it be that I am getting a different result for count()?
The explanation is actually quite simple, but a bit tricky. Spark might perform additional reads to the input source (in this case a database). Since some other process is inserting data in the database, these additional calls read slightly different data than the original read, causing this inconsistent behaviour. A simple call to df.cache() after the read disables further reads. I figured this out by analyzing the traffic between the database and my computer, and indeed, some further SQL commands where issued that matched my transformations. After adding the cache() call, no further traffic appeared.
Since you are using Limit 1000, you might be getting different 1000 records on each execution. And since you will be getting different records each time, the result of aggregation will be different. In order to get the consistent behaviour with Limit you can try following approaches.
Either try to cache your dataframe with cahce() or Persist method, which will ensure that spark will use same data till the time it will be available in memory.
But better approach could be to sort the data based on some unique column and then get the 1000 records, which will ensure that you will get the same 1000 records each time.
Hope it helps.

How to create partitioned table from Google Bucket?

Every weekend I add a few files to a google bucket and then run something from the command line to "update" a table with the new data.
By "update" I mean that I delete the table and then remake it by using all the files in the bucket, including the new files.
I do everything by using python to execute the following command in the Windows command line:
bq mk --table --project_id=hippo_fence-5412 mouse_data.partition_test gs://mybucket/mouse_data/* measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
This table is getting massive (>200 GB) and it would be much cheaper for the lab to use partitioned tables.
I've tried a to partition the table in a few ways, including what is recommened by the official docs but I can't make it work.
The most recent command I tried was just inserting --time_partitioning_type=DAY like:
bq mk --table --project_id=hippo_fence-5412 --time_partitioning_type=DAY mouse_data.partition_test gs://mybucket/mouse_data/* measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
but that didn't work, giving me the error:
FATAL Flags parsing error: Unknown command line flag 'time_partitioning_type'
How can I make this work?
For the old data, a possible solution would be to create an empty partitioned table and then import each bucket file in the desired day partition. Unfortunately it didn’t work with wildcards when I tested it.
1. Create the partitioned table
bq mk --table --time_partitioning_type=DAY [MY_PROJECT]:[MY_DATASET].[PROD_TABLE] measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
2. Import each file in the desired partition day. Here is an example for a file from 22nd February 2018.
bq load [MY_PROJECT]:[MY_DATASET].[PROD_TABLE]$20180222 gs://MY-BUCKET/my_file.csv
3. Process the current uploads normally and they will be automatically counted in the day of the upload partition.
bq load [MY_PROJECT]:[MY_DATASET].[PROD_TABLE] gs://MY-BUCKET/files*