Snowpipe skips files - snowflake-pipe

We have used Snowpipe for ~10 months now and we recently ran into a case where part of the files in a stage got uploaded to the corresponding snowflake table and any future files were not detected. Verified that the underlying stage and pipe were in valid states.
Let's assume that the staging location is s3://<some_bucket>/some/path and there are 5 files file1.csv, file2.csv, file3.csv, file4.csv, file5.csv
select metadata$filename, count(*) from #<DB_NAME>.<SCHEMA>.<STAGE_NAME> group by metadata$filename;
The output indicates that all 5 files were detected and the counts align with what's expected. But file4.csv and file5.csv never got ingested.
select * from table(information_schema.copy_history(table_name=>'<TABLE_NAME>', start_time=> dateadd(hours, -1000, current_timestamp())));
does not show the copy history which makes us suspect if the table/pipe is in non-deterministic state and if there's a way out of this.
Note: the copy_history command works for other tables in the database.

Related

Capturing Each Skipped Record - Copy Data Activity

I'm doing a large-scale project with multiple pipelines, millions of records per pipeline. I'm trying to develop a generic skipped row capture process.
What I need to do is: for every source row skipped due to any error encountered on the attempted load, I want to capture a key column value from the row and write it to a distinct log file (or separate DB table row). This can't be summary data: for each individual row that fails, I need to capture the row key from that row so we can review/re-load later (I will add in system variable values to identify pipeline, component, time stamp, etc). Pipeline must complete with all successful rows loaded, all unsuccessful rows logged.
This is no-brainer functionality in most ETL tools; I have to be overlooking something in ADF, because I can't find a way to do this. Appreciate any/all suggestions.
You can enable Fault tolerance and choose Skip incompatible rows option. It will skip the incompatible rows between source and target store during copy data. e.g. type and field mismatch or PK violation.
Then you can enable session log and choose Warning log level in copy activity to log skipped rows. Finally, you can save your log file in Azure Storage or Azure Data Lake Storage Gen2.
Reference:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-fault-tolerance
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-log
With your first copy activity, check the fault tolerance option in 'settings' to log skipped fault rows.
Make sure to place your rows key column, as the first in the mapping definition.
Get the copy activity logFilePath from the activity output into a variable
Add another copy activity to load skipped rows into relational table
it source path will be the variable holds logFilePath
Set the file path type to: 'Wildcard file path'
Keep the 'Wildcard file path' empty
Will be the value in Wildcard file name
Make sure that the delimited file dataset escape character is set to quotations.
The OperationItem field of the lg file holds your record fields seperated by ,; because we placed the rowID first on mapping, it will appear first in OperationalItem as well.
Goodluck

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

SSIS connection manager error with PostgreSQL 'Value does not fall within the expected range'

I have very simple SSIS package which brings 8 columns from PostgreSQL into SQL server. There are only 10 rows feeding through. Now requirement changed by adding another
- join
- with extra column to bring back.
- Also one where clause condition to be removed.
When changed and preview using OLE DB source editor it throws error 'Value does not fall within the expected range'
script runs successfully in PostgreSQL within 40 sec(though it should not take that long).
In 'OLE DB source editor:
When adding Join - it previews(successfully)
when adding Join and bringing column - it previews (successfully)
when adding Join, bringing column and removing condition ( condition let it bring 149 records instead of 10 ) - (throws error)
I guess it is very basic error and someone with experience will be able to spot it quickly. Will be very thankful for kind and prompt response..
I have checked datatype for new column source and destination and it matches. character (255)
I have run script using PGAdmin 4 and it runs successfully in PostgreSQL.
.

BigQuery Merge - Size of Query expands with DELETE clause

When attempting a MERGE statement, BigQuery is only scanning the requested partitions UNTIL the DELETE statment is added, at which point it reverts to scanning the whole dataset (blossoming from 1GB to >1TB in this case).
Is there a way to use the full features of MERGE, including DELETE, without incurring the extra cost?
Generic sample that matches my effort below:
MERGE target_table AS t *## All Dates, partitioned on
activity_date*
USING source_table AS s ## one date, only yesterday
ON t.field_a = s.field_a
AND t.activity_date >=
DATE_ADD(DATE(current_timestamp(),'America/Los_Angeles'), INTERVAL -1 DAY) ## use partition to limit to yesterday
WHEN MATCHED
THEN UPDATE SET
field_b = s.field_b
WHEN NOT MATCHED
THEN INSERT
(field_a, field_b)
VALUES
(field_a, field_b)
WHEN NOT MATCHED BY SOURCE
THEN DELETE
Based on the query you have provided, it is not expected behavior for it to apply the merge on the whole dataset. After the query has run, you should analyze your dataset and check its validity to ensure that the query only ran on the specific partitions.
If, after further inspection, no unexpected changes were made to your dataset, the 1 TB of data noted may be simply explained as BigQuery ingesting that data into memory as a side step to be able to run the query.
However, to confirm it is recommended to submit a ticket in the issue tracker with your BigQuery JobID so that BigQuery engineering can properly inspect the issue.

Sort unmatched records using joinkeys

I have two GDG files (-1 & 0 version). Using these two files a flat file needs to be generated which will have Insert records(records which are not in -1 version but are in +0 version), Delete records(records which are in -1 version but are not in +0 version) and Update records(records which are in both the versions but the +0 version might have changes in some of the fields). How can i get those update records? Can i do it using Joinkeys, if yes, How?
Note: The update can be anywhere from column 1 to the last column of the file(+0 version of the GDG)
It is a simple JOINKEYS:
OPTION COPY
JOINKEYS F1=INA,FIELDS=(4,80),SORTED,NOSEQCK
JOINKEYS F2=INB,FIELDS=(4,80),SORTED,NOSEQCK
JOIN UNPAIRED
REFORMAT FIELDS=(F1:1,227,F2:1,227,?)
The OPTION COPY is for the Main Task, the bit which runs after the joined file is produced. SORT FIELDS=COPY is equivalent to OPTION COPY.
The assumption is that your data is in key order already. If not, remove the SORTED,NOSEQCKs but bear in mind that you may get "spurious" matches, by equal keys not in the same position on the file relative to inserts and deletes.
JOIN UPAIRED gives you matches and both types of mismatch. JOIN UNPAIRED,F1,F2 is equivalent.
The REFORMAT statement defines the records on the joined file. You want all the data from both/either record, and you want to know whether there was a match, and if no match, which input file had the record. That is what the question-mark (?) is. It will contain 'B' (on both files), '1' (on F1, or the first physically present JOINKEYS, only) or '2' (on the other JOINKEYS file only).
Then you need to output the data. I'll assume you want the data in separate places:
OUTFIL FNAMES=INSERT,
INCLUDE=(455,1,CH,EQ,C'1'),
BUILD=(1,227)
OUTFIL FNAMES=DELETE,
INCLUDE=(455,1,CH,EQ,C'2'),
BUILD=(228,227)
OUTFIL FNAMES=CHANGE,
INCLUDE=(455,1,CH,EQ,C'B',
AND,
1,227,CH,NE,228,227,CH),
BUILD=(1,454)
OUTFIL FNAMES=UNCHNGE,
SAVE,
BUILD=(1,227)
INCLUDE= (or OMIT=) includes or omits the data from the "OUTFIL Group". OUTFILs "run" concurrently (as in the same record is presented to each in turn, then the next record, etc).
FNAMES gives you the DDname to put in the JCL.
For CHANGE, the INCLUDE is for the first record (known to match due to the test for 'B') not being equal to the second. It is not exactly clear what output you want here. Currently those are output as F2 appended to F1, and entire (twice the size) record written. You could also write the records in "pairs" (BUILD=(1,227,/,228,227)) or just one or the other of the records.
SAVE is a thing which says "if this record hasn't appeared on any OUTFIL, output it here. It is certainly useful for testing, even if you don't want it in the final code.