How can we handle Data validations in snowpipe in Snowflake - snowflake-task

My Scenario is I have data in AWS S3 flat files.
I am using SNS to trigger the Snow-pipe when new file arrives in S3.
To load the data from flat files in S3 to Snowflake table I am using Snow-pipe.
So While loading data from flat files to snowflake table by Snow-pipe,
Can I handle data-validation and couple of calculations on source data?
Please help me if we have any way to do this...
Thanks in Advance.

Validation_mode copy option is not yet supported by snowpipe. However, snowpipe does support simple transformations like column reordering, cast etc are supported. The best way to perform calculations and transform your data would be to load the data into a staging table and process downstream into target tables.
Reference:
https://docs.snowflake.net/manuals/sql-reference/sql/create-pipe.html#usage-notes
https://docs.snowflake.net/manuals/user-guide/data-load-transform.html

Related

Incrementally loading into a Synapse table using Spark

I am creating a data warehouse using Azure Data Factory to extract data from a MySQL table and saving it in parquet format in an ADLS Gen 2 filesystem. From there, I use Synapse notebooks to process and load data into destination tables.
The initial load is fairly easy using spark.write.saveAsTable('orders') however, I am running into some issues doing incremental load following the intial load. In particular, I have not been able to find a way to reliably insert/update information into an existing Synapse table.
Since Spark does not allow DML operations on a table, I have resorted to reading the current table into a Spark DataFrame and inserting/updating records in that DataFrame. However, when I try to save that DataFrame using spark.write.saveAsTable('orders', mode='overwrite', format='parquet'), I run into a Cannot overwrite table 'orders' that is also being read from error.
A solution indicated by this suggests creating a temporary table and then inserting using that but that still resorts in the above error.
Another solution in this post suggests to write the data into a temporary table, drop the target table, and then rename the table but upon doing this, Spark gives me a FileNotFound errors regarding metadata.
I know Delta Tables can fix this issue pretty reliably but our company is not yet ready to move over to DataBricks.
All suggestions are greatly appreciated.

Azure data factory: Implementing the SCD2 on txt files

I have flat files in adls source,
for full load we are adding 2 columns Insert and datatimestamp.
For change load we need to Lookup with full data, the data available in full should be taken as Updated and not available data as Insert and copy.
below is the approach I tried to work out, but i'm unable to perform.
Can any one help me on this.
Thanks you and waiting for quick response.
Currently, the feature to update the existing flat file using the Azure data factory sink is not supported. You have to create a new flat file.
You can also use data flow activity to read full and incremental data and load to a new file in sink transformation.

How to Update Table in Snowflake using Azure Data Factory

I have two tables in snowflake named table1 and table2. Table1 is the source table which contains incremental data and table2 is the target table.
So my usecase is I have to take data from table1 and update the data into table2 but this process has to be done using Azure Data Factory.
I tried to create a data flow in ADF but it didn't allowed me to connect with the snowflake directly as it is not in the supported sources list. The native snowflake connector only supports the Copy Data Activity. So as a work around I first created a copy activity which copy the data from snowflake to azure blob. Then used the Azure Blob as source for Data Flow to create my scd1 implementation and saved the output in csv files.
Now My question is how should I update the data in target table2. Because If I directly use the copy activity to copy the csv files into snowflake then it will result in the duplicate records at snowflake side. For instance lets say table2 contains a row
id,name,age,data
1234,kristopher,24,somedata
and table1 contains
id,name,age,data
1234,kristopher,24,some-new-data
So now I have table1 data in csv which has to be loaded in snowflake. If I am loading directly then the resultant looks something like this.
id,name,age,data
1234,kristopher,24,somedata
1234,kristopher,24,some-new-data
But I only need
1234,kristopher,24,some-new-data
Let me know if some more explanation is required. I am new to Azure Data Factory and Snowflake as well.
Thanks
As you have observed, the ADF Data Flows currently don't support Snowflake datasets as a source.
You could theoretically follow this design pattern but it seems like alot of work for the requirement you have described. An alternative would be to go down the Azure Function route, but again I would trade off the requirement vs. effort required.
If it didn't have to be in ADF, then a quick approach would be to use a Snowflake Task to schedule some SQL to manage the SCD behavior for you.
I hope this helps.
Best regards,
Dan.
you can put your login in a snowflake stored procedure, then execute your stored proc in ADF

AWS Glue Scala Upsert

I am trying to Upsert data into an existing S3 bucket from another using AWS Glue in Scala. Is there a standard way to use this? One of the methods that I found was to use SQL's MERGE method. What are the advantages and disadvantages of using that?
Thanks
You can't really implement 'SQL MERGE' method in s3 since it's not possible to update existing data objects.
A workaround is to load existing rows in a Glue job, merge it with incoming dataset, drop obsolete records and overwrite all objects on s3. If you have a lot of data it would be more efficient to partition it by some columns and then override those partitions that should contain new data only.
If you goal is preventing duplicates then you can do similar: load existing, drop those records from incoming dataset that already exist in s3 (loaded on previous step) and then write to s3 new records only.

How to extract data from hive table to csv using talend

I want ot transfer data from one hadoop server to another hadoop server with help of Talend.
Through my research I come to know we can transfer data through flat files.Can any one suggest me how to transfer data from hive to flat file. If any other alternative way to transfer data using talend please suggest me.
You can use the tHiveInput Component to read the data from the hive table. Use a row link to connect it to a tfileInputDelimited component.
If you want to transfer to another Hadoop system, you can use a tHiveOutput or a tHDFSOutput instead of tFileInputDelimited.