How long does Snowflake really maintain file load history? - metadata

Source #1 of 3 - https://docs.snowflake.com/en/user-guide/data-load-considerations-load.html#load-metadata says -
"Snowflake maintains detailed metadata for each table into which data
is loaded ... This load metadata expires after 64 days..." followed by
an explanation of the LOAD_UNCERTAIN_FILES copy option. This option
tells Snowflake whether or not to load files whose metadata, being
over 64 days old, has been purged.
#2 of 3 - https://docs.snowflake.com/en/user-guide/data-load-local-file-system-copy.html#monitoring-files-staged-internally says -
"Snowflake retains historical data for COPY INTO commands executed
within the previous 14 days... Use the LOAD_HISTORY Information Schema
view to retrieve the history of data loaded into tables using the COPY
INTO command"
#3 of 3 - https://docs.snowflake.com/en/sql-reference/account-usage/copy_history.html#copy-history-view says -
"This Account Usage view can be used to query Snowflake data loading
history for the last 365 days (1 year). The view displays load
activity for both COPY INTO statements and continuous data
loading using Snowpipe. The view avoids the 10,000 row limitation of
the LOAD_HISTORY View."
Question #1
#3 seems to supersede #2, as the duration is 365 days and it maintains metadata not only for bulk loads, but continuous loading as well. Also, apparently there's a row limit on #2.
The view in #3 is available only to the ACCOUNTADMIN role by default.
But if Snowflake does have the information for the last 365 days, why force the use of LOAD_UNCERTAIN_FILES after just 64 days?
Question #2
Aren't sources #1 and 2 inconsistent?

The most important number for behavior of copying files into tables is 64 days. If you run a COPY INTO command from a stage without limiting the files by a list or pattern, Snowflake will not reload a file it's loaded within the previous 64 days.
You can override that using FORCE = TRUE in the copy options, https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#copy-options-copyoptions. That will make it load the files whether they're marked as loaded or not.
When the following three conditions are met, for the purposes of the COPY INTO command, Snowflake does not know if a file has already been loaded or not:
The file’s LAST_MODIFIED date (i.e. date when the file was staged) is
older than 64 days.
The initial set of data was loaded into the table more than 64 day
earlier.
If the file was already loaded successfully into the table, this
event occurred more than 64 days earlier.
Under those conditions, the LOAD_UNCERTAIN_FILES option applies.
The other two times concern reporting, not behavior of the COPY INTO command. The 14 days is for the LIST or LS command information returned. The 365 days is for the data shared back from Snowflake to customers through the "snowflake" database. Data can take between 15 minutes and 3 hours to appear in this database depending on the view in question. It remains for 365 days after that.

Related

Azure data factory copy activity mystery (one duped row, one missing row from source)

Okay I am at a total loss here - I have tried tweaking every setting I can think of in the copy activity I am dealing with but for some reason on the sink it is inserting a duplicate row, and missing a row from the datasource it pulls.
Data comes in via REST from our CRM system - this works fine, all records come through - then this mystery happens where the duplicated rows get inserted (this even occurred somehow when I changed it from insert to upsert and specified the GUID of the row as the key - how the heck ??)
Overall flow is : Grab auth token from rest source -> grab records via REST endpoint in a Copy data activity
Copy data settings - source settings
Sink settings copy data
Settings of the copy data step are default - I have tried editing DIUs, parallelism, max concurrent connections, different batch sizes, etc to no impact.
It appears to always be the 15,000 row that is duplicated, so 15,001 is 15,000 duped, so in my mind I am thinking it is something to do with the data being partitioned and it's not indexed properly? I just don't get how this is happening - it doesn't happen with any of our other 60+ pipelines we have run nightly.
Any help or suggestions is greatly appreciated!

Azure Data Factory ForEach is seemingly not running data flow in parallel

In Azure Data Factory I am using a Lookup activity to get a list of files to download, then pass it to a ForEach where a dataflow is processing each file
I do not have 'Sequential' mode turned on, I would assume that the data flows should be running in parallel. However, their runtimes are not the same but actually have almost constant time between them (like, first data flow ran 4 mins, second 6, third 8 and so on). It seems as if the second data flow is waiting for the first one to finish and then uses its cluster to process the file.
Is that intended behavior? I have TTL on the cluster set but that did not help too much. If it is, then what is a workaround? I am currently working on creating a list of files first and using that instead of a ForEach but I am not sure if I am going to see an increase in efficiency
I have not been able to solve the issue with the Parallel data flows not executing in parallel, however, I have managed to change the solution that would increase performance.
What was before: A lookup activity that would get a list of files to process, passed on to a ForEach loop with a data flow activity.
What I am testing now: A Data flow activity that would get a list of files, and save them in a text file in ADLS, Then another data flow activity that was previously in a ForEach loop, but changed its source to use "List of Files" and point to that list
The result was an increase in efficiency (Using the same cluster, 40 files would take around 80 mins using ForEach and only 2-3 mins using List of Files), however, debugging is not easy now that everything is in 1 data flow
You can overwrite a list of files file, or use dynamic expressions and name the file as the pipelineId or something else

Keeping database table clean using Entity Framework

I am using Entity Framework and I have a table that I use to record some events that gets generated by a 3rd party. These events are valid for 3 days. So I like to clean out any events that is older than 3 days to keep my database table leaner. Is there anyway I can do this? Some approach that won't cause any performance issue during clean up.
To do the mentioned above there are some options :
1) Define stored procedure mapped to your EF . And you can use Quarts Trigger to execute this method using timely manner.
https://www.quartz-scheduler.net
2) SQL Server Agent job which runs every day at the least peak time and remove your rows.
https://learn.microsoft.com/en-us/sql/ssms/agent/schedule-a-job?view=sql-server-2017
If only the method required for cleaning purpose nothing else i recommend you go with option 2
first of all. Make sure the 3rd party write a timestamp on every record, in that way you will be able to track how old the record is.
then. Create a script that deletes all record that are older than 3 days.
DELETE FROM yourTable WHERE DATEDIFF(day,getdate(),thatColumn) < -3
now create a scheduled task in SQL management Studio:
In SQL Management Studio, navigate to the server, then expand the SQL Server Agent item, and finally the Jobs folder to view, edit, add scheduled jobs.
set script to run once every day or whatever please you :)

Kibana - what logs are not reporting

I am currently using kibana 5.0 almost 45 log sources are integrated with kibana like iis,vpn ,asa etc.now my question is how to create a visualization to check what logs sources are not reporting to kibana.can anybody help on this?
Quick and dirty solution...
Make sure each log source is given a unique and meaningful tag as soon as their data enters the logstash workflow.
As each document is processed write an entry to a separate index, call it masterlist.idx (do not give this index a date suffix). Use the tags you assigned as the document ID when you write entries to masterlist.idx.
The masterlist.idx should really just contain a list of your log sources with each entry having a timestamp. Then all you've got to do is visualise masterlist showing all the entries. You should have 45 documents each with a timestamp showing their latest updates. I think the default timepicker on Kibana's discover tab will do the job. Any sources that haven't been updated in X days (or whenever your threshold is) have a problem.

Powershell - Copying CSV, Modifying Headers, and Continuously Updating New CSV

We have a log that tracks faxes sent through our fax server. It is a .csv that contains Date_Time, Duration, CallerID, Direction (i.e. inbound/outbound), Dialed#, and Answered#. This file is overwritten every 10 minutes with any new info that was tracked on the fax server. This cannot be changed to be appended.
Sometimes our faxes fail, and the duration on those will be equal to 00:00:00. We really don't know if they are failing until users let us know that they are getting complaints about missing faxes. I am trying to create a Powershell script that can read the file and notify us via email if there are n amount of failures.
I started working on it, but it quickly became a big mess as I ran into more problems. One issue I was trying to overcome was having it email us over and over if there are certain failures. Since I can't save anything on the original .csv's, I was trying to preform these ideas in the script.
Copy .csv with a new header titled "LoggedFailure". Create file if it doesn't exist.
Compare the two files, and add different data (i.e. updates on the original) to the copy.
Check copied .csv for Durations equal to 00:00:00. If it is, mark the LoggedFailure header as "Yes" or some value.
If there are n amount of failures, email us.
Have this script run as a scheduled task (every hour or so).
I'm having difficulty with maintaining the data. I haven't done a lot of work with scripting or programming, so I'm having trouble with making the correct logic. I can look up cmdlets and understand them, but my main issue is logic. Does anyone have any tips or could provide some ideas on how to best update the data, track failures as to not send duplicate information, and have it run?
I'd use a hash table with the Dialed# as the key. Create PSCustomObjects that have LastFail date and FailCount properties as the values. Read through the log in chronological order, and add/increment a new entry in the hash table every time it finds an entry with Duration of 00:00:00 that's newer than what's already in the hash table. If it finds a successful delivery event, delete the entry with that Dialed# key from the hash table if it exists.
When it's done, the hash table keys will be a collection of the Dialed numbers that are failing, and the objects in the values will tell you how many failures there have been, and when the last one was. Use that to determine determine if an alert needs to be sent, and what numbers to report.
When a problem with a given fax number is resolved, a successful fax to that number will clear the entry from the hash table, and stop the alerts.
Save the hash table between runs by exporting it as CLIXML, and re-import it at the beginning of each run.