AzCopy from HDInsight cluster failing in PowerShell script

AzCopy from HDInsight cluster failing in PowerShell script - powershell

I have a PowerShell script that creates some output using hive on HDinsight. The output is placed in a local blob, and then I copy it to a local machine using AzCopy. I do this a lot to get various pieces of data that I need, often calling that script multiple times. The problem is that at some point the AzCopy errors out with the message "The condition specified using HTTP conditional headers(s) is not met.", but this after numerous successful iterations.
I have am not sure what this means, and a fiddler transcript did not help much either. I tried deleting the file and repeating the AzCopy and the error persisted, so it might have something to do with the AzCopy http session. Can anyone enlighten me?
PS C:\hive> AzCopy /Y /Source:https://msftcampusdata.blob.core.windows.net/crunch88-1 /Dest:c:\hive\extracts\data\ /SourceKey:attEwHZ9AGq7pzzTYwRvjWwcmwLvFqnkxIvJcTblYnZAs1GSsCCtvbBKz9T/TTtwDSVMDuU3DenBbmOYqPIMhQ== /Pattern:hivehost/stdout
AzCopy : [2015/05/10 15:08:44][ERROR] hivehost/stdout: The remote server returned an error: (412) The condition specified using HTTP conditional header(s)
is not met..
At line:1 char:1
+ AzCopy /Y /Source:https://msftcampusdata.blob.core.windows.net/crunch88-1 /Dest: ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: ([2015/05/10 15:...s) is not met..:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
The condition specified using HTTP conditional header(s) is not met.

In order to ensure the data integrity during the whole downloading process, AzCopy passes ETag of source blob into HTTP header "If-Match" while reading data from source blob. Thus HTTP status code 412 (Precondition Failed) "The condition specified using HTTP conditional headers(s) is not met." just means that your blobs were being changed while AzCopy was downloading them.
Please avoid changing source blobs while downloading them. If your have to change source blobs simultaneously, you can give a try to the following workaround:
Firstly take a snapshot of a source blob, and then download the blob with AzCopy (/Snapshot option specified), so that AzCopy will try to download the source blob and all of its snapshots. Although downloading of the source blob may fail with 412 (Precondition Fail), downloading of the snapshot can succeed. File name of the downloaded snapshot is: {blob name without extension} ({snapshot timestamp}).{extension}.
For the further information of AzCopy and the option /Snapshot, please refer to Getting Started with the AzCopy Command-Line Utility.
Some Updates:
Did you terminate AzCopy and then resume it with the same command line? If so, you need to make sure the source blob wasn't changed after the previous execution of AzCopy because AzCopy has to ensure the source blob remains unchanged during the period between AzCopy downloaded it for the first time and the source blob is downloaded successfully. In order to check whether resuming occurs, you can check whether the output of AzCopy contains "Incomplete operation with same command line detected at the journal directory {Dir Path}, AzCopy will start to resume.".
Because /Y is specified in your command line, the resuming prompt will be always answered "Yes". To avoid resuming behavior, you can clean up the default journal folder "%LocalAppData%\Microsoft\Azure\AzCopy" before executing AzCopy, or specify /Z: to configure an unique journal folder for each execution.

Related

Azure Data Factory - What is the fastest way to copy a lot of files from OnPrem to blob storage when they are deeply nested

I must get to two different excel files that are nested within 360 parent directories XXXX with a \ME (Month End directory) then a year directory, and finally a yyyymm directory.
Example: Z500\ME\2022\202205\Z500_contributions_202205.xls.
I tried with the copy data activity and killed it after it was still spinning on the listing source step. I thought about the lookup and metadata activities and those have limits of 5000 rows. Any thoughts on what would be the fastest way to do this?
Code for creating the filelist. I'll clean the results up in Excel
dir L:*.xls /s /b > "C:\Foo.txt"
Right now I am creating a list of files with the DOS dir command and hoping that if I have a filelist with the copy activity that it will runs faster if it doesn't have to go through the "list sources" step and interrogate the filesystem.
Thoughts on a ADF option?

If you are facing issues with copy activity, instead you can try azcopy which also can be used for copying from OnPrem to Blob storage.
You can try the below code:
azcopy copy "local path/*" "https://<storage account
name>.blob.core.windows.net/<container name><path to blob" --recursive=true --include-pattern "*.xlsx"
Please go through this Microsoft documentation to know how to use azcopy.
The above script copies all the excel files from nested folders recursively. But it also copies the folders to the blob storage as well.
After copying to blob storage, you can use the Start-AzureStorageBlobCopy in the Powershell to list out all the excel files in the nested folders to a single folder.
Start-AzureStorageBlobCopy -SrcFile $sourcefile -DestCloudBlob “Destination path”
Please refer this SO thread to list out the files in the blob recursively.
If you are creating list of files in OnPrem then you can use either azcopy or copy activity as your wish.
Please check these screenshots of azcopy for your reference:
Here I am using azcopy with SAS. You can use it in both ways with SAS and with Active Directory as mentioned in the documentation above.
Excel file in Blob storage:

AzCopy ignore if source file is older

Is there an option to handle the next situation:
I have a pipeline and Copy Files task implemented in it, it is used to upload some static html file from git to blob. Everything works perfect. But sometimes I need this file to be changed in the blob storage (using hosted application tools). So, the question is: can I "detect" if my git file is older than target blob file and ignore this file for the copy task to leave it untouched. My initial idea was to use Azure file copy and use an "Optional Arguments" textbox. However, I couldn't find required option in the documentation. Does it allow such things? Or should this case be handled some other way?

I think you're looking for the isSourceNewer value for the --overwrite option.
--overwrite string Overwrite the conflicting files and blobs at the destination if this flag is set to true. (default true) Possible values include true, false, prompt, and ifSourceNewer.
More info: azcopy copy - Options

Agree with ickvdbosch. The isSourceNewer value for the --overwrite option could meet your requirements.
error: couldn't parse "ifSourceNewer" into a "OverwriteOption"
Based on my test, I could reproduce this issue in Azure file copy task.
It seems that the isSourceNewer value couldn't be set to Overwrite option in Azure File copy task.
Workaround: you could use PowerShell task to run the azcopy script to upload the files with --overwrite=ifSourceNewer
For example:
azcopy copy "filepath" "BlobURLwithSASToken" --overwrite=ifSourceNewer --recursive
For more detailed info, you could refer to this doc.
For the issue about the Azure File copy task, I suggest that you could submit a feedback ticket in the following link: Report task issues.

How can I permanently sync a folder with azcopy to Azure Blobstore?

I wrote a Powershell script using azcopy to sync a local folder to my azure blob store account.
When the script is finished it states that all the files have been successfully uploaded.
I want to re-start the script permanently as soon as it finished. So the folder is permanently synced with the cloud. Is this possible? how?

You can synchronize local storage with Azure Blob storage by using the AzCopy v10 command-line utility, i.e., you can synchronize the contents of your local file system with a blob container.
Note that this synchronization is one way. In other words, you choose which of these two endpoints is the source and which one is the destination.
In your case, the container is the destination, and the local file system is the source.
Here is the Syntax:
azcopy sync 'https://<storage-account-name>.blob.core.windows.net/<container-name>' 'C:\myDirectory' --recursive
and an example:
azcopy sync 'https://mystorageaccount.blob.core.windows.net/mycontainer' 'C:\myDirectory' --recursive
Command reference: azcopy sync
Note: If you set the --delete-destination flag to true (default false), AzCopy deletes files without providing a prompt. If you want a prompt to appear before AzCopy deletes a file, set the --delete-destination flag to prompt.

AZCopy Copy results in 404 blob does not exist

When running AZCopy command copy to get 2 pictures from a blob container/ folder, it results in a 404 blog not found. Error does not occur if I specify the filename (it downloads the folder structure with the file in it).
Tested 3 different versions of it the code but can not get a recursive version to work.
Example of not working
azcopy copy "https://bloburl.blob.core.windows.net/Container/Folder/*?SASKey" "C:\Users\MyFolder\Pictures"
azcopy copy "https://bloburl.blob.core.windows.net/Container/Folder?SASKey" "C:\Users\MyFolder\Pictures"
Example of working
azcopy copy "https://bloburl.blob.core.windows.net/Container/Folder/UNSC_Infinity.jpg?SASKey" "C:\Users\MyFolder\Pictures"
My goal is to download all files in the blob container/folder and not the structure itself.
SAS URI: https://s00aops01stg01blkbsa.blob.core.windows.net/5740-christianmimms?st=2019-06-03T17%3A09%3A34Z&se=2020-06-04T17%3A09%3A00Z&sp=rwl&sv=2018-03-28&sr=c&sig=3HyCaMQ1JCkb4Yof%2BOWExx8amHtPTmZHpEZbZPX8Iqs%3D
SAS URI w/Folder: https://s00aops01stg01blkbsa.blob.core.windows.net/5740-christianmimms/images/*?st=2019-06-03T17%3A09%3A34Z&se=2020-06-04T17%3A09%3A00Z&sp=rwl&sv=2018-03-28&sr=c&sig=3HyCaMQ1JCkb4Yof%2BOWExx8amHtPTmZHpEZbZPX8Iqs%3D

Azure Data Factory pipelines are failing when no files available in the source

Currently – we do our data loads from Hadoop on-premise server to SQL DW [ via ADF Staged Copy and DMG on-premise server]. We noticed that ADF pipelines are failing – when there are no files in the Hadoop on-premise server location [ we do not expect our upstreams to send the files everyday and hence its valid scenario to have ZERO files on Hadoop on-premise server location ].
Do you have a solution for this kind of scenario ?
Error message given below
Failed execution Copy activity encountered a user error:
ErrorCode=UserErrorFileNotFound,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Cannot
find the 'HDFS' file.
,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Net.WebException,Message=The
remote server returned an error: (404) Not Found.,Source=System,'.
Thanks,
Aravind

This Requirement can be solved by using the ADFv2 Metadata Task to check for file existence and then skip the copy activity if the file or folder does not exist:
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity

You can change the File Path Type to Wildcard, add the name of the file and add a "*" at the end of the name or any other place that suits you.
This is a simple way to stop the Pipeline failing when there is no file.

Do you have Input DataSet for your pipeline? See if you can skip your Input Dataset dependency..

Mmmm, this is a tricky one. I'll up vote the question I think.
Couple of options that I can think of here...
1) I would suggest the best way would be to create a custom activity ahead of the copying to check the source directory first. This could handle the behaviour if there isn't a file present, rather than just throwing an error. You could then code this to be a little more graceful when it returns and not block the downstream ADF activities.
2) Use some PowerShell to inspect the ADF activity for the missing file error. Then simply set the dataset slice to either skipped or ready using the cmdlet to override the status.
For example:
Set-AzureRmDataFactorySliceStatus `
-ResourceGroupName $ResourceGroup `
-DataFactoryName $ADFName.DataFactoryName `
-DatasetName $Dataset.OutputDatasets `
-StartDateTime $Dataset.WindowStart `
-EndDateTime $Dataset.WindowEnd `
-Status "Ready" `
-UpdateType "Individual"
This of course isn't ideal, but would be quicker to develop than a custom activity using Azure Automation.
Hope this helps.

I know i'm late to the party, but if you're like me and running into this issue, looks like they made an update a while back to allow for no files found

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse