Loading Amazon Redshift with a manifest, with an error in one file

Loading Amazon Redshift with a manifest, with an error in one file - amazon-redshift

When using the COPY command to load Amazon Redshift with a manifest, suppose one of the files contains an error.
Is there a way to just log the error for that file, but continue loading the other files?

The manifest file indicates whether a file is mandatory and whether an error should be generated if a file is not found. (Using a Manifest to Specify Data Files)
The COPY command will retry if it cannot read a file. (Errors When Reading Multiple Files)
The COPY command can specify a MAXERRORS parameter that permits a certain number of errors before the COPY command fails. (MAXERROR)
When loading data from files, Amazon Redshift will report any errors in the STL_LOAD_ERRORS table. (STL_LOAD_ERRORS)

As said above, the maxerror property should satisfy the above requirement.
In addition, copy-noload property checks the validity of the data without loading. Running with NOLOAD parameter is much faster as it only parses the file

Related

Redshift system view with S3 URI to manifest

Is there any Redshift system view that shows manifest file used during COPY command?
I tried to find it in STL_LOAD_COMMITS, but it contains only file path. STL_FILE_SCAN is useful, but also did not help.
I can load link to manifest after dynamical building of COPY command in my Python script, but I would like to try to join it with Redshift system views. Manifest is always new for each COPY command and it would be a good candidate for the hashed key to join.

Interesting question. I believe you will need to parse stl_querytext to extract the manifest file name.

Azure Data Factory copy data - how do I save an http downloaded zip to blob store?

I have a simple copy data activity, with an HTTP connector source, and Azure Blob Storage as the sink. The file is a zip file so I am using a binary dataset for source and a binary for sink.
The data is properly fetched (I believe - looking at bytes transferred). However, I cannot save it to the Blob Store. In this scenario, you do not get to set the filename, only the path (container/directory). The filename used is the name of the file that I fetched.
However, the filename used in the sink step is prefixed with a backslash. It does not exist in the source, and I can find no way to remove it, and with a filename like that, I get a failure:
Failure happened on 'Sink' side. ErrorCode=UserErrorFailedFileOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Upload file failed at path extract/coEDW\XXXX_Data_etc.zip.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified resource does not exist. RequestId:bfe4e2f6-501e-002e-6a21-eaf10e000000 Time:2021-01-14T02:59:24.3300081Z,,''Type=System.Net.WebException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,'
(filename masked by me)
I am sure the fix is simple, but I cannot figure this out. Can anyone help?
Thanks.

You will have to add a dynamic file name for your Blob sink.
You can use the below example to see how to dynamically add a file name using variables:
In this example, the file name is having date and time fields to mark each file with their date and time.
Let me know if that works.

How to set file type when using TextIO.write to Google Cloud Storage

I wrote a DataFlow pipeline that outputs a single small csv file on Google Cloud Storage. The file type of that file is text/plain but i want it to be application/csv.
this is the code i use
TextIO.write()
.to("gs://bucket/path/to/filename").withoutSharding()
.withSuffix(".csv")
.withDelimiter(new char[]{'\r','\n'})
How do i specify the file type so that the file type will be application/csv after the pipeline completes?

TextIO always write content type text/plain. This is configured here. https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSink.java#L95
One option for you might be to update the content type of objects already written to GCS. This can be done using the gsutil tool after you finish your Dataflow pipeline that writes files. See here for more information.
https://cloud.google.com/storage/docs/gsutil/commands/setmeta

n_tied_num.mdef and alltriphones.mdef for reading was not found

While I was training my native language (Amharic) using SphinxTrain-5prealpha.... Fatal error happened when creating the PRUNE TREE and Training the Context dependent models.
Training wav file is about 19 hours WAVFILE_SRATE 16000.
check this link to see sphinx_train.cfg file
The logs file are:
It is supposed to create it by it self. Is there any configuration that it is needed or configuration which I have missed?

After the error happened, I looked into the files and folders created during the training process. There is a folder named model_architecture among others.
In the model_architecture folder, there are two mdef files including others. YOUR_DB_NAME.ci.mdef and YOUR_DB_NAME.untied.mdef. (my database name was amharic as you can see from the error log. so it was amharic.untied.mdef).
I copied both mdef file into the model_architecture folder and RENAMED YOUR_DB_NAME.ci.mdef to YOUR_DB_NAME.3000.mdef (3000 is n_tied.num, so it could be different for you) and YOUR_DB_NAME.untied.mdef to YOUR_DB_NAME.alltriphones.mdef.
If it is a directory that is missing, do the same.

how to copy local directory with files to remote server talend

in Talend(data integration) i am trying to copy local directory to remote directory but when i am running the job only i can copy files but not folders from directory.please help me with this job.
In my talend job i am using local connection and remote connection components->
tfilelist->tfileproperties(to store path and name in one table)->tmssqlinput(extracting path from last table)->iteration-> tssh(if directory s not available then create)->finally sending it to tftpput to connect and copy to remote directory.
when i am storing in one table using tfileproperties in that for files it will generate some size but when folder s coming the size will be zero,using this condition m creating the directory using tssh component but unable to create folders,please help me.

Do you get an error message?
I believe the output of the TMSSqlInput should be a row based, rather than iteration. That might be the source of the problem.
tMSqlInput docs
tMSSqlInput executes a DB query with a strictly defined order which
must correspond to the schema definition. Then it passes on the field
list to the next component via a Main row link.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Loading Amazon Redshift with a manifest, with an error in one file - amazon-redshift

When using the COPY command to load Amazon Redshift with a manifest, suppose one of the files contains an error. Is there a way to just log the error for that file, but continue loading the other files?

As said above, the maxerror property should satisfy the above requirement. In addition, copy-noload property checks the validity of the data without loading. Running with NOLOAD parameter is much faster as it only parses the file

Related

Redshift system view with S3 URI to manifest

Azure Data Factory copy data - how do I save an http downloaded zip to blob store?

How to set file type when using TextIO.write to Google Cloud Storage

n_tied_num.mdef and alltriphones.mdef for reading was not found

how to copy local directory with files to remote server talend

Categories

Resources