I am using Kinesis Firehose to deliver to the Redshift database. I am stuck while Firehose tries to execute COPY query from the saved stream on the S3 bucket.
The error is
ERROR:Invalid value.
That's all. To mitigate this error, I tried to reproduce error without manifest;
COPY firehose_test_table FROM 's3://xx/terraform-kinesis-firehose-test-stream-2-1-2022-05-19-14-37-02-53dc5a65-ae25-4089-8acf-77e199fd007c.gz' CREDENTIALS 'aws_iam_role=arn:aws:iam::xx' format as json 'auto ignorecase';
The data inside the .gz is default AWS streaming data,
{"CHANGE":0.58,"PRICE":13.09,"TICKER_SYMBOL":"WAS","SECTOR":"RETAIL"}{"CHANGE":1.17,"PRICE":177.33,"TICKER_SYMBOL":"BNM","SECTOR":"TECHNOLOGY"}{"CHANGE":-0.78,"PRICE":29.5,"TICKER_SYMBOL":"PPL","SECTOR":"HEALTHCARE"}{"CHANGE":-0.5,"PRICE":41.47,"TICKER_SYMBOL":"KFU","SECTOR":"ENERGY"}
and the object itself and target table as
Create table firehose_test_table
(
ticker_symbol varchar(4),
sector varchar(16),
change float,
price float
);
I am not sure what to do next, the error is too unrevealing to understand the problem. I also tried JSONpaths by defining
{
"jsonpaths": [
"$['change']",
"$['price']",
"$['ticker_symbol']",
"$['sector']"
]
}
however, the same error was raised. What am I missing?
A few things to try...
Specify GZIP in the COPY options configuration. This is explicitly stated in the Kinesis Delivery Stream documentation.
Parameters that you can specify in the Amazon Redshift COPY command. These might be required for your configuration. For example, "GZIP" is required if Amazon S3 data compression is enabled.
Explicitly specify Redshift column names in the Kinesis Delivery Stream configuration. The order of the comma-separated list of column names must match the order of the fields in the message: change,price,ticker_symbol,sector.
Query STL_LOAD_ERRORS Redshift table (STL_LOAD_ERRORS docs) to view error details of the COPY command. You should be able to see the exact error. Example: select * from stl_load_errors order by starttime desc limit 10;
Verify all varchar fields do not exceed the column size limit. You can specify the TRUNCATECOLUMNS COPY option if this is acceptable for your use case (TRUNCTATECOLUMNS docs).
Related
I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following error:
Failed to merge fields 'COLUMN1' and 'COLUMN1'. Failed to merge incompatible data types LongType and DecimalType(32,0)
Command i use to write data to delta table:
DF.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("s3 path)
The only option i can think of right now is to enable OverWriteSchema to True.
But this will rewrite my target schema completely. I am just concerned about any sudden change in source schema that will replace existing target schema without any notification or alert.
Also i can't explicitly convert these columns because the databricks notebook i am using is a parametrized one used to to load data from source to Target(We are reading data from a CSV file that contain all the details about Target table, Source table, partition key etc)
Is there any better way to tackle this issue?
Any help is much appreciated!
We have an Azure Data Factory dataflow, it will sink into Delta. We have Owerwrite, Allow Insert options set and Vacuum = 1.
When we run the pipeline over and over with no change in the table structure pipeline is successfull.
But when the table structure being sinked changed, ex data types changed and such the pipeline fails with below error.
Error code: DFExecutorUserError
Failure type: User configuration issue
Details: Job failed due to reason: at Sink 'ConvertToDelta': Job aborted.
We tried setting Vacuum to 0 and back, Merge Schema set and now, instead of Overwrite Truncate and back and forth, pipeline still failed.
Can you try enabling Delta Lake's schema evolution (more information)? By default, Delta Lake has schema enforcement enabled which means that the change to the source table is not allowed which would result in an error.
Even with overwrite enabled, unless you specify schema evolution, overwrite will fail because by default the schema cannot be changed.
I created ADLS Gen2 storage account and created input and output folders and uploaded parquet file into input folder.
I created pipeline and created dataflow as below:
I have taken Parquet file as source.
Dataflow Source:
Dataset of Source:
Data preview of Source:
I created derived column to change the structure of the table.
Derived column:
I updated 'difficulty' column of parquet file. I changed the datatype of 'difficulty' column from long to double using below code:
difficulty : toDouble(difficulty)
Image for reference:
I updated 'transactions_len' column of parquet file. I changed the datatype of 'transactions_len' column from Integer to float using below code:
transactions_len : toFloat(transactions_len)
I updated 'number' column of parquet file. I changed the datatype of 'number' column from long to string using below code:
number : toString(number)
Image for reference:
Data preview of Derived column:
I have taken delta as sink.
Dataflow sink:
Sink settings:
Data preview of Sink:
I run the pipeline It executed successfully.
Image for reference:
I t successfully stored in my storage account output folder.
Image for reference:
The procedure worked in my machine please recheck from your end.
The source (Ingestion) was generated to azure blob with given a specific filename. Whenever we generated to source parquet files without specifying a specific filename but only a directory the sink worked
Has anybody had any success in applying a run_date parameter when creating a Transfer in BigQuery using the Transfer service UI ?
I'm taking a CSV file from Google Cloud storage and I want to mirror this into my ingestion date partitioned table, table_a.
Initally I set the destination table as table_a, which resulted in the following message in the job log:
Partition suffix has to be set for date-partitioned tables. Please recreate your transfer config with a valid table name. For example, to load new files to partition of the run date, specify your table name as transferTest${run_date} for daily partitioning or transferTest${run_time|"%Y%m%d%H"} for hourly partitioning.
I then set the destination to table_a$(run_date), which then issues the warning:
Invalid table name. Please use only alphabetic, numeric characters or underscore with supported parameters wrapped in brackets.
However it won't accept table_a_(run_date) either - could anyone please advise?
best wishes
Dave
Apologies - i've identified the correct syntax now
table_a_{run_date}
I have created a CDC task that captures changes in a source PostgreSQL schema and writes them in Parquet format into a target S3 bucket. The task captures the inserts, updates and deletes correctly but fails to capture column name and type changes in the source.
When I change a column name or type of a table in the source and insert new rows to the table, the resulting Parquet file uses the old column name and type.
Is there a specific configuration I am missing? or it is not possible to achieve the desired outcome from this task in DMS?
if you change column at source and DMS will pick automatically from source and update at destination. check your DMS setting. you no need to do manually adding column at destination
Make sure you have the HandleSourceTableAltered parameter set to true in the task settings.[1] (The setting applies when the target metadata parameter BatchApplyEnabled is set to either true or false.)
Same goes for HandleSourceTableDropped or HandleSourceTableTruncated if this is relevant in your case.
Obviously, previously replicated Parquet files on S3 will not change to reflect this DDL change on the source.
[1] https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.DDLHandling.html
When attempting a MERGE statement, BigQuery is only scanning the requested partitions UNTIL the DELETE statment is added, at which point it reverts to scanning the whole dataset (blossoming from 1GB to >1TB in this case).
Is there a way to use the full features of MERGE, including DELETE, without incurring the extra cost?
Generic sample that matches my effort below:
MERGE target_table AS t *## All Dates, partitioned on
activity_date*
USING source_table AS s ## one date, only yesterday
ON t.field_a = s.field_a
AND t.activity_date >=
DATE_ADD(DATE(current_timestamp(),'America/Los_Angeles'), INTERVAL -1 DAY) ## use partition to limit to yesterday
WHEN MATCHED
THEN UPDATE SET
field_b = s.field_b
WHEN NOT MATCHED
THEN INSERT
(field_a, field_b)
VALUES
(field_a, field_b)
WHEN NOT MATCHED BY SOURCE
THEN DELETE
Based on the query you have provided, it is not expected behavior for it to apply the merge on the whole dataset. After the query has run, you should analyze your dataset and check its validity to ensure that the query only ran on the specific partitions.
If, after further inspection, no unexpected changes were made to your dataset, the 1 TB of data noted may be simply explained as BigQuery ingesting that data into memory as a side step to be able to run the query.
However, to confirm it is recommended to submit a ticket in the issue tracker with your BigQuery JobID so that BigQuery engineering can properly inspect the issue.