BigQuery Transfer Service UI - run_date parameter - google-cloud-storage

Has anybody had any success in applying a run_date parameter when creating a Transfer in BigQuery using the Transfer service UI ?
I'm taking a CSV file from Google Cloud storage and I want to mirror this into my ingestion date partitioned table, table_a.
Initally I set the destination table as table_a, which resulted in the following message in the job log:
Partition suffix has to be set for date-partitioned tables. Please recreate your transfer config with a valid table name. For example, to load new files to partition of the run date, specify your table name as transferTest${run_date} for daily partitioning or transferTest${run_time|"%Y%m%d%H"} for hourly partitioning.
I then set the destination to table_a$(run_date), which then issues the warning:
Invalid table name. Please use only alphabetic, numeric characters or underscore with supported parameters wrapped in brackets.
However it won't accept table_a_(run_date) either - could anyone please advise?
best wishes
Dave

Apologies - i've identified the correct syntax now
table_a_{run_date}

Related

Delta Lake Data Load Datatype mismatch

I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following error:
Failed to merge fields 'COLUMN1' and 'COLUMN1'. Failed to merge incompatible data types LongType and DecimalType(32,0)
Command i use to write data to delta table:
DF.write.mode("overwrite").format("delta").option("mergeSchema", "true").save("s3 path)
The only option i can think of right now is to enable OverWriteSchema to True.
But this will rewrite my target schema completely. I am just concerned about any sudden change in source schema that will replace existing target schema without any notification or alert.
Also i can't explicitly convert these columns because the databricks notebook i am using is a parametrized one used to to load data from source to Target(We are reading data from a CSV file that contain all the details about Target table, Source table, partition key etc)
Is there any better way to tackle this issue?
Any help is much appreciated!

Invalid value Error on AWS Redshift delivery by Firehose

I am using Kinesis Firehose to deliver to the Redshift database. I am stuck while Firehose tries to execute COPY query from the saved stream on the S3 bucket.
The error is
ERROR:Invalid value.
That's all. To mitigate this error, I tried to reproduce error without manifest;
COPY firehose_test_table FROM 's3://xx/terraform-kinesis-firehose-test-stream-2-1-2022-05-19-14-37-02-53dc5a65-ae25-4089-8acf-77e199fd007c.gz' CREDENTIALS 'aws_iam_role=arn:aws:iam::xx' format as json 'auto ignorecase';
The data inside the .gz is default AWS streaming data,
{"CHANGE":0.58,"PRICE":13.09,"TICKER_SYMBOL":"WAS","SECTOR":"RETAIL"}{"CHANGE":1.17,"PRICE":177.33,"TICKER_SYMBOL":"BNM","SECTOR":"TECHNOLOGY"}{"CHANGE":-0.78,"PRICE":29.5,"TICKER_SYMBOL":"PPL","SECTOR":"HEALTHCARE"}{"CHANGE":-0.5,"PRICE":41.47,"TICKER_SYMBOL":"KFU","SECTOR":"ENERGY"}
and the object itself and target table as
Create table firehose_test_table
(
ticker_symbol varchar(4),
sector varchar(16),
change float,
price float
);
I am not sure what to do next, the error is too unrevealing to understand the problem. I also tried JSONpaths by defining
{
"jsonpaths": [
"$['change']",
"$['price']",
"$['ticker_symbol']",
"$['sector']"
]
}
however, the same error was raised. What am I missing?
A few things to try...
Specify GZIP in the COPY options configuration. This is explicitly stated in the Kinesis Delivery Stream documentation.
Parameters that you can specify in the Amazon Redshift COPY command. These might be required for your configuration. For example, "GZIP" is required if Amazon S3 data compression is enabled.
Explicitly specify Redshift column names in the Kinesis Delivery Stream configuration. The order of the comma-separated list of column names must match the order of the fields in the message: change,price,ticker_symbol,sector.
Query STL_LOAD_ERRORS Redshift table (STL_LOAD_ERRORS docs) to view error details of the COPY command. You should be able to see the exact error. Example: select * from stl_load_errors order by starttime desc limit 10;
Verify all varchar fields do not exceed the column size limit. You can specify the TRUNCATECOLUMNS COPY option if this is acceptable for your use case (TRUNCTATECOLUMNS docs).

Cassandra Alter Column type from Timestamp to Date

Is there any way to alter the Cassandra column from timestamp to date without data lost? For example '2021-02-25 20:30:00+0000' to '2021-02-25'
If not, what is the easiest way to migrate this column(timestamp) to the new column(date)?
It's impossible to change a type of the existing column, so you need to add a new column with correct data type, and perform migration. Migration could be done via Spark + Spark Cassandra Connector - it could be most flexible solution, and even could be done via single node machine with Spark running in the local master mode (default). Code could look something like this (try on test data first):
import pyspark.sql.functions as F
options = { "table": "tbl", "keyspace": "ks"}
spark.read.format("org.apache.spark.sql.cassandra").options(**options).load()\
.select("pk_col1", "pk_col2", F.col("timestamp_col").cast("date").alias("new_name"))\
.write.format("org.apache.spark.sql.cassandra").options(**options).save()
P.S. you can use DSBulk, for example, but you need to have enough space to offload the data (although you need only primary key column + your timestamp)
To add to Alex Ott's answer, there are validations done in Cassandra that prevents changing the data type of a column. The reason is that SSTables (Cassandra data files) are immutable -- once they are written to disk, they are never modified/edited/updated. They can only be compacted to new SSTables.
Some try to get around it by dropping the column from the table then adding it back in with a new data type. Unlike traditional RDBMS, the existing data in the SSTables don't get updated so if you tried to read the old data, you'll get a CorruptSSTableException because the CQL type of the data on disk won't match that of the schema.
For this reason, it is no longer possible to drop/recreate columns with the same name (CASSANDRA-14948). If you're interested, I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/8018/. Cheers!
You can use ToDate to change it. For example: Table Email has column Date with format: 2001-08-29 13:03:35.000000+0000.
Select Date, ToDate(Date) as Convert from keyspace.Email:
date | convert ---------------------------------+------------ 2001-08-29 13:03:35.000000+0000 | 2001-08-29

Hive - the correct way to permanently change the date and type in the entire column

I would be grateful if someone could explain here step by step what the process of changing the date format and column type from string to date should look like in the table imported via Hive View to HDP 2.6.5.
The data source is the well-known MovieLens 100K Dataset set ('u.item' file) from:
https://grouplens.org/datasets/movielens/100k/
$ hive --version is: 1.2.1000.2.6.5.0-292
Date format for the column is: '01-Jan-1995'
Data type of column is: 'string'
ACID Transactions is 'On'
Ultimately, I would like to convert permanently the data in the entire column to the correct Hive format 'yyyy-MM-dd' and next column type to 'Date'.
I have looked at over a dozen threads regarding similar questions before. Of course, the problem is not to display the column like this, it can be easily done using just:
SELECT from_unixtime(unix_timestamp(prod_date,'dd-MMM-yyyy'),'yyyy-MM-dd') FROM moviesnames;
The problem is to finally write it down this way. Unfortunately, this cannot be done via UPDATE in the following way, despite the inclusion of atomic operations in Hive config.
UPDATE moviesnames SET prodate = (select to_date(from_unixtime(UNIX_TIMESTAMP(prod_date,'dd-MMM-yyyy'))) from moviesnames);
What's the easiest way to achieve the above using Hive-SQL? By copying and transforming a column or an entire table?
Try this:
UPDATE moviesnames SET prodate = to_date(from_unixtime(UNIX_TIMESTAMP(prod_date,'dd-MMM-yyyy')));

AWS DMS CDC task does not detect column name and type changes

I have created a CDC task that captures changes in a source PostgreSQL schema and writes them in Parquet format into a target S3 bucket. The task captures the inserts, updates and deletes correctly but fails to capture column name and type changes in the source.
When I change a column name or type of a table in the source and insert new rows to the table, the resulting Parquet file uses the old column name and type.
Is there a specific configuration I am missing? or it is not possible to achieve the desired outcome from this task in DMS?
if you change column at source and DMS will pick automatically from source and update at destination. check your DMS setting. you no need to do manually adding column at destination
Make sure you have the HandleSourceTableAltered parameter set to true in the task settings.[1] (The setting applies when the target metadata parameter BatchApplyEnabled is set to either true or false.)
Same goes for HandleSourceTableDropped or HandleSourceTableTruncated if this is relevant in your case.
Obviously, previously replicated Parquet files on S3 will not change to reflect this DDL change on the source.
[1] https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.DDLHandling.html