Cassandra Alter Column type from Timestamp to Date - date

Is there any way to alter the Cassandra column from timestamp to date without data lost? For example '2021-02-25 20:30:00+0000' to '2021-02-25'
If not, what is the easiest way to migrate this column(timestamp) to the new column(date)?

It's impossible to change a type of the existing column, so you need to add a new column with correct data type, and perform migration. Migration could be done via Spark + Spark Cassandra Connector - it could be most flexible solution, and even could be done via single node machine with Spark running in the local master mode (default). Code could look something like this (try on test data first):
import pyspark.sql.functions as F
options = { "table": "tbl", "keyspace": "ks"}
spark.read.format("org.apache.spark.sql.cassandra").options(**options).load()\
.select("pk_col1", "pk_col2", F.col("timestamp_col").cast("date").alias("new_name"))\
.write.format("org.apache.spark.sql.cassandra").options(**options).save()
P.S. you can use DSBulk, for example, but you need to have enough space to offload the data (although you need only primary key column + your timestamp)

To add to Alex Ott's answer, there are validations done in Cassandra that prevents changing the data type of a column. The reason is that SSTables (Cassandra data files) are immutable -- once they are written to disk, they are never modified/edited/updated. They can only be compacted to new SSTables.
Some try to get around it by dropping the column from the table then adding it back in with a new data type. Unlike traditional RDBMS, the existing data in the SSTables don't get updated so if you tried to read the old data, you'll get a CorruptSSTableException because the CQL type of the data on disk won't match that of the schema.
For this reason, it is no longer possible to drop/recreate columns with the same name (CASSANDRA-14948). If you're interested, I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/8018/. Cheers!

You can use ToDate to change it. For example: Table Email has column Date with format: 2001-08-29 13:03:35.000000+0000.
Select Date, ToDate(Date) as Convert from keyspace.Email:
date | convert ---------------------------------+------------ 2001-08-29 13:03:35.000000+0000 | 2001-08-29

Related

Hive - the correct way to permanently change the date and type in the entire column

I would be grateful if someone could explain here step by step what the process of changing the date format and column type from string to date should look like in the table imported via Hive View to HDP 2.6.5.
The data source is the well-known MovieLens 100K Dataset set ('u.item' file) from:
https://grouplens.org/datasets/movielens/100k/
$ hive --version is: 1.2.1000.2.6.5.0-292
Date format for the column is: '01-Jan-1995'
Data type of column is: 'string'
ACID Transactions is 'On'
Ultimately, I would like to convert permanently the data in the entire column to the correct Hive format 'yyyy-MM-dd' and next column type to 'Date'.
I have looked at over a dozen threads regarding similar questions before. Of course, the problem is not to display the column like this, it can be easily done using just:
SELECT from_unixtime(unix_timestamp(prod_date,'dd-MMM-yyyy'),'yyyy-MM-dd') FROM moviesnames;
The problem is to finally write it down this way. Unfortunately, this cannot be done via UPDATE in the following way, despite the inclusion of atomic operations in Hive config.
UPDATE moviesnames SET prodate = (select to_date(from_unixtime(UNIX_TIMESTAMP(prod_date,'dd-MMM-yyyy'))) from moviesnames);
What's the easiest way to achieve the above using Hive-SQL? By copying and transforming a column or an entire table?
Try this:
UPDATE moviesnames SET prodate = to_date(from_unixtime(UNIX_TIMESTAMP(prod_date,'dd-MMM-yyyy')));

BigQuery Transfer Service UI - run_date parameter

Has anybody had any success in applying a run_date parameter when creating a Transfer in BigQuery using the Transfer service UI ?
I'm taking a CSV file from Google Cloud storage and I want to mirror this into my ingestion date partitioned table, table_a.
Initally I set the destination table as table_a, which resulted in the following message in the job log:
Partition suffix has to be set for date-partitioned tables. Please recreate your transfer config with a valid table name. For example, to load new files to partition of the run date, specify your table name as transferTest${run_date} for daily partitioning or transferTest${run_time|"%Y%m%d%H"} for hourly partitioning.
I then set the destination to table_a$(run_date), which then issues the warning:
Invalid table name. Please use only alphabetic, numeric characters or underscore with supported parameters wrapped in brackets.
However it won't accept table_a_(run_date) either - could anyone please advise?
best wishes
Dave
Apologies - i've identified the correct syntax now
table_a_{run_date}

How to automate the execution process of data quality rules?

One of our clients has a requirement to build/develop data quality rules using hiveQL.
E.g, Replace NULL values, Change date format in YYYY-MM-DD, Standardize amount column values in US & EU format, etc.
Problem Statement:
I have the set of data quality rules in one hive table(dq_rules), want to execute each rule one by one and store the errors(the data issues such as null column, incorrect date format column) in another hive table(dq_logging) for reporting/logging purpose.
Please Suggest me solution by keeping one thing in mind that, I want to make this solution generic and executable for any hive table/columns(It means it should be parameterized).
Restriction: I cannot use existing Data Quality tools. I need to complete it using a hive only(Restriction is given by Client).
Schema for Tables:
dq_rules => Validation Rule ID,Rule Category,DQ Dimension,Rule Description Date Added,Date Retired
dq_logging => Error_ID,Source_Name,Erroneous_Source_Fields,Source_File_Record,Validation Rule ID
If anyone has a solution related to writing shell/python script that will also work for me. I just need to make it end to end process.

Partial inserts with Cassandra and Phantom DSL

I'm building a simple Scala Play app which stores data in a Cassandra DB using the Phantom DSL driver for Scala. One of the nice features of Cassandra is that you can do partial updates i.e. so long as you provide the key columns, you do not have to provide values for all the other columns in the table. Cassandra will merge the data into your existing record based on the key.
Unfortunately, it seems this doesn't work with Phantom DSL. I have a table with several columns, and I want to be able to do an update, specifying values just for the key and one of the data columns, and let Cassandra merge this into the record as usual, while leaving all the other data columns for that record unchanged.
But Phantom DSL overwrites existing columns with null if you don't specify values in your insert/update statement.
Does anybody know of a work-around for this? I don't want to have to read/write all the data columns every time, as eventually the data columns will be quite large.
FYI I'm using the same approach to my Phantom coding as in these examples:
https://github.com/thiagoandrade6/cassandra-phantom/blob/master/src/main/scala/com/cassandra/phantom/modeling/model/GenericSongsModel.scala
It would be great to see some code, but partial updates are possible with phantom. Phantom is an immutable builder, it will not override anything with null by default. If you don't specify a value it won't do anything about it.
database.table.update.where(_.id eqs id).update(_.bla setTo "newValue")
will produce a query where only the values you've explicitly set to something will be set to null. Please provide some code examples, your problem seems really strange as queries don't keep track of table columns to automatically add in what's missing.
Update
If you would like to delete column values, e.g set them to null inside Cassandra basically, phantom offers a different syntax which does the same thing:
database.table.delete(_.col1, _.col2).where(_.id eqs id)`
Furthermore, you can even delete map entries in the same fashion:
database.table.delete(_.props("test"), _.props("test2").where(_.id eqs id)
This assumes props is a MapColumn[Table, Record, String, _], as the props.apply(key: T) is typesafe, so it will respect the keytype you define for the map column.

Talend Data Itegration: Avoid nulls coming out of tExtractXMLField?

I have this simple flow in Talend DI 6 (simplified for posting on SO):
The last step crashes with a NullPointerException, because missing XML attributes are returned as null.
Is there a way to get empty string values instead of nulls?
For now I'm using a tReplace step to remove nulls as a work-around, but it's tedious and adds to the cost of maintenance by creating one more place where the list of attributes needs to be maintained.
In Talend DI 5.6.2 it is possible to add default data values to the schema. The column in the schema is called "Default". If you expect strings, you can set an empty string, which is set if the column value is null:
Talend schema view with Default column
Works also for other data types. Talend DI 6 should still be able to do this, although the field might be renamed.