Are there disadvantages on using as partition column a non-primitive column (date) in Hive? - date

Is there any reason why I shouldn't use a column formatted as date as the partitioning column in a table in Apache Hive?
The official documentation says:
Although currently there is not restriction on the data type of the partitioning column, allowing non-primitive columns to be partitioning column probably doesn't make sense. The dynamic partitioning column's type should be derived from the expression. The data type has to be able to be converted to a string in order to be saved as a directory name in HDFS.
https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions#DynamicPartitions-Designissues
I don't see why columns formatted as date would create any issue, since by design these could be converted into string.

Related

Cassandra Alter Column type from Timestamp to Date

Is there any way to alter the Cassandra column from timestamp to date without data lost? For example '2021-02-25 20:30:00+0000' to '2021-02-25'
If not, what is the easiest way to migrate this column(timestamp) to the new column(date)?
It's impossible to change a type of the existing column, so you need to add a new column with correct data type, and perform migration. Migration could be done via Spark + Spark Cassandra Connector - it could be most flexible solution, and even could be done via single node machine with Spark running in the local master mode (default). Code could look something like this (try on test data first):
import pyspark.sql.functions as F
options = { "table": "tbl", "keyspace": "ks"}
spark.read.format("org.apache.spark.sql.cassandra").options(**options).load()\
.select("pk_col1", "pk_col2", F.col("timestamp_col").cast("date").alias("new_name"))\
.write.format("org.apache.spark.sql.cassandra").options(**options).save()
P.S. you can use DSBulk, for example, but you need to have enough space to offload the data (although you need only primary key column + your timestamp)
To add to Alex Ott's answer, there are validations done in Cassandra that prevents changing the data type of a column. The reason is that SSTables (Cassandra data files) are immutable -- once they are written to disk, they are never modified/edited/updated. They can only be compacted to new SSTables.
Some try to get around it by dropping the column from the table then adding it back in with a new data type. Unlike traditional RDBMS, the existing data in the SSTables don't get updated so if you tried to read the old data, you'll get a CorruptSSTableException because the CQL type of the data on disk won't match that of the schema.
For this reason, it is no longer possible to drop/recreate columns with the same name (CASSANDRA-14948). If you're interested, I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/8018/. Cheers!
You can use ToDate to change it. For example: Table Email has column Date with format: 2001-08-29 13:03:35.000000+0000.
Select Date, ToDate(Date) as Convert from keyspace.Email:
date | convert ---------------------------------+------------ 2001-08-29 13:03:35.000000+0000 | 2001-08-29

Hive - the correct way to permanently change the date and type in the entire column

I would be grateful if someone could explain here step by step what the process of changing the date format and column type from string to date should look like in the table imported via Hive View to HDP 2.6.5.
The data source is the well-known MovieLens 100K Dataset set ('u.item' file) from:
https://grouplens.org/datasets/movielens/100k/
$ hive --version is: 1.2.1000.2.6.5.0-292
Date format for the column is: '01-Jan-1995'
Data type of column is: 'string'
ACID Transactions is 'On'
Ultimately, I would like to convert permanently the data in the entire column to the correct Hive format 'yyyy-MM-dd' and next column type to 'Date'.
I have looked at over a dozen threads regarding similar questions before. Of course, the problem is not to display the column like this, it can be easily done using just:
SELECT from_unixtime(unix_timestamp(prod_date,'dd-MMM-yyyy'),'yyyy-MM-dd') FROM moviesnames;
The problem is to finally write it down this way. Unfortunately, this cannot be done via UPDATE in the following way, despite the inclusion of atomic operations in Hive config.
UPDATE moviesnames SET prodate = (select to_date(from_unixtime(UNIX_TIMESTAMP(prod_date,'dd-MMM-yyyy'))) from moviesnames);
What's the easiest way to achieve the above using Hive-SQL? By copying and transforming a column or an entire table?
Try this:
UPDATE moviesnames SET prodate = to_date(from_unixtime(UNIX_TIMESTAMP(prod_date,'dd-MMM-yyyy')));

Mapping Data Flow Common Data Model source connector datetime/timestamp columns nullified?

We are using Azure Data Factory Mapping data flow to read from Common Data Model (model.json).
We use dynamic pattern – where Entity is parameterised and we do not project any columns and we have selected Allow schema drift.
Problem: We are having issue with “Source” in mapping data flow (Source Type is Common Data Model). All the datetime/timestamp columns are read as null in source activity.
We also tried in projection tab Infer drifted column types where we provide a format for timestamp columns, However, it satisfies only certain timestamp columns - since in the source each datetime column has different timestamp format.
11/20/2020 12:45:01 PM
2020-11-20T03:18:45Z
2018-01-03T07:24:20.0000000+00:00
Question: How to prevent datetime columns becoming null? Ideally, we do not want Mapping Data Flow to typecast any columns - is there a way to just read all columns as string?
Some screenshots
In Projection tab - we do not specify schema - to allow schema drift and to dynamically load more than 1 entities.
In Data Preview tab
ModifiedOn, SinkCreatedOn, SinkModifiedOn - all these are system columns and will definitely have values in it.
This is now resolved on a separate conversation with Azure Data Factory team.
Firstly there is no way to 'stringfy' all the columns in Source, because CDM connector gets its metadata from model.json (if needed this file can be manipulated, however not ideal for my scenario).
To solve datetime/timestamp columns becoming null - under Projection tab we need to select Infer drifted column types and then you can add "multiple" time formats that you expect to come from CDM. You can either select from dropdown - if your particular datetime format is not listed in the dropdown (which was my case) then you can edit the code behind the data flow (i.e. data flow script), to add your format (see second screenshot).

What are Pros and Cons in using prefixes and suffixes in PostgreSQL dialect for timestamp columns

I have analysed several articles about naming conventions for Date/Time types in SQL data models.
Most of them suggest implementing a database design where a timestamp type is used for some registered even values only, literally timestamping the event case just when it happens. And naturally they suggest datetime type for any other time instanting needs. And they suggest to avoid using suffixes and prefixes which match known data types, like date and time, at all costs, to avoid confusion with data types where only the purpose of the column name is expected.
But PostgreSQL dialect does not have that datetime type at all, so there is only the timestamp type for all cases when just date and time are not enough for the column which is expected to store a value of past or future instant of time.
So, basically, what prefixes or suffixes if any would you suggest for PostgreSQL dialect columns, known that some of them would store past and present and future time instants? And why, for what benefits or because of what limitations?
Should we use timestamp and datetime as prefixes or suffixes to distinguish the purpose of different timestamp type columns by their names? Or would that be a bad practice since there is actually a data type named timestamp and no data type named datetime in PostgreSQL dialect?
Or should we maybe use something very neurtal like an instant noun as a prefix or suffix to denote the purpose of the column?

Datatype conversion of Parquet using spark sql - Dynamically without specify a column name explicityly

I am looking for a way to handle the data type conversion dynamically. SparkDataframes , i am loading the data into a Dataframe using a hive SQL and storing into dataframe and then writing to a parquet file. Hive is unable to read some of the data types and i wanted to convert the decimal datatypes to Double . Instead of specifying a each column name separately Is there any way we can dynamically handle the datatype. Lets say in my dataframe i have 50 columns out of 8 are decimals and need to convert all 8 of them to Double datatype Without specify a column name. can we do that directly?
There is no direct way to do this convert data type here are some ways,
Either you have to cast those columns in hive query .
or
Create /user case class of data types you required and populate data and use it to generate parquet.
or
you can read data type from hive query meta and use dynamic code to get case one or case two to get. achieved
There are two options:
1. Use the schema from the dataframe and dynamically generate query statement
2. Use the create table...select * option with spark sql
This is already answered and this post has details, with code.