Hive - the correct way to permanently change the date and type in the entire column - date

I would be grateful if someone could explain here step by step what the process of changing the date format and column type from string to date should look like in the table imported via Hive View to HDP 2.6.5.
The data source is the well-known MovieLens 100K Dataset set ('u.item' file) from:
https://grouplens.org/datasets/movielens/100k/
$ hive --version is: 1.2.1000.2.6.5.0-292
Date format for the column is: '01-Jan-1995'
Data type of column is: 'string'
ACID Transactions is 'On'
Ultimately, I would like to convert permanently the data in the entire column to the correct Hive format 'yyyy-MM-dd' and next column type to 'Date'.
I have looked at over a dozen threads regarding similar questions before. Of course, the problem is not to display the column like this, it can be easily done using just:
SELECT from_unixtime(unix_timestamp(prod_date,'dd-MMM-yyyy'),'yyyy-MM-dd') FROM moviesnames;
The problem is to finally write it down this way. Unfortunately, this cannot be done via UPDATE in the following way, despite the inclusion of atomic operations in Hive config.
UPDATE moviesnames SET prodate = (select to_date(from_unixtime(UNIX_TIMESTAMP(prod_date,'dd-MMM-yyyy'))) from moviesnames);
What's the easiest way to achieve the above using Hive-SQL? By copying and transforming a column or an entire table?

Try this:
UPDATE moviesnames SET prodate = to_date(from_unixtime(UNIX_TIMESTAMP(prod_date,'dd-MMM-yyyy')));

Related

Cassandra Alter Column type from Timestamp to Date

Is there any way to alter the Cassandra column from timestamp to date without data lost? For example '2021-02-25 20:30:00+0000' to '2021-02-25'
If not, what is the easiest way to migrate this column(timestamp) to the new column(date)?
It's impossible to change a type of the existing column, so you need to add a new column with correct data type, and perform migration. Migration could be done via Spark + Spark Cassandra Connector - it could be most flexible solution, and even could be done via single node machine with Spark running in the local master mode (default). Code could look something like this (try on test data first):
import pyspark.sql.functions as F
options = { "table": "tbl", "keyspace": "ks"}
spark.read.format("org.apache.spark.sql.cassandra").options(**options).load()\
.select("pk_col1", "pk_col2", F.col("timestamp_col").cast("date").alias("new_name"))\
.write.format("org.apache.spark.sql.cassandra").options(**options).save()
P.S. you can use DSBulk, for example, but you need to have enough space to offload the data (although you need only primary key column + your timestamp)
To add to Alex Ott's answer, there are validations done in Cassandra that prevents changing the data type of a column. The reason is that SSTables (Cassandra data files) are immutable -- once they are written to disk, they are never modified/edited/updated. They can only be compacted to new SSTables.
Some try to get around it by dropping the column from the table then adding it back in with a new data type. Unlike traditional RDBMS, the existing data in the SSTables don't get updated so if you tried to read the old data, you'll get a CorruptSSTableException because the CQL type of the data on disk won't match that of the schema.
For this reason, it is no longer possible to drop/recreate columns with the same name (CASSANDRA-14948). If you're interested, I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/8018/. Cheers!
You can use ToDate to change it. For example: Table Email has column Date with format: 2001-08-29 13:03:35.000000+0000.
Select Date, ToDate(Date) as Convert from keyspace.Email:
date | convert ---------------------------------+------------ 2001-08-29 13:03:35.000000+0000 | 2001-08-29

IBM Datastage assumes a column is WVARCHAR while it's date

I'm doing an ETL for a job. For the data source stage, I input custom select statement. In the output tab of data source stage, I defined the INCEPTION column data type is Timestamp. The right data type for INCEPTION is date. I check it via DBEAVER. But somehow the IBM Datastage assumes that it is WVARCHAR. It says ODBC_Connector_0: Schema reconciliation detected a type mismatch for field INCEPTION. When moving data from field type WVARCHAR(min=0,max=10) into DATETIME(fraction=6), data corruption can occur (CC_DBSchemaRules::reportTypeMismatch, file CC_DBSchemaRules.cpp, line 1,838). I don't know why it is, since from the database shows that INCEPTION is definitely a Date column. I don't know how to fix this since I don't think I'm making mistake. What did I do wrong and how to fix this?
Where did DataStage get its table definition? DataStage is a computer program; it can't "decide" anything. If you import the table definition from the source, what data type is INCEPTION ? If it is either Date or Timestamp, load that table definition into your DataStage job. Otherwise explicitly convert the string using StringToTimestamp() function in a Transformer stage.

Updating the table through tOracleOutput in Talend using an additional SQL query

I have a job where I am getting a flow into tOracleOutput where I am updating the table. Now, I have to update that table using an SQL statement, which I guess we have option in Advanced settings of tOracleOuptut, but I don't know how to use it or you can say that I am not getting the settings properly. I referred to official documentation but could not understand. Can any one explain the fields like Name, SQL expression, Position, Reference Column in a better way?
the SQL query which I am using is:
update set COL1=SOMETHING1
where COL2=SOMETHING2
Now, value for COL1 is coming from the flow but COL2 is some column in the table which is not coming from the flow.
Have a look to tOracleRow for such a case.
Hope this helps.
TRF
Using tOracleOutput is helpful when a ready data source (table or file (...) with same columns as destination) the more elaborate your query is, the more you should do as TRF said (and use tOracleRow), but here's an example to your question:
file contain 3 column,
DB table of destination contains 4 column, where the 4th is the date of update, (the first 3 are identical to the input)
so you add the destination's column's name in Name and put the SQL function for the date (eg: SYSDATE) and where to put it (Position) in reference to a column of your choice (Reference Column)
In my view it helps avoid using tMap for a miserable additional column when you want to Insert, but you want to Update, in which case the component doesn't offer the additional column section, plus I don't think you can add the WHERE clause here
Hope it helps

How to assign csv field value to SQL query written inside table input step in Pentaho Spoon

I am pretty new to Pentaho so my query might sound very novice.
I have written a transformation in which am using CSV file input step and table input step.
Steps I followed:
Initially, I created a parameter in transformation properties. The
parameter birthdate doesn't have any default value set.
I have used this parameter in postgresql query in table input step
in the following manner:
select * from person where EXTRACT(YEAR FROM birthdate) > ${birthdate};
I am reading the CSV file using CSV file input step. How do I assign the birthdate value which is present in my CSV file to the parameter which I created in the transformation?
(OR)
Could you guide me the process of assigning the CSV field value directly to the SQL query used in the table input step without the use of a parameter?
TLDR;
I recommend using a "database join" step like in my third suggestion below.
See the last image for reference
First idea - Using Table Input as originally asked
Well, you don't need any parameter for that, unless you are going to provide the value for that parameter when asking the transformation to run. If you need to read data from a CSV you can do that with this approach.
First, read your CSV and make sure your rows are ok.
After that, use a select values to keep only the columns to be used as parameters.
In the table input, use a placeholder (?) to determine where to place the data and ask it to run for each row that it receives from the source step.
Just keep in ming that the order of columns received by the table input (the columns out of the select values) is the same order that it will be used for the placeholders (?). This should not be a problem with your question that uses only one placeholder, but keep that in mind as you ramp up using Pentaho.
Second idea, using a Database Lookup
This is another approach where you can't personalize the query made to the database and may experience a better performance because you can set a "Enable cache" flag and if you don't need to use a function on your where clause this is really recommended.
Third idea, using a Database Join
That is my recommended approach if you need a function on your where clause. It looks a lot like the Table Input approach but you can skip the select values step and select what columns to use, repeat the same column a bunch of times and enable a "outer join" flag that returns the rows without result from the query
ProTip: If you feel the transformation running too slow, try to use multiple copies from the step (documentation here) and obviously make sure the table have the appropriate indexes in place.
Yes there's a way of assigning directly without the use of parameter. Do as follows.
Use Block this step until steps finish to halt the table input step till csv input step completes.
Following is how you configure each step.
Note:
Postgres query should be select * from person where EXTRACT(YEAR
FROM birthdate) > ?::integer
Check Execute for each row and Replace variables in in Table input step.
Select only the birthday column in CSV input step.

Redshift - Adding a column, do we have to change our previous CSVs to include it?

I currently have a redshift table in our database that has 10 columns, and I want to add another. It's trivial to do an alter table to do this.
My question - When I do this, will all my old CSV files fail to insert into redshift (via COPY from S3) given they won't have this new column?
I was hoping the columns would just be NULL vs. it failing on import, but I haven't seen any documentation on this.
Ideally I wish I could specify the actual column name in the header row of the CSV, but I haven't seen if that is possible anywhere.
FILLRECORD in COPY command does that: 'Allows data files to be loaded when contiguous columns are missing at the end of some of the records'.