To change table schema in Databricks

To change table schema in Databricks - pyspark

Is there any way to change data type inside struct column in a table?
Example:
emp_details: struct
emp_id: integer
emp_name: string
If emp_details is a column in a table which is of strict type and inside it emp_id and emp_name is present and I want to change emp_id to string.

Yes you can. You should explicitly cast the column and build the new emp_details using the casted column. Once you create the desired dataframe you can overwrite the table in Databricks to store it with the desired schema.
This should look something like this:
# For code readability, let's first create the correct casted column
original_df_with_casted_column_df = original_df.withColumn("casted_emp_id", col("emp_details.emp_id").cast("string"))
# We generate the new struct field using the original emp_name column and the newly created column after renaming it.
final_df = original_df.select(struct(col("casted_emp_id").alias("emp_id"), col("emp_name")).alias("emp_details"))

Related

Create new Date column of DATE type from existing Date column of TEXT type in PostgresSQL

I have a PostgresSQL table that has a Date column of type TEXT
Values look like this: 2019-07-19 00:00
I want to either cast this column to type DATE so I can query based on latest values, etc.... or create a new column of type DATE and cast into there (so I have both for the future). I would appreciate any advice on both options!
Hope this isn't a dupe, but I havn't found any answers on SO.
Some context, I will need to add more data later on to the table that only has the TEXT column, which is why i want to keep the original but open to suggestions.

You can alter the column type with the simple command:
alter table my_table alter my_col type date using my_col::date
This seems to be the best solution as maintaining duplicate columns of different types is a potential source of future trouble.
Note that all values in the column have to be null or be recognizable by Postgres as a date, otherwise the conversion will fail.
Test it in db<>fiddle.
However, if you insist on creating a new column, use the update command:
alter table my_table add my_date_col date;
update my_table
set my_date_col = my_col::date;
Db<>fiddle.

PostgreSQL - Ignore missing column when adding row

Is there a way to ignore values with missing columns when using INSERT INTO in PostgreSQL?
For example:
INSERT INTO tblExample(col_Exist1, col_Exist2, col_NotExist) VALUES ('Val1', 'Val2', 'Val3)
I want to insert a new row containing values Val1 and Val2, but ignore Val3 since its column does not exist, so the result would be:
# | col_Exist1 | col_Exist2
-----------------------------
1 | Val1 | Val2
I see that there is a INSERT ... ON CONFLICT DO NOTHING construct, but this seems to apply to an entire row only - not a singular value.
For explanation, I realise this may be not best practice, but my application is using dynamically created queries based on properties from documents - the properties can vary, but there are lots of columns, so defining them explicitly is painful. Instead, I'm using a 'template' document to define them and, hopefully, I can just ignore properties from other documents that don't exist in the template document.
Thanks in advance.
EDIT: I've figured out a workaround for now - I'm just querying the table to get the list of columns - if the column name exists, add the property to the new INSERT INTO query. The original question still stands.

What about moving document's data to json?
Form one table where you will have following fields:
Table: Documents
id: uuid4
name: varchar or text
data: json type according https://www.postgresql.org/docs/devel/datatype-json.html
After this trick you can store any dynamical data you'd like

not able to do copy activity with bit value in azure data factory without column mapping for sink as postgresql

I've multiple csv files in folder like employee.csv, student.csv, etc.,.. with headers
And also I've tables for all the files(Both header and table column name is same).
employee.csv
id|name|is_active
1|raja|1
2|arun|0
student.csv
id|name
1|raja
2|arun
Table Structure:
emplyee:
id INT, name VARCHAR, is_active BIT
student:
id INT, name VARCHAR
now I'm trying to do copy activity with all the files using foreach activity,
the student table copied successfully, but the employee table was not copied its throwing error while reading the employee.csv file.
Error Message:
{"Code":27001,"Message":"ErrorCode=TypeConversionInvalidHexLength,Exception occurred when converting value '0' for column name 'is_active' from type 'String' (precision:, scale:) to type 'ByteArray' (precision:0, scale:0). Additional info: ","EventType":0,"Category":5,"Data":{},"MsgId":null,"ExceptionType":"Microsoft.DataTransfer.Common.Shared.PluginRuntimeException","Source":null,"StackTrace":"","InnerEventInfos":[]}

Use data flow activity.
In dataflow activity, select Source.
After this select derived column and change datatype of is_active column from BIT to String.
As shown in below screenshot, Salary column has string datatype. So I changed it to integer.
To modify datatype use expression builder. You can use toString()
This way you can change datatype before sink activity.
In a last step, provide Sink as postgresql and run pipeline.

PostgreSQL: COPY from csv missing values into a column with NOT NULL Constraint

I have a table with an INTEGER Column which has NOT NULL constraint and a DEFAULT value = 0;
I need to copy data from a series of csv files.
In some of these files this column is an empty string.
So far, I have set NULL parameter in the COPY command to some non existing value so empty string is not converted to NULL value, but now I get an error saying that empty string is incorrect value for the INTEGER column.
I would like to use COPY command because of its speed, but maybe it is not possible.
The file contains no header. All columns in the file have their counterparts in the table.
It there a way to specify that:
an empty sting is zero, or
if there is en empty string use the default column value?

You could create a view on the table that does not contain the column and create an INSTEAD OF INSERT trigger on it. When you COPY data into that view, the default value will be used for the table. Don't know if the performance will be good enough.

Rename column without breaking functions

Is there a way to rename a table column such that all references to that column in existing functions are automatically updated?
e.g. Doing this
ALTER TABLE public.person RENAME COLUMN name TO firstname;
would automatically change a reference like the following in any function:
return query
select * from person where name is null;

Since function bodies are just strings, there is no way to automatically change references to columns in function bodies when you rename a column.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

To change table schema in Databricks - pyspark

Is there any way to change data type inside struct column in a table? Example: emp_details: struct emp_id: integer emp_name: string If emp_details is a column in a table which is of strict type and inside it emp_id and emp_name is present and I want to change emp_id to string.

Related

Create new Date column of DATE type from existing Date column of TEXT type in PostgresSQL

PostgreSQL - Ignore missing column when adding row

not able to do copy activity with bit value in azure data factory without column mapping for sink as postgresql

PostgreSQL: COPY from csv missing values into a column with NOT NULL Constraint

Rename column without breaking functions

Categories

Resources