Create a DataFrame with nested structure and populate data from input table

Create a DataFrame with nested structure and populate data from input table - scala

I am working with Spark in Scala and want to transform an existing dataset (dataframe) to a new table which will contain nested structure.
Example input:
columnA columnB columnC columnD columnE
Example output:
columnA columnB columnC columnF:{columnD, columnE} (create a new column that contains both D and E as nested structure)
It's straightforward to transform columnA-C as I can use .withColumn(), but I am not sure how to specify a new nested column structure and populate data from same table. I think StructType(Array(StructField1, StructField2)) is needed to define nested schema, how do I combine a StructType with "regular" column and insert data?

If you are using Spark verion 3.1.1, then there is a new API withField introduced in Column class.
You can use it as
dataframe
.withColumn("ColumnF",$"ColumnF".withField("ColumnA",$"ColumnA")
.withColumn("ColumnF",$"ColumnF".withField("ColumnE",$"ColumnE")
Make sure to import scala implicits.
For the older verions of Spark you could leverage struct function as:
dataframe
.withColumn("ColumnF",struct($"ColumnF.*", $"ColumnA")
.withColumn("ColumnF",struct($"ColumnF.*",$"ColumnE")

Related

How to create many tables programatically?

I have a table in my database called products and has prouductId, ProductName, BrandId and BrandName. I need to create delta tables for each brands by passing brand id as parameter and the table name should be corresponding .delta. Every time when new data is inserted into products (master table) the data in brand tables need to be truncated and reloaded into brand.delta tables. Could you please let me know if this is possible within databricks using spark or dynamic SQL?

It's easy to do, really there are few variants:
in Spark - read data from source table, filter out, etc., and use .saveAsTable in the overwrite mode:
df = spark.read.table("products")
... transform df
brand_table_name = "brand1"
df.write.mode("overwrite").saveAsTable(brand_table_name)
in SQL by using CREATE OR REPLACE TABLE (You can use spark.sql to substitute variables in this text):
CREATE OR REPLACE TABLE brand1
USING delta
AS SELECT * FROM products where .... filter condition
for list of brands you just need to use spark.sql with loop:
for brand in brands:
spark.sql(f"""CREATE OR REPLACE TABLE {brand}
USING delta
AS SELECT * FROM products where .... filter condition""")
P.S. Really, I think that you just need to define views (doc) over the products table, that will have corresponding condition - in this case you avoid data duplication, and don't incur computing costs for that writes.

Data insertion into delta table with changing schema

How to insert data into delta table with changing schema in Databricks.
In Databricks Scala, I'm exploding a Map column and loading it into a delta table. I have a predefined schema of the delta table.
Let's say the schema has 4 columns A, B, C, D.
So, one day 1 I'm loading my dataframe with 4 columns into the delta table using the below code.
loadfinaldf.write.format("delta").option("mergeSchema", "true")\
.mode("append").insertInto("table")
The columns in the dataframe change every day. For instance on day 2, two new columns E, F are added and there is no C column. Now I have 5 columns A, B, D, E, F in the dataframe. When I load this data into the delta table, columns E and F should be dynamically created in the table schema and the corresponding data should load into these two columns and column C should be populated as NULL. I was assuming that spark.conf.set("spark.databricks.delta.schema.autoMerge","true") will do the job. But I'm unable to achieve this.
My approach:
I was thinking to list the pre-defined delta schema and the dataframe schema and compare both before loading it into the delta table.

Can you use some Python logic?
result = pd.concat([df1, df2], axis=1, join="inner")
Then, push your dataframe into a dynamically created SQL table?
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html

Update Table Hive Using Spark Scala

I need to update a Table Hive like
update A from B
set
Col5 = A.Col2,
Col2 = B.Col2,
DT_Change = B.DT,
Col3 = B.Col3,
Col4 = B.Col4
where A.Col1 = B.Col1 and A.Col2 <> B.Col2
Using Scala Spark RDD
How can I do this ?

I want to split this question in to two questions to explain it simple.
First question : How to write Spark RDD data to Hive table ?
The simplest way is to convert the RDD in to Spark SQL (dataframe) using method rdd.toDF(). Then register the dataframe as temptable using df.registerTempTable("temp_table"). Now you can query from the temptable and insert in to hive table using sqlContext.sql("insert into table my_table select * from temp_table").
Second question: How to update Hive table from Spark ?
As of now, Hive is not a best fit for record level updates. Updates can only be performed on tables that support ACID. One primary limitation is only ORC format supports updating Hive tables. You can find some information on it from https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
You can refer How to Updata an ORC Hive table form Spark using Scala for this.
Few methods might have deprecated with spark 2.x and you can check spark 2.0 documentation for the latest methods.
While there could be better approaches, this is the simplest approach that I can think of which works.

MySql COALESCE In Spark

I had scenario where I will join 2 dataframe's then want to calculate usage of 2 columns.Presently the logic in SQL want to convert to spark dataframe
bml.Usage*COALESCE(u.ConValue,1) as acUsage where bml is a table inner joined with 'u' table ..While in data frames bml.join(u,seq("id"),"inner").Select(COALESCE ???) how to perform this operation either through UDF

Difference between DataFrame API methods vs Storing DF as table

What is the difference between saving the DF as temp table then process using SQL queries and directly accessing the DF API methods?
For example:
df.registerTempTable("tablename")
sqlCtx.sql("select column1 from tablename where column2='value2' group by column1")
and this
df.where($"column2"==="value2").groupBy($"column1").select($"column1")
Is there any performance difference between these two?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Create a DataFrame with nested structure and populate data from input table - scala

Related

How to create many tables programatically?

Data insertion into delta table with changing schema

Update Table Hive Using Spark Scala

MySql COALESCE In Spark

Difference between DataFrame API methods vs Storing DF as table

Categories

Resources