Data insertion into delta table with changing schema - scala

How to insert data into delta table with changing schema in Databricks.
In Databricks Scala, I'm exploding a Map column and loading it into a delta table. I have a predefined schema of the delta table.
Let's say the schema has 4 columns A, B, C, D.
So, one day 1 I'm loading my dataframe with 4 columns into the delta table using the below code.
loadfinaldf.write.format("delta").option("mergeSchema", "true")\
.mode("append").insertInto("table")
The columns in the dataframe change every day. For instance on day 2, two new columns E, F are added and there is no C column. Now I have 5 columns A, B, D, E, F in the dataframe. When I load this data into the delta table, columns E and F should be dynamically created in the table schema and the corresponding data should load into these two columns and column C should be populated as NULL. I was assuming that spark.conf.set("spark.databricks.delta.schema.autoMerge","true") will do the job. But I'm unable to achieve this.
My approach:
I was thinking to list the pre-defined delta schema and the dataframe schema and compare both before loading it into the delta table.

Can you use some Python logic?
result = pd.concat([df1, df2], axis=1, join="inner")
Then, push your dataframe into a dynamically created SQL table?
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html

Related

merge into deltalake table updates all rows

i'm trying to update a deltalake table using a spark dataframe. What i want to do is to update all rows that are different in the spark dataframe than in the deltalake table, and to insert all rows that are missing from the deltalake table.
I tried to do this as follows:
import io.delta.tables._
val not_equal_string = df.schema.fieldNames.map(fn =>
s"coalesce(not ((updates.${fn} = history.${fn}) or (isnull(history.${fn}) and isnull(updates.${fn})) ),false)"
).reduceLeft((x,y) => s"$x OR $y ")
val deltaTable = DeltaTable.forPath(spark, "s3a://sparkdata/delta-table")
deltaTable.as("history").merge(
df.as("updates"), "updates.EquipmentKey = history.EquipmentKey"
).whenMatched(not_equal_string).updateAll().whenNotMatched().insertAll().execute()
this works but when i look in the resulting delta table i see that it effectively doubled in size even if i didn't update a single record. A new json file was generated with a remove for every old partition and an add with all new partitions.
when i just run a sql join with the whenMatched criterion as a where condition, i don't get a single row.
i would expect the delta table to be untouched after such a merge operation. am i missing something simple ?

column values change between loading two partioned tables in KDB (q)

I have two partioned kdb tables on disk (one called trades, one called books). I created the data by
using
.Q.dpft[`:I:/check/trades/;2020.01.01;`symTrade;`trades]
and
.Q.dpft[`:I:/check/books/;2020.01.01;`sym;`books]
for each day. If I select data from the trades table and then load the books table (without selecting data) the values in the symTrade columns of my result change to new values. I assume it has got something to do with the paritioning in the books table getting applied to the result from trades table (also the trades table is no longer accessible after loading the books table).
How do I:
keep the trades table accessible after loading the books table?
avoid having my symTrade column overwritten by the sym values in
the books table?
Here is an example:
system "l I:/check/trades/";
test: 10 sublist select from trades where date=2020.01.01;
show cols test;
// gives `date`symTrade`time`Price`Qty`Volume
select distinct symTrade from test;
// gives TICKER1
// now loading another table
system "l I:/check/books";
select distinct symTrade from test;
// now gives a different value e.g. TICKER200
I think the problem is that you are saving these tables to two different databases.
The first argument in .Q.dpft is the path to the root of the database, and the fourth argument is the name of the table you want to store. So when you do
.Q.dpft[`:I:/check/trades/;2020.01.01;`symTrade;`trades]
You are storing the trades table in a database in I:/check/trades and when you do
.Q.dpft[`:I:/check/books/;2020.01.01;`sym;`books]
you are storing the books table in a database in I:/check/books. I think q can only load in one database at a time, so that might be the problem.
Try doing this
.Q.dpft[`:I:/check/;2020.01.01;`symTrade;`trades]
.Q.dpft[`:I:/check/;2020.01.01;`sym;`books]
system "l I:/check/";
Let us know if that works!

Postgresql trigger and trigger function

I have created 6 tables :- A, B, C, D, E & F where A is the parent table. Now I want to create a trigger to insert the data in table F after successful insertion of data in tables A, B, C, D & E. Also the data in F should contain the values from table A(so basically the trigger should be made for Table A).
In addition to that a column in table F is json. So i need to combine 3 columns from table A(which is character varying) and insert into json field of table F.
The data in json should appear like column name : value
Please suggest me the right approach.

Join Multiple Data frames in Spark

I am Implementing a project where MySql data is imported to hdfs using sqoop. It had nearly 30 tables.I am reading each table as a dataframe by inferring schema and registered as temp tables. I has few questions in doing this...
1. There several joins need to implemented for the tables suppose say df1 to df10 . In MySQL the query will be
select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name
Instead of using
sqlContext.sql(select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name)
Is there other to join all the data frames effectively based on conditions..
Is it the correct way to convert tables to data frames and querying on top of them or any better way to approach this type of joins and querying in spark
I had similiar problem and I end up Using :
val df_list = ListBuffer[DataFrame]()
df_list .toList.reduce((a, b) => a.join(b, a.col(a.schema.head.name) === b.col(b.schema.head.name), "left_outer"))
You could make a free sql statement on Sqoop and join everything there. Or Use Spark JDBC to do the same job

running tasks in parallel on separate Hive partitions using Scala and Spark to speed up loading Hive and writing results to Hive or Parquet

this question is a spin off from [this one] (saving a list of rows to a Hive table in pyspark).
EDIT please see my update edits at the bottom of this post
I have used both Scala and now Pyspark to do the same task, but I am having problems with VERY slow saves of a dataframe to parquet or csv, or converting a dataframe to a list or array type data structure. Below is the relevant python/pyspark code and info:
#Table is a List of Rows from small Hive table I loaded using
#query = "SELECT * FROM Table"
#Table = sqlContext.sql(query).collect()
for i in range(len(Table)):
rows = sqlContext.sql(qry)
val1 = Table[i][0]
val2 = Table[i][1]
count = Table[i][2]
x = 100 - count
#hivetemp is a table that I copied from Hive to my hfs using:
#create external table IF NOT EXISTS hive temp LIKE hivetableIwant2copy LOCATION "/user/name/hiveBackup";
#INSERT OVERWRITE TABLE hivetemp SELECT * FROM hivetableIwant2copy;
query = "SELECT * FROM hivetemp WHERE col1<>\""+val1+"\" AND col2 ==\""+val2+"\" ORDER BY RAND() LIMIT "+str(x)
rows = sqlContext.sql(query)
rows = rows.withColumn("col4", lit(10))
rows = rows.withColumn("col5", lit(some_string))
#writing to parquet is heck slow AND I can't work with pandas due to the library not installed on the server
rows.saveAsParquetFile("rows"+str(i)+".parquet")
#tried this before and heck slow also
#rows_list = rows.collect()
#shuffle(rows_list)
I have tried to do the above in Scala, and I had similar problems. I could easily load the hive table or query of a hive table, but needing to do a random shuffle or store a large dataframe encounters memory issues. There were also some challenges with being able to add 2 extra columns.
The Hive table (hiveTemp) that I want to add rows to has 5,570,000 ~5.5 million rows and 120 columns.
The Hive table that I am iterating in the for loop through has 5000 rows and 3 columns. There are 25 unique val1 (a column in hiveTemp), and the combinations of val1 and val2 3000. Val2 could be one of 5 columns and its specific cell value. This means if I had tweaked code, then I could reduce the lookups of rows to add down to 26 from 5000, but the number of rows I have to retrieve, store and random shuffle would be pretty large and hence a memory issue (unless anyone has suggestions on this)
As far as how many total rows I need to add to the table might be about 100,000.
The ultimate goal is to have the original table of 5.5mill rows appended with the 100k+ rows written as a hive or parquet table. If its easier, I am fine with writing the 100k rows in its own table that can be merged to the 5.5 mill table later
Scala or Python is fine, though Scala is more preferred..
Any advice on this and the options that would be best would be great.
Thanks a lot!
EDIT
Some additional thought I had on this problem:
I used the hash partitioner to partition the hive table into 26 partitions. This is based on a column value which there are 26 distinct ones. The operations I want to perform in the for loop could be generalized so that it only needs to happen on each of these partitions.
That being said, how could I, or what guide can I look at online to be able to write the scala code to do this, and for a separate executer to do each of these loops on each partition? I am thinking this would make things much faster.
I know how to do something like this using multithreads but not sure how to in the scala/spark paradigm.