I have a table in my database called products and has prouductId, ProductName, BrandId and BrandName. I need to create delta tables for each brands by passing brand id as parameter and the table name should be corresponding .delta. Every time when new data is inserted into products (master table) the data in brand tables need to be truncated and reloaded into brand.delta tables. Could you please let me know if this is possible within databricks using spark or dynamic SQL?
It's easy to do, really there are few variants:
in Spark - read data from source table, filter out, etc., and use .saveAsTable in the overwrite mode:
df = spark.read.table("products")
... transform df
brand_table_name = "brand1"
df.write.mode("overwrite").saveAsTable(brand_table_name)
in SQL by using CREATE OR REPLACE TABLE (You can use spark.sql to substitute variables in this text):
CREATE OR REPLACE TABLE brand1
USING delta
AS SELECT * FROM products where .... filter condition
for list of brands you just need to use spark.sql with loop:
for brand in brands:
spark.sql(f"""CREATE OR REPLACE TABLE {brand}
USING delta
AS SELECT * FROM products where .... filter condition""")
P.S. Really, I think that you just need to define views (doc) over the products table, that will have corresponding condition - in this case you avoid data duplication, and don't incur computing costs for that writes.
Related
I have a handful of tables that are only a few MBs in filesize each that I want to capture as Delta Tables. Inserting new data into them takes an extraordinarily long time, 15+ minutes, which I am astonished at.
The culprit, I am guessing, is that while the table is very small; there are over 300 columns in these tables.
I have tried the following methods, with the former being faster than the latter (unsurprisingly(?)): (1) INSERT INTO , (2) MERGE INTO.
Before inserting data into the Delta Tables, I apply a handful of Spark functions to clean the data and then lastly register it as a temp table (e.g., INSERT INTO DELTA_TBL_OF_INTEREST (cols) SELECT * FROM tempTable
Any recommendations on speeding this process up for trivial data?
If you're performing data transformations using PySpark before putting the data into the destination table, then you don't need to go to the SQL level, you can just write data using append mode.
If you're using registered table:
df = ... transform source data ...
df.write.mode("append").format("delta").saveAsTable("table_name")
If you're using file path:
df = ... transform source data ...
df.write.mode("append").format("delta").save("path_to_delta")
I would like to organize my postgresql table in ascending order from the date it was created on.
So I tried:
SELECT *
FROM price
ORDER BY created_on;
And it did show me the database in that order, however it did not save it.
Is there a way I can make it so it gets saved?
Tables in a relational database represent unordered sets. There is no such thing as the "order of rows" in a table.
If you need a specific sort order, the only way is to use an order by in a select statement as you did.
If you don't want to type the order by each time, you can create a view that does that:
create view sorted_price
as
select *
from price
order by created_on;
But be warned: if you sort the rows from the view in a different way, e.g. select * from sorted_price order by created_on desc Postgres will actually apply two sorts. The query optimizer is unfortunately not smart enough to remove the one store in the view's definition.
I have two partioned kdb tables on disk (one called trades, one called books). I created the data by
using
.Q.dpft[`:I:/check/trades/;2020.01.01;`symTrade;`trades]
and
.Q.dpft[`:I:/check/books/;2020.01.01;`sym;`books]
for each day. If I select data from the trades table and then load the books table (without selecting data) the values in the symTrade columns of my result change to new values. I assume it has got something to do with the paritioning in the books table getting applied to the result from trades table (also the trades table is no longer accessible after loading the books table).
How do I:
keep the trades table accessible after loading the books table?
avoid having my symTrade column overwritten by the sym values in
the books table?
Here is an example:
system "l I:/check/trades/";
test: 10 sublist select from trades where date=2020.01.01;
show cols test;
// gives `date`symTrade`time`Price`Qty`Volume
select distinct symTrade from test;
// gives TICKER1
// now loading another table
system "l I:/check/books";
select distinct symTrade from test;
// now gives a different value e.g. TICKER200
I think the problem is that you are saving these tables to two different databases.
The first argument in .Q.dpft is the path to the root of the database, and the fourth argument is the name of the table you want to store. So when you do
.Q.dpft[`:I:/check/trades/;2020.01.01;`symTrade;`trades]
You are storing the trades table in a database in I:/check/trades and when you do
.Q.dpft[`:I:/check/books/;2020.01.01;`sym;`books]
you are storing the books table in a database in I:/check/books. I think q can only load in one database at a time, so that might be the problem.
Try doing this
.Q.dpft[`:I:/check/;2020.01.01;`symTrade;`trades]
.Q.dpft[`:I:/check/;2020.01.01;`sym;`books]
system "l I:/check/";
Let us know if that works!
In Most common cases, we have two tables (& more) in DB termed as master (e.g. SalesOrderHeader) & chirld (e.g. SalesOrderDetail).
We can read records from DB by one Select with INNER JOIN and additional constaints WHERE for lessen volume data for loading from DB (using "Addater.Fill(DataSet)")
#"SELECT d.SalesOrderID, d.SalesOrderDetailID, d.OrderQty,
d.ProductID, d.UnitPrice
FROM Sales.SalesOrderDetail d
INNER JOIN Sales.SalesOrderHeader h
ON d.SalesOrderID = h.SalesOrderID
WHERE DATEPART(YEAR, OrderDate) = #year;"
Did I understand right, in this case we receive one table in DataSet, w/o primary and foreign keys, and w/o possibility to set constraint between master and child tables.
This Dataset can be useful only for different queries regarding columns and record which exist in DataSet?
We can't using DbCommandBuilder for creating SQLCommands for Insert, Update, Delete based on the SelectCommand which was used for filling DataSet? And simply to Update data in these table in DB?
If we want to organize the local data moddification in tables by using the disconnect layer of ADO.NET, we must populate DataSet by two Select
"SELECT *
FROM Sales.SalesOrderHeader;"
"SELECT *
FROM Sales.SalesOrderDetail;"
After that we must create the primary keys for both table, and set constraint between master and child table. Create by DbCommandBuilder SQLCommands for Insert, Update, Delete.
In that case we will have possibility to any modification data in these tables remotely and after Update records in DB (using "Addater.Update(DataSet)").
If we will use one SelectCommand to load data in two tables in DataSet, can we use that SelectCommand for DbCommandBuilder for creating other SQLCommands for "Update" and Update all tables in DataSet by one "Addater.Update(DataSet)" or we must create separate Addapter for Update every table?
If I for economy resources will load only part of records (see below) from table (e.g. SalesOrderDetail). Do I right understand, that in this case, I can have a possible problems, when I will send new records to DB (by Update), because news records can conflict with existen in DB by primary key (some records which have other value in OrderDate field)?
"SELECT *
FROM Sales.SalesOrderDetail
WHERE DATEPART(YEAR, OrderDate) = #year;"
There is nothing preventing you from writing your own Insert, Update and Delete commands for your first select statement with the join. Of course you will have to determine a way to assure that the foreign keys exist.
Insert Into SalesOrderDetail (SalesOrderID, OrderQty, ProductID, UnitPrice) Values ( #SalesOrderID, #OrderQty, #ProductID, #UnitPrice);
Update SalesOrderDetail Set OrderQty = #OrderQty Where SalesOrderDetailID = #ID;
Delete From SalesOrderDetail Where SalesOrderDetailID = #ID;
You would execute these with ADO.net commands instead of using the adapter. I wrote the sample code in vb.net but I am sure it is easy to change to C# if you prefer.
Private Sub UpdateQuantity(Quant As Integer, DetailID As Integer)
Using cn As New SqlConnection("Your connection string"),
cmd As New SqlCommand("Update SalesOrderDetail Set OrderQty = #OrderQty Where SalesOrderDetailID = #ID;")
cmd.Parameters.Add("#OrderQty", SqlDbType.Int).Value = Quant
cmd.Parameters.Add("#ID", SqlDbType.Int).Value = DetailID
cn.Open()
cmd.ExecuteNonQuery()
End Using
End Sub
I am fairly new to DB2 (and SQL in general) and I am having trouble finding an efficient method to DECODE columns
Currently, the database has a number of tables most of which have a significant number of their columns as numbers, these numbers correspond to a table with the real values. We are talking 9,500 different values (e.g '502=yes' or '1413= Graduate Student')
In any situation, I would just do WHERE clause and show where they are equal, but since there are 20-30 columns that need to be decoded per table, I can't really do this (that I know of).
Is there a way to effectively just display the corresponding value from the other table?
Example:
SELECT TEST_ID, DECODE(TEST_STATUS, 5111, 'Approved, 5112, 'In Progress') TEST_STATUS
FROM TEST_TABLE
The above works fine.......but I manually look up the numbers and review them to build the statements. As I mentioned, some tables have 20-30 columns that would need this AND some need DECODE statements that would be 12-15 conditions.
Is there anything that would allow me to do something simpler like:
SELECT TEST_ID, DECODE(TEST_STATUS = *TableWithCodeValues*) TEST_STATUS
FROM TEST_TABLE
EDIT: Also, to be more clear, I know I can do a ton of INNER JOINS, but I wasn't sure if there was a more efficient way than that.
From a logical point of view, I would consider splitting the lookup table into several domain/dimension tables. Not sure if that is possible to do for you, so I'll leave that part.
As mentioned in my comment I would stay away from using DECODE as described in your post. I would start by doing it as usual joins:
SELECT a.TEST_STATUS
, b.TEST_STATUS_DESCRIPTION
, a.ANOTHER_STATUS
, c.ANOTHER_STATUS_DESCRIPTION
, ...
FROM TEST_TABLE as a
JOIN TEST_STATUS_TABLE as b
ON a.TEST_STATUS = b.TEST_STATUS
JOIN ANOTHER_STATUS_TABLE as c
ON a.ANOTHER_STATUS = c.ANOTHER_STATUS
JOIN ...
If things are too slow there are a couple of things you can try:
Create a statistical view that can help determine cardinalities from the joins (may help the optimizer creating a better plan):
https://www.ibm.com/support/knowledgecenter/sl/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0021713.html
If your license admits you can experiment with Materialized Query Tables (MQT). Note that there is a penalty for modifications of the base tables, so if you have more of a OLTP workload, this is probably not a good idea:
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/index.html
A third option if your lookup table is fairly static is to cache the lookup table in the application. Read the TEST_TABLE from the database, and lookup descriptions in the application. Further improvements may be to add triggers that invalidate the cache when lookup table is modified.
If you don't want to do all these joins you could create yourself an own LOOKUP function.
create or replace function lookup(IN_ID INTEGER)
returns varchar(32)
deterministic reads sql data
begin atomic
declare OUT_TEXT varchar(32);--
set OUT_TEXT=(select text from test.lookup where id=IN_ID);--
return OUT_TEXT;--
end;
With a table TEST.LOOKUP like
create table test.lookup(id integer, text varchar(32))
containing some id/text pairs this will return the text value corrseponding to an id .. if not found NULL.
With your mentioned 10k id/text pairs and an index on the ID field this shouldn't be a performance issue as such data amount should be easily be cached in the corresponding bufferpool.