DataBricks: Fastest Way to Insert Data Into Delta Table?

DataBricks: Fastest Way to Insert Data Into Delta Table? - pyspark

I have a handful of tables that are only a few MBs in filesize each that I want to capture as Delta Tables. Inserting new data into them takes an extraordinarily long time, 15+ minutes, which I am astonished at.
The culprit, I am guessing, is that while the table is very small; there are over 300 columns in these tables.
I have tried the following methods, with the former being faster than the latter (unsurprisingly(?)): (1) INSERT INTO , (2) MERGE INTO.
Before inserting data into the Delta Tables, I apply a handful of Spark functions to clean the data and then lastly register it as a temp table (e.g., INSERT INTO DELTA_TBL_OF_INTEREST (cols) SELECT * FROM tempTable
Any recommendations on speeding this process up for trivial data?

If you're performing data transformations using PySpark before putting the data into the destination table, then you don't need to go to the SQL level, you can just write data using append mode.
If you're using registered table:
df = ... transform source data ...
df.write.mode("append").format("delta").saveAsTable("table_name")
If you're using file path:
df = ... transform source data ...
df.write.mode("append").format("delta").save("path_to_delta")

Related

How to create many tables programatically?

I have a table in my database called products and has prouductId, ProductName, BrandId and BrandName. I need to create delta tables for each brands by passing brand id as parameter and the table name should be corresponding .delta. Every time when new data is inserted into products (master table) the data in brand tables need to be truncated and reloaded into brand.delta tables. Could you please let me know if this is possible within databricks using spark or dynamic SQL?

It's easy to do, really there are few variants:
in Spark - read data from source table, filter out, etc., and use .saveAsTable in the overwrite mode:
df = spark.read.table("products")
... transform df
brand_table_name = "brand1"
df.write.mode("overwrite").saveAsTable(brand_table_name)
in SQL by using CREATE OR REPLACE TABLE (You can use spark.sql to substitute variables in this text):
CREATE OR REPLACE TABLE brand1
USING delta
AS SELECT * FROM products where .... filter condition
for list of brands you just need to use spark.sql with loop:
for brand in brands:
spark.sql(f"""CREATE OR REPLACE TABLE {brand}
USING delta
AS SELECT * FROM products where .... filter condition""")
P.S. Really, I think that you just need to define views (doc) over the products table, that will have corresponding condition - in this case you avoid data duplication, and don't incur computing costs for that writes.

PostgreSQL - 100 million records transfer from archive to a new table

I have a requirement to transfer data from 2 tables (Table A and Table B) into a new table.
I am using a query to join both A and B tables using an ID column.
Table A and B are archive tables without any indexes. (Millions of records)
Table X and Y are a replica of A and B with good indexes. (Some thousands of records)
Below is the code for my project.
with data as
(
SELECT a.*, b.* FROM A_archive a
join B_archive b where a.transaction_id = b.transaction_id
UNION
SELECT x.*, y.* FROM X x
join Y y where x.transaction_id = y.transaction_id
)
INSERT INTO
Another_Table
(
columns
)
select * from data
On Conflict(transaction_id)
do udpate ...
The above whole thing is running in production environment and has nearly 140 million records.
Due to this production database is taking almost 10 hours to process the data and failing.
I am also having a distributed job scheduler in AWS to schedule this query inside a function and retrieve the latest records every 5 hours. The archive tables store closed invoice data. Pega UI will be using this table for retrieving data about closed invoices and showing to the customer.
Please suggest something that is a bit more performant.

UNION removes duplicate rows. On big unindexed tables that is an expensive operation. Try UNION ALL if you don't need deduplication. It will save the s**tton of data shuffling and comparisons required for deduplication.
Without indexes on your archival tables your JOIN operation will be grossly inefficient. Index, at a minimum, the transaction_id columns you use in your ON clause.
You don't say what you want to do with the resulting table. In many cases you'll be able to use a VIEW rather than a table for your purposes. A VIEW removes the work of creating the derived table. Actually it defers the work to the time of SELECT operations using the derived structure. If your SELECT operations have highly selective WHERE clauses the savings can be astonishing. For this to work well you may need to put appropriate indexes on your archival tables.
You use SELECT * when you could enumerate the columns you need. That certainly puts one redundant column into your result: it generates two copies of transaction_id. It also may generate other redundant or unused data. Always avoid SELECT * in production software unless you know you need it.
Keep this in mind: SQL is declarative, not procedural. You declare (describe) the result you require, and you let the server work out the best way to get it. VIEWs let the server do this work for you in cases like your table combination. It will use the indexes you provide as best it can.

That UNION must be costly, it pretty much builds a temp-table in the background containing all the A-B + X-Y records, sorts it (over all fields) and then removes any doubles. If you say 100 million records are involved then that's a LOT of sorting going on that most likely will involve swapping out to disk.
Keep in mind that you only need to do this if there are expected duplicates
in the result from the JOIN between A and B
in the result from the JOIN between X and Y
in the combined result from the two above
IF neither of those are expected, just use UNION ALL
In fact, in that case, why not have 1 INSERT operation for A-B and another one for X-Y? Going by the description I'd say that whatever is in X-Y should overrule whatever is in A-B anyway, right?
Also, as mentioned by O.Jones, archive tables or not, they should come at least with a (preferably clustered) index on the transaction_id fields you're JOINing on. (same for the Another_Table btw)
All that said, processing 100M records in 1 transaction IS going to take some time, it's just a lot of data that's being moved around. But 10h does sound excessive indeed.

In pyspark, how to select n rows of DataFrame without scan the whole table

I'm using pyspark, and want to show user a preview of a (very large, 10 million for example) table, for example, user can see 5000 rows in the table, (first/last/random, any 5000 rows are ok), so what is the fastest way to get n rows from the table? I have tried limit, sample, but these function will still scan the whole table, the time complexity are O(N*), which takes a lot of time.
spark.sql('select * from some_table').limit(N)
Can some help me.

spark.sql('select * from some_table limit 10')
Since you are making a sql call from python, this is by far the easiest solution. And it's fast. I don't think it scans the whole table when you use a sql call. Assuming your table is already cached- are you sure the delay is caused by scanning the table, or is it caused by materializing the table?
As an alternative, assuming you had a python dataframe handle, df_some_table, it gets trickier because the .head() and .show() functions return something other than a dataframe, but they can work for peeking at the dataframe.
df_some_table.head(N)
df_some_table.show(N)

Unable to optimise Redshift query

I have build a system where data is loaded from s3 into redshift every few minutes (from a kinesis firehose). I then grab data from that main table and split it into a table per customer.
The main table has a few hundred million rows.
creating the subtable is done with a query like this:
create table {$table} as select * from {$source_table} where customer_id = '{$customer_id} and time between {$start} and {$end}'
I have keys defined as:
SORTKEY (customer_id, time)
DISTKEY customer_id
Everything I have read suggests this would be the optimal way to structure my tables/queries but the performance is absolutely awful. building the sub tables takes over a minute even with only a few rows to select.
Am I missing something or do I just need to scale the cluster?

If you do not have a better key you may have to consider using DISTSTYLE EVEN, keeping the same sort key.
Ideally the distribution key should be a value that is used in joins and spreads your data evenly across the cluster. By using customer_id as the distribution key and then filtering on that key you're forcing all work to be done on just one slice.
To see this in action look in the system tables. First, find an example query:
SELECT *
FROM stl_query
WHERE userid > 1
ORDER BY starttime DESC
LIMIT 10;
Then, look at the bytes per slice for each step of you query in svl_query_report:
SELECT *
FROM svl_query_report
WHERE query = <your query id>
ORDER BY query,segment,step,slice;
For a very detailed guide on designing the best table structure have a look at our "Amazon Redshift Engineering’s Advanced Table Design Playbook"

running tasks in parallel on separate Hive partitions using Scala and Spark to speed up loading Hive and writing results to Hive or Parquet

this question is a spin off from [this one] (saving a list of rows to a Hive table in pyspark).
EDIT please see my update edits at the bottom of this post
I have used both Scala and now Pyspark to do the same task, but I am having problems with VERY slow saves of a dataframe to parquet or csv, or converting a dataframe to a list or array type data structure. Below is the relevant python/pyspark code and info:
#Table is a List of Rows from small Hive table I loaded using
#query = "SELECT * FROM Table"
#Table = sqlContext.sql(query).collect()
for i in range(len(Table)):
rows = sqlContext.sql(qry)
val1 = Table[i][0]
val2 = Table[i][1]
count = Table[i][2]
x = 100 - count
#hivetemp is a table that I copied from Hive to my hfs using:
#create external table IF NOT EXISTS hive temp LIKE hivetableIwant2copy LOCATION "/user/name/hiveBackup";
#INSERT OVERWRITE TABLE hivetemp SELECT * FROM hivetableIwant2copy;
query = "SELECT * FROM hivetemp WHERE col1<>\""+val1+"\" AND col2 ==\""+val2+"\" ORDER BY RAND() LIMIT "+str(x)
rows = sqlContext.sql(query)
rows = rows.withColumn("col4", lit(10))
rows = rows.withColumn("col5", lit(some_string))
#writing to parquet is heck slow AND I can't work with pandas due to the library not installed on the server
rows.saveAsParquetFile("rows"+str(i)+".parquet")
#tried this before and heck slow also
#rows_list = rows.collect()
#shuffle(rows_list)
I have tried to do the above in Scala, and I had similar problems. I could easily load the hive table or query of a hive table, but needing to do a random shuffle or store a large dataframe encounters memory issues. There were also some challenges with being able to add 2 extra columns.
The Hive table (hiveTemp) that I want to add rows to has 5,570,000 ~5.5 million rows and 120 columns.
The Hive table that I am iterating in the for loop through has 5000 rows and 3 columns. There are 25 unique val1 (a column in hiveTemp), and the combinations of val1 and val2 3000. Val2 could be one of 5 columns and its specific cell value. This means if I had tweaked code, then I could reduce the lookups of rows to add down to 26 from 5000, but the number of rows I have to retrieve, store and random shuffle would be pretty large and hence a memory issue (unless anyone has suggestions on this)
As far as how many total rows I need to add to the table might be about 100,000.
The ultimate goal is to have the original table of 5.5mill rows appended with the 100k+ rows written as a hive or parquet table. If its easier, I am fine with writing the 100k rows in its own table that can be merged to the 5.5 mill table later
Scala or Python is fine, though Scala is more preferred..
Any advice on this and the options that would be best would be great.
Thanks a lot!
EDIT
Some additional thought I had on this problem:
I used the hash partitioner to partition the hive table into 26 partitions. This is based on a column value which there are 26 distinct ones. The operations I want to perform in the for loop could be generalized so that it only needs to happen on each of these partitions.
That being said, how could I, or what guide can I look at online to be able to write the scala code to do this, and for a separate executer to do each of these loops on each partition? I am thinking this would make things much faster.
I know how to do something like this using multithreads but not sure how to in the scala/spark paradigm.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

DataBricks: Fastest Way to Insert Data Into Delta Table? - pyspark

Related

How to create many tables programatically?

PostgreSQL - 100 million records transfer from archive to a new table

In pyspark, how to select n rows of DataFrame without scan the whole table

Unable to optimise Redshift query

running tasks in parallel on separate Hive partitions using Scala and Spark to speed up loading Hive and writing results to Hive or Parquet

Categories

Resources