I am performing an incremental load on data coming from a Teradata database and storing it as a parquet file. Because the tables from Teradata contains billions of rows, I would like my PySpark script to compare hash values.
Teradata Table:
An example table from Teradata
Current Stored Parquet File:
Data stored in parquet file
My PySpark script uses a JDBC read connection to make the call to teradata:
tdDF = return spark.read \
.format("jdbc") \
.option("driver", "com.teradata.jdbc.TeraDriver") \
.option("url", "jdbc:teradata://someip/DATABASE=somedb,MAYBENULL=ON") \
.option("dbtable", "(SELECT * FROM somedb.table)tmp")
Spark script that reads in the parquet:
myDF = spark.read.parquet("myParquet")
myDF.createOrReplaceTempView("myDF")
spark.sql("select * from myDF").show()
How can I:
include a hash function in my call to teradata that returns the hash of the entire row values (this hash should be performed on Teradata)
Include a hash function in my PySpark code when reading in the parquet file that returns the hash of the entire row values (this hash should be performed in Spark)
Compare these two hashes to see which is the delta from Teradata that needs to be loaded
You want to Insert new rows, or, if rows with identifying info exist, update them. This is called 'upsert' or in teradata, 'merge'.
It depends on which columns are allowed to change and which ones make a row 'new'.
In your examples there you have :
terradata
Name Account Product
------+--------+---------
Sam 1234 Speakers
Jane 1256 Earphones
Janet 3214 Laptop
Billy 5678 HardDisk
parquet
Name Account Product
------+--------+---------
Sam 1234 Speakers
Jane 1256 Earphones
So if any Name,Account combination should be unique, the database table should have a unique key defined for it.
With that, the database won't allow insert of another row with the same unique key, but will allow you to update it.
So going by this example, with your example data, youe sql commands would look like:
UPDATE somedb.table SET product = 'Speakers' WHERE name = 'Sam' AND account = 1234 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Sam',1234,'Speakers');
UPDATE somedb.table SET product = 'Earphones' WHERE name = 'Jane' AND account = 1256 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Jane',1256,'Earphones');
UPDATE somedb.table SET product = 'Laptop' WHERE name = 'Janet' AND account = 3214 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Janet',3214,'Laptop');
UPDATE somedb.table SET product = 'HardDisk' WHERE name = 'Billy' AND account = 5678 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Billy',5678,'HardDisk');
But this is a very simplistic approach that will likely perform very poorly.
Googleing 'teradata bulk upload' finds links such as
https://kontext.tech/article/483/teradata-fastload-load-csv-file
https://etl-sql.com/6-ways-to-load-data-file-into-teradata-table/
There are likely many others.
Related
I have a table in my database called products and has prouductId, ProductName, BrandId and BrandName. I need to create delta tables for each brands by passing brand id as parameter and the table name should be corresponding .delta. Every time when new data is inserted into products (master table) the data in brand tables need to be truncated and reloaded into brand.delta tables. Could you please let me know if this is possible within databricks using spark or dynamic SQL?
It's easy to do, really there are few variants:
in Spark - read data from source table, filter out, etc., and use .saveAsTable in the overwrite mode:
df = spark.read.table("products")
... transform df
brand_table_name = "brand1"
df.write.mode("overwrite").saveAsTable(brand_table_name)
in SQL by using CREATE OR REPLACE TABLE (You can use spark.sql to substitute variables in this text):
CREATE OR REPLACE TABLE brand1
USING delta
AS SELECT * FROM products where .... filter condition
for list of brands you just need to use spark.sql with loop:
for brand in brands:
spark.sql(f"""CREATE OR REPLACE TABLE {brand}
USING delta
AS SELECT * FROM products where .... filter condition""")
P.S. Really, I think that you just need to define views (doc) over the products table, that will have corresponding condition - in this case you avoid data duplication, and don't incur computing costs for that writes.
We are running SQL 2019 with CU 12 with an external data source that points to ADLS Gen2 storage account. We have two parquet files in the same directory where one file has 2 columns and the other file has 3 columns. We purposely did this to test the reject options knowing that our schemas will change over time.
/employee/file1.csv (2 columns/5 rows)
/employee/file2.csv (3 columns/5 rows)
Based on the documentation for reject options, we should be able to query across the external table and return non-dirty rows in the result set if reject rows fall within the reject configuration which is listed below.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated
CREATE EXTERNAL TABLE [dbo].[Employee] (
[FirstName] varchar(100) NOT NULL,
[LastName] varchar(100) NOT NULL
)
WITH (LOCATION='/employee/',
DATA_SOURCE = DATA_LAKE,
FILE_FORMAT = ParquetFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 1000000
);
When we select from the external table, I would expect to have it return the 5 rows from the one file that has 2 columns and reject the 5 rows from the file that contains 3 columns. Instead, we get no rows at all with the following exception.
Unexpected error encountered creating the record reader.
HadoopExecutionException: Column count mismatch. Source file has 3
columns, external table definition has 2 columns.
I feel like I must be missing something or my understanding of how reject options support file schema differences is incorrect. Can anyone shed any light on this?
The way Polybase works is that it first queries the schema on both files to see if they match the specification; since they don't, Polybase doesn't query one file successfully and fails on the other one. Both files must adhere to the external table specification first, and then you can have some records with two columns and other records with three columns. You can learn more about how Polybase works in my book "Hands-on data virtualization with Polybase".
I'm having some concerns regarding the behaviour of dataframes after writing them to Hive tables.
Context:
I run a Spark Scala (version 2.2.0.2.6.4.105-1) job through spark-submit in my production environment, which has Hadoop 2.
I do multiple computations and store some intermediate data to Hive ORC tables; after storing a table, I need to re-use the dataframe to compute a new dataframe to be stored in another Hive ORC table.
E.g.:
// dataframe with ~10 million record
val df = prev_df.filter(some_filters)
val df_temp_table_name = "temp_table"
val df_table_name = "table"
sql("SET hive.exec.dynamic.partition = true")
sql("SET hive.exec.dynamic.partition.mode = nonstrict")
df.createOrReplaceTempView(df_temp_table_name)
sql(s"""INSERT OVERWRITE TABLE $df_table_name PARTITION(partition_timestamp)
SELECT * FROM $df_temp_table_name """)
These steps always work and the table is properly populated with the correct data and partitions.
After this, I need to use the just computed dataframe (df) to update another table. So I query the table to be updated into dataframe df2, then I join df with df2, and the result of the join needs to overwrite the table of df2 (a plain, non-partitioned table).
val table_name_to_be_updated = "table2"
// Query the table to be updated
val df2 = sql(table_name_to_be_updated)
val df3 = df.join(df2).filter(some_filters).withColumn(something)
val temp = "temp_table2"
df3.createOrReplaceTempView(temp)
sql(s"""INSERT OVERWRITE TABLE $table_name_to_be_updated
SELECT * FROM $temp """)
At this point, df3 is always found empty, so the resulting Hive table is always empty as well. This happens also when I .persist() it to keep it in memory.
When testing with spark-shell, I have never encountered the issue. This happens only when the flow is scheduled in cluster-mode under Oozie.
What do you think might be the issue? Do you have any advice on approaching a problem like this with efficient memory usage?
I don't understand if it's the first df that turns empty after writing to a table, or if the issue is because I first query and then try to overwrite the same table.
Thank you very much in advance and have a great day!
Edit:
Previously, df was computed in an individual script and then inserted into its respective table. On a second script, that table was queried into a new variable df; then the table_to_be_updated was also queried and stored into a variable old_df2 let's say. The two were then joined and computed upon in a new variable df3, that was then inserted with overwrite into the table_to_be_updated.
I have requirement to to extract row counts of each table in hive database(which has multiple schemas). I wrote pyspark job which extracts counts of each table, it works fine when i try for some of the schemas, however it fails with GV overhead error when i try for all schemas. i tried creating union all for all table queries across the database, also tried union all for all tables within schema. both have failed with the GC Error.
Can you please advise to avoid this error. below is my script:
# For loop for Schema starts here
for schema in schemas_list:
# Dataframe with all table names available in given Schema for level1 and level2
tables_1_df=tables_df(schema,1)
tables_1_list=formatted_list(tables_1_df,1)
tables_2_df=tables_df(schema,2)
tables_2_list=formatted_list(tables_2_df,2)
tables_list=list(set(tables_1_list) & set(tables_2_list)) #Intersection of level1 and level2 tables per Schema Name
# For loop for Tables starts her
for table in tables_list:
# Creating Dataframe with Row Count of given table for level 1 and level2
level_1_query=prep_query(schema, table, 1)
level_2_query=prep_query(schema, table, 2)
level_1_count_df=level_1_count_df.union(table_count(level_1_query))
level_1_count_df.persist()
level_2_count_df=level_2_count_df.union(table_count(level_2_query))
level_2_count_df.persist()
# Validate if level1 and level2 are re-conciled, if not write the row into data frame which will intern write into file in S3 Location
level_1_2_join_df = level_1_count_df.alias("one").join(level_2_count_df.alias("two"),(level_1_count_df.schema_name==level_2_count_df.schema_name) & (level_1_count_df.table_name==level_2_count_df.table_name),'inner').select(col("one.schema_name"),col("two.table_name"),col("level_1_count"),col("level_2_count"))
main_df=header_df.union(level_1_2_join_df)
if extracttype=='DELTA':
main_df=main_df.filter(main_df.level_1_count!=main_df.level_2_count)
main_df=main_df.select(concat(col("schema_name"),lit(","),col("table_name"),lit(","),col("level_1_count"),lit(","),col("level_2_count")))
# creates file in temp location
file_output(main_df, tempfolder) # writes to txt file in hadoop
this question is a spin off from [this one] (saving a list of rows to a Hive table in pyspark).
EDIT please see my update edits at the bottom of this post
I have used both Scala and now Pyspark to do the same task, but I am having problems with VERY slow saves of a dataframe to parquet or csv, or converting a dataframe to a list or array type data structure. Below is the relevant python/pyspark code and info:
#Table is a List of Rows from small Hive table I loaded using
#query = "SELECT * FROM Table"
#Table = sqlContext.sql(query).collect()
for i in range(len(Table)):
rows = sqlContext.sql(qry)
val1 = Table[i][0]
val2 = Table[i][1]
count = Table[i][2]
x = 100 - count
#hivetemp is a table that I copied from Hive to my hfs using:
#create external table IF NOT EXISTS hive temp LIKE hivetableIwant2copy LOCATION "/user/name/hiveBackup";
#INSERT OVERWRITE TABLE hivetemp SELECT * FROM hivetableIwant2copy;
query = "SELECT * FROM hivetemp WHERE col1<>\""+val1+"\" AND col2 ==\""+val2+"\" ORDER BY RAND() LIMIT "+str(x)
rows = sqlContext.sql(query)
rows = rows.withColumn("col4", lit(10))
rows = rows.withColumn("col5", lit(some_string))
#writing to parquet is heck slow AND I can't work with pandas due to the library not installed on the server
rows.saveAsParquetFile("rows"+str(i)+".parquet")
#tried this before and heck slow also
#rows_list = rows.collect()
#shuffle(rows_list)
I have tried to do the above in Scala, and I had similar problems. I could easily load the hive table or query of a hive table, but needing to do a random shuffle or store a large dataframe encounters memory issues. There were also some challenges with being able to add 2 extra columns.
The Hive table (hiveTemp) that I want to add rows to has 5,570,000 ~5.5 million rows and 120 columns.
The Hive table that I am iterating in the for loop through has 5000 rows and 3 columns. There are 25 unique val1 (a column in hiveTemp), and the combinations of val1 and val2 3000. Val2 could be one of 5 columns and its specific cell value. This means if I had tweaked code, then I could reduce the lookups of rows to add down to 26 from 5000, but the number of rows I have to retrieve, store and random shuffle would be pretty large and hence a memory issue (unless anyone has suggestions on this)
As far as how many total rows I need to add to the table might be about 100,000.
The ultimate goal is to have the original table of 5.5mill rows appended with the 100k+ rows written as a hive or parquet table. If its easier, I am fine with writing the 100k rows in its own table that can be merged to the 5.5 mill table later
Scala or Python is fine, though Scala is more preferred..
Any advice on this and the options that would be best would be great.
Thanks a lot!
EDIT
Some additional thought I had on this problem:
I used the hash partitioner to partition the hive table into 26 partitions. This is based on a column value which there are 26 distinct ones. The operations I want to perform in the for loop could be generalized so that it only needs to happen on each of these partitions.
That being said, how could I, or what guide can I look at online to be able to write the scala code to do this, and for a separate executer to do each of these loops on each partition? I am thinking this would make things much faster.
I know how to do something like this using multithreads but not sure how to in the scala/spark paradigm.