I have requirement to to extract row counts of each table in hive database(which has multiple schemas). I wrote pyspark job which extracts counts of each table, it works fine when i try for some of the schemas, however it fails with GV overhead error when i try for all schemas. i tried creating union all for all table queries across the database, also tried union all for all tables within schema. both have failed with the GC Error.
Can you please advise to avoid this error. below is my script:
# For loop for Schema starts here
for schema in schemas_list:
# Dataframe with all table names available in given Schema for level1 and level2
tables_1_df=tables_df(schema,1)
tables_1_list=formatted_list(tables_1_df,1)
tables_2_df=tables_df(schema,2)
tables_2_list=formatted_list(tables_2_df,2)
tables_list=list(set(tables_1_list) & set(tables_2_list)) #Intersection of level1 and level2 tables per Schema Name
# For loop for Tables starts her
for table in tables_list:
# Creating Dataframe with Row Count of given table for level 1 and level2
level_1_query=prep_query(schema, table, 1)
level_2_query=prep_query(schema, table, 2)
level_1_count_df=level_1_count_df.union(table_count(level_1_query))
level_1_count_df.persist()
level_2_count_df=level_2_count_df.union(table_count(level_2_query))
level_2_count_df.persist()
# Validate if level1 and level2 are re-conciled, if not write the row into data frame which will intern write into file in S3 Location
level_1_2_join_df = level_1_count_df.alias("one").join(level_2_count_df.alias("two"),(level_1_count_df.schema_name==level_2_count_df.schema_name) & (level_1_count_df.table_name==level_2_count_df.table_name),'inner').select(col("one.schema_name"),col("two.table_name"),col("level_1_count"),col("level_2_count"))
main_df=header_df.union(level_1_2_join_df)
if extracttype=='DELTA':
main_df=main_df.filter(main_df.level_1_count!=main_df.level_2_count)
main_df=main_df.select(concat(col("schema_name"),lit(","),col("table_name"),lit(","),col("level_1_count"),lit(","),col("level_2_count")))
# creates file in temp location
file_output(main_df, tempfolder) # writes to txt file in hadoop
Related
I'm trying to check the size of the different tables we're generating in our data warehouse, so we can have an automatic way to calculate partition size in next runs.
In order to get the table size I'm getting the stats from dataframes in the following way:
val db = "database"
val table_name = "table_name"
val table_size_bytes = spark.read.table(s"$db.$table_name").queryExecution.analyzed.stats.sizeInBytes
This was working fine until I started running the same code on partitioned tables. Each time I ran it on a partitioned table I got the same value for sizeInBytes, which is the max allowed value for BigInt: 9223372036854775807.
Is this a bug in Spark or should I be running this in a different way for partitioned tables?
I have a handful of tables that are only a few MBs in filesize each that I want to capture as Delta Tables. Inserting new data into them takes an extraordinarily long time, 15+ minutes, which I am astonished at.
The culprit, I am guessing, is that while the table is very small; there are over 300 columns in these tables.
I have tried the following methods, with the former being faster than the latter (unsurprisingly(?)): (1) INSERT INTO , (2) MERGE INTO.
Before inserting data into the Delta Tables, I apply a handful of Spark functions to clean the data and then lastly register it as a temp table (e.g., INSERT INTO DELTA_TBL_OF_INTEREST (cols) SELECT * FROM tempTable
Any recommendations on speeding this process up for trivial data?
If you're performing data transformations using PySpark before putting the data into the destination table, then you don't need to go to the SQL level, you can just write data using append mode.
If you're using registered table:
df = ... transform source data ...
df.write.mode("append").format("delta").saveAsTable("table_name")
If you're using file path:
df = ... transform source data ...
df.write.mode("append").format("delta").save("path_to_delta")
We are running SQL 2019 with CU 12 with an external data source that points to ADLS Gen2 storage account. We have two parquet files in the same directory where one file has 2 columns and the other file has 3 columns. We purposely did this to test the reject options knowing that our schemas will change over time.
/employee/file1.csv (2 columns/5 rows)
/employee/file2.csv (3 columns/5 rows)
Based on the documentation for reject options, we should be able to query across the external table and return non-dirty rows in the result set if reject rows fall within the reject configuration which is listed below.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated
CREATE EXTERNAL TABLE [dbo].[Employee] (
[FirstName] varchar(100) NOT NULL,
[LastName] varchar(100) NOT NULL
)
WITH (LOCATION='/employee/',
DATA_SOURCE = DATA_LAKE,
FILE_FORMAT = ParquetFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 1000000
);
When we select from the external table, I would expect to have it return the 5 rows from the one file that has 2 columns and reject the 5 rows from the file that contains 3 columns. Instead, we get no rows at all with the following exception.
Unexpected error encountered creating the record reader.
HadoopExecutionException: Column count mismatch. Source file has 3
columns, external table definition has 2 columns.
I feel like I must be missing something or my understanding of how reject options support file schema differences is incorrect. Can anyone shed any light on this?
The way Polybase works is that it first queries the schema on both files to see if they match the specification; since they don't, Polybase doesn't query one file successfully and fails on the other one. Both files must adhere to the external table specification first, and then you can have some records with two columns and other records with three columns. You can learn more about how Polybase works in my book "Hands-on data virtualization with Polybase".
I am performing an incremental load on data coming from a Teradata database and storing it as a parquet file. Because the tables from Teradata contains billions of rows, I would like my PySpark script to compare hash values.
Teradata Table:
An example table from Teradata
Current Stored Parquet File:
Data stored in parquet file
My PySpark script uses a JDBC read connection to make the call to teradata:
tdDF = return spark.read \
.format("jdbc") \
.option("driver", "com.teradata.jdbc.TeraDriver") \
.option("url", "jdbc:teradata://someip/DATABASE=somedb,MAYBENULL=ON") \
.option("dbtable", "(SELECT * FROM somedb.table)tmp")
Spark script that reads in the parquet:
myDF = spark.read.parquet("myParquet")
myDF.createOrReplaceTempView("myDF")
spark.sql("select * from myDF").show()
How can I:
include a hash function in my call to teradata that returns the hash of the entire row values (this hash should be performed on Teradata)
Include a hash function in my PySpark code when reading in the parquet file that returns the hash of the entire row values (this hash should be performed in Spark)
Compare these two hashes to see which is the delta from Teradata that needs to be loaded
You want to Insert new rows, or, if rows with identifying info exist, update them. This is called 'upsert' or in teradata, 'merge'.
It depends on which columns are allowed to change and which ones make a row 'new'.
In your examples there you have :
terradata
Name Account Product
------+--------+---------
Sam 1234 Speakers
Jane 1256 Earphones
Janet 3214 Laptop
Billy 5678 HardDisk
parquet
Name Account Product
------+--------+---------
Sam 1234 Speakers
Jane 1256 Earphones
So if any Name,Account combination should be unique, the database table should have a unique key defined for it.
With that, the database won't allow insert of another row with the same unique key, but will allow you to update it.
So going by this example, with your example data, youe sql commands would look like:
UPDATE somedb.table SET product = 'Speakers' WHERE name = 'Sam' AND account = 1234 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Sam',1234,'Speakers');
UPDATE somedb.table SET product = 'Earphones' WHERE name = 'Jane' AND account = 1256 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Jane',1256,'Earphones');
UPDATE somedb.table SET product = 'Laptop' WHERE name = 'Janet' AND account = 3214 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Janet',3214,'Laptop');
UPDATE somedb.table SET product = 'HardDisk' WHERE name = 'Billy' AND account = 5678 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Billy',5678,'HardDisk');
But this is a very simplistic approach that will likely perform very poorly.
Googleing 'teradata bulk upload' finds links such as
https://kontext.tech/article/483/teradata-fastload-load-csv-file
https://etl-sql.com/6-ways-to-load-data-file-into-teradata-table/
There are likely many others.
this question is a spin off from [this one] (saving a list of rows to a Hive table in pyspark).
EDIT please see my update edits at the bottom of this post
I have used both Scala and now Pyspark to do the same task, but I am having problems with VERY slow saves of a dataframe to parquet or csv, or converting a dataframe to a list or array type data structure. Below is the relevant python/pyspark code and info:
#Table is a List of Rows from small Hive table I loaded using
#query = "SELECT * FROM Table"
#Table = sqlContext.sql(query).collect()
for i in range(len(Table)):
rows = sqlContext.sql(qry)
val1 = Table[i][0]
val2 = Table[i][1]
count = Table[i][2]
x = 100 - count
#hivetemp is a table that I copied from Hive to my hfs using:
#create external table IF NOT EXISTS hive temp LIKE hivetableIwant2copy LOCATION "/user/name/hiveBackup";
#INSERT OVERWRITE TABLE hivetemp SELECT * FROM hivetableIwant2copy;
query = "SELECT * FROM hivetemp WHERE col1<>\""+val1+"\" AND col2 ==\""+val2+"\" ORDER BY RAND() LIMIT "+str(x)
rows = sqlContext.sql(query)
rows = rows.withColumn("col4", lit(10))
rows = rows.withColumn("col5", lit(some_string))
#writing to parquet is heck slow AND I can't work with pandas due to the library not installed on the server
rows.saveAsParquetFile("rows"+str(i)+".parquet")
#tried this before and heck slow also
#rows_list = rows.collect()
#shuffle(rows_list)
I have tried to do the above in Scala, and I had similar problems. I could easily load the hive table or query of a hive table, but needing to do a random shuffle or store a large dataframe encounters memory issues. There were also some challenges with being able to add 2 extra columns.
The Hive table (hiveTemp) that I want to add rows to has 5,570,000 ~5.5 million rows and 120 columns.
The Hive table that I am iterating in the for loop through has 5000 rows and 3 columns. There are 25 unique val1 (a column in hiveTemp), and the combinations of val1 and val2 3000. Val2 could be one of 5 columns and its specific cell value. This means if I had tweaked code, then I could reduce the lookups of rows to add down to 26 from 5000, but the number of rows I have to retrieve, store and random shuffle would be pretty large and hence a memory issue (unless anyone has suggestions on this)
As far as how many total rows I need to add to the table might be about 100,000.
The ultimate goal is to have the original table of 5.5mill rows appended with the 100k+ rows written as a hive or parquet table. If its easier, I am fine with writing the 100k rows in its own table that can be merged to the 5.5 mill table later
Scala or Python is fine, though Scala is more preferred..
Any advice on this and the options that would be best would be great.
Thanks a lot!
EDIT
Some additional thought I had on this problem:
I used the hash partitioner to partition the hive table into 26 partitions. This is based on a column value which there are 26 distinct ones. The operations I want to perform in the for loop could be generalized so that it only needs to happen on each of these partitions.
That being said, how could I, or what guide can I look at online to be able to write the scala code to do this, and for a separate executer to do each of these loops on each partition? I am thinking this would make things much faster.
I know how to do something like this using multithreads but not sure how to in the scala/spark paradigm.