I have a Scala code that I run in spark-shell to extract data from json files and load first in a HIVE staging table and then extract data from staging and load into the final main table in HIVE.
I use the following commands:
//read json to DF
val df = hiveContext.read.schema(schema1).json(file)
//DF to Staging
df.write.mode("append").saveAsTable("stg")
//Staging to Final
hiveContext.sql("insert into final select distinct columns from stg left outer join final on stg.id = final.id where stg.id is not null and final.id is null")
I want to be able to understand the counts of records read/written at every stage.
For json to DF, I understand that I can do df.count() but how about the remaining two stages. How do I get the count of saveAsTable and Insert into statements?
Related
i'm trying to update a deltalake table using a spark dataframe. What i want to do is to update all rows that are different in the spark dataframe than in the deltalake table, and to insert all rows that are missing from the deltalake table.
I tried to do this as follows:
import io.delta.tables._
val not_equal_string = df.schema.fieldNames.map(fn =>
s"coalesce(not ((updates.${fn} = history.${fn}) or (isnull(history.${fn}) and isnull(updates.${fn})) ),false)"
).reduceLeft((x,y) => s"$x OR $y ")
val deltaTable = DeltaTable.forPath(spark, "s3a://sparkdata/delta-table")
deltaTable.as("history").merge(
df.as("updates"), "updates.EquipmentKey = history.EquipmentKey"
).whenMatched(not_equal_string).updateAll().whenNotMatched().insertAll().execute()
this works but when i look in the resulting delta table i see that it effectively doubled in size even if i didn't update a single record. A new json file was generated with a remove for every old partition and an add with all new partitions.
when i just run a sql join with the whenMatched criterion as a where condition, i don't get a single row.
i would expect the delta table to be untouched after such a merge operation. am i missing something simple ?
I'm having some concerns regarding the behaviour of dataframes after writing them to Hive tables.
Context:
I run a Spark Scala (version 2.2.0.2.6.4.105-1) job through spark-submit in my production environment, which has Hadoop 2.
I do multiple computations and store some intermediate data to Hive ORC tables; after storing a table, I need to re-use the dataframe to compute a new dataframe to be stored in another Hive ORC table.
E.g.:
// dataframe with ~10 million record
val df = prev_df.filter(some_filters)
val df_temp_table_name = "temp_table"
val df_table_name = "table"
sql("SET hive.exec.dynamic.partition = true")
sql("SET hive.exec.dynamic.partition.mode = nonstrict")
df.createOrReplaceTempView(df_temp_table_name)
sql(s"""INSERT OVERWRITE TABLE $df_table_name PARTITION(partition_timestamp)
SELECT * FROM $df_temp_table_name """)
These steps always work and the table is properly populated with the correct data and partitions.
After this, I need to use the just computed dataframe (df) to update another table. So I query the table to be updated into dataframe df2, then I join df with df2, and the result of the join needs to overwrite the table of df2 (a plain, non-partitioned table).
val table_name_to_be_updated = "table2"
// Query the table to be updated
val df2 = sql(table_name_to_be_updated)
val df3 = df.join(df2).filter(some_filters).withColumn(something)
val temp = "temp_table2"
df3.createOrReplaceTempView(temp)
sql(s"""INSERT OVERWRITE TABLE $table_name_to_be_updated
SELECT * FROM $temp """)
At this point, df3 is always found empty, so the resulting Hive table is always empty as well. This happens also when I .persist() it to keep it in memory.
When testing with spark-shell, I have never encountered the issue. This happens only when the flow is scheduled in cluster-mode under Oozie.
What do you think might be the issue? Do you have any advice on approaching a problem like this with efficient memory usage?
I don't understand if it's the first df that turns empty after writing to a table, or if the issue is because I first query and then try to overwrite the same table.
Thank you very much in advance and have a great day!
Edit:
Previously, df was computed in an individual script and then inserted into its respective table. On a second script, that table was queried into a new variable df; then the table_to_be_updated was also queried and stored into a variable old_df2 let's say. The two were then joined and computed upon in a new variable df3, that was then inserted with overwrite into the table_to_be_updated.
When we use spark to read data from csv for DB as follow, it will automatically split the data to multiple partitions and sent to executors
spark
.read
.option("delimiter", ",")
.option("header", "true")
.option("mergeSchema", "true")
.option("codec", properties.getProperty("sparkCodeC"))
.format(properties.getProperty("fileFormat"))
.load(inputFile)
Currently, I have a id list as :
[1,2,3,4,5,6,7,8,9,...1000]
What I want to do is split this list to multiple partitions and sent to executors, in each executor, run the sql as
ids.foreach(id => {
select * from table where id = id
})
When we load data from cassandra, the connector will generate the query sql as:
select columns from table where Token(k) >= ? and Token(k) <= ?
it means, the connector will scan the whole database, virtually, I needn't to scan the whole table, I just what to get all the data from the table where the k(partition key) in the id list.
the table schema as:
CREATE TABLE IF NOT EXISTS tab.events (
k int,
o text,
event text
PRIMARY KEY (k,o)
);
or how can i use spark to load data from cassandra using pre defined sql statement without scan the whole table?
You simply need to use joinWithCassandra function to perform selection only of the data is required for your operation. But be aware that this function is only available via RDD API.
Something like this:
val joinWithRDD = your_df.rdd.joinWithCassandraTable("tab","events")
You need to make sure that column name in your DataFrame matched the partition key name in Cassandra - see documentation for more information.
The DataFrame implementation is only available in the DSE version of Spark Cassandra Connector as described in following blog post.
Update in September 2020th: support for join with Cassandra was added in the Spark Cassandra Connector 2.5.0
I am performing an incremental load on data coming from a Teradata database and storing it as a parquet file. Because the tables from Teradata contains billions of rows, I would like my PySpark script to compare hash values.
Teradata Table:
An example table from Teradata
Current Stored Parquet File:
Data stored in parquet file
My PySpark script uses a JDBC read connection to make the call to teradata:
tdDF = return spark.read \
.format("jdbc") \
.option("driver", "com.teradata.jdbc.TeraDriver") \
.option("url", "jdbc:teradata://someip/DATABASE=somedb,MAYBENULL=ON") \
.option("dbtable", "(SELECT * FROM somedb.table)tmp")
Spark script that reads in the parquet:
myDF = spark.read.parquet("myParquet")
myDF.createOrReplaceTempView("myDF")
spark.sql("select * from myDF").show()
How can I:
include a hash function in my call to teradata that returns the hash of the entire row values (this hash should be performed on Teradata)
Include a hash function in my PySpark code when reading in the parquet file that returns the hash of the entire row values (this hash should be performed in Spark)
Compare these two hashes to see which is the delta from Teradata that needs to be loaded
You want to Insert new rows, or, if rows with identifying info exist, update them. This is called 'upsert' or in teradata, 'merge'.
It depends on which columns are allowed to change and which ones make a row 'new'.
In your examples there you have :
terradata
Name Account Product
------+--------+---------
Sam 1234 Speakers
Jane 1256 Earphones
Janet 3214 Laptop
Billy 5678 HardDisk
parquet
Name Account Product
------+--------+---------
Sam 1234 Speakers
Jane 1256 Earphones
So if any Name,Account combination should be unique, the database table should have a unique key defined for it.
With that, the database won't allow insert of another row with the same unique key, but will allow you to update it.
So going by this example, with your example data, youe sql commands would look like:
UPDATE somedb.table SET product = 'Speakers' WHERE name = 'Sam' AND account = 1234 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Sam',1234,'Speakers');
UPDATE somedb.table SET product = 'Earphones' WHERE name = 'Jane' AND account = 1256 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Jane',1256,'Earphones');
UPDATE somedb.table SET product = 'Laptop' WHERE name = 'Janet' AND account = 3214 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Janet',3214,'Laptop');
UPDATE somedb.table SET product = 'HardDisk' WHERE name = 'Billy' AND account = 5678 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Billy',5678,'HardDisk');
But this is a very simplistic approach that will likely perform very poorly.
Googleing 'teradata bulk upload' finds links such as
https://kontext.tech/article/483/teradata-fastload-load-csv-file
https://etl-sql.com/6-ways-to-load-data-file-into-teradata-table/
There are likely many others.
I was able to insert data into a Hive table from my spark code using HiveContext like below
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS e360_models.employee(id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
sqlContext.sql("insert into table e360_models.employee select t.* from (select 1210, 'rahul', 55) t")
sqlContext.sql("insert into table e360_models.employee select t.* from (select 1211, 'sriram pv', 35) t")
sqlContext.sql("insert into table e360_models.employee select t.* from (select 1212, 'gowri', 59) t")
val result = sqlContext.sql("FROM e360_models.employee SELECT id, name, age")
result.show()
But, this approach is creating a separate file in the warehouse for every insertion like below
part-00000
part-00000_copy_1
part-00000_copy_2
part-00000_copy_3
Is there any way to avoid this and just append the new data to a single file or is there any other better way to insert data into hive from spark?
No, there is no way to do that. Each new insert will create a new file. It's not a Spark "issue", but a general behavior you can experience with Hive too. The only way is to perform a single insert with the UNION of all your data, but if you need to do multiple inserts, you'll have multiple files.
The only thing you can do is to enable file merging in hive (look at it here: Hive Create Multi small files for each insert in HDFS and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties).