Spark- Load data frame contents in table in a loop - scala

I use scala/ spark to insert data into a Hive parquet table as follows
for(*lots of current_Period_Id*){//This loop is on a result of another query that returns multiple rows of current_Period_Id
val myDf = hiveContext.sql(s"""SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
val count: Int = myDf.count().toInt
if(count>0){
hiveContext.sql(s"""INSERT INTO destinationtable PARTITION(period_id=$current_Period_Id) SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
}
}
This approach takes a lot of time to complete because the select statement is being executed twice.
I'm trying to avoid selecting data twice and one way I've thought of is writing the dataframe myDf to the table directly.
This is the gist of the code I'm trying to use for the purpose
val sparkConf = new SparkConf().setAppName("myApp")
.set("spark.yarn.executor.memoryOverhead","4096")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition","true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
for(*lots of current_Period_Id*){//This loop is on a result of another query
val myDf = hiveContext.sql("SELECT COLUMNS FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id")
val count: Int = myDf.count().toInt
if(count>0){
myDf.write.mode("append").format("parquet").partitionBy("PERIOD_ID").saveAsTable("destinationtable")
}
}
But I get an error in the myDf.write part.
java.util.NoSuchElementException: key not found: period_id
The destination table is partitioned by period_id.
Could someone help me with this?
The spark version I'm using is 1.5.0-cdh5.5.2.

The dataframe schema and table's description differs from each other. The PERIOD_ID != period_id column name is Upper case in your DF but in UPPER case in table. Try in sql with lowercase the period_id

Related

How to add a new column to a Delta Lake table?

I'm trying to add a new column to data stored as a Delta Table in Azure Blob Storage. Most of the actions being done on the data are upserts, with many updates and few new inserts. My code to write data currently looks like this:
DeltaTable.forPath(spark, deltaPath)
.as("dest_table")
.merge(myDF.as("source_table"),
"dest_table.id = source_table.id")
.whenNotMatched()
.insertAll()
.whenMatched(upsertCond)
.updateExpr(upsertStat)
.execute()
From these docs, it looks like Delta Lake supports adding new columns on insertAll() and updateAll() calls only. However, I'm updating only when certain conditions are met and want the new column added to all the existing data (with a default value of null).
I've come up with a solution that seems extremely clunky and am wondering if there's a more elegant approach. Here's my current proposed solution:
// Read in existing data
val myData = spark.read.format("delta").load(deltaPath)
// Register table with Hive metastore
myData.write.format("delta").saveAsTable("input_data")
// Add new column
spark.sql("ALTER TABLE input_data ADD COLUMNS (new_col string)")
// Save as DataFrame and overwrite data on disk
val sqlDF = spark.sql("SELECT * FROM input_data")
sqlDF.write.format("delta").option("mergeSchema", "true").mode("overwrite").save(deltaPath)
Alter your delta table first and then you do your merge operation:
from pyspark.sql.functions import lit
spark.read.format("delta").load('/mnt/delta/cov')\
.withColumn("Recovered", lit(''))\
.write\
.format("delta")\
.mode("overwrite")\
.option("overwriteSchema", "true")\
.save('/mnt/delta/cov')
New columns can also be added with SQL commands as follows:
ALTER TABLE dbName.TableName ADD COLUMNS (newColumnName dataType)
UPDATE dbName.TableName SET newColumnName = val;
This is the approach that worked for me using scala
Having a delta table, named original_table, which path is:
val path_to_delta = "/mnt/my/path"
This table currently has got 1M records with the following schema: pk, field1, field2, field3, field4
I want to add a new field, named new_field, to the existing schema without loosing the data already stored in original_table.
So I first created a dummy record with a simple schema containing just pk and newfield
case class new_schema(
pk: String,
newfield: String
)
I created a dummy record using that schema:
import spark.implicits._
val dummy_record = Seq(new new_schema("delete_later", null)).toDF
I inserted this new record (the existing 1M records will have newfield populated as null). I also removed this dummy record from the original table:
dummy_record
.write
.format("delta")
.option("mergeSchema", "true")
.mode("append")
.save(path_to_delta )
val original_dt : DeltaTable = DeltaTable.forPath(spark, path_to_delta )
original_dt .delete("pk = 'delete_later'")
Now the original table will have 6 fields: pk, field1, field2, field3, field4 and newfield
Finally I upsert the newfield values in the corresponding 1M records using pk as join key
val df_with_new_field = // You bring new data from somewhere...
original_dt
.as("original")
.merge(
df_with_new_field .as("new"),
"original.pk = new.pk")
.whenMatched
.update( Map(
"newfield" -> col("new.newfield")
))
.execute()
https://www.databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html
Have you tried using the merge statement?
https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html

Storing Spark DataFrame Value In scala Variable

I need to Check the duplicate filename in my table and if file count is 0 then i need to load a file in my table using sparkSql. I wrote below code.
val s1=spark.sql("select count(filename) from mytable where filename='myfile.csv'") //giving '2'
s1: org.apache.spark.sql.DataFrame = [count(filename): bigint]
s1.show //giving 2 as output
//s1 is giving me the filecount from my table then i need to compare this count value using if statement.
I'm using below code.
val s2=s1.count //not working always giving 1
val s2=s1.head.count() // error: value count is not a member of org.apache.spark.sql.Row
val s2=s1.size //value size is not a member of Unit
if(s1>0){ //code } //value > is not a member of org.apache.spark.sql.DataFrame
can someone please give me a hint how should i do this.How can i get the dataframe value and can use as variable to check the condition.
i.e.
if(value of s1(i.e.2)>0){
//my code
}
You need to extract the value itself. Count will return the number of rows in the df, which is just one row.
So you can keep your original query and extract the value after with first and getInt methods
val s1 = spark.sql("select count(filename) from mytable where filename='myfile.csv'")`
val valueToCompare = s1.first().getInt(0)
And then:
if(valueToCompare>0){
//my code
}
Another option is performing the count outside the query, then the count will give you the desired value:
val s1 = spark.sql("select filename from mytable where filename='myfile.csv'")
if(s1.count>0){
//my code
}
I like the most the second option, but there is no reason other than that i think it is more clear
spark.sql("select count(filename) from mytable where filename='myfile.csv'") returns a dataframe and you need to extract both the first row and the first column of that row. It is much simpler to directly filter the dataset and count the number of rows in Scala:
val s1 = df.filter($"filename" === "myfile.csv").count
if (s1 > 0) {
...
}
where df is the dataset that corresponds to the mytable table.
If you got the table from some other source and not by registering a view, use SparkSession.table() to get a dataframe using the instance of SparkSession that you already have. For example, in Spark shell the pre-set variable spark holds the session and you'll do:
val df = spark.table("mytable")
val s1 = df.filter($"filename" === "myfile.csv").count

How to Compare columns of two tables using Spark?

I am trying to compare two tables() by reading as DataFrames. And for each common column in those tables using concatenation of a primary key say order_id with other columns like order_date, order_name, order_event.
The Scala Code I am using
val primary_key=order_id
for (i <- commonColumnsList){
val column_name = i
val tempDataFrameForNew = newDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
val tempDataFrameOld = oldDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
//Get those records which aren common in both old/new tables
matchCountCalculated = tempDataFrameForNew.intersect(tempDataFrameOld)
//Get those records which aren't common in both old/new tables
nonMatchCountCalculated = tempDataFrameOld.unionAll(tempDataFrameForNew).except(matchCountCalculated)
//Total Null/Non-Null Counts in both old and new tables.
nullsCountInNewDataFrame = newDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nullsCountInOldDataFrame = oldDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nonNullsCountInNewDataFrame = newDFCount - nullsCountInNewDataFrame
nonNullsCountInOldDataFrame = oldDFCount - nullsCountInOldDataFrame
//Put the result for a given column in a Seq variable, later convert it to Dataframe.
tempSeq = tempSeq :+ Row(column_name, matchCountCalculated.toString, nonMatchCountCalculated.toString, (nullsCountInNewDataFrame - nullsCountInOldDataFrame).toString,
(nonNullsCountInNewDataFrame - nonNullsCountInOldDataFrame).toString)
}
// Final Step: Create DataFrame using Seq and some Schema.
spark.createDataFrame(spark.sparkContext.parallelize(tempSeq), schema)
The above code is working fine for a medium set of Data, but as the number of Columns and Records increases in my New & Old Table, the execution time is increasing. Any sort of advice is appreciated.
Thank you in Advance.
You can do the following:
1. Outer join the old and new dataframe on priamary key
joined_df = df_old.join(df_new, primary_key, "outer")
2. Cache it if you possibly can. This will save you a lot of time
3. Now you can iterate over columns and compare columns using spark functions (.isNull for not matched, == for matched etc)
for (col <- df_new.columns){
val matchCount = df_joined.filter(df_new[col].isNotNull && df_old[col].isNotNull).count()
val nonMatchCount = ...
}
This should be considerably faster, especially when you can cache your dataframe. If you can't it might be a good idea so save the joined df to disk in order to avoid a shuffle each time

Looping a dataframe from the column from the same table in scala

I have DataFrame in which it will contain table name with data. I need to loop the DataFrame with the table column name. Is there a better way to do it with a collect at first?
val tablename:Array[String] = df1.select("msgname").distinct().rdd.map(row=>row.getString(0).trim).collect
tablename.foreach{table =>
//print(table)
//val columns:Array[String] = df1.filter(s"msgname = '$table'").select("columns").distinct().rdd.map(row=>row.toString()).collect
df1.filter(s"msgname = '$table'").select("record_data").write.saveAsTable(s"$table")
//.toDF(columns:_*).show()
//.toDF(columns:_*).show()
}
2 ideas to improve performance: cache df1 and/or fire parallel spark jobs e.g. using parallel collections, like this:
df1.cache()
val tablename:Array[String] = df1.select(trim("msgname")).distinct().as[String].collect
tablename
.par // enable parallel execution
.foreach{table =>
df1.filter(s"msgname ='$table'").select("record_data").write.saveAsTable(s"$table")
}

Accessing column in a dataframe using Spark

I am working on SPARK 1.6.1 version using SCALA and facing a unusual issue. When creating a new column using an existing column created during same execution getting "org.apache.spark.sql.AnalysisException".
WORKING:.
val resultDataFrame = dataFrame.withColumn("FirstColumn",lit(2021)).withColumn("SecondColumn",when($"FirstColumn" - 2021 === 0, 1).otherwise(10))
resultDataFrame.printSchema().
NOT WORKING
val resultDataFrame = dataFrame.withColumn("FirstColumn",lit(2021)).withColumn("SecondColumn",when($"FirstColumn" - **max($"FirstColumn")** === 0, 1).otherwise(10))
resultDataFrame.printSchema().
Here i am creating my SecondColumn using the FirstColumn created during the same execution. Question is why it does not work while using avg/max functions. Please let me know how can i resolve this problem.
If you want to use aggregate functions together with "normal" columns, the functions should come after a groupBy or with a Window definition clause. Out of these cases they make no sense. Examples:
val result = df.groupBy($"col1").max("col2").as("max") // This works
In the above case, the resulting DataFrame will have both "col1" and "max" as columns.
val max = df.select(min("col2"), max("col2"))
This works because there are only aggregate functions in the query. However, the following will not work:
val result = df.filter($"col1" === max($"col2"))
because I am trying to mix a non aggregated column with an aggregated column.
If you want to compare a column with an aggregated value, you can try a join:
val maxDf = df.select(max("col2").as("maxValue"))
val joined = df.join(maxDf)
val result = joined.filter($"col1" === $"maxValue").drop("maxValue")
Or even use the simple value:
val maxValue = df.select(max("col2")).first.get(0)
val result = filter($"col1" === maxValue)