spark sql - how to write dynamic query in spark sql - scala

I have one hive table. I want to create dynamic spark SQL queries.at the time of spark submit, i am specifying rulename. based on the rule name query should generate. At the time of spark submit, I have to specify rule name. For example:
sparks-submit <RuleName> IncorrectAge
It should fire my scala object code:
select tablename, filter, condition from all_rules where rulename="IncorrectAge"
My table: Rules(Input table)
|---------------------------------------------------------------------------|
| rowkey| rule_name|rule_run_status| tablename |condition|filter |level|
|--------------------------------------------------------------------------|
| 1 |IncorrectAge| In_Progress | VDP_Vendor_List| age>18 gender=Male|NA|
|---------------------------------------------------------------------------
|2 | Customer_age| In_Progress | Customer_List | age<25 gender=Female|NA|
|----------------------------------------------------------------------------
I fetch the rulename:
select tablename, filter, condition from all_rules where rulename="IncorrectAge";
After executing this query, I got the result like this:
|----------------------------------------------|
|tablename | filter | condition |
|----------------------------------------------|
|VDP_Vendor_List | gender=Male | age>18 |
|----------------------------------------------|
Now I want make spark sql query dynamically
select count(*) from VDP_Vendor_List // first column --tablename
select count(*) from VDP_Vendor_List where gender=Male --tablename and filter
select * from EMP where gender=Male AND age >18 --tablename, filter, condition
My Code -Spark 2.2 version code :
import org.apache.spark.sql.{ Row, SparkSession }
import org.apache.log4j._
object allrules {
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local[*]")
.appName("Spark Hive")
.enableHiveSupport().getOrCreate();
import spark.implicits._
val sampleDF = spark.read.json("C:/software/sampletableCopy.json") // for testing purpose i converted hive table to json data
sampleDF.registerTempTable("sampletable")
val allrulesDF = spark.sql("SELECT * FROM sampletable")
allrulesDF.show()
val TotalCount: Long = allrulesDF.count()
println("==============> Total count ======>" + allrulesDF.count())
val df1 = allrulesDF.select(allrulesDF.col("tablename"),allrulesDF.col("condition"),allrulesDF.col("filter"),allrulesDF.col("rule_name"))
df1.show()
val df2= df1.where(df1.col("rule_name").equalTo("IncorrectAge")).show()
println(df2)
// var table_name = ""
// var condition =""
// var filter = "";
// df1.foreach(row=>{
// table_name = row.get(1).toString();
// condition = row.get(2).toString();
// filter = row.get(3).toString();
// })
}
}

You can pass arguments from spark-submit to your application:
bin/spark-submit --class allrules something.jar tablename filter condition
then, in your main function you will have your params:
def main(args: Array[String]) : Unit = {
// args(0), args(1) ... there are your params
}

You can pass your argument to your driver class like this :
object DriverClass
{
val log = Logger.getLogger(getClass.getName)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("yarn").config("spark.sql.warehouse.dir", "path").enableHiveSupport().getOrCreate()
if (args == null || args.isEmpty || args.length != 2) {
log.error("Invalid number of arguments passed.")
log.error("Arguments Usage: <Rule Name> <Rule Type>)
log.error("Stopping the flow")
System.exit(1)
}
import spark.implicits._
val ruleName: String = String.valueOf(args(0).trim())
val ruleType: String = String.valueOf(args(1).trim())
val processSQL: String="Select tablename, filter, condition from all_rules where $ruleName=$ruleType"
val metadataDF=spark.sql(processSQL)
val (tblnm,fltr,cndtn) =metadataDF.rdd.map(f=>(f.get(0).toString(),f.get(1).toString(),f.get(2).toString())).collect()(0)
val finalSql_1="select count(*) from $tblnm" // first column
val finalSql_2="select count(*) from $tblnm" where $fltr"
val finalSql_3="select * from EMP where $fltr AND $cndtn"
spark.sql(finalSql_1).show()
spark.sql(finalSql_2).show()
spark.sql(finalSql_3).show()
}
}

Related

Assigning elements in an array into the same DataFrame using scala ad spark

I input an array and then I want to get their unicodes and store into a dataframe. Here is my code
def getUnicodeOfEmoji (emojiArray : Array[String]) : DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
import existingSparkSession.implicits._
var result: DataFrame = null
var df : DataFrame = null
for (i <- 0 until emojiArray.length) {
df = Seq(emojiArray(i)).toDF("emoji")
df.show()
result = df.selectExpr(
"emoji",
"'U+' || trim('0' , string(hex(encode(emoji, 'utf-32')))) as result"
)
}
result.show(false)
return result
}
}
input = val emojis="😃😜😍"
actual output
|emoji|result |
+-----+-------+
|😍 |U+1F60D|
+-----+-------+
But I need to have all 3 emojis with their specific unicodes within the dataframe.
You don't need a for loop to construct the dataframe. You can convert the array to a Seq and use the toDF method of a Seq to construct the resulting dataframe.
def getUnicodeOfEmoji (emojiArray : Array[String]) : DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
import existingSparkSession.implicits._
val df = emojiArray.toSeq.toDF("emoji")
val result = df.selectExpr(
"emoji",
"'U+' || trim('0' , string(hex(encode(emoji, 'utf-32')))) as result"
)
result.show(false)
return result
}
val emojis = "😃😜😍"
val input = raw"\p{block=Emoticons}".r.findAllIn(emojis).toArray
val converted = getUnicodeOfEmoji(input)
+-----+-------+
|emoji|result |
+-----+-------+
|😃 |U+1F603|
|😜 |U+1F61C|
|😍 |U+1F60D|
+-----+-------+
A slight improvement is to convert your string of emojis to a Seq[String] directly before feeding into the function, e.g.
def getUnicodeOfEmoji (emojiArray : Seq[String]) : DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
import existingSparkSession.implicits._
val df = emojiArray.toDF("emoji")
val result = df.selectExpr(
"emoji",
"'U+' || trim('0' , string(hex(encode(emoji, 'utf-32')))) as result"
)
result.show(false)
return result
}
val emojis = "😃😜😍"
val input = raw"\p{block=Emoticons}".r.findAllIn(emojis).toSeq
val converted = getUnicodeOfEmoji(input)

How to perform UPSERT or MERGE operation in Apache Spark?

I am trying to update and insert records to old Dataframe using unique column "ID" using Apache Spark.
In order to update Dataframe, you can perform "left_anti" join on unique columns and then UNION it with Dataframe which contains new records
def refreshUnion(oldDS: Dataset[_], newDS: Dataset[_], usingColumns: Seq[String]): Dataset[_] = {
val filteredNewDS = selectAndCastColumns(newDS, oldDS)
oldDS.join(
filteredNewDS,
usingColumns,
"left_anti")
.select(oldDS.columns.map(columnName => col(columnName)): _*)
.union(filteredNewDS.toDF)
}
def selectAndCastColumns(ds: Dataset[_], refDS: Dataset[_]): Dataset[_] = {
val columns = ds.columns.toSet
ds.select(refDS.columns.map(c => {
if (!columns.contains(c)) {
lit(null).cast(refDS.schema(c).dataType) as c
} else {
ds(c).cast(refDS.schema(c).dataType) as c
}
}): _*)
}
val df = refreshUnion(oldDS, newDS, Seq("ID"))
Spark Dataframes are immutable structure. Therefore, you can't do any update based on the ID.
The way to update dataframe is to merge the older dataframe and the newer dataframe and save the merged dataframe on HDFS. To update the older ID you would require some de-duplication key (Timestamp may be).
I am adding the sample code for this in scala. You need to call the merge function with the uniqueId and the timestamp column name. Timestamp should be in Long.
case class DedupableDF(unique_id: String, ts: Long);
def merge(snapshot: DataFrame)(
delta: DataFrame)(uniqueId: String, timeStampStr: String): DataFrame = {
val mergedDf = snapshot.union(delta)
return dedupeData(mergedDf)(uniqueId, timeStampStr)
}
def dedupeData(dataFrameToDedupe: DataFrame)(
uniqueId: String,
timeStampStr: String): DataFrame = {
import sqlContext.implicits._
def removeDuplicates(
duplicatedDataFrame: DataFrame): Dataset[DedupableDF] = {
val dedupableDF = duplicatedDataFrame.map(a =>
DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
val mappedPairRdd =
dedupableDF.map(row ⇒ (row.unique_id, (row.unique_id, row.ts))).rdd;
val reduceByKeyRDD = mappedPairRdd
.reduceByKey((row1, row2) ⇒ {
if (row1._2 > row2._2) {
row1
} else {
row2
}
})
.values;
val ds = reduceByKeyRDD.toDF.map(a =>
DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
return ds;
}
/** get distinct unique_id, timestamp combinations **/
val filteredData =
dataFrameToDedupe.select(uniqueId, timeStampStr).distinct
val dedupedData = removeDuplicates(filteredData)
dataFrameToDedupe.createOrReplaceTempView("duplicatedDataFrame");
dedupedData.createOrReplaceTempView("dedupedDataFrame");
val dedupedDataFrame =
sqlContext.sql(s""" select distinct duplicatedDataFrame.*
from duplicatedDataFrame
join dedupedDataFrame on
(duplicatedDataFrame.${uniqueId} = dedupedDataFrame.unique_id
and duplicatedDataFrame.${timeStampStr} = dedupedDataFrame.ts)""")
return dedupedDataFrame
}

is it possible (and how) to specify an sql query on command line with spark-submit

I have the following code:
def main(args: Array[String]) {
var dvfFiles : String = "g:/data/gouv/dvf/raw"
var q : String = ""
//q = "SELECT distinct DateMutation, NVoie, IndVoie, Voie, Valeur, CodeTypeLocal, TypeLocal, Commune FROM mutations WHERE Commune = 'ICI' and Valeur > 100000 and CodeTypeLocal in (1, 2) order by Valeur desc"
args.sliding(2, 2).toList.collect {
case Array("--sfiles", argFiles: String) => dvfFiles = argFiles
case Array("--squery", argQ: String) => q = argQ
}
println(s"files from: ${dvfFiles}")
if I run the following command:
G:\dev\fromGit\dvf\spark>spark-submit .\target\scala-2.11\dfvqueryer_2.11-1.0.jar \
--squery "SELECT distinct DateMutation, NVoie, IndVoie, Voie, Valeur, CodeTypeLocal, \
TypeLocal, Commune FROM mutations WHERE (Commune = 'ICI') and (Valeur > 100000) and (CodeTypeLocal in (1, 2)) order by Valeur desc"
I got the following result:
== SQL ==
SELECT distinct DateMutation, NVoie, IndVoie, Voie, Valeur, CodeTypeLocal, TypeLocal, Commune FROM mutations WHERE (Commune = 'ICI') and (Valeur and (CodeTypeLocal in (1, 2)) order by Valeur desc
----------------------------------------------------------------------------------------------^^^
the ^^^ pointing the FROM
I also notice the missing > 100000 after Valeur.
the query is correct because if I uncomment the //q =..., package the code and submit it, all happens fine.
Seems that the process is burning part of the query during input. One solution to this problem would be to send the entire argument of you select query in one line and input it into a string value. In that format it can be immediately pipelined into the sql function to run you query. Below is how you can build out the function:
//The Package Tree
package stack.overFlow
//Call all needed packages
import org.apache.spark.sql.{DataFrame, SparkSession, Column, SQLContext}
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql
//Object Name
object demoCode {
def main(args: Array[String]) {
///Build the contexts
var spark = SparkSession.builder.enableHiveSupport().getOrCreate()
var sc = spark.sparkContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
//Set the query as a string for argument 1
val commandQuery : String = args(0)
//Pass query to the sql function
val inputDF = spark.sql(commandQuery)
}
}
When the function compiles you will need two objects (1) the Jar as well as (2) the package tree and class for running the function. When running bot of those within --class all you need to do is include a space and pass through the SQL query so on run time it will be loaded into the spark session.
spark-submit --class stack.overFlow.demoCode /home/user/demo_code/target/demoCode-compilation-jar.jar \
SELECT distinct DateMutation, NVoie, IndVoie, Voie, Valeur, CodeTypeLocal,TypeLocal, Commune FROM mutations WHERE (Commune = 'ICI') and (Valeur > 100000) and (CodeTypeLocal in (1, 2)) order by Valeur desc
Would this help your use-case or do you need it to be in another format?

How to make select field from select * Flink

I am trying to join 2 selects.
I have to make a query in code, looks like this query
select *
from Data
where numPers > 10 && Object = P1
and this
select *
from Data
where numPers < 20 && Object == P1
And i need to only timestamps from the data without repeats
The program code i use is shown below
object Prog {
def main(args: Array[String]) : Unit = {
org.apache.log4j.BasicConfigurator.configure()
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tableEnv = TableEnvironment.getTableEnvironment(env)
val csvTableSource = CsvTableSource
.builder
.path("src/main/resources/data.stream")
.field("numPers", Types.INT)
.field("Object", Types.STRING)
.field("TIMESTAMP", Types.STRING)
.fieldDelimiter(",")
.ignoreFirstLine
.ignoreParseErrors
.commentPrefix("%")
.build()
tableEnv.registerTableSource("Data", csvTableSource)
val table = tableEnv.scan("Data") //this works
.filter("numPers > 10")
.select("*")
val ds = tableEnv.toAppendStream(table, classOf[Row])
ds.print()
env.execute()
}
}
But how can I add the second query to the first?
If I understand your requirements correctly, you don't need a join but just a BETWEEN predicate:
val query = "SELECT * FROM Data WHERE numPers BETWEEN 10 AND 20 AND Object = P1"
val table = tableEnv.sqlQuery(query)

NullPointerException when using UDF in Spark

I have a DataFrame in Spark such as this one:
var df = List(
(1,"{NUM.0002}*{NUM.0003}"),
(2,"{NUM.0004}+{NUM.0003}"),
(3,"END(6)"),
(4,"END(4)")
).toDF("CODE", "VALUE")
+----+---------------------+
|CODE| VALUE|
+----+---------------------+
| 1|{NUM.0002}*{NUM.0003}|
| 2|{NUM.0004}+{NUM.0003}|
| 3| END(6)|
| 4| END(4)|
+----+---------------------+
My task is to iterate through the VALUE column and do the following: check if there is a substring such as {NUM.XXXX}, get the XXXX number, get the row where $"CODE" === XXXX, and replace the {NUM.XXX} substring with the VALUE string in that row.
I would like the dataframe to look like this in the end:
+----+--------------------+
|CODE| VALUE|
+----+--------------------+
| 1|END(4)+END(6)*END(6)|
| 2| END(4)+END(6)|
| 3| END(6)|
| 4| END(4)|
+----+--------------------+
This is the best I've come up with:
val process = udf((ln: String) => {
var newln = ln
while(newln contains "{NUM."){
var num = newln.slice(newln.indexOf("{")+5, newln.indexOf("}")).toInt
var new_value = df.where($"CODE" === num).head.getAs[String](1)
newln = newln.replace(newln.slice(newln.indexOf("{"),newln.indexOf("}")+1), new_value)
}
newln
})
var df2 = df.withColumn("VALUE", when('VALUE contains "{NUM.",process('VALUE)).otherwise('VALUE))
Unfortunately, I get a NullPointerException when I try to filter/select/save df2, and no error when I just show df2. I believe the error appears when I access the DataFrame df within the UDF, but I need to access it every iteration, so I can't pass it as an input. Also, I've tried saving a copy of df inside the UDF but I don't know how to do that. What can I do here?
Any suggestions to improve the algorithm are very welcome! Thanks!
I wrote something which works but not very optimized I think. I actually do recursive joins on the initial DataFrame to replace the NUMs by END. Here is the code :
case class Data(code: Long, value: String)
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder().master("local").getOrCreate()
val data = Seq(
Data(1,"{NUM.0002}*{NUM.0003}"),
Data(2,"{NUM.0004}+{NUM.0003}"),
Data(3,"END(6)"),
Data(4,"END(4)"),
Data(5,"{NUM.0002}")
)
val initialDF = sparkSession.createDataFrame(data)
val endDF = initialDF.filter(!(col("value") contains "{NUM"))
val numDF = initialDF.filter(col("value") contains "{NUM")
val resultDF = endDF.union(replaceNumByEnd(initialDF, numDF))
resultDF.show(false)
}
val parseNumUdf = udf((value: String) => {
if (value.contains("{NUM")) {
val regex = """.*?\{NUM\.(\d+)\}.*""".r
value match {
case regex(code) => code.toLong
}
} else {
-1L
}
})
val replaceUdf = udf((value: String, replacement: String) => {
val regex = """\{NUM\.(\d+)\}""".r
regex.replaceFirstIn(value, replacement)
})
def replaceNumByEnd(initialDF: DataFrame, currentDF: DataFrame): DataFrame = {
if (currentDF.count() == 0) {
currentDF
} else {
val numDFWithCode = currentDF
.withColumn("num_code", parseNumUdf(col("value")))
.withColumnRenamed("code", "code_original")
.withColumnRenamed("value", "value_original")
val joinedDF = numDFWithCode.join(initialDF, numDFWithCode("num_code") === initialDF("code"))
val replacedDF = joinedDF.withColumn("value_replaced", replaceUdf(col("value_original"), col("value")))
val nextDF = replacedDF.select(col("code_original").as("code"), col("value_replaced").as("value"))
val endDF = nextDF.filter(!(col("value") contains "{NUM"))
val numDF = nextDF.filter(col("value") contains "{NUM")
endDF.union(replaceNumByEnd(initialDF, numDF))
}
}
If you need more explanation, don't hesitate.