Test sparksql query - scala

I have a Dataframe that I want to run a simple query on like this:
def runQuery(df: DataFrame, queryString: String): DataFrame = {
df.createOrReplaceTempView("myDataFrame")
spark.sql(queryString)
}
Where queryString can be something like
"SELECT name, age FROM myDataFrame WHERE age > 30"
But I'd really like to know ahead of time whether the query will work without having to throw an Exception. For instance, what if df doesn't have the columns name and age? I want to write something like this to handle it:
def runQuery(df: DataFrame, queryString: String): DataFrame = {
if (/*** df and queryString are compatible ***/) {
df.createOrReplaceTempView("myDataFrame")
spark.sql(queryString)
} else {
spark.createDataFrame(sc.emptyRDD[Row], df.schema)
}
}
Is there a way to check this in an 'if' statement?

I wouldn't worry to much about exceptions. Just wrap it with Try:
import scala.util.Try
import org.apache.spark.sql.catalyst.encoders.RowEncoder
def runQuery(df: DataFrame, queryString: String): DataFrame = Try {
df.createOrReplaceTempView("myDataFrame")
df.sparkSession.sql(queryString)
}.getOrElse(df.sparkSession.emptyDataset(RowEncoder(df.schema)))

You can check all the columns present in dataframe or not with triggering spark job
def runQuery(df: DataFrame, queryString: String): DataFrame =
if(Array("name", "age", "address").forall(df.columns.contains)) {
df.createOrReplaceTempView("myDataFrame")
df.sparkSession.sql(queryString)
} else {
df.sparkSession.emptyDataset(RowEncoder(df.schema))
}
you can use df.schema to match datatype as well

Related

How to batch columns of spark dataframe, process with REST API and add it back?

I have a dataframe in spark and I need to process a particular column in that dataframe using a REST API. The API does some transformation to a string and returns a result string. The API can process multiple strings at a time.
I can iterate over the columns of the dataframe, collect n values of the column in a batch and call the api and then add it back to the dataframe, and continue with the next batch. But this seems like the normal way of doing it without taking advantage of spark.
Is there a better way to do this which can take advantage of spark sql optimiser and spark parallel processing?
For Spark parallel processing you can use mapPartitions
case class Input(col: String)
case class Output ( col : String,new_col : String )
val data = spark.read.csv("/a/b/c").as[Input].repartiton(n)
def declare(partitions: Iterator[Input]): Iterator[Output] ={
val url = ""
implicit val formats: DefaultFormats.type = DefaultFormats
var list = new ListBuffer[Output]()
val httpClient =
try {
while (partitions.hasNext) {
val x = partitions.next()
val col = x.col
val concat_url =""
val apiResp = HttpClientAcceptSelfSignedCertificate.call(httpClient, concat_url)
if (apiResp.isDefined) {
val json = parse(apiResp.get)
val new_col = (json \\"value_to_take_from_api").children.head.values.toString
val output = Output(col,new_col)
list+=output
}
else {
val new_col = "Not Found"
val output = Output(col,new_col)
list+=output
}
}
} catch {
case e: Exception => println("api Exception with : " + e.getMessage)
}
finally {
HttpClientAcceptSelfSignedCertificate.close(httpClient)
}
list.iterator
}
val dd:Dataset[Output] =data.mapPartitions(x=>declare(x))

Dynamic Query on Scala Spark

I'm basically trying to do something like this but spark doesn’t recognizes it.
val colsToLower: Array[String] = Array("col0", "col1", "col2")
val selectQry: String = colsToLower.map((x: String) => s"""lower(col(\"${x}\")).as(\"${x}\"), """).mkString.dropRight(2)
df
.select(selectQry)
.show(5)
Is there a way to do something like this in spark/scala?
If you need to lowercase the name of your columns there is a simple way of doing it. Here is one example:
df.columns.foreach(c => {
val newColumnName = c.toLowerCase
df = df.withColumnRenamed(c, newColumnName)
})
This will allow you to lowercase the column names, and update it in the spark dataframe.
I believe I found a way to build it:
def lowerTextColumns(cols: Array[String])(df: DataFrame): DataFrame = {
val remainingCols: String = (df.columns diff cols).mkString(", ")
val lowerCols: String = cols.map((x: String) => s"""lower(${x}) as ${x}, """).mkString.dropRight(2)
val selectQry: String =
if (colsToSelect.nonEmpty) lowerCols + ", " + remainingCols
else lowerCols
df
.selectExpr(selectQry.split(","):_*)
}

How to perform UPSERT or MERGE operation in Apache Spark?

I am trying to update and insert records to old Dataframe using unique column "ID" using Apache Spark.
In order to update Dataframe, you can perform "left_anti" join on unique columns and then UNION it with Dataframe which contains new records
def refreshUnion(oldDS: Dataset[_], newDS: Dataset[_], usingColumns: Seq[String]): Dataset[_] = {
val filteredNewDS = selectAndCastColumns(newDS, oldDS)
oldDS.join(
filteredNewDS,
usingColumns,
"left_anti")
.select(oldDS.columns.map(columnName => col(columnName)): _*)
.union(filteredNewDS.toDF)
}
def selectAndCastColumns(ds: Dataset[_], refDS: Dataset[_]): Dataset[_] = {
val columns = ds.columns.toSet
ds.select(refDS.columns.map(c => {
if (!columns.contains(c)) {
lit(null).cast(refDS.schema(c).dataType) as c
} else {
ds(c).cast(refDS.schema(c).dataType) as c
}
}): _*)
}
val df = refreshUnion(oldDS, newDS, Seq("ID"))
Spark Dataframes are immutable structure. Therefore, you can't do any update based on the ID.
The way to update dataframe is to merge the older dataframe and the newer dataframe and save the merged dataframe on HDFS. To update the older ID you would require some de-duplication key (Timestamp may be).
I am adding the sample code for this in scala. You need to call the merge function with the uniqueId and the timestamp column name. Timestamp should be in Long.
case class DedupableDF(unique_id: String, ts: Long);
def merge(snapshot: DataFrame)(
delta: DataFrame)(uniqueId: String, timeStampStr: String): DataFrame = {
val mergedDf = snapshot.union(delta)
return dedupeData(mergedDf)(uniqueId, timeStampStr)
}
def dedupeData(dataFrameToDedupe: DataFrame)(
uniqueId: String,
timeStampStr: String): DataFrame = {
import sqlContext.implicits._
def removeDuplicates(
duplicatedDataFrame: DataFrame): Dataset[DedupableDF] = {
val dedupableDF = duplicatedDataFrame.map(a =>
DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
val mappedPairRdd =
dedupableDF.map(row ⇒ (row.unique_id, (row.unique_id, row.ts))).rdd;
val reduceByKeyRDD = mappedPairRdd
.reduceByKey((row1, row2) ⇒ {
if (row1._2 > row2._2) {
row1
} else {
row2
}
})
.values;
val ds = reduceByKeyRDD.toDF.map(a =>
DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
return ds;
}
/** get distinct unique_id, timestamp combinations **/
val filteredData =
dataFrameToDedupe.select(uniqueId, timeStampStr).distinct
val dedupedData = removeDuplicates(filteredData)
dataFrameToDedupe.createOrReplaceTempView("duplicatedDataFrame");
dedupedData.createOrReplaceTempView("dedupedDataFrame");
val dedupedDataFrame =
sqlContext.sql(s""" select distinct duplicatedDataFrame.*
from duplicatedDataFrame
join dedupedDataFrame on
(duplicatedDataFrame.${uniqueId} = dedupedDataFrame.unique_id
and duplicatedDataFrame.${timeStampStr} = dedupedDataFrame.ts)""")
return dedupedDataFrame
}

Why do SparkSQL UDF return a dataframe with columns names in the format UDF("Original Column Name")?

So the dataframe I get after running the following code is exactly how I want it to be. It is the same dataframe as the original but all cells with purely numeric data have had all brackets and slashes removed (brackets are replaced with a minus sign at the front).
stringModifierIterator takes in a dataframe and returns a List[Column]. The List[Column] can then be used like in the command dataframe.select(List[Column]: _*) to create a new dataframe.
Unfortunately, the column names have been altered to something like UDF("Original Column Name") and I can't figure out why.
def stringModifierIterator(dataFrame: DataFrame, dataFrameColumns: Array[String], uDF: UserDefinedFunction): List[Column] ={
if(dataFrameColumns.isEmpty){
Nil
} else {
uDF(dataFrame(dataFrameColumns.head)) :: stringModifierIterator(dataFrame, dataFrameColumns.tail, uDF)
}
}
val stringModifierFunction: (String => String) = { s: String => Option(s).map(modifier).getOrElse("0") }
def modifier(inputString: String): String = {
???
}
This is what the column names look like when I use df.show()
You can solve this by explicitly naming the columns you create with the UDF in stringModifierIterator using Column.as:
def stringModifierIterator(dataFrame: DataFrame, dataFrameColumns: Array[String], uDF: UserDefinedFunction): List[Column] ={
if(dataFrameColumns.isEmpty){
Nil
} else {
val col = dataFrameColumns.head
uDF(dataFrame(col)).as(col) :: stringModifierIterator(dataFrame, dataFrameColumns.tail, uDF)
}
}
BTW, this method can be be much shorter and simpler without recursion:
def stringModifierIterator(dataFrame: DataFrame, dataFrameColumns: Array[String], uDF: UserDefinedFunction): List[Column] ={
dataFrameColumns.toList.map(col => uDF(dataFrame(col)).as(col))
}

How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame

Is there a more elegant way of filtering based on values in a Set of String?
def myFilter(actions: Set[String], myDF: DataFrame): DataFrame = {
val containsAction = udf((action: String) => {
actions.contains(action)
})
myDF.filter(containsAction('action))
}
In SQL you can do
select * from myTable where action in ('action1', 'action2', 'action3')
How about this:
myDF.filter("action in (1,2)")
OR
import org.apache.spark.sql.functions.lit
myDF.where($"action".in(Seq(1,2).map(lit(_)):_*))
OR
import org.apache.spark.sql.functions.lit
myDF.where($"action".in(Seq(lit(1),lit(2)):_*))
Additional support will be added to make this cleaner in 1.5