How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame - scala

Is there a more elegant way of filtering based on values in a Set of String?
def myFilter(actions: Set[String], myDF: DataFrame): DataFrame = {
val containsAction = udf((action: String) => {
actions.contains(action)
})
myDF.filter(containsAction('action))
}
In SQL you can do
select * from myTable where action in ('action1', 'action2', 'action3')

How about this:
myDF.filter("action in (1,2)")
OR
import org.apache.spark.sql.functions.lit
myDF.where($"action".in(Seq(1,2).map(lit(_)):_*))
OR
import org.apache.spark.sql.functions.lit
myDF.where($"action".in(Seq(lit(1),lit(2)):_*))
Additional support will be added to make this cleaner in 1.5

Related

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.
Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}
I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

Getting datatype of all columns in a dataframe using scala

I have a data frame in which following data is loaded:
enter image description here
I am trying to develop a code which will read data from any source, load the data into data frame and return the following o/p:
enter image description here
You can use the schema property and then iterate over the fields.
Example:
Seq(("A", 1))
.toDF("Field1", "Field2")
.schema
.fields
.foreach(field => println(s"${field.name}, ${field.dataType}"))
Results:
Make sure to take a look at the Spark ScalaDoc.
Thats the closest I could get to the output.
Create schema from case class and and there after created DF from the list of schema columns and mapped them to a Dataframe
import java.sql.Date
object GetColumnDf {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val map = Map("Emp_ID" -> "Dimension","Cust_Name" -> "Dimension","Cust_Age" -> "Measure",
"Salary" -> "Measure","DoJ" -> "Dimension")
import spark.implicits._
val lsit = Seq(Bean123("C-1001","Jack",25,3000,new Date(2000000))).toDF().schema.fields
.map( col => (col.name,col.dataType.toString,map.get(col.name))).toList.toDF("Headers","Data_Type","Type")
lsit.show()
}
}
case class Bean123(Emp_ID: String,Cust_Name: String,Cust_Age: Int, Salary : Int,DoJ: Date)

How to select a column in a dataframe by its number instead of its name

I'd like to select a Column in a Spark dataframe by its number instead of it's name. Is it possible?
Thank you
If you want to write your own method for this you can do:
package utils
object Extensions {
implicit class DataFrameExtensions(df: DataFrame) {
def selecti(indices: Int*) = {
val cols = df.columns
df.select(indices.map(cols(_)):_*)
}
}
}
Now you can import and use this method as:
import utils.Extensions._
df.selecti(1,2,3)

Test sparksql query

I have a Dataframe that I want to run a simple query on like this:
def runQuery(df: DataFrame, queryString: String): DataFrame = {
df.createOrReplaceTempView("myDataFrame")
spark.sql(queryString)
}
Where queryString can be something like
"SELECT name, age FROM myDataFrame WHERE age > 30"
But I'd really like to know ahead of time whether the query will work without having to throw an Exception. For instance, what if df doesn't have the columns name and age? I want to write something like this to handle it:
def runQuery(df: DataFrame, queryString: String): DataFrame = {
if (/*** df and queryString are compatible ***/) {
df.createOrReplaceTempView("myDataFrame")
spark.sql(queryString)
} else {
spark.createDataFrame(sc.emptyRDD[Row], df.schema)
}
}
Is there a way to check this in an 'if' statement?
I wouldn't worry to much about exceptions. Just wrap it with Try:
import scala.util.Try
import org.apache.spark.sql.catalyst.encoders.RowEncoder
def runQuery(df: DataFrame, queryString: String): DataFrame = Try {
df.createOrReplaceTempView("myDataFrame")
df.sparkSession.sql(queryString)
}.getOrElse(df.sparkSession.emptyDataset(RowEncoder(df.schema)))
You can check all the columns present in dataframe or not with triggering spark job
def runQuery(df: DataFrame, queryString: String): DataFrame =
if(Array("name", "age", "address").forall(df.columns.contains)) {
df.createOrReplaceTempView("myDataFrame")
df.sparkSession.sql(queryString)
} else {
df.sparkSession.emptyDataset(RowEncoder(df.schema))
}
you can use df.schema to match datatype as well

What is the similar alternative to reduceByKey in DataFrames

Give following code
case class Contact(name: String, phone: String)
case class Person(name: String, ts:Long, contacts: Seq[Contact])
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
val people = sqlContext.read.format("orc").load("people")
What is the best way to dedupe users by its timestamp
So the user with max ts will stay at collection?
In spark using RDD I would run something like this
rdd.reduceByKey(_ maxTS _)
and would add the maxTS method to Person or add implicits ...
def maxTS(that: Person):Person =
that.ts > ts match {
case true => that
case false => this
}
Is it possible to do the same at DataFrames? and will that be the similar performance?
We are using spark 1.6
You can use Window functions, I'm assuming that the key is name:
import org.apache.spark.sql.functions.{rowNumber, max, broadcast}
import org.apache.spark.sql.expressions.Window
val df = // convert to DataFrame
val win = Window.partitionBy('name).orderBy('ts.desc)
df.withColumn("personRank", rowNumber.over(win))
.where('personRank === 1).drop("personRank")
For each person it will create personRank - each person with given name will have unique number, person with the latest ts will have the lowest rank, equal to 1. The you drop temporary rank
You can do a groupBy and use your preferred aggregation method like sum, max etc.
df.groupBy($"name").agg(sum($"tx").alias("maxTS"))