I have a status dataset like below:
I want to select all the rows from this dataset which have "FAILURE" in any of these 5 status columns.
So, I want the result to contain only IDs 1,2,4 as they have FAILURE in one of the Status columns.
I guess in SQL we can do something like below:
SELECT * FROM status WHERE "FAILURE" IN (Status1, Status2, Status3, Status4, Status5);
In spark, I know I can do a filter by comparing each Status column with "FAILURE"
status.filter(s => {s.Status1.equals(FAILURE) || s.Status2.equals(FAILURE) ... and so on..})
But I would like to know if there is a smarter way of doing this in Spark SQL.
Thanks in advance!
In case there are many columns to be examined, consider a recursive function that short-circuits upon the first match, as shown below:
val df = Seq(
(1, "T", "F", "T", "F"),
(2, "T", "T", "T", "T"),
(3, "T", "T", "F", "T")
).toDF("id", "c1", "c2", "c3", "c4")
import org.apache.spark.sql.Column
def checkFor(elem: Column, cols: List[Column]): Column = cols match {
case Nil =>
lit(true)
case h :: tail =>
when(h === elem, lit(false)).otherwise(checkFor(elem, tail))
}
val cols = df.columns.filter(_.startsWith("c")).map(col).toList
df.where(checkFor(lit("F"), cols)).show
// +---+---+---+---+---+
// | id| c1| c2| c3| c4|
// +---+---+---+---+---+
// | 2| T| T| T| T|
// +---+---+---+---+---+
A similar example you can modify and filter on the new column added. I leave that to you, here checking for zeroes excluding first col:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = sc.parallelize(Seq(
("r1", 0.0, 0.0, 0.0, 0.0),
("r2", 6.4, 4.9, 6.3, 7.1),
("r3", 4.2, 0.0, 7.2, 8.4),
("r4", 1.0, 2.0, 0.0, 0.0)
)).toDF("ID", "a", "b", "c", "d")
val count_some_val = df.columns.tail.map(x => when(col(x) === 0.0, 1).otherwise(0)).reduce(_ + _)
val df2 = df.withColumn("some_val_count", count_some_val)
df2.filter(col("some_val_count") > 0).show(false)
Afaik not possible to stop when first match found easily, but I do remember a smarter person than myself showing me this approach with lazy exists which I think does stop at first encounter of a match. Like this then, but a different approach, that I like:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = sc.parallelize(Seq(
("r1", 0.0, 0.0, 0.0, 0.0),
("r2", 6.0, 4.9, 6.3, 7.1),
("r3", 4.2, 0.0, 7.2, 8.4),
("r4", 1.0, 2.0, 0.0, 0.0)
)).toDF("ID", "a", "b", "c", "d")
df.map{r => (r.getString(0),r.toSeq.tail.exists(c =>
c.asInstanceOf[Double]==0))}
.toDF("ID","ones")
.show()
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> import spark.implicits._
import spark.implicits._
scala> val df = Seq(
| ("Prop1", "SUCCESS", "SUCCESS", "SUCCESS", "FAILURE" ,"SUCCESS"),
| ("Prop2", "SUCCESS", "FAILURE", "SUCCESS", "FAILURE", "SUCCESS"),
| ("Prop3", "SUCCESS", "SUCCESS", "SUCCESS", "SUCCESS", "SUCCESS" ),
| ("Prop4", "SUCCESS", "FAILURE", "SUCCESS", "FAILURE", "SUCCESS"),
| ("Prop5", "SUCCESS", "SUCCESS", "SUCCESS", "SUCCESS","SUCCESS")
| ).toDF("Name", "Status1", "Status2", "Status3", "Status4","Status5")
df: org.apache.spark.sql.DataFrame = [Name: string, Status1: string ... 4 more fields]
scala> df.show
+-----+-------+-------+-------+-------+-------+
| Name|Status1|Status2|Status3|Status4|Status5|
+-----+-------+-------+-------+-------+-------+
|Prop1|SUCCESS|SUCCESS|SUCCESS|FAILURE|SUCCESS|
|Prop2|SUCCESS|FAILURE|SUCCESS|FAILURE|SUCCESS|
|Prop3|SUCCESS|SUCCESS|SUCCESS|SUCCESS|SUCCESS|
|Prop4|SUCCESS|FAILURE|SUCCESS|FAILURE|SUCCESS|
|Prop5|SUCCESS|SUCCESS|SUCCESS|SUCCESS|SUCCESS|
+-----+-------+-------+-------+-------+-------+
scala> df.where($"Name".isin("Prop1","Prop4") and $"Status1".isin("SUCCESS","FAILURE")).show
+-----+-------+-------+-------+-------+-------+
| Name|Status1|Status2|Status3|Status4|Status5|
+-----+-------+-------+-------+-------+-------+
|Prop1|SUCCESS|SUCCESS|SUCCESS|FAILURE|SUCCESS|
|Prop4|SUCCESS|FAILURE|SUCCESS|FAILURE|SUCCESS|
+-----+-------+-------+-------+-------+-------+
Related
Been recently trying to apply a default function to aggregated values that were being calculated so that I didn't have to reprocess them afterwards. As far as I see I'm getting the following error.
Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
From the following function.
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
And applying it the following way.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
initialDF
.groupBy("field1", "field2")
.agg(
defaultUDF(functions.count("field3"), lit(0)).as("counter") // exception thrown here
)
Am I trying to do black magic in here or is it something that I may be missing?
The issue is in the implementation of your UserDefinedFunction:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
// java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789)
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:724)
// at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
// at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:906)
// at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:46)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:723)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:720)
// at org.apache.spark.sql.functions$.udf(functions.scala:3914)
// ... 65 elided
The error you're getting is basically because Spark cannot map the return type (i.e. Column) of your UserDefinedFunction defaultFunction to a Spark DataType.
Your defaultFunction has to accept and return Scala types that correspond with a Spark DataType. You can find the list of supported Scala types here: https://spark.apache.org/docs/latest/sql-reference.html#data-types
In any case, you don't need a UserDefinedFunction if your function takes Columns and returns a Column. For your use-case, the following code will work:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
df
.groupBy("field1", "field2")
.agg(defaultFunction(count("field3"), lit(0)).as("counter"))
.show
// +------+------+-------+
// |field1|field2|counter|
// +------+------+-------+
// | a| b| 1|
// | a| null| 1|
// +------+------+-------+
How do you aggregate dynamically in scala spark based on data types?
For example:
SELECT ID, SUM(when DOUBLE type)
, APPEND(when STRING), MAX(when BOOLEAN)
from tbl
GROUP BY ID
Sample data
You can do this by getting the runtime schema matching on the datatype, example :
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._
val df =Seq(
(1, 1.0, true, "a"),
(1, 2.0, false, "b")
).toDF("id","d","b","s")
val dataTypes: Map[String, DataType] = df.schema.map(sf => (sf.name,sf.dataType)).toMap
def genericAgg(c:String) = {
dataTypes(c) match {
case DoubleType => sum(col(c))
case StringType => concat_ws(",",collect_list(col(c))) // "append"
case BooleanType => max(col(c))
}
}
val aggExprs: Seq[Column] = df.columns.filterNot(_=="id") // use all
.map(c => genericAgg(c))
df
.groupBy("id")
.agg(
aggExprs.head,aggExprs.tail:_*
)
.show()
gives
+---+------+------+-----------------------------+
| id|sum(d)|max(b)|concat_ws(,, collect_list(s))|
+---+------+------+-----------------------------+
| 1| 3.0| true| a,b|
+---+------+------+-----------------------------+
Is there a way to get the dataframe that union dataframe in loop?
This is a sample code:
var fruits = List(
"apple"
,"orange"
,"melon"
)
for (x <- fruits){
var df = Seq(("aaa","bbb",x)).toDF("aCol","bCol","name")
}
I would want to obtain some like this:
aCol | bCol | fruitsName
aaa,bbb,apple
aaa,bbb,orange
aaa,bbb,melon
Thanks again
You could created a sequence of DataFrames and then use reduce:
val results = fruits.
map(fruit => Seq(("aaa", "bbb", fruit)).toDF("aCol","bCol","name")).
reduce(_.union(_))
results.show()
Steffen Schmitz's answer is the most concise one I believe.
Below is a more detailed answer if you are looking for more customization (of field types, etc):
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
//initialize DF
val schema = StructType(
StructField("aCol", StringType, true) ::
StructField("bCol", StringType, true) ::
StructField("name", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
//list to iterate through
var fruits = List(
"apple"
,"orange"
,"melon"
)
for (x <- fruits) {
//union returns a new dataset
initialDF = initialDF.union(Seq(("aaa", "bbb", x)).toDF)
}
//initialDF.show()
references:
How to create an empty DataFrame with a specified schema?
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/Dataset.html
https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html
If you have different/multiple dataframes you can use below code, which is efficient.
val newDFs = Seq(DF1,DF2,DF3)
newDFs.reduce(_ union _)
In a for loop:
val fruits = List("apple", "orange", "melon")
( for(f <- fruits) yield ("aaa", "bbb", f) ).toDF("aCol", "bCol", "name")
you can first create a sequence and then use toDF to create Dataframe.
scala> var dseq : Seq[(String,String,String)] = Seq[(String,String,String)]()
dseq: Seq[(String, String, String)] = List()
scala> for ( x <- fruits){
| dseq = dseq :+ ("aaa","bbb",x)
| }
scala> dseq
res2: Seq[(String, String, String)] = List((aaa,bbb,apple), (aaa,bbb,orange), (aaa,bbb,melon))
scala> val df = dseq.toDF("aCol","bCol","name")
df: org.apache.spark.sql.DataFrame = [aCol: string, bCol: string, name: string]
scala> df.show
+----+----+------+
|aCol|bCol| name|
+----+----+------+
| aaa| bbb| apple|
| aaa| bbb|orange|
| aaa| bbb| melon|
+----+----+------+
Well... I think your question is a bit mis-guided.
As per my limited understanding of whatever you are trying to do, you should be doing following,
val fruits = List(
"apple",
"orange",
"melon"
)
val df = fruits
.map(x => ("aaa", "bbb", x))
.toDF("aCol", "bCol", "name")
And this should be sufficient.
I'm trying to create a simple DataFrame as follows:
import sqlContext.implicits._
val lookup = Array("one", "two", "three", "four", "five")
val theRow = Array("1",Array(1,2,3), Array(0.1,0.4,0.5))
val theRdd = sc.makeRDD(theRow)
case class X(id: String, indices: Array[Integer], weights: Array[Float] )
val df = theRdd.map{
case Array(s0,s1,s2) => X(s0.asInstanceOf[String],s1.asInstanceOf[Array[Integer]],s2.asInstanceOf[Array[Float]])
}.toDF()
df.show()
df is defined as
df: org.apache.spark.sql.DataFrame = [id: string, indices: array<int>, weights: array<float>]
which is what I want.
Upon executing, I get
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 13.0 failed 1 times, most recent failure: Lost task 1.0 in stage 13.0 (TID 50, localhost): scala.MatchError: 1 (of class java.lang.String)
Where is this MatchError coming from? And, is there a simpler way to create sample DataFrames programmatically?
First, theRow should be a Row and not an Array. Now, if you modify your types in such a way that the compatibility between Java and Scala is respected, your example will work
val theRow =Row("1",Array[java.lang.Integer](1,2,3), Array[Double](0.1,0.4,0.5))
val theRdd = sc.makeRDD(Array(theRow))
case class X(id: String, indices: Array[Integer], weights: Array[Double] )
val df=theRdd.map{
case Row(s0,s1,s2)=>X(s0.asInstanceOf[String],s1.asInstanceOf[Array[Integer]],s2.asInstanceOf[Array[Double]])
}.toDF()
df.show()
//+---+---------+---------------+
//| id| indices| weights|
//+---+---------+---------------+
//| 1|[1, 2, 3]|[0.1, 0.4, 0.5]|
//+---+---------+---------------+
For another example that you can refer
import spark.implicits._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val columns=Array("id", "first", "last", "year")
val df1=sc.parallelize(Seq(
(1, "John", "Doe", 1986),
(2, "Ive", "Fish", 1990),
(4, "John", "Wayne", 1995)
)).toDF(columns: _*)
val df2=sc.parallelize(Seq(
(1, "John", "Doe", 1986),
(2, "IveNew", "Fish", 1990),
(3, "San", "Simon", 1974)
)).toDF(columns: _*)
Using Spark 1.5 and Scala 2.10.6
I'm trying to filter a dataframe via a field "tags" that is an array of strings. Looking for all rows that have the tag 'private'.
val report = df.select("*")
.where(df("tags").contains("private"))
getting:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve 'Contains(tags, private)' due to data type mismatch:
argument 1 requires string type, however, 'tags' is of array
type.;
Is the filter method better suited?
UPDATED:
the data is coming from cassandra adapter but a minimal example that shows what I'm trying to do and also gets the above error is:
def testData (sc: SparkContext): DataFrame = {
val stringRDD = sc.parallelize(Seq("""
{ "name": "ed",
"tags": ["red", "private"]
}""",
"""{ "name": "fred",
"tags": ["public", "blue"]
}""")
)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
sqlContext.read.json(stringRDD)
}
def run(sc: SparkContext) {
val df1 = testData(sc)
df1.show()
val report = df1.select("*")
.where(df1("tags").contains("private"))
report.show()
}
UPDATED: the tags array can be any length and the 'private' tag can be in any position
UPDATED: one solution that works: UDF
val filterPriv = udf {(tags: mutable.WrappedArray[String]) => tags.contains("private")}
val report = df1.filter(filterPriv(df1("tags")))
I think if you use where(array_contains(...)) it will work. Here's my result:
scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContext
scala> import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame
scala> def testData (sc: SparkContext): DataFrame = {
| val stringRDD = sc.parallelize(Seq
| ("""{ "name": "ned", "tags": ["blue", "big", "private"] }""",
| """{ "name": "albert", "tags": ["private", "lumpy"] }""",
| """{ "name": "zed", "tags": ["big", "private", "square"] }""",
| """{ "name": "jed", "tags": ["green", "small", "round"] }""",
| """{ "name": "ed", "tags": ["red", "private"] }""",
| """{ "name": "fred", "tags": ["public", "blue"] }"""))
| val sqlContext = new org.apache.spark.sql.SQLContext(sc)
| import sqlContext.implicits._
| sqlContext.read.json(stringRDD)
| }
testData: (sc: org.apache.spark.SparkContext)org.apache.spark.sql.DataFrame
scala>
| val df = testData (sc)
df: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> val report = df.select ("*").where (array_contains (df("tags"), "private"))
report: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> report.show
+------+--------------------+
| name| tags|
+------+--------------------+
| ned|[blue, big, private]|
|albert| [private, lumpy]|
| zed|[big, private, sq...|
| ed| [red, private]|
+------+--------------------+
Note that it works if you write where(array_contains(df("tags"), "private")), but if you write where(df("tags").array_contains("private")) (more directly analogous to what you wrote originally) it fails with array_contains is not a member of org.apache.spark.sql.Column. Looking at the source code for Column, I see there's some stuff to handle contains (constructing a Contains instance for that) but not array_contains. Maybe that's an oversight.
You can use ordinal to refer to the json array's for e.g. in your case df("tags")(0). Here is a working sample
scala> val stringRDD = sc.parallelize(Seq("""
| { "name": "ed",
| "tags": ["private"]
| }""",
| """{ "name": "fred",
| "tags": ["public"]
| }""")
| )
stringRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[87] at parallelize at <console>:22
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> sqlContext.read.json(stringRDD)
res28: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> val df=sqlContext.read.json(stringRDD)
df: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> df.columns
res29: Array[String] = Array(name, tags)
scala> df.dtypes
res30: Array[(String, String)] = Array((name,StringType), (tags,ArrayType(StringType,true)))
scala> val report = df.select("*").where(df("tags")(0).contains("private"))
report: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> report.show
+----+-------------+
|name| tags|
+----+-------------+
| ed|List(private)|
+----+-------------+