I'm trying to create a simple DataFrame as follows:
import sqlContext.implicits._
val lookup = Array("one", "two", "three", "four", "five")
val theRow = Array("1",Array(1,2,3), Array(0.1,0.4,0.5))
val theRdd = sc.makeRDD(theRow)
case class X(id: String, indices: Array[Integer], weights: Array[Float] )
val df = theRdd.map{
case Array(s0,s1,s2) => X(s0.asInstanceOf[String],s1.asInstanceOf[Array[Integer]],s2.asInstanceOf[Array[Float]])
}.toDF()
df.show()
df is defined as
df: org.apache.spark.sql.DataFrame = [id: string, indices: array<int>, weights: array<float>]
which is what I want.
Upon executing, I get
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 13.0 failed 1 times, most recent failure: Lost task 1.0 in stage 13.0 (TID 50, localhost): scala.MatchError: 1 (of class java.lang.String)
Where is this MatchError coming from? And, is there a simpler way to create sample DataFrames programmatically?
First, theRow should be a Row and not an Array. Now, if you modify your types in such a way that the compatibility between Java and Scala is respected, your example will work
val theRow =Row("1",Array[java.lang.Integer](1,2,3), Array[Double](0.1,0.4,0.5))
val theRdd = sc.makeRDD(Array(theRow))
case class X(id: String, indices: Array[Integer], weights: Array[Double] )
val df=theRdd.map{
case Row(s0,s1,s2)=>X(s0.asInstanceOf[String],s1.asInstanceOf[Array[Integer]],s2.asInstanceOf[Array[Double]])
}.toDF()
df.show()
//+---+---------+---------------+
//| id| indices| weights|
//+---+---------+---------------+
//| 1|[1, 2, 3]|[0.1, 0.4, 0.5]|
//+---+---------+---------------+
For another example that you can refer
import spark.implicits._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val columns=Array("id", "first", "last", "year")
val df1=sc.parallelize(Seq(
(1, "John", "Doe", 1986),
(2, "Ive", "Fish", 1990),
(4, "John", "Wayne", 1995)
)).toDF(columns: _*)
val df2=sc.parallelize(Seq(
(1, "John", "Doe", 1986),
(2, "IveNew", "Fish", 1990),
(3, "San", "Simon", 1974)
)).toDF(columns: _*)
Related
I have a status dataset like below:
I want to select all the rows from this dataset which have "FAILURE" in any of these 5 status columns.
So, I want the result to contain only IDs 1,2,4 as they have FAILURE in one of the Status columns.
I guess in SQL we can do something like below:
SELECT * FROM status WHERE "FAILURE" IN (Status1, Status2, Status3, Status4, Status5);
In spark, I know I can do a filter by comparing each Status column with "FAILURE"
status.filter(s => {s.Status1.equals(FAILURE) || s.Status2.equals(FAILURE) ... and so on..})
But I would like to know if there is a smarter way of doing this in Spark SQL.
Thanks in advance!
In case there are many columns to be examined, consider a recursive function that short-circuits upon the first match, as shown below:
val df = Seq(
(1, "T", "F", "T", "F"),
(2, "T", "T", "T", "T"),
(3, "T", "T", "F", "T")
).toDF("id", "c1", "c2", "c3", "c4")
import org.apache.spark.sql.Column
def checkFor(elem: Column, cols: List[Column]): Column = cols match {
case Nil =>
lit(true)
case h :: tail =>
when(h === elem, lit(false)).otherwise(checkFor(elem, tail))
}
val cols = df.columns.filter(_.startsWith("c")).map(col).toList
df.where(checkFor(lit("F"), cols)).show
// +---+---+---+---+---+
// | id| c1| c2| c3| c4|
// +---+---+---+---+---+
// | 2| T| T| T| T|
// +---+---+---+---+---+
A similar example you can modify and filter on the new column added. I leave that to you, here checking for zeroes excluding first col:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = sc.parallelize(Seq(
("r1", 0.0, 0.0, 0.0, 0.0),
("r2", 6.4, 4.9, 6.3, 7.1),
("r3", 4.2, 0.0, 7.2, 8.4),
("r4", 1.0, 2.0, 0.0, 0.0)
)).toDF("ID", "a", "b", "c", "d")
val count_some_val = df.columns.tail.map(x => when(col(x) === 0.0, 1).otherwise(0)).reduce(_ + _)
val df2 = df.withColumn("some_val_count", count_some_val)
df2.filter(col("some_val_count") > 0).show(false)
Afaik not possible to stop when first match found easily, but I do remember a smarter person than myself showing me this approach with lazy exists which I think does stop at first encounter of a match. Like this then, but a different approach, that I like:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = sc.parallelize(Seq(
("r1", 0.0, 0.0, 0.0, 0.0),
("r2", 6.0, 4.9, 6.3, 7.1),
("r3", 4.2, 0.0, 7.2, 8.4),
("r4", 1.0, 2.0, 0.0, 0.0)
)).toDF("ID", "a", "b", "c", "d")
df.map{r => (r.getString(0),r.toSeq.tail.exists(c =>
c.asInstanceOf[Double]==0))}
.toDF("ID","ones")
.show()
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> import spark.implicits._
import spark.implicits._
scala> val df = Seq(
| ("Prop1", "SUCCESS", "SUCCESS", "SUCCESS", "FAILURE" ,"SUCCESS"),
| ("Prop2", "SUCCESS", "FAILURE", "SUCCESS", "FAILURE", "SUCCESS"),
| ("Prop3", "SUCCESS", "SUCCESS", "SUCCESS", "SUCCESS", "SUCCESS" ),
| ("Prop4", "SUCCESS", "FAILURE", "SUCCESS", "FAILURE", "SUCCESS"),
| ("Prop5", "SUCCESS", "SUCCESS", "SUCCESS", "SUCCESS","SUCCESS")
| ).toDF("Name", "Status1", "Status2", "Status3", "Status4","Status5")
df: org.apache.spark.sql.DataFrame = [Name: string, Status1: string ... 4 more fields]
scala> df.show
+-----+-------+-------+-------+-------+-------+
| Name|Status1|Status2|Status3|Status4|Status5|
+-----+-------+-------+-------+-------+-------+
|Prop1|SUCCESS|SUCCESS|SUCCESS|FAILURE|SUCCESS|
|Prop2|SUCCESS|FAILURE|SUCCESS|FAILURE|SUCCESS|
|Prop3|SUCCESS|SUCCESS|SUCCESS|SUCCESS|SUCCESS|
|Prop4|SUCCESS|FAILURE|SUCCESS|FAILURE|SUCCESS|
|Prop5|SUCCESS|SUCCESS|SUCCESS|SUCCESS|SUCCESS|
+-----+-------+-------+-------+-------+-------+
scala> df.where($"Name".isin("Prop1","Prop4") and $"Status1".isin("SUCCESS","FAILURE")).show
+-----+-------+-------+-------+-------+-------+
| Name|Status1|Status2|Status3|Status4|Status5|
+-----+-------+-------+-------+-------+-------+
|Prop1|SUCCESS|SUCCESS|SUCCESS|FAILURE|SUCCESS|
|Prop4|SUCCESS|FAILURE|SUCCESS|FAILURE|SUCCESS|
+-----+-------+-------+-------+-------+-------+
How do I convert a Dataset[Seq[T]] to Dataset[T]?
For example, Dataset[Seq[Car]] to Dataset[Car].
You can do flatMap:
val df = Seq(Seq(1, 2, 3), Seq(4, 5, 6, 7)).toDF("s").as[Seq[Int]];
df.flatMap(x => x.toList)
You can also try explode function:
df.select(explode('s)).select("col.*").as[Car]
Full example:
import org.apache.spark.sql.functions._
case class Car(i : Int);
val df = Seq(List(Car(1), Car(2), Car(3))).toDF("s").as[List[Car]];
val df1 = df.flatMap(x => x.toList)
val df2 = df.select(explode('s)).select("col.*").as[Car]
I am new to Scala.
I have a Dataframe with fields
ID:string, Time:timestamp, Items:array(struct(name:string,ranking:long))
I want to convert each row of the Items field to a hashmap, with the name as the key.
I am not very sure how to do this.
This can be done using a UDF:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
// Sample data:
val df = Seq(
("id1", "t1", Array(("n1", 4L), ("n2", 5L))),
("id2", "t2", Array(("n3", 6L), ("n4", 7L)))
).toDF("ID", "Time", "Items")
// Create UDF converting array of (String, Long) structs to Map[String, Long]
val arrayToMap = udf[Map[String, Long], Seq[Row]] {
array => array.map { case Row(key: String, value: Long) => (key, value) }.toMap
}
// apply UDF
val result = df.withColumn("Items", arrayToMap($"Items"))
result.show(false)
// +---+----+---------------------+
// |ID |Time|Items |
// +---+----+---------------------+
// |id1|t1 |Map(n1 -> 4, n2 -> 5)|
// |id2|t2 |Map(n3 -> 6, n4 -> 7)|
// +---+----+---------------------+
I can't see a way to do this without a UDF (using Spark's built-in functions only).
Since 2.4.0, one can use map_from_entries:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(Array(("n1", 4L), ("n2", 5L))),
(Array(("n3", 6L), ("n4", 7L)))
).toDF("Items")
df.select(map_from_entries($"Items")).show
/*
+-----------------------+
|map_from_entries(Items)|
+-----------------------+
| [n1 -> 4, n2 -> 5]|
| [n3 -> 6, n4 -> 7]|
+-----------------------+
*/
I have a DataFrame where I want to create multiple UDFs dynamically to determine if certain rows match. I am just testing one example right now. My test code looks like the following.
//create the dataframe
import spark.implicits._
val df = Seq(("t","t"), ("t", "f"), ("f", "t"), ("f", "f")).toDF("n1", "n2")
//create the scala function
def filter(v1: Seq[Any], v2: Seq[String]): Int = {
for (i <- 0 until v1.length) {
if (!v1(i).equals(v2(i))) {
return 0
}
}
return 1
}
//create the udf
import org.apache.spark.sql.functions.udf
val fudf = udf(filter(_: Seq[Any], _: Seq[String]))
//apply the UDF
df.withColumn("filter1", fudf(Seq($"n1"), Seq("t"))).show()
However, when I run the last line, I get the following error.
:30: error: not found: value df
df.withColumn("filter1", fudf($"n1", Seq("t"))).show()
^
:30: error: type mismatch;
found : Seq[String]
required: org.apache.spark.sql.Column
df.withColumn("filter1", fudf($"n1", Seq("t"))).show()
^
Any ideas on what I'm doing wrong? Note, I am on Scala v2.11.x and Spark 2.0.x.
On another note, if we can solve this "dynamic" UDF question/concern, my use case would be to add them to the dataframe. With some test code as follows, it takes forever (it doesn't even finish, I had to ctrl-c to break out). I'm guessing doing a bunch of .withColumn in a for-loop is a bad idea in Spark. If so, please let me know and I'll abandon this approach altogether.
import spark.implicits._
val df = Seq(("t","t"), ("t", "f"), ("f", "t"), ("f", "f")).toDF("n1", "n2")
import org.apache.spark.sql.functions.udf
val fudf = udf( (x: String) => if (x.equals("t")) 1 else 0)
var df2 = df
for (i <- 0 until 10000) {
df2 = df2.withColumn("filter"+i, fudf($"n1"))
}
Enclose "t" in lit()
df.withColumn("filter1", fudf($"n1", Seq(lit("t")))).show()
Try registering UDF on sqlContext.
Spark 2.0 UDF registration
Using Spark 1.5 and Scala 2.10.6
I'm trying to filter a dataframe via a field "tags" that is an array of strings. Looking for all rows that have the tag 'private'.
val report = df.select("*")
.where(df("tags").contains("private"))
getting:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve 'Contains(tags, private)' due to data type mismatch:
argument 1 requires string type, however, 'tags' is of array
type.;
Is the filter method better suited?
UPDATED:
the data is coming from cassandra adapter but a minimal example that shows what I'm trying to do and also gets the above error is:
def testData (sc: SparkContext): DataFrame = {
val stringRDD = sc.parallelize(Seq("""
{ "name": "ed",
"tags": ["red", "private"]
}""",
"""{ "name": "fred",
"tags": ["public", "blue"]
}""")
)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
sqlContext.read.json(stringRDD)
}
def run(sc: SparkContext) {
val df1 = testData(sc)
df1.show()
val report = df1.select("*")
.where(df1("tags").contains("private"))
report.show()
}
UPDATED: the tags array can be any length and the 'private' tag can be in any position
UPDATED: one solution that works: UDF
val filterPriv = udf {(tags: mutable.WrappedArray[String]) => tags.contains("private")}
val report = df1.filter(filterPriv(df1("tags")))
I think if you use where(array_contains(...)) it will work. Here's my result:
scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContext
scala> import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame
scala> def testData (sc: SparkContext): DataFrame = {
| val stringRDD = sc.parallelize(Seq
| ("""{ "name": "ned", "tags": ["blue", "big", "private"] }""",
| """{ "name": "albert", "tags": ["private", "lumpy"] }""",
| """{ "name": "zed", "tags": ["big", "private", "square"] }""",
| """{ "name": "jed", "tags": ["green", "small", "round"] }""",
| """{ "name": "ed", "tags": ["red", "private"] }""",
| """{ "name": "fred", "tags": ["public", "blue"] }"""))
| val sqlContext = new org.apache.spark.sql.SQLContext(sc)
| import sqlContext.implicits._
| sqlContext.read.json(stringRDD)
| }
testData: (sc: org.apache.spark.SparkContext)org.apache.spark.sql.DataFrame
scala>
| val df = testData (sc)
df: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> val report = df.select ("*").where (array_contains (df("tags"), "private"))
report: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> report.show
+------+--------------------+
| name| tags|
+------+--------------------+
| ned|[blue, big, private]|
|albert| [private, lumpy]|
| zed|[big, private, sq...|
| ed| [red, private]|
+------+--------------------+
Note that it works if you write where(array_contains(df("tags"), "private")), but if you write where(df("tags").array_contains("private")) (more directly analogous to what you wrote originally) it fails with array_contains is not a member of org.apache.spark.sql.Column. Looking at the source code for Column, I see there's some stuff to handle contains (constructing a Contains instance for that) but not array_contains. Maybe that's an oversight.
You can use ordinal to refer to the json array's for e.g. in your case df("tags")(0). Here is a working sample
scala> val stringRDD = sc.parallelize(Seq("""
| { "name": "ed",
| "tags": ["private"]
| }""",
| """{ "name": "fred",
| "tags": ["public"]
| }""")
| )
stringRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[87] at parallelize at <console>:22
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> sqlContext.read.json(stringRDD)
res28: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> val df=sqlContext.read.json(stringRDD)
df: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> df.columns
res29: Array[String] = Array(name, tags)
scala> df.dtypes
res30: Array[(String, String)] = Array((name,StringType), (tags,ArrayType(StringType,true)))
scala> val report = df.select("*").where(df("tags")(0).contains("private"))
report: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> report.show
+----+-------------+
|name| tags|
+----+-------------+
| ed|List(private)|
+----+-------------+