How to create a sample dataframe in Scala / Spark - scala

I'm trying to create a simple DataFrame as follows:
import sqlContext.implicits._
val lookup = Array("one", "two", "three", "four", "five")
val theRow = Array("1",Array(1,2,3), Array(0.1,0.4,0.5))
val theRdd = sc.makeRDD(theRow)
case class X(id: String, indices: Array[Integer], weights: Array[Float] )
val df = theRdd.map{
case Array(s0,s1,s2) => X(s0.asInstanceOf[String],s1.asInstanceOf[Array[Integer]],s2.asInstanceOf[Array[Float]])
}.toDF()
df.show()
df is defined as
df: org.apache.spark.sql.DataFrame = [id: string, indices: array<int>, weights: array<float>]
which is what I want.
Upon executing, I get
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 13.0 failed 1 times, most recent failure: Lost task 1.0 in stage 13.0 (TID 50, localhost): scala.MatchError: 1 (of class java.lang.String)
Where is this MatchError coming from? And, is there a simpler way to create sample DataFrames programmatically?

First, theRow should be a Row and not an Array. Now, if you modify your types in such a way that the compatibility between Java and Scala is respected, your example will work
val theRow =Row("1",Array[java.lang.Integer](1,2,3), Array[Double](0.1,0.4,0.5))
val theRdd = sc.makeRDD(Array(theRow))
case class X(id: String, indices: Array[Integer], weights: Array[Double] )
val df=theRdd.map{
case Row(s0,s1,s2)=>X(s0.asInstanceOf[String],s1.asInstanceOf[Array[Integer]],s2.asInstanceOf[Array[Double]])
}.toDF()
df.show()
//+---+---------+---------------+
//| id| indices| weights|
//+---+---------+---------------+
//| 1|[1, 2, 3]|[0.1, 0.4, 0.5]|
//+---+---------+---------------+

For another example that you can refer
import spark.implicits._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val columns=Array("id", "first", "last", "year")
val df1=sc.parallelize(Seq(
(1, "John", "Doe", 1986),
(2, "Ive", "Fish", 1990),
(4, "John", "Wayne", 1995)
)).toDF(columns: _*)
val df2=sc.parallelize(Seq(
(1, "John", "Doe", 1986),
(2, "IveNew", "Fish", 1990),
(3, "San", "Simon", 1974)
)).toDF(columns: _*)

Related

Spark SQL - Check for a value in multiple columns

I have a status dataset like below:
I want to select all the rows from this dataset which have "FAILURE" in any of these 5 status columns.
So, I want the result to contain only IDs 1,2,4 as they have FAILURE in one of the Status columns.
I guess in SQL we can do something like below:
SELECT * FROM status WHERE "FAILURE" IN (Status1, Status2, Status3, Status4, Status5);
In spark, I know I can do a filter by comparing each Status column with "FAILURE"
status.filter(s => {s.Status1.equals(FAILURE) || s.Status2.equals(FAILURE) ... and so on..})
But I would like to know if there is a smarter way of doing this in Spark SQL.
Thanks in advance!
In case there are many columns to be examined, consider a recursive function that short-circuits upon the first match, as shown below:
val df = Seq(
(1, "T", "F", "T", "F"),
(2, "T", "T", "T", "T"),
(3, "T", "T", "F", "T")
).toDF("id", "c1", "c2", "c3", "c4")
import org.apache.spark.sql.Column
def checkFor(elem: Column, cols: List[Column]): Column = cols match {
case Nil =>
lit(true)
case h :: tail =>
when(h === elem, lit(false)).otherwise(checkFor(elem, tail))
}
val cols = df.columns.filter(_.startsWith("c")).map(col).toList
df.where(checkFor(lit("F"), cols)).show
// +---+---+---+---+---+
// | id| c1| c2| c3| c4|
// +---+---+---+---+---+
// | 2| T| T| T| T|
// +---+---+---+---+---+
A similar example you can modify and filter on the new column added. I leave that to you, here checking for zeroes excluding first col:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = sc.parallelize(Seq(
("r1", 0.0, 0.0, 0.0, 0.0),
("r2", 6.4, 4.9, 6.3, 7.1),
("r3", 4.2, 0.0, 7.2, 8.4),
("r4", 1.0, 2.0, 0.0, 0.0)
)).toDF("ID", "a", "b", "c", "d")
val count_some_val = df.columns.tail.map(x => when(col(x) === 0.0, 1).otherwise(0)).reduce(_ + _)
val df2 = df.withColumn("some_val_count", count_some_val)
df2.filter(col("some_val_count") > 0).show(false)
Afaik not possible to stop when first match found easily, but I do remember a smarter person than myself showing me this approach with lazy exists which I think does stop at first encounter of a match. Like this then, but a different approach, that I like:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = sc.parallelize(Seq(
("r1", 0.0, 0.0, 0.0, 0.0),
("r2", 6.0, 4.9, 6.3, 7.1),
("r3", 4.2, 0.0, 7.2, 8.4),
("r4", 1.0, 2.0, 0.0, 0.0)
)).toDF("ID", "a", "b", "c", "d")
df.map{r => (r.getString(0),r.toSeq.tail.exists(c =>
c.asInstanceOf[Double]==0))}
.toDF("ID","ones")
.show()
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> import spark.implicits._
import spark.implicits._
scala> val df = Seq(
| ("Prop1", "SUCCESS", "SUCCESS", "SUCCESS", "FAILURE" ,"SUCCESS"),
| ("Prop2", "SUCCESS", "FAILURE", "SUCCESS", "FAILURE", "SUCCESS"),
| ("Prop3", "SUCCESS", "SUCCESS", "SUCCESS", "SUCCESS", "SUCCESS" ),
| ("Prop4", "SUCCESS", "FAILURE", "SUCCESS", "FAILURE", "SUCCESS"),
| ("Prop5", "SUCCESS", "SUCCESS", "SUCCESS", "SUCCESS","SUCCESS")
| ).toDF("Name", "Status1", "Status2", "Status3", "Status4","Status5")
df: org.apache.spark.sql.DataFrame = [Name: string, Status1: string ... 4 more fields]
scala> df.show
+-----+-------+-------+-------+-------+-------+
| Name|Status1|Status2|Status3|Status4|Status5|
+-----+-------+-------+-------+-------+-------+
|Prop1|SUCCESS|SUCCESS|SUCCESS|FAILURE|SUCCESS|
|Prop2|SUCCESS|FAILURE|SUCCESS|FAILURE|SUCCESS|
|Prop3|SUCCESS|SUCCESS|SUCCESS|SUCCESS|SUCCESS|
|Prop4|SUCCESS|FAILURE|SUCCESS|FAILURE|SUCCESS|
|Prop5|SUCCESS|SUCCESS|SUCCESS|SUCCESS|SUCCESS|
+-----+-------+-------+-------+-------+-------+
scala> df.where($"Name".isin("Prop1","Prop4") and $"Status1".isin("SUCCESS","FAILURE")).show
+-----+-------+-------+-------+-------+-------+
| Name|Status1|Status2|Status3|Status4|Status5|
+-----+-------+-------+-------+-------+-------+
|Prop1|SUCCESS|SUCCESS|SUCCESS|FAILURE|SUCCESS|
|Prop4|SUCCESS|FAILURE|SUCCESS|FAILURE|SUCCESS|
+-----+-------+-------+-------+-------+-------+

How to convert a Dataset[Seq[T]] to Dataset[T] in Spark

How do I convert a Dataset[Seq[T]] to Dataset[T]?
For example, Dataset[Seq[Car]] to Dataset[Car].
You can do flatMap:
val df = Seq(Seq(1, 2, 3), Seq(4, 5, 6, 7)).toDF("s").as[Seq[Int]];
df.flatMap(x => x.toList)
You can also try explode function:
df.select(explode('s)).select("col.*").as[Car]
Full example:
import org.apache.spark.sql.functions._
case class Car(i : Int);
val df = Seq(List(Car(1), Car(2), Car(3))).toDF("s").as[List[Car]];
val df1 = df.flatMap(x => x.toList)
val df2 = df.select(explode('s)).select("col.*").as[Car]

Spark Scala Dataframe convert a column of Array of Struct to a column of Map

I am new to Scala.
I have a Dataframe with fields
ID:string, Time:timestamp, Items:array(struct(name:string,ranking:long))
I want to convert each row of the Items field to a hashmap, with the name as the key.
I am not very sure how to do this.
This can be done using a UDF:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
// Sample data:
val df = Seq(
("id1", "t1", Array(("n1", 4L), ("n2", 5L))),
("id2", "t2", Array(("n3", 6L), ("n4", 7L)))
).toDF("ID", "Time", "Items")
// Create UDF converting array of (String, Long) structs to Map[String, Long]
val arrayToMap = udf[Map[String, Long], Seq[Row]] {
array => array.map { case Row(key: String, value: Long) => (key, value) }.toMap
}
// apply UDF
val result = df.withColumn("Items", arrayToMap($"Items"))
result.show(false)
// +---+----+---------------------+
// |ID |Time|Items |
// +---+----+---------------------+
// |id1|t1 |Map(n1 -> 4, n2 -> 5)|
// |id2|t2 |Map(n3 -> 6, n4 -> 7)|
// +---+----+---------------------+
I can't see a way to do this without a UDF (using Spark's built-in functions only).
Since 2.4.0, one can use map_from_entries:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(Array(("n1", 4L), ("n2", 5L))),
(Array(("n3", 6L), ("n4", 7L)))
).toDF("Items")
df.select(map_from_entries($"Items")).show
/*
+-----------------------+
|map_from_entries(Items)|
+-----------------------+
| [n1 -> 4, n2 -> 5]|
| [n3 -> 6, n4 -> 7]|
+-----------------------+
*/

How do I dynamically create a UDF in Spark?

I have a DataFrame where I want to create multiple UDFs dynamically to determine if certain rows match. I am just testing one example right now. My test code looks like the following.
//create the dataframe
import spark.implicits._
val df = Seq(("t","t"), ("t", "f"), ("f", "t"), ("f", "f")).toDF("n1", "n2")
//create the scala function
def filter(v1: Seq[Any], v2: Seq[String]): Int = {
for (i <- 0 until v1.length) {
if (!v1(i).equals(v2(i))) {
return 0
}
}
return 1
}
//create the udf
import org.apache.spark.sql.functions.udf
val fudf = udf(filter(_: Seq[Any], _: Seq[String]))
//apply the UDF
df.withColumn("filter1", fudf(Seq($"n1"), Seq("t"))).show()
However, when I run the last line, I get the following error.
:30: error: not found: value df
df.withColumn("filter1", fudf($"n1", Seq("t"))).show()
^
:30: error: type mismatch;
found : Seq[String]
required: org.apache.spark.sql.Column
df.withColumn("filter1", fudf($"n1", Seq("t"))).show()
^
Any ideas on what I'm doing wrong? Note, I am on Scala v2.11.x and Spark 2.0.x.
On another note, if we can solve this "dynamic" UDF question/concern, my use case would be to add them to the dataframe. With some test code as follows, it takes forever (it doesn't even finish, I had to ctrl-c to break out). I'm guessing doing a bunch of .withColumn in a for-loop is a bad idea in Spark. If so, please let me know and I'll abandon this approach altogether.
import spark.implicits._
val df = Seq(("t","t"), ("t", "f"), ("f", "t"), ("f", "f")).toDF("n1", "n2")
import org.apache.spark.sql.functions.udf
val fudf = udf( (x: String) => if (x.equals("t")) 1 else 0)
var df2 = df
for (i <- 0 until 10000) {
df2 = df2.withColumn("filter"+i, fudf($"n1"))
}
Enclose "t" in lit()
df.withColumn("filter1", fudf($"n1", Seq(lit("t")))).show()
Try registering UDF on sqlContext.
Spark 2.0 UDF registration

filter spark dataframe with row field that is an array of strings

Using Spark 1.5 and Scala 2.10.6
I'm trying to filter a dataframe via a field "tags" that is an array of strings. Looking for all rows that have the tag 'private'.
val report = df.select("*")
.where(df("tags").contains("private"))
getting:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve 'Contains(tags, private)' due to data type mismatch:
argument 1 requires string type, however, 'tags' is of array
type.;
Is the filter method better suited?
UPDATED:
the data is coming from cassandra adapter but a minimal example that shows what I'm trying to do and also gets the above error is:
def testData (sc: SparkContext): DataFrame = {
val stringRDD = sc.parallelize(Seq("""
{ "name": "ed",
"tags": ["red", "private"]
}""",
"""{ "name": "fred",
"tags": ["public", "blue"]
}""")
)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
sqlContext.read.json(stringRDD)
}
def run(sc: SparkContext) {
val df1 = testData(sc)
df1.show()
val report = df1.select("*")
.where(df1("tags").contains("private"))
report.show()
}
UPDATED: the tags array can be any length and the 'private' tag can be in any position
UPDATED: one solution that works: UDF
val filterPriv = udf {(tags: mutable.WrappedArray[String]) => tags.contains("private")}
val report = df1.filter(filterPriv(df1("tags")))
I think if you use where(array_contains(...)) it will work. Here's my result:
scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContext
scala> import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame
scala> def testData (sc: SparkContext): DataFrame = {
| val stringRDD = sc.parallelize(Seq
| ("""{ "name": "ned", "tags": ["blue", "big", "private"] }""",
| """{ "name": "albert", "tags": ["private", "lumpy"] }""",
| """{ "name": "zed", "tags": ["big", "private", "square"] }""",
| """{ "name": "jed", "tags": ["green", "small", "round"] }""",
| """{ "name": "ed", "tags": ["red", "private"] }""",
| """{ "name": "fred", "tags": ["public", "blue"] }"""))
| val sqlContext = new org.apache.spark.sql.SQLContext(sc)
| import sqlContext.implicits._
| sqlContext.read.json(stringRDD)
| }
testData: (sc: org.apache.spark.SparkContext)org.apache.spark.sql.DataFrame
scala>
| val df = testData (sc)
df: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> val report = df.select ("*").where (array_contains (df("tags"), "private"))
report: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> report.show
+------+--------------------+
| name| tags|
+------+--------------------+
| ned|[blue, big, private]|
|albert| [private, lumpy]|
| zed|[big, private, sq...|
| ed| [red, private]|
+------+--------------------+
Note that it works if you write where(array_contains(df("tags"), "private")), but if you write where(df("tags").array_contains("private")) (more directly analogous to what you wrote originally) it fails with array_contains is not a member of org.apache.spark.sql.Column. Looking at the source code for Column, I see there's some stuff to handle contains (constructing a Contains instance for that) but not array_contains. Maybe that's an oversight.
You can use ordinal to refer to the json array's for e.g. in your case df("tags")(0). Here is a working sample
scala> val stringRDD = sc.parallelize(Seq("""
| { "name": "ed",
| "tags": ["private"]
| }""",
| """{ "name": "fred",
| "tags": ["public"]
| }""")
| )
stringRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[87] at parallelize at <console>:22
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> sqlContext.read.json(stringRDD)
res28: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> val df=sqlContext.read.json(stringRDD)
df: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> df.columns
res29: Array[String] = Array(name, tags)
scala> df.dtypes
res30: Array[(String, String)] = Array((name,StringType), (tags,ArrayType(StringType,true)))
scala> val report = df.select("*").where(df("tags")(0).contains("private"))
report: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>]
scala> report.show
+----+-------------+
|name| tags|
+----+-------------+
| ed|List(private)|
+----+-------------+