Spark SQL 'explode' command failing on AWS EC2 but succeeding locally - scala

I am using Spark SQL (I mention that it is in Spark in case that affects the SQL syntax - I'm not familiar enough to be sure yet) and I have a table that I am trying to re-structure. I have an approach that works locally but when I try to run the same command on an AWS EC2 instance I get an error reporting that I have an 'unresolved operator'
Basically I have data that looks like:
userId someString varA
1 "example1" [0,2,5]
2 "example2" [1,20,5]
and I use an 'explode' command in an sqlContext on varA. When I run this locally things return correctly, but on AWS they fail.
I can reproduce this with the following commands:
val data = List(
("1", "example1", Array(0,2,5)), ("2", "example2", Array(1,20,5)))
val distData = sc.parallelize(data)
val distTable = distData.toDF("userId", "someString", "varA")
distTable.registerTempTable("distTable_tmp")
val temp1 = sqlContext.sql("select userId, someString, varA from distTable_tmp")
val temp2 = sqlContext.sql(
"select userId, someString, explode(varA) as varA from distTable_tmp")
Locally, temp1.show() and temp2.show() return what I'd expect, namely:
scala> temp1.show()
+------+----------+----------+
|userId|someString| varA|
+------+----------+----------+
| 1| example1| [0, 2, 5]|
| 2| example2|[1, 20, 5]|
+------+----------+----------+
scala> temp2.show()
+------+----------+----+
|userId|someString|varA|
+------+----------+----+
| 1| example1| 0|
| 1| example1| 2|
| 1| example1| 5|
| 2| example2| 1|
| 2| example2| 20|
| 2| example2| 5|
+------+----------+----+
but on AWS the temp1 sqlContext command works fine, but temp2 fails with the message:
scala> val temp2 = sqlContext.sql("select userId, someString, explode(varA) as varA from distTable_tmp")
15/11/05 22:46:49 INFO parse.ParseDriver: Parsing command: select userId, someString, explode(varA) as varA from distTable_tmp
15/11/05 22:46:49 INFO parse.ParseDriver: Parse Completed
org.apache.spark.sql.AnalysisException: unresolved operator 'Project [userId#3,someString#4,HiveGenericUdtf#org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode(varA#5) AS varA#6];
...
Many thanks.

The source of the problem is a Spark version you use on EC2. explode function has been introduced in Spark 1.4, hence it cannot work on 1.3.1. It is possible to use RDD and flatMap like this:
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val rows: RDD[Row] = distTable.rdd.flatMap(
row => row.getAs[Seq[Int]](2).map(v => Row.fromSeq(row.toSeq :+ v)))
val newSchema = StructType(
distTable.schema.fields :+ StructField("varA_exploded", IntegerType, true))
sqlContext.createDataFrame(rows, newSchema).show
// userId someString varA varA_exploded
// 1 example1 ArrayBuffer(0, 2, 5) 0
// 1 example1 ArrayBuffer(0, 2, 5) 2
// 1 example1 ArrayBuffer(0, 2, 5) 5
// 2 example2 ArrayBuffer(1, 20... 1
// 2 example2 ArrayBuffer(1, 20... 20
// 2 example2 ArrayBuffer(1, 20... 5
but it doubt it is worth all the fuss.

Related

How to write a function that takes a list of column names of a DataFrame, reorders selected columns the left and preserves unselected columns

I'd like to build a function
def reorderColumns(columnNames: List[String]) = ...
that can be applied to a Spark DataFrame such that the columns specified in columnNames gets reordered to the left, and remaining columns (in any order) remain to the right.
Example:
Given a df with the following 5 columns
| A | B | C | D | E
df.reorderColumns(["D","B","A"]) returns a df with columns ordered like so:
| D | B | A | C | E
Try this one:
def reorderColumns(df: DataFrame, columns: Array[String]): DataFrame = {
val restColumns: Array[String] = df.columns.filterNot(c => columns.contains(c))
df.select((columns ++ restColumns).map(col): _*)
}
Usage example:
val spark: SparkSession = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val df = List((1, 3, 1, 6), (2, 4, 2, 5), (3, 6, 3, 4)).toDF("colA", "colB", "colC", "colD")
reorderColumns(df, Array("colC", "colB")).show
// output:
//+----+----+----+----+
//|colC|colB|colA|colD|
//+----+----+----+----+
//| 1| 3| 1| 6|
//| 2| 4| 2| 5|
//| 3| 6| 3| 4|
//+----+----+----+----+

Perform lookup on a broadcasted Map conditoned on column value in Spark using Scala

I want to perform a lookup on myMap. When col2 value is "0000" I want to update it with the value related to col1 key. Otherwise I want to keep the existing col2 value.
val myDF :
+-----+-----+
|col1 |col2 |
+-----+-----+
|1 |a |
|2 |0000 |
|3 |c |
|4 |0000 |
+-----+-----+
val myMap : Map[String, String] ("2" -> "b", "4" -> "d")
val broadcastMyMap = spark.sparkContext.broadcast(myMap)
def lookup = udf((key:String) => broadcastMyMap.value.get(key))
myDF.withColumn("col2", when ($"col2" === "0000", lookup($"col1")).otherwise($"col2"))
I've used the code above in spark-shell and it works fine but when I build the application jar and submit it to Spark using spark-submit it throws an error:
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$5: (string) => string)
Caused by: java.lang.NullPointerException
Is there a way to perform the lookup without using UDF, which aren't the best option in terms of performance, or to fix the error?
I think I can't just use join because some values of myDF.col2 that have to be kept could be sobstituted in the operation.
your NullPointerException is NOT Valid.I proved with sample program like below.
its PERFECTLY WORKING FINE. you execute the below program.
package com.example
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.UserDefinedFunction
object MapLookupDF {
Logger.getLogger("org").setLevel(Level.OFF)
def main(args: Array[String]) {
import org.apache.spark.sql.functions._
val spark = SparkSession.builder.
master("local[*]")
.appName("MapLookupDF")
.getOrCreate()
import spark.implicits._
val mydf = Seq((1, "a"), (2, "0000"), (3, "c"), (4, "0000")).toDF("col1", "col2")
mydf.show
val myMap: Map[String, String] = Map("2" -> "b", "4" -> "d")
println(myMap.toString)
val broadcastMyMap = spark.sparkContext.broadcast(myMap)
def lookup: UserDefinedFunction = udf((key: String) => {
println("getting the value for the key " + key)
broadcastMyMap.value.get(key)
}
)
val finaldf = mydf.withColumn("col2", when($"col2" === "0000", lookup($"col1")).otherwise($"col2"))
finaldf.show
}
}
Result :
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
+----+----+
|col1|col2|
+----+----+
| 1| a|
| 2|0000|
| 3| c|
| 4|0000|
+----+----+
Map(2 -> b, 4 -> d)
getting the value for the key 2
getting the value for the key 4
+----+----+
|col1|col2|
+----+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
+----+----+
note: there wont be significant degradation for a small map broadcasted.
if you want to go with a dataframe you can go as convert map to dataframe
val df = myMap.toSeq.toDF("key", "val")
Map(2 -> b, 4 -> d) in dataframe format will be like
+----+----+
|key|val |
+----+----+
| 2| b|
| 4| d|
+----+----+
and then join like this
DIY...

Spark: Joining with array

I need to join a dataframe with a string column to one with array of string so that if one of the values in the array is matched, the rows will join.
I tried this but I guess it's not support.
Any other way to do this?
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("test")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
val left = spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF("col1")
val right = spark.sparkContext.parallelize(Seq((Array(1, 2), "Yes"),(Array(3),"No"))).toDF("col1", "col2")
left.join(right,"col1")
Throws:
org.apache.spark.sql.AnalysisException: cannot resolve '(col1
=col1)' due to data
type mismatch: differing types in '(col1 =
col1)' (int and array).;;
One option is to create an UDF for building your join condition:
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val left = spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF("col1")
val right = spark.sparkContext.parallelize(Seq((Array(1, 2), "Yes"),(Array(3),"No"))).toDF("col1", "col2")
val checkValue = udf {
(array: WrappedArray[Int], value: Int) => array.contains(value)
}
val result = left.join(right, checkValue(right("col1"), left("col1")), "inner")
result.show
+----+------+----+
|col1| col1|col2|
+----+------+----+
| 1|[1, 2]| Yes|
| 2|[1, 2]| Yes|
| 3| [3]| No|
+----+------+----+
The most succinct way to do this is to use the array_contains spark sql expression as shown below, that said I've compared the performance of this with the performance of doing an explode and join as shown in a previous answer and the explode seems more performant.
import org.apache.spark.sql.functions.expr
import spark.implicits._
val left = Seq(1, 2, 3).toDF("col1")
val right = Seq((Array(1, 2), "Yes"),(Array(3),"No")).toDF("col1", "col2").withColumnRenamed("col1", "col1_array")
val joined = left.join(right, expr("array_contains(col1_array, col1)")).show
+----+----------+----+
|col1|col1_array|col2|
+----+----------+----+
| 1| [1, 2]| Yes|
| 2| [1, 2]| Yes|
| 3| [3]| No|
+----+----------+----+
Note you can't use the org.apache.spark.sql.functions.array_contains function directly as it requires the second argument to be a literal as opposed to a column expression.
You could use explode on you Array column before the join. Explode creates a new line for each element in the array :
right = right.withColumn("exploded_col",explode(right("col1")))
right.show()
+------+----+--------------+
| col1|col2|exploded_col_1|
+------+----+--------------+
|[1, 2]| Yes| 1|
|[1, 2]| Yes| 2|
| [3]| No| 3|
+------+----+--------------+
Then you can easily join with your first dataset.

how to use achieve below requirement using spark RDD

I am really new for Spark. would you please help below require..
I have below source file. first field is name, second field is group id. I need to count how many group the name has, and list all the groups and count.
abc 1
abc 2
abc 3
xyz 1
xyz 3
def 2
def 4
lmn 6
I want to get below ex
name dept count
abc 1,2,3 3
xyz 1,3 2
def 2,4 2
lmn 6 1
thanks in advance.
Assuming you have a CSV file. So , first create dataframe using following steps.
import org.apache.spark.sql.types._
import org.apache.spark.Row
val members = sc.textFile("member").Map(lines => lines.split(",")).map(a => Row(a(0),a(1)))
val rddStruct = new StructType(Array(StructField("name", StringType, nullable = true),StructField("depart", StringType, nullable = true)))
val df = sqlContext.createDataFrame(members,rddStruct)
To achieve the output, following steps can be followed.
Apply a groupBy function can collect all departments as a set
val df2 = df.groupBy("Name").agg(collect_set("Depart").as("Depart"))
df2.show
+---+-----------+
| Name| Depart|
+---+-----------+
|lmn| [6]|
|def| [2, 4]|
|abc|[1, 2, 3]|
|xyz| [1, 2]|
+---+---------+
Then apply a size function on the Depart column to get the count.
val df3 = df2.withColumn("Count", size(df2("Depart")))
df3.show
+---+---------+-----+
| Name| Depart|Count|
+---+---------+-----+
|lmn| [6]| 1|
|def| [2, 4]| 2|
|abc|[1, 2, 3]| 3|
|xyz| [1, 2]| 2|
+---+---------+-----+
If result required should be sorted in descending order than you can apply a orderBy function on the above output.
val df4 = df3.orderBy(desc("Count"))
df4.show
+---+---------+-----+
| Name| Depart|Count|
+---+---------+-----+
|abc|[1, 2, 3]| 3|
|def| [2, 4]| 2|
|xyz| [1, 2]| 2|
|lmn| [6]| 1|
+---+---------+-----+
About structType you can read here
You can make it simple using RDD transformation:
scala> var rdd = sc.textFile("/data_test1")
scala> rdd.map(x => x.split(" ")).
map(x => (x(0), x(1))).
groupByKey().
map(x => (x._1, x._2.toSet.mkString(","),x._2.size)).
toDF("name", "dept", "count").show()
Output:
+----+-----+-----+
|name| dept|count|
+----+-----+-----+
| abc|1,2,3| 3|
| lmn| 6| 1|
| def| 2,4| 2|
| xyz| 1,3| 2|
+----+-----+-----+

Spark: Add column to dataframe conditionally

I am trying to take my input data:
A B C
--------------
4 blah 2
2 3
56 foo 3
And add a column to the end based on whether B is empty or not:
A B C D
--------------------
4 blah 2 1
2 3 0
56 foo 3 1
I can do this easily by registering the input dataframe as a temp table, then typing up a SQL query.
But I'd really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala.
I've tried .withColumn, but I can't get that to do what I want.
Try withColumn with the function when as follows:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
newDf.show() shows
+---+----+---+---+
| A| B| C| D|
+---+----+---+---+
| 4|blah| 2| 1|
| 2| | 3| 0|
| 56| foo| 3| 1|
|100|null| 5| 0|
+---+----+---+---+
I added the (100, null, 5) row for testing the isNull case.
I tried this code with Spark 1.6.0 but as commented in the code of when, it works on the versions after 1.4.0.
My bad, I had missed one part of the question.
Best, cleanest way is to use a UDF.
Explanation within the code.
// create some example data...BY DataFrame
// note, third record has an empty string
case class Stuff(a:String,b:Int)
val d= sc.parallelize(Seq( ("a",1),("b",2),
("",3) ,("d",4)).map { x => Stuff(x._1,x._2) }).toDF
// now the good stuff.
import org.apache.spark.sql.functions.udf
// function that returns 0 is string empty
val func = udf( (s:String) => if(s.isEmpty) 0 else 1 )
// create new dataframe with added column named "notempty"
val r = d.select( $"a", $"b", func($"a").as("notempty") )
scala> r.show
+---+---+--------+
| a| b|notempty|
+---+---+--------+
| a| 1| 1111|
| b| 2| 1111|
| | 3| 0|
| d| 4| 1111|
+---+---+--------+
How about something like this?
val newDF = df.filter($"B" === "").take(1) match {
case Array() => df
case _ => df.withColumn("D", $"B" === "")
}
Using take(1) should have a minimal hit