Spark sqlContext UDF acting on Sets

Spark sqlContext UDF acting on Sets - scala

I've been trying to define a function that works within Spark's DataFrame which takes scala sets as input and outputs an integer. I'm getting the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 20 in stage 25.0 failed 1 times, most recent failure: Lost task 20.0 in stage 25.0 (TID 473, localhost): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.collection.immutable.Set
Here is a simple code that gives the crux of the issue:
// generate sample data
case class Dummy( x:Array[Integer] )
val df = sqlContext.createDataFrame(Seq(
Dummy(Array(1,2,3)),
Dummy(Array(10,20,30,40))
))
// define the UDF
import org.apache.spark.sql.functions._
def setSize(A:Set[Integer]):Integer = {
A.size
}
// For some reason I couldn't get it to work without this valued function
val sizeWrap: (Set[Integer] => Integer) = setSize(_)
val sizeUDF = udf(sizeWrap)
// this produces the error
df.withColumn("colSize", sizeUDF('x)).show
What am I missing here? How can I get this to work? I know I can do this by casting to RDD but I don't want to go back and forth between RDD and DataFrames.

Use Seq:
val sizeUDF = udf((x: Seq) => setSize(x.toSet))

Related

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast t

I am trying to convert a Row in a dataframe to a case class and getting following error
2019-08-19 20:13:08 Executor task launch worker for task 1 ERROR
Executor:91 - Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot
be cast to Models.City
Sample Log = {"Id": "1","City": {"name": "A","state": "B"}}
Below is the code that is reading a text file having data in json format which is throwing above error
case class City(name: String, state: String)
val file = new File("src/test/resources/log.txt")
val logs = spark.
read.
text(file.getAbsolutePath).
select(col("value").
as("body"))
import spark.implicits._
var logDF: DataFrame = spark.read.json(logs.as[String])
logDF.map(row => row.getAs[City]("City").state).show()
Basically I can not perform any operation on the dataframe itself due to some restrictions.
So given a row how can we cast it into a case class (i cannot use match pattern here as case class can have lot of fields and nested case classes)
Thanks in advance.
Any help is greatly appreciated!!

I had the same issue (Spark SQL 3.1.3). The solution was to use Spark to convert Dataframe to Dataset and then access the fields.
import spark.implicits._
var logDF: DataFrame = spark.read.json(logs.as[String])
logDF.select("City").as[City].map(city => city.state).show()

spark create Dataframe in UDF

I have a example, want to create Dataframe in a UDF. Something like the one below
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.feature.VectorAssembler
data to Dataframe
val df = Seq((1,1,34,23,34,56),(2,1,56,34,56,23),(3,0,34,23,23,78),(4,0,23,34,78,23),(5,1,56,23,23,12),
(6,1,67,34,56,34),(7,0,23,23,23,56),(8,0,12,34,45,89),(9,1,12,34,12,34),(10,0,12,34,23,34)).toDF("id","label","tag1","tag2","tag3","tag4")
val assemblerDF = new VectorAssembler().setInputCols(Array("tag1", "tag2", "tag3","tag4")).setOutputCol("features")
val data = assemblerDF.transform(df)
val Array(train,test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val testData=test.toDF
val loadmodel=LogisticRegressionModel.load("/user/xu/savemodel")
sc.broadcast(loadmodel)
val assemblerFe = new VectorAssembler().setInputCols(Array("a", "b", "c","d")).setOutputCol("features")
sc.broadcast(assemblerFe)
UDF
def predict(predictSet:Vector):Double={
val set=Seq((1,2,3,4)).toDF("a","b","c","d")
val predata = assemblerFe.transform(set)
val result=loadmodel.transform(predata)
result.rdd.take(1)(0)(3).toString.toDouble}
spark.udf.register("predict", predict _)
testData.registerTempTable("datatable")
spark.sql("SELECT predict(features) FROM datatable").take(1)
i get an error like
ERROR Executor: Exception in task 3.0 in stage 4.0 (TID 7) [Executor task launch worker for task 7]
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => double)
and
WARN TaskSetManager: Lost task 3.0 in stage 4.0 (TID 7, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => double)
Is dataframe not supported? I am using Spark 2.3.0 and Scala 2.11. thanks

As mentioned in comments, you don't need UDF here to apply the Trained model to test data. You can apply the model to test dataframe in the main program itself as below:
val df = Seq((1,1,34,23,34,56),(2,1,56,34,56,23),(3,0,34,23,23,78),(4,0,23,34,78,23),(5,1,56,23,23,12),
(6,1,67,34,56,34),(7,0,23,23,23,56),(8,0,12,34,45,89),(9,1,12,34,12,34),(10,0,12,34,23,34)).toDF("id","label","tag1","tag2","tag3","tag4")
val assemblerDF = new VectorAssembler().setInputCols(Array("tag1", "tag2", "tag3","tag4")).setOutputCol("features")
val data = assemblerDF.transform(df)
val Array(train,test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val testData=test.toDF
val loadmodel=LogisticRegressionModel.load("/user/xu/savemodel")
sc.broadcast(loadmodel)
val assemblerFe = new VectorAssembler().setInputCols(Array("a", "b", "c","d")).setOutputCol("features")
sc.broadcast(assemblerFe)
val set=Seq((1,2,3,4)).toDF("a","b","c","d")
val predata = assemblerFe.transform(set)
val result=loadmodel.transform(predata) // Applying model on predata dataframe. You can apply model on any DataFrame.
result is a DataFrame now, you can Reigister the DataFrame as a table and query predictionLabel and features using SQL OR you can directly select the predictLabel and other fields from DataFrame.
Please note, UDF is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets. It doesnt return the DataFrame itself as a return type. and generally its not advised to use UDF's unless necessary, refer to: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs-blackbox.html

Merge two columns of type Array[string] into a new Array[string] column

I have two columns in a Spark SQL DataFrame with each entry in either column as an array of strings.
val ngramDataFrame = Seq(
(Seq("curious", "bought", "20"), Seq("iwa", "was", "asj"))
).toDF("filtered_words", "ngrams_array")
I want to merge the arrays in each row to make a single array in a new column. My code is as follows:
def concat_array(firstarray: Array[String],
secondarray: Array[String]) : Array[String] =
{ (firstarray ++ secondarray).toArray }
val concatUDF = udf(concat_array _)
val concatFrame = ngramDataFrame.withColumn("full_array", concatUDF($"filtered_words", $"ngrams_array"))
I can successfully use the concat_array function on two arrays. However when I run the above code, I get the following exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 1 times, most recent failure: Lost task 0.0 in stage 16.0 (TID 12, localhost): org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (array, array) => array) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String; at $line80.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:76) ... 13 more Driver stacktrace:

In Spark 2.4 or later you can use concat (if you want to keep duplicates):
ngramDataFrame.withColumn(
"full_array", concat($"filtered_words", $"ngrams_array")
).show
+--------------------+---------------+--------------------+
| filtered_words| ngrams_array| full_array|
+--------------------+---------------+--------------------+
|[curious, bought,...|[iwa, was, asj]|[curious, bought,...|
+--------------------+---------------+--------------------+
or array_union (if you want to drop duplicates):
ngramDataFrame.withColumn(
"full_array",
array_union($"filtered_words", $"ngrams_array")
)
These can be also composed from the other higher order functions, for example
ngramDataFrame.withColumn(
"full_array",
flatten(array($"filtered_words", $"ngrams_array"))
)
with duplicates, and
ngramDataFrame.withColumn(
"full_array",
array_distinct(flatten(array($"filtered_words", $"ngrams_array")))
)
without.
On a side note, you shouldn't use WrappedArray when working with ArrayType columns. Instead you should expect the guaranteed interface, which is Seq. So the udf should use function with following signature:
(Seq[String], Seq[String]) => Seq[String]
Please refer to SQL Programming Guide for details.

Arjun there is an error in the udf you had created.when you are passing the array type columns .data type is not Array[String] it is WrappedArray[String].below i am pasting the modified udf along with output.
val SparkCtxt = new SparkContext(sparkConf)
val sqlContext = new SQLContext(SparkCtxt)
import sqlContext.implicits
import org.apache.spark.sql.functions._
val temp=SparkCtxt.parallelize(Seq(Row(Array("String1","String2"),Array("String3","String4"))))
val df= sqlContext.createDataFrame(temp,
StructType(List(
StructField("Col1",ArrayType(StringType),true),
StructField("Col2",ArrayType(StringType),true)
)
) )
def concat_array(firstarray: mutable.WrappedArray[String],
secondarray: mutable.WrappedArray[String]) : mutable.WrappedArray[String] =
{
(firstarray ++ secondarray)
}
val concatUDF = udf(concat_array _)
val df2=df.withColumn("udftest",concatUDF(df.col("Col1"), df.col("Col2")))
df2.select("udftest").foreach(each=>{println("***********")
println(each(0))})
df2.show(true)
OUTPUT:
+------------------+------------------+--------------------+
| Col1| Col2| udftest|
+------------------+------------------+--------------------+
|[String1, String2]|[String3, String4]|[String1, String2...|
+------------------+------------------+--------------------+
WrappedArray(String1, String2, String3, String4)

Spark reducer and summation result issue

Here is sample file
Department,Designation,costToCompany,State
Sales,Trainee,12000,UP
Sales,Lead,32000,AP
Sales,Lead,32000,LA
Sales,Lead,32000,TN
Sales,Lead,32000,AP
Sales,Lead,32000,TN
Sales,Lead,32000,LA
Sales,Lead,32000,LA
Marketing,Associate,18000,TN
Marketing,Associate,18000,TN
HR,Manager,58000,TN
Produce an output as csv
Group by department, desigination, State
With additional columns with sum(costToCompany) and sum(TotalEmployeeCount)
Result should be like
Dept,Desg,state,empCount,totalCost
Sales,Lead,AP,2,64000
Sales,Lead,LA,3,96000
Sales,Lead,TN,2,64000
Following is the solution and writing to file is resulting in an error. What am i doing wrong here?
Step #1: Load file
val file = sc.textFile("data/sales.txt")
Step #2: Create a case class to represt the data
scala> case class emp(Dept:String, Desg:String, totalCost:Double, State:String)
defined class emp
Step #3: Split data and create RDD of emp object
scala> val fileSplit = file.map(_.split(","))
scala> val data = fileSplit.map(x => emp(x(0), x(1), x(2).toDouble, x(3)))
Step #4: Turn the data into Key/value par with key=(dept, desg,state) and value=(1,totalCost)
scala> val keyVals = data.map(x => ((x.Dept,x.Desg,x.State),(1,x.totalCost)))
Step #5: Group by using reduceByKey, as we want summation as well for total number of employees and the cost
scala> val results = keyVals.reduceByKey{(a,b) => (a._1+b._1, a._2+b._2)} //(a.count+ b.count, a.cost+b.cost)
results: org.apache.spark.rdd.RDD[((String, String, String), (Int, Double))] = ShuffledRDD[41] at reduceByKey at <console>:55
Step #6: save the results
scala> results.repartition(1).saveAsTextFile("data/result")
Error
17/08/16 22:16:59 ERROR executor.Executor: Exception in task 0.0 in stage 20.0 (TID 23)
java.lang.NumberFormatException: For input string: "costToCompany"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
at java.lang.Double.parseDouble(Double.java:540)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
at $line85.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
at $line85.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/08/16 22:16:59 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 20.0 (TID 23, localhost, executor driver): java.lang.NumberFormatException: For input string: "costToCompany"
Update 1
Forgot to remove header. update code here. Save is throwing a different error now. also, need to put the header back in the file.
scala> val file = sc.textFile("data/sales.txt")
scala> val header = fileSplit.first()
scala> val noHeaderData = fileSplit.filter(_(0) != header(0))
scala> case class emp(Dept:String, Desg:String, totalCost:Double, State:String)
scala> val data = noHeaderData.map(x => emp(x(0), x(1), x(2).toDouble, x(3)))
scala> val keyVals = data.map(x => ((x.Dept,x.Desg,x.State),(1,x.totalCost)))
scala> val resultSpecific = results.map(x => (x._1._1, x._1._2, x._1._3, x._2._1, x._2._2))
scala> resultSpecific.repartition(1).saveASTextFile("data/specific")
<console>:64: error: value saveASTextFile is not a member of org.apache.spark.rdd.RDD[(String, String, String, Int, Double)]
resultSpecific.repartition(1).saveASTextFile("data/specific")

To answer your question as well as comments:
It would be easier for you to utilize dataframes in this case, as your file is in csv format you can use the following way to load and save the data. In this way, you do not need to concern yourself with splitting the rows in the file as well as taking care of the header (both when loading and saving).
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true") //reading the headers
.load("csv/file/path");
The dataframe column names will then be the same as the header in the file. Instead of reduceByKey() you can use the dataframe's groupBy() and agg():
val res = df.groupBy($"Department", $"Designation", $"State")
.agg(count($"costToCompany").alias("empCount"), sum($"costToCompany").alias("totalCost"))
Then save it:
res.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("results.csv")

when you are trying to cast into double, costToCompany string wont cast thats why its stuck when try to fire action. just drop first record from file and then it will work . you can also do such operation on dataframe also which will be easy

The error is straight forward and it says that
:64: error: value saveASTextFile is not a member of
org.apache.spark.rdd.RDD[(String, String, String, Int, Double)]
resultSpecific.repartition(1).saveASTextFile("data/specific")
In fact, you got no method called saveASTextFile(...) but saveAsTextFile(???). you have case error on your method name.

Spark recommendation ALS nullpointer exception

I am trying to build an ALS model using Spark.mllib.recommendation.
I am getting a null pointer exception.
I do not see any null values in the columns I am using. help needed.
import org.apache.spark.mllib.recommendation.Rating
import org.apache.spark.ml.recommendation.ALS
val path = "DataPath"
val data = spark.read.json(path)
data.printSchema()
data.createOrReplaceTempView("reviews")
val raw_reviews = spark.sql("Select reviewerID, cast(asin as int) as ProductID, overall from reviews")
raw_reviews.printSchema()
import org.apache.spark.ml.feature.StringIndexer
val stringindexer = new StringIndexer()
.setInputCol("reviewerID")
.setOutputCol("userID")
val modelc = stringindexer.fit(raw_reviews)
val df = modelc.transform(raw_reviews)
val Array(training,test) = df.randomSplit(Array(0.8,0.2))
val als = new ALS().setMaxIter(5).setRegParam(0.01).setUserCol("userID").setItemCol("ProductID").setRatingCol("overall")
val model = als.fit(training)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 30.0 failed 1 times, most recent failure: Lost task 1.0 in stage 30.0 (TID 94, localhost): java.lang.NullPointerException: Value at index 1 is null

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark sqlContext UDF acting on Sets - scala

Use Seq: val sizeUDF = udf((x: Seq) => setSize(x.toSet))

Related

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast t

spark create Dataframe in UDF

Merge two columns of type Array[string] into a new Array[string] column

Spark reducer and summation result issue

Spark recommendation ALS nullpointer exception

Categories

Resources