How to group by a dataframe of a specific class - scala

I have one dataframe with this schema:
|-- Agreement_A1: string (nullable = true)
|-- Line_A1: string (nullable = true)
|-- Line_A2: string (nullable = true)
I create a new dataframe with this code:
val df2 = df.map(row => new MapResultRequestLine().apply(row))(Encoders.bean(classOf[AgreementLine]))
Function apply() is this:
public AgreementLine apply(Row row) {
AgreementLine agrLine = new AgreementLine();
agrLine.Agreement_A1 = row.getAs("Agreement_A1");
Line res = new Line();
res.Line_A1 = row.getAs("Line_A1");
res.Line_A2 = row.getAs("Line_A2");
agrLine.line = res
return agrLine;
}
Class AgreementLine looks like this:
public class AgreementLine{
public String agreementCrocCode;
public Line line;
}
Class Line is this:
public class Line{
public String Line_A1;
public String Line_A2;
}
How to group df2 so the result dataframe had Agreement_A1 column and the list of Line?
I have tried it this way:
val groupedDF = df2.groupBy($"Agreement_A1").agg(collect_set((array($"line"))).as("lines"))
But it shows an error "cannot resolve 'Agreement_A1' given input columns: [];"

The issue is here:
val df2 = df.map(row => new MapResultRequestLine().apply(row))(Encoders.bean(classOf[AgreementLine]))
scala doesn't show the data type, so you think it's a DataFrame(as DataSet[Row]).
But actually, it's a DataSet(as DataSet[AgreementLine]). And thanks for your encoder, it has lost all the schema, that's the reason why your df2.printSchema return the empty result.
Thus, when you call df2.groupBy($"Agreement_A1"), it will throw the Exception because there is no column named "Agreement_A1".
Obviously, the solution is to update schema of your DataSet(df2 in your case).
And sadly, I have no idea how to do with this(I'm a Rookie as well).
My only solution is to convert the dataset back to RDD[Row](mention it is RDD[AgreementLine] if you want to use df2.rdd), and build a new DataFrame with custom schema.
Hope you'll get a better solution.

Related

Scala/Spark - Convert Word2vec output to Dataset[_]

I believe the case class type should match with DataFrame. However, I'm confused what should be my case class type for text column?
My code below:
case class vectorData(value: Array[String], vectors: Array[Float])
def main(args: Array[String]) {
val word2vec = new Word2Vec()
.setInputCol("value").setOutputCol("vectors")
.setVectorSize(5).setMinCount(0).setWindowSize(5)
val dataset = spark.createDataset(data)
val model = word2vec.fit(dataset)
val encoder = org.apache.spark.sql.Encoders.product[vectorData]
val result = model.transform(dataset)
result.foreach(row => println(row.get(0)))
println("###################################")
result.foreach(row => println(row.get(1)))
val output = result.as(encoder)
}
As shown, when I print the first column, I get this:
WrappedArray(#marykatherine_q, know!, I, heard, afternoon, wondered, thing., Moscow, times)
WrappedArray(laying, bed, voice..)
WrappedArray(I'm, sooo, sad!!!, killed, Kutner, House, whyyyyyyyy)
when I print the second column, I get this:
[-0.0495405454809467,0.03403271486361821,0.011959535030958552,-0.008446224654714266,0.0014322120696306229]
[-0.06924172700382769,0.02562551060691476,0.01857258938252926,-0.0269106051127892,-0.011274430900812149]
[-0.06266747579416808,0.007715661790879334,0.047578315007472956,-0.02747830021989477,-0.015755867421188775]
The error I'm getting:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`text`' given input columns: [result, value];
It seems apparent that my case class has type mismatch with actual result. What should be the correct one? I want val output to be DataSet[_].
Thank you
EDIT:
I've modified the case class column names to be same as the word2vec output. Now I'm getting this error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: need an array field but got struct<type:tinyint,size:int,indices:array<int>,values:array<double>>;
From what I see, it is just a problem of attribute naming. What spark is telling you is that it cannot find the attribute text in the dataframe result.
You do not say how you create the data object but it must have an attribute value since Word2vec manages to find it. model.transform simply adds a result column to that dataset, and turns it into a dataframe of the following type:
root
|-- value: array (nullable = true)
| |-- element: string (containsNull = true)
|-- vector: array (nullable = true)
| |-- element: float (containsNull = false)
|-- result: vector (nullable = true)
So when you try to turn it into a dataset, spark tries to find a text column and throws that exception. Just rename the value column and it will work:
val output = result.withColumnRenamed("value", "text").as(encoder)
After checking the source code of word2vec, I managed to realise that the output of transform is actually not Array[Float], it is actually Vector (from o.a.s.ml.linalg).
It worked by changing case class as below:
case class vectorData(value: Array[String], vectors: Vector)

How to optimize dataset aggregation on top of java objects dataset

could you please support me on stupid question:
I have some Java class:
public class ProbePoint implements Serializable, Cloneable {
private long arrivalTimeMillis = 0;
private long captureTimeMillis = 0;
//...
}
public class Trip implements Serializable, Cloneable {
private ArrayList<ProbePoint> points = new ArrayList<>();
//...
}
I have Dataset[Trip]. I need to collect some min/max values. What would be better implementation of next:
public class DataRanges implements Serializable {
private long minCaptureTs;
private long maxArrivalTs;
}
val timesDs: Dataset[DataRanges] = trips.mapPartitions(t => {
var minCaptTime = Long.MaxValue
var maxArrTime = Long.MinValue
t.foreach(f => {
if (f.points.head < minCaptTime) minCaptTime = f.points.head
if (f.points.last.getArrivalTimeMillis > maxArrTime) maxArrTime = f.points.last.getArrivalTimeMillis
})
Iterator[DataRanges](
new DataRanges(minStartTime, maxEndTime, minArrTime, maxArrTime))
})(Encoders.bean(classOf[DataRanges]))
val times = timesDs.agg(min("minCaptureTs"), max("maxArrivalTs")).head()
}
}
Looking at the Java classes, the schema of Dataset[Trip] should be
root
|-- points: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- arrivalTimeMillis: long (nullable = false)
| | |-- captureTimeMillis: long (nullable = false)
It would be possible to explode the array and then take the min and max of the resulting columns, thus simplyfing the code a bit:
val df = tripsDF
.withColumn("exploded", explode($"points"))
.withColumn("arrivalTimeMillis", $"exploded.arrivalTimeMillis")
.withColumn("captureTimeMillis", $"exploded.captureTimeMillis")
val Row(minArrivaltime: Long, maxCaptureTimeMillis: Long) =
df.agg(min("arrivalTimeMillis"), max("captureTimeMillis")).head
println(minArrivaltime)
println(maxCaptureTimeMillis)
The code in the question assumes that the arrays in the Trip class are sorted: the minimal capture time is always taken from the first element of the array and the maximum arrival time is always taken from the last one. This code takes the minimum and maximum over all ProbePoints, so the logic is slightly different.

Filter an array column based on a provided list

I have the following types in a dataframe:
root
|-- id: string (nullable = true)
|-- items: array (nullable = true)
| |-- element: string (containsNull = true)
input:
val rawData = Seq(("id1",Array("item1","item2","item3","item4")),("id2",Array("item1","item2","item3")))
val data = spark.createDataFrame(rawData)
and a list of items:
val filter_list = List("item1", "item2")
I would like to filter out items that are non in the filter_list, similar to how array_contains would function, but its not working on a provided list of strings, only a single value.
so the output would look like this:
val rawData = Seq(("id1",Array("item1","item2")),("id2",Array("item1","item2")))
val data = spark.createDataFrame(rawData)
I tried solving this with the following UDF, but I probably mix types between Scala and Spark:
def filterItems(flist: List[String]) = udf {
(recs: List[String]) => recs.filter(item => flist.contains(item))
}
I'm using Spark 2.2
thanks!
You code is almost right. All you have to do is replace List with Seq
def filterItems(flist: List[String]) = udf {
(recs: Seq[String]) => recs.filter(item => flist.contains(item))
}
It would also make sense to change signature from List[String] => UserDefinedFunction to SeqString] => UserDefinedFunction, but it is not required.
Reference SQL Programming Guide - Data Types.

Spark UDAF: How to get value from input by column field name in UDAF (User-Defined Aggregation Function)?

I am trying to use Spark UDAF to summarize two existing columns into a new column. Most of the tutorials on Spark UDAF out there use indices to get the values in each column of the input Row. Like this:
input.getAs[String](1)
, which is used in my update method (override def update(buffer: MutableAggregationBuffer, input: Row): Unit). It works in my case as well. However I want to use the field name of the that column to get that value. Like this:
input.getAs[String](ColumnNames.BehaviorType)
, where ColumnNames.BehaviorType is a String object defined in an object:
/**
* Column names in the original dataset
*/
object ColumnNames {
val JobSeekerID = "JobSeekerID"
val JobID = "JobID"
val Date = "Date"
val BehaviorType = "BehaviorType"
}
This time it does not work. I got the following exception:
java.lang.IllegalArgumentException: Field "BehaviorType" does not
exist. at
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:292)
... at org.apache.spark.sql.Row$class.getAs(Row.scala:333) at
org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:165)
at
com.recsys.UserBehaviorRecordsUDAF.update(UserBehaviorRecordsUDAF.scala:44)
Some relevant code segments:
This is part of my UDAF:
class UserBehaviorRecordsUDAF extends UserDefinedAggregateFunction {
override def inputSchema: StructType = StructType(
StructField("JobID", IntegerType) ::
StructField("BehaviorType", StringType) :: Nil)
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
println("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
println(input.schema.treeString)
println
println(input.mkString(","))
println
println(this.inputSchema.treeString)
// println
// println(bufferSchema.treeString)
input.getAs[String](ColumnNames.BehaviorType) match { //ColumnNames.BehaviorType //1 //TODO WHY??
case BehaviourTypes.viewed_job =>
buffer(0) =
buffer.getAs[Seq[Int]](0) :+ //Array[Int] //TODO WHY??
input.getAs[Int](0) //ColumnNames.JobID
case BehaviourTypes.bookmarked_job =>
buffer(1) =
buffer.getAs[Seq[Int]](1) :+ //Array[Int]
input.getAs[Int](0)//ColumnNames.JobID
case BehaviourTypes.applied_job =>
buffer(2) =
buffer.getAs[Seq[Int]](2) :+ //Array[Int]
input.getAs[Int](0) //ColumnNames.JobID
}
}
The following is the part of codes that call the UDAF:
val ubrUDAF = new UserBehaviorRecordsUDAF
val userProfileDF = userBehaviorDS
.groupBy(ColumnNames.JobSeekerID)
.agg(
ubrUDAF(
userBehaviorDS.col(ColumnNames.JobID), //userBehaviorDS.col(ColumnNames.JobID)
userBehaviorDS.col(ColumnNames.BehaviorType) //userBehaviorDS.col(ColumnNames.BehaviorType)
).as("profile str"))
It seems the field names in the schema of the input Row are not passed into the UDAF:
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root
|-- input0: integer (nullable = true)
|-- input1: string (nullable = true)
30917,viewed_job
root
|-- JobID: integer (nullable = true)
|-- BehaviorType: string (nullable = true)
What is the problem in my codes?
I also want to use the field names from my inputSchema in my update method to create maintainable code.
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
class MyUDAF extends UserDefinedAggregateFunction {
def update(buffer: MutableAggregationBuffer, input: Row) = {
val inputWSchema = new GenericRowWithSchema(input.toSeq.toArray, inputSchema)
Ultimately switched to Aggregator which ran in half the time.

Spark - recursive function as udf generates an Exception

I am working with DataFrames which elements have got a schema similar to:
root
|-- NPAData: struct (nullable = true)
| |-- NPADetails: struct (nullable = true)
| | |-- location: string (nullable = true)
| | |-- manager: string (nullable = true)
| |-- service: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- serviceName: string (nullable = true)
| | | |-- serviceCode: string (nullable = true)
|-- NPAHeader: struct (nullable = true)
| | |-- npaNumber: string (nullable = true)
| | |-- date: string (nullable = true)
In my DataFrame I want to group all elements which has the same NPAHeader.code, so to do that I am using the following line:
val groupedNpa = orderedNpa.groupBy($"NPAHeader.code" ).agg(collect_list(struct($"NPAData",$"NPAHeader")).as("npa"))
After this I have a dataframe with the following schema:
StructType(StructField(npaNumber,StringType,true), StructField(npa,ArrayType(StructType(StructField(NPAData...)))))
An example of each Row would be something similar to:
[1234,WrappedArray([npaNew,npaOlder,...npaOldest])]
Now what I want is to generate another DataFrame with picks up just one of the element in the WrappedArray, so I want an output similar to:
[1234,npaNew]
Note: The chosen element from the WrappedArray is the one that matches a complext logic after iterating over the whole WrappedArray. But to simplify the question, I will pick up always the last element of the WrappedArray (after iterating all over it).
To do so, I want to define a recurside udf
import org.apache.spark.sql.functions.udf
def returnRow(elementList : Row)(index:Int): Row = {
val dif = elementList.size - index
val row :Row = dif match{
case 0 => elementList.getAs[Row](index)
case _ => returnRow(elementList)(index + 1)
}
row
}
val returnRow_udf = udf(returnRow _)
groupedNpa.map{row => (row.getAs[String]("npaNumber"),returnRow_udf(groupedNpa("npa")(0)))}
But I am getting the following error in the map:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type Int => Unit is not supported
What am I doing wrong?
As an aside, I am not sure if I am passing correctly the npa column with groupedNpa("npa"). I am accesing the WrappedArray as a Row, because I don't know how to iterate over Array[Row] (the get(index) method is not present in Array[Row])
TL;DR Just use one of the methods described in How to select the first row of each group?
If you want to use complex logic, and return Row you can skip SQL API and use groupByKey:
val f: (String, Iterator[org.apache.spark.sql.Row]) => Row
val encoder: Encoder
df.groupByKey(_.getAs[String]("NPAHeader.code")).mapGroups(f)(encoder)
or better:
val g: (Row, Row) => Row
df.groupByKey(_.getAs[String]("NPAHeader.code")).reduceGroups(g)
where encoder is a valid RowEncoder (Encoder error while trying to map dataframe row to updated row).
Your code is faulty in multiple ways:
groupBy doesn't guarantee the order of values. So:
orderBy(...).groupBy(....).agg(collect_list(...))
can have non-deterministic output. If you really decide to go this route you should skip orderBy and sort collected array explicitly.
You cannot pass curried function to udf. You'd have to uncurry it first, but it would require different order of arguments (see example below).
If you could, this might be the correct way to call it (Note that you omit the second argument):
returnRow_udf(groupedNpa("npa")(0))
To make it worse, you call it inside map, where udfs are not applicable at all.
udf cannot return Row. It has to return external Scala type.
External representation for array<struct> is Seq[Row]. You cannot just substitute it with Row.
SQL arrays can be accessed by index with apply:
df.select($"array"(size($"array") - 1))
but it is not a correct method due to non-determinism. You could apply sort_array, but as pointed out at the beginning, there are more efficient solutions.
Surprisingly recursion is not so relevant. You could design function like this:
def size(i: Int=0)(xs: Seq[Any]): Int = xs match {
case Seq() => i
case null => i
case Seq(h, t # _*) => size(i + 1)(t)
}
val size_ = udf(size() _)
and it would work just fine:
Seq((1, Seq("a", "b", "c"))).toDF("id", "array")
.select(size_($"array"))
although recursion is an overkill, if you can just iterate over Seq.