Selecting a row from array<struct> based on given condition - scala

I've a dataframe with following schema -
|-- ID: string (nullable = true)
|-- VALUES: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _v1: string (nullable = true)
| | |-- _v2: string (nullable = true)
VALUES are like -
[["ABC","a"],["PQR","c"],["XYZ","b"],["TUV","d"]]
[["PQR","g"]]
[["TUV","f"],["ABC","e"]]
I've to select a single struct from this array based on the value of _v1. There is a hierarchy in these values like -
"ABC" -> "XYZ" -> "PQR" -> "TUV"
Now, if "TUV" is present, we will select the row with "TUV" in its _v1. Else we will check for "PQR". If "PQR" is present, take its row. Else check for "XYZ" and so on.
The result df should look like - (which will be StructType now, not Array[Struct])
["TUV","d"]
["PQR","g"]
["TUV","f"]
Can someone please guide me how can I approach this problem by creating a udf ?
Thanks in advance.

you can do something like below
import org.apache.spark.sql.functions._
def getIndex = udf((array : mutable.WrappedArray[String]) => {
if(array.contains("TUV")) array.indexOf("TUV")
else if(array.contains("PQR")) array.indexOf("PQR")
else if(array.contains("XYZ")) array.indexOf("XYZ")
else if(array.contains("ABC")) array.indexOf("ABC")
else 0
})
df.select($"VALUES"(getIndex($"VALUES._v1")).as("selected"))
You should have following output
+--------+
|selected|
+--------+
|[TUV,d] |
|[PQR,g] |
|[TUV,f] |
+--------+
I hope the answer is helpful
Updated
You can select the elements of struct column by using . notation. Here $"VALUES._v1" is selecting all the _v1 of struct and passing them to udf function as Array in the same order.
for example : for [["ABC","a"],["PQR","c"],["XYZ","b"],["TUV","d"]], $"VALUES._v1" would return ["ABC","PQR","XYZ","TUV"] which is passed to udf function
Inside udf function, index of array where the strings matched is returned. for example : for ["ABC","PQR","XYZ","TUV"], "TUV" matches so it would return 3.
for the first row, getIndex($"VALUES._v1") would return 3 so $"VALUES"(getIndex($"VALUES._v1") is equivalent to $"VALUES"(3) which is the fourth element of [["ABC","a"],["PQR","c"],["XYZ","b"],["TUV","d"]] i.e. ["TUV","d"] .
I hope the explanation is clear.

This should work as long as each row only contains each _v1 values at most once. The UDF will return the index of the best value in the hierarchy list. Then the stuct containing this value in _v1 will be selected and put into the "select" column.
val hierarchy = List("TUV", "PQR", "XYZ", "ABC")
val findIndex = udf((rows: Seq[String]) => {
val s = rows.toSet
val best = hierarchy.filter(h => s contains h).head
rows.indexOf(best)
})
df.withColumn("select", $"VALUES"(findIndex($"VALUES._v2")))
A list is used for the order to make it easy to extend to more than 4 values.

Related

Scala/Spark - Convert Word2vec output to Dataset[_]

I believe the case class type should match with DataFrame. However, I'm confused what should be my case class type for text column?
My code below:
case class vectorData(value: Array[String], vectors: Array[Float])
def main(args: Array[String]) {
val word2vec = new Word2Vec()
.setInputCol("value").setOutputCol("vectors")
.setVectorSize(5).setMinCount(0).setWindowSize(5)
val dataset = spark.createDataset(data)
val model = word2vec.fit(dataset)
val encoder = org.apache.spark.sql.Encoders.product[vectorData]
val result = model.transform(dataset)
result.foreach(row => println(row.get(0)))
println("###################################")
result.foreach(row => println(row.get(1)))
val output = result.as(encoder)
}
As shown, when I print the first column, I get this:
WrappedArray(#marykatherine_q, know!, I, heard, afternoon, wondered, thing., Moscow, times)
WrappedArray(laying, bed, voice..)
WrappedArray(I'm, sooo, sad!!!, killed, Kutner, House, whyyyyyyyy)
when I print the second column, I get this:
[-0.0495405454809467,0.03403271486361821,0.011959535030958552,-0.008446224654714266,0.0014322120696306229]
[-0.06924172700382769,0.02562551060691476,0.01857258938252926,-0.0269106051127892,-0.011274430900812149]
[-0.06266747579416808,0.007715661790879334,0.047578315007472956,-0.02747830021989477,-0.015755867421188775]
The error I'm getting:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`text`' given input columns: [result, value];
It seems apparent that my case class has type mismatch with actual result. What should be the correct one? I want val output to be DataSet[_].
Thank you
EDIT:
I've modified the case class column names to be same as the word2vec output. Now I'm getting this error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: need an array field but got struct<type:tinyint,size:int,indices:array<int>,values:array<double>>;
From what I see, it is just a problem of attribute naming. What spark is telling you is that it cannot find the attribute text in the dataframe result.
You do not say how you create the data object but it must have an attribute value since Word2vec manages to find it. model.transform simply adds a result column to that dataset, and turns it into a dataframe of the following type:
root
|-- value: array (nullable = true)
| |-- element: string (containsNull = true)
|-- vector: array (nullable = true)
| |-- element: float (containsNull = false)
|-- result: vector (nullable = true)
So when you try to turn it into a dataset, spark tries to find a text column and throws that exception. Just rename the value column and it will work:
val output = result.withColumnRenamed("value", "text").as(encoder)
After checking the source code of word2vec, I managed to realise that the output of transform is actually not Array[Float], it is actually Vector (from o.a.s.ml.linalg).
It worked by changing case class as below:
case class vectorData(value: Array[String], vectors: Vector)

Dump array of map column of a spark dataframe into csv file

I have the following spark dataframe and its corresponding schema
+----+--------------------+
|name| subject_list|
+----+--------------------+
| Tom|[[Math -> 99], [P...|
| Amy| [[Physics -> 77]]|
+----+--------------------+
root
|-- name: string (nullable = true)
|-- subject_list: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: integer (valueContainsNull = false)
How can I dump this dataframe into a csv file seperated by "\t" as following
Tom [(Math, 99), (Physics, 88)]
Amy [(Physics, 77)]
Here's link to a similar post to this question but it is for dumping an array of string, not an array of map.
Appreciate for any help, thanks.
The reason why it throws error and other details are listed in same link that you have shared. Here is the modified version of stringify for array of map:
def stringify = udf((vs: Seq[Map[String, Int]]) => vs match {
case null => null
case x => "[" + x.flatMap(_.toList).mkString(",") + "]"
})
credits: link
You can write an udf to convert Map to string as you want like
val mapToString = udf((marks: Map[String, String]) => {
marks.map{case (k, v) => (s"(${k},${v})")}.mkString("[",",", "]")
})
dff.withColumn("marks", mapToString($"marks"))
.write.option("delimiter", "\t")
.csv("csvoutput")
Output:
Tom [(Math,99),(Physics,88)]
Amy [(Physics,77)]
But I don't recommend you to do this, You gonna have problem while reading again and have to parse manually
Its better to flatten those map as
dff.select($"name", explode($"marks")).write.csv("csvNewoutput")
Which will store as
Tom,Math,99
Tom,Physics,88
Amy,Physics,77

function to each row of Spark Dataframe

I have a spark Dataframe (df) with 2 column's (Report_id and Cluster_number).
I want to apply a function (getClusterInfo) to df which will return the name for each cluster i.e. if cluster number is '3' then for a specific report_id, the 3 below mentioned rows will be written:
{"cluster_id":"1","influencers":[{"screenName":"A"},{"screenName":"B"},{"screenName":"C"},...]}
{"cluster_id":"2","influencers":[{"screenName":"D"},{"screenName":"E"},{"screenName":"F"},...]}
{"cluster_id":"3","influencers":[{"screenName":"G"},{"screenName":"H"},{"screenName":"E"},...]}
I am using foreach on df to apply getClusterInfo function, but can't figure out how to convert o/p to a Dataframe (Report_id, Array[cluster_info]).
Here is the code snippet:
df.foreach(row => {
val report_id = row(0)
val cluster_no = row(1).toString
val cluster_numbers = new Range(0, cluster_no.toInt - 1, 1)
for (cluster <- cluster_numbers.by(1)) {
val cluster_id = report_id + "_" + cluster
//get cluster influencers
val result = getClusterInfo(cluster_id)
println(result.get)
val res : String = result.get.toString()
// TODO ?
}
.. //TODO ?
})
Geenrally speaking, you shouldn't use foreach when you want to map something into something else; foreach is good for applying functions that only have side-effects and return nothing.
In this case, if I got the details right (probably not), you can use a User-Defined Function (UDF) and explode the result:
import org.apache.spark.sql.functions._
import spark.implicits._
// I'm assuming we have these case classes (or similar)
case class Influencer(screenName: String)
case class ClusterInfo(cluster_id: String, influencers: Array[Influencer])
// I'm assuming this method is supplied - with your own implementation
def getClusterInfo(clusterId: String): ClusterInfo =
ClusterInfo(clusterId, Array(Influencer(clusterId)))
// some sample data - assuming both columns are integers:
val df = Seq((222, 3), (333, 4)).toDF("Report_id", "Cluster_number")
// actual solution:
// UDF that returns an array of ClusterInfo;
// Array size is 'clusterNo', creates cluster id for each element and maps it to info
val clusterInfoUdf = udf { (clusterNo: Int, reportId: Int) =>
(1 to clusterNo).map(v => s"${reportId}_$v").map(getClusterInfo)
}
// apply UDF to each record and explode - to create one record per array item
val result = df.select(explode(clusterInfoUdf($"Cluster_number", $"Report_id")))
result.printSchema()
// root
// |-- col: struct (nullable = true)
// | |-- cluster_id: string (nullable = true)
// | |-- influencers: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- screenName: string (nullable = true)
result.show(truncate = false)
// +-----------------------------+
// |col |
// +-----------------------------+
// |[222_1,WrappedArray([222_1])]|
// |[222_2,WrappedArray([222_2])]|
// |[222_3,WrappedArray([222_3])]|
// |[333_1,WrappedArray([333_1])]|
// |[333_2,WrappedArray([333_2])]|
// |[333_3,WrappedArray([333_3])]|
// |[333_4,WrappedArray([333_4])]|
// +-----------------------------+

Spark - recursive function as udf generates an Exception

I am working with DataFrames which elements have got a schema similar to:
root
|-- NPAData: struct (nullable = true)
| |-- NPADetails: struct (nullable = true)
| | |-- location: string (nullable = true)
| | |-- manager: string (nullable = true)
| |-- service: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- serviceName: string (nullable = true)
| | | |-- serviceCode: string (nullable = true)
|-- NPAHeader: struct (nullable = true)
| | |-- npaNumber: string (nullable = true)
| | |-- date: string (nullable = true)
In my DataFrame I want to group all elements which has the same NPAHeader.code, so to do that I am using the following line:
val groupedNpa = orderedNpa.groupBy($"NPAHeader.code" ).agg(collect_list(struct($"NPAData",$"NPAHeader")).as("npa"))
After this I have a dataframe with the following schema:
StructType(StructField(npaNumber,StringType,true), StructField(npa,ArrayType(StructType(StructField(NPAData...)))))
An example of each Row would be something similar to:
[1234,WrappedArray([npaNew,npaOlder,...npaOldest])]
Now what I want is to generate another DataFrame with picks up just one of the element in the WrappedArray, so I want an output similar to:
[1234,npaNew]
Note: The chosen element from the WrappedArray is the one that matches a complext logic after iterating over the whole WrappedArray. But to simplify the question, I will pick up always the last element of the WrappedArray (after iterating all over it).
To do so, I want to define a recurside udf
import org.apache.spark.sql.functions.udf
def returnRow(elementList : Row)(index:Int): Row = {
val dif = elementList.size - index
val row :Row = dif match{
case 0 => elementList.getAs[Row](index)
case _ => returnRow(elementList)(index + 1)
}
row
}
val returnRow_udf = udf(returnRow _)
groupedNpa.map{row => (row.getAs[String]("npaNumber"),returnRow_udf(groupedNpa("npa")(0)))}
But I am getting the following error in the map:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type Int => Unit is not supported
What am I doing wrong?
As an aside, I am not sure if I am passing correctly the npa column with groupedNpa("npa"). I am accesing the WrappedArray as a Row, because I don't know how to iterate over Array[Row] (the get(index) method is not present in Array[Row])
TL;DR Just use one of the methods described in How to select the first row of each group?
If you want to use complex logic, and return Row you can skip SQL API and use groupByKey:
val f: (String, Iterator[org.apache.spark.sql.Row]) => Row
val encoder: Encoder
df.groupByKey(_.getAs[String]("NPAHeader.code")).mapGroups(f)(encoder)
or better:
val g: (Row, Row) => Row
df.groupByKey(_.getAs[String]("NPAHeader.code")).reduceGroups(g)
where encoder is a valid RowEncoder (Encoder error while trying to map dataframe row to updated row).
Your code is faulty in multiple ways:
groupBy doesn't guarantee the order of values. So:
orderBy(...).groupBy(....).agg(collect_list(...))
can have non-deterministic output. If you really decide to go this route you should skip orderBy and sort collected array explicitly.
You cannot pass curried function to udf. You'd have to uncurry it first, but it would require different order of arguments (see example below).
If you could, this might be the correct way to call it (Note that you omit the second argument):
returnRow_udf(groupedNpa("npa")(0))
To make it worse, you call it inside map, where udfs are not applicable at all.
udf cannot return Row. It has to return external Scala type.
External representation for array<struct> is Seq[Row]. You cannot just substitute it with Row.
SQL arrays can be accessed by index with apply:
df.select($"array"(size($"array") - 1))
but it is not a correct method due to non-determinism. You could apply sort_array, but as pointed out at the beginning, there are more efficient solutions.
Surprisingly recursion is not so relevant. You could design function like this:
def size(i: Int=0)(xs: Seq[Any]): Int = xs match {
case Seq() => i
case null => i
case Seq(h, t # _*) => size(i + 1)(t)
}
val size_ = udf(size() _)
and it would work just fine:
Seq((1, Seq("a", "b", "c"))).toDF("id", "array")
.select(size_($"array"))
although recursion is an overkill, if you can just iterate over Seq.

splitting contents of a dataframe column using Spark 1.4 for nested json data

I am having issues with splitting contents of a dataframe column using Spark 1.4. The dataframe was created by reading a nested complex json file. I used df.explode but keep getting error message. The json file format is as follows:
[
{
"neid":{ },
"mi":{
"mts":"20100609071500Z",
"gp":"900",
"tMOID":"Aal2Ap",
"mt":[ ],
"mv":[
{
"moid":"ManagedElement=1,TransportNetwork=1,Aal2Sp=1,Aal2Ap=r1552q",
"r":
[
1,
2,
5
]
},
{
"moid":"ManagedElement=1,TransportNetwork=1,Aal2Sp=1,Aal2Ap=r1542q",
"r":
[
1,
2,
5
]
}
]
}
},
{
"neid":{
"neun":"RC003",
"nedn":"SubNetwork=ONRM_RootMo_R,SubNetwork=RC003,MeContext=RC003",
"nesw":"CP90831_R9YC/11"
},
"mi":{
"mts":"20100609071500Z",
"gp":"900",
"tMOID":"PlugInUnit",
"mt":"pmProcessorLoad",
"mv":[
{
"moid":"ManagedElement=1,Equipment=1,Subrack=MS,Slot=6,PlugInUnit=1",
"r":
[ 1, 2, 5
]
},
{
"moid":"ManagedElement=1,Equipment=1,Subrack=ES-1,Slot=1,PlugInUnit=1",
"r":
[ 1, 2, 5
]
}
]
}
}
]
I used following code to load in Spark 1.4
scala> val df = sqlContext.read.json("/Users/xx/target/statsfile.json")
scala> df.show()
+--------------------+--------------------+
| mi| neid|
+--------------------+--------------------+
|[900,["pmEs","pmS...|[SubNetwork=ONRM_...|
|[900,["pmIcmpInEr...|[SubNetwork=ONRM_...|
|[900,pmUnsuccessf...|[SubNetwork=ONRM_...|
|[900,["pmBwErrBlo...|[SubNetwork=ONRM_...|
|[900,["pmSctpStat...|[SubNetwork=ONRM_...|
|[900,["pmLinkInSe...|[SubNetwork=ONRM_...|
|[900,["pmGrFc","p...|[SubNetwork=ONRM_...|
|[900,["pmReceived...|[SubNetwork=ONRM_...|
|[900,["pmIvIma","...|[SubNetwork=ONRM_...|
|[900,["pmEs","pmS...|[SubNetwork=ONRM_...|
|[900,["pmEs","pmS...|[SubNetwork=ONRM_...|
|[900,["pmExisOrig...|[SubNetwork=ONRM_...|
|[900,["pmHDelayVa...|[SubNetwork=ONRM_...|
|[900,["pmReceived...|[SubNetwork=ONRM_...|
|[900,["pmReceived...|[SubNetwork=ONRM_...|
|[900,["pmAverageR...|[SubNetwork=ONRM_...|
|[900,["pmDchFrame...|[SubNetwork=ONRM_...|
|[900,["pmReceived...|[SubNetwork=ONRM_...|
|[900,["pmNegative...|[SubNetwork=ONRM_...|
|[900,["pmUsedTbsQ...|[SubNetwork=ONRM_...|
+--------------------+--------------------+
scala> df.printSchema()
root
|-- mi: struct (nullable = true)
| |-- gp: long (nullable = true)
| |-- mt: string (nullable = true)
| |-- mts: string (nullable = true)
| |-- mv: string (nullable = true)
|-- neid: struct (nullable = true)
| |-- nedn: string (nullable = true)
| |-- nesw: string (nullable = true)
| |-- neun: string (nullable = true)
scala> val df1=df.select("mi.mv").show()
+--------------------+
| mv|
+--------------------+
|[{"r":[0,0,0],"mo...|
|{"r":[0,4,0,4],"m...|
|{"r":5,"moid":"Ma...|
|[{"r":[2147483647...|
|{"r":[225,1112986...|
|[{"r":[83250,0,0,...|
|[{"r":[1,2,529982...|
|[{"r":[26998564,0...|
|[{"r":[0,0,0,0,0,...|
|[{"r":[0,0,0],"mo...|
|[{"r":[0,0,0],"mo...|
|{"r":[0,0,0,0,0,0...|
|{"r":[0,0,1],"moi...|
|{"r":[4587,4587],...|
|[{"r":[180,180],"...|
|[{"r":["0,0,0,0,0...|
|{"r":[0,35101,0,0...|
|[{"r":["0,0,0,0,0...|
|[{"r":[0,1558],"m...|
|[{"r":["7484,4870...|
+--------------------+
scala> df1.explode("mv","mvnew")(mv: String => mv.split(","))
<console>:1: error: ')' expected but '(' found.
df1.explode("mv","mvnew")(mv: String => mv.split(","))
^
<console>:1: error: ';' expected but ')' found.
df1.explode("mv","mvnew")(mv: String => mv.split(","))
Am i doing something wrong? I need to extract data under mi.mv in separate columns so i can apply some transformations.
I know this is old but I have a solution that made be useful to someone who is searching for a solution to this problem (as I was). I have been using spark 1.5 built with scala 2.10.4.
It appears to just be a format thing. I was replicating all of the errors above and what worked for me was
df1.explode("mv","mvnew"){mv: String => mv.asInstanceOf[String].split(",")}
I don't entirely understand why I need to define mv as a string twice and if anyone would care to explain that, I'd be interested, but this should enable someone to explode a dataframe column.
One more gotcha. If you are splitting on a special character (say a "?") you need to escape it twice. So in the above, splitting on a "?" would give:
df1.explode("mv","mvnew"){mv: String => mv.asInstanceOf[String].split("\\?")}
I hope this helps someone somewhere.
Remove the String typing of mv like so:
df1.explode("mv","mvnew")(mv => mv.split(","))
because the typing is already in the explode definition.
Update (see comment)
Then you get a different error, where df1 is of type Unit not DataFrame. You can fix this as follows:
val df1=df.select("mi.mv")
df1.show()
df1.explode...
That's because show() returns a value of type Unit which you previously attempted to run explode on. The above ensures that you run explode on the actual DataFrame.