Schema of dataframe df10
root
|-- ID: string (nullable = true)
|-- KEY: array (nullable = true)
| |-- element: string (containsNull = true)
Code
val gid1 = 505
val array1: Array[String] = Array("atm_P3", "fee_P6", "c_P8", "card_P4", "iss_P5", "vat_P7")
//simplistic udf
val isSubsetArrayUDF = udf { a : Seq[String] => if (!{for (elem <- a) yield array1.contains(elem)}.contains(false) == true) gid1 else 0}
val df11 = df10.withColumn("is_subset_KEY", isSubsetArrayUDF(col("tran_particular")))
I need to assign each 'KEY' in df10 a 'GID' using the given map
Map(KEY -> WrappedArray(atm_P3, fee_P6, c_P8, card_P4, iss_P5, vat_P7, cif_P1, cif_P2), GID -> 505)
Map(KEY -> WrappedArray(atm_P3, fee_P6, c_P8, card_P4, iss_P5, vat_P7, cif_P2), GID -> 423)
...
How to achieve so using a udf?
Related
I am new to Spark and trying to call an existing convertor method. I have a Spark DataFrame df with two columns. I want to convert these two columns to a List[(Long,Int)]
val df = (spark.read.parquet(s"<file path>")
.select(
explode($"groups") as "groups"))
.select(
explode($"groups.lanes") as "lanes")
.select(
$"lanes.lane_path" as "lane_path")
Schema
root
|-- lane_path: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- diffs: array (nullable = true)
| | |-- element: integer (containsNull = true)
UDF
val generateValues = spark.udf.register("my_udf",
(laneAttrs: List[(Long,Int)], x: Long, y: Int) => {
Converter.convert(laneAttrs, x, y)
.map {
case coord#Left(f) => throw new InterruptedException(s"$coord: $f")
case Right(result) => Map("a" -> result.a,
"b" -> result.b)
}
})
How can I pass "$lane_path.coordinates" and "$lane_path.diffs" as List[(Long,Int)] to call generateValues UDF?
I have the following dataframe df with the following schema:
|-- type: string (nullable = true)
|-- record_sales: array (nullable = false)
| |-- element: string (containsNull = false)
|-- record_marketing: array (nullable = false)
| |-- element: string (containsNull = false)
and a map
typemap = Map("sales" -> "record_sales", "marketing" -> "record_marketing")
I want a new column "record" that is either the value of record_sales or record_marketing based on the value of type.
I've tried some variants of this:
val typeMapCol = typedLit(typemap)
val df2 = df.withColumn("record", col(typeMapCol(col("type"))))
But nothing has worked. Does anyone have any idea? Thanks!
You can iterate over the map typemap and use when function to get case/when expressions depending on the value of type column:
val recordCol = typemap.map{case (k,v) => when(col("type") === k, col(v))}.toSeq
val df2 = df.withColumn("record", coalesce(recordCol: _*))
I have a set of dataframes, dfs, with different schema, for example:
root
|-- A_id: string (nullable = true)
|-- b_cd: string (nullable = true)
|-- c_id: integer (nullable = true)
|-- d_info: struct (nullable = true)
| |-- eid: string (nullable = true)
| |-- oid: string (nullable = true)
|-- l: array (nullable = true)
| |-- m: struct (containsNull = true)
| | |-- n: string (nullable = true)
| | |-- o: string (nullable = true)
..........
I want to check if, for example, "oid" is given in one of the column (here under d_info column). How can I search inside a schema for a set of dataframes and distinguish them. Pyspark or Scala suggestion are both helpful. Thank you
A dictionary/map of [node , root to node path] could be created for DataFame StructType (including nested StructType) using a recursive function.
val df = spark.read.json("nested_data.json")
val path = searchSchema(df.schema, "n", "root")
def searchSchema(schema: StructType, key: String, path: String): String = {
val paths = scala.collection.mutable.Map[String, String]()
addPaths(schema, path, paths)
paths(key)
}
def addPaths(schema: StructType, path: String, paths: scala.collection.mutable.Map[String, String]): Unit = {
for (field <- schema.fields) {
val _path = s"$path.${field.name}"
paths += (field.name -> _path)
field.dataType match {
case structType: StructType => addPaths(structType, _path, paths)
case arrayType: ArrayType => addPaths(arrayType.elementType.asInstanceOf[StructType], _path, paths)
case _ => //donothing
}
}
}
Input and output
Input = {"A_id":"A_id","b_cd":"b_cd","c_id":1,"d_info":{"eid":"eid","oid":"oid"},"l":[{"m":{"n":"n1","o":"01"}},{"m":{"n":"n2","o":"02"}}]}
Output = Map(n -> root.l.m.n, b_cd -> root.b_cd, d_info -> root.d_info, m -> root.l.m, oid -> root.d_info.oid, c_id -> root.c_id, l -> root.l, o -> root.l.m.o, eid -> root.d_info.eid, A_id -> root.A_id)
I'm working with Apache Spark's ALS model, and the recommendForAllUsers method returns a dataframe with the schema
root
|-- user_id: integer (nullable = false)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- item_id: integer (nullable = true)
| | |-- rating: float (nullable = true)
In practice, the recommendations are a WrappedArray like:
WrappedArray([636958,0.32910484], [995322,0.31974298], [1102140,0.30444127], [1160820,0.27908015], [1208899,0.26943958])
I'm trying to extract just the item_ids and return them as a 1D array. So the above example would be [636958,995322,1102140,1160820,1208899]
This is what's giving me trouble. So far I have:
val numberOfRecs = 20
val userRecs = model.recommendForAllUsers(numberOfRecs).cache()
val strippedScores = userRecs.rdd.map(row => {
val user_id = row.getInt(0)
val recs = row.getAs[Seq[Row]](1)
val item_ids = new Array[Int](numberOfRecs)
recs.toArray.foreach(x => {
item_ids :+ x.get(0)
})
item_ids
})
But this just returns [I#2f318251, and if I get the string value of it via mkString(","), it returns 0,0,0,0,0,0
Any thoughts on how I can extract the item_ids and return them as a separate, 1D array?
Found in the Spark ALSModel docs that recommendForAllUsers returns
"a DataFrame of (userCol: Int, recommendations), where recommendations
are stored as an array of (itemCol: Int, rating: Float) Rows"
(https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.recommendation.ALSModel)
By array, it means WrappedArray, so instead of trying to to cast it to Seq[Row], I cast it to mutable.WrappedArray[Row]. I was then able to get each item_id like:
val userRecItems = userRecs.rdd.map(row => {
val user_id = row.getInt(0)
val recs = row.getAs[mutable.WrappedArray[Row]](1)
for (rec <- recs) {
val item_id = rec.getInt(0)
userRecommendatinos += game_id
}
})
where userRecommendations was a mutable ArrayBuffer
You can use a fully qualified name to access a structure element in the array:
scala> case class Recommendation(item_id: Int, rating: Float)
defined class Recommendation
scala> val userReqs = Seq(Array(Recommendation(636958,0.32910484f), Recommendation(995322,0.31974298f), Recommendation(1102140,0.30444127f), Recommendation(1160820,0.27908015f), Recommendation(1208899,0.26943958f))).toDF
userReqs: org.apache.spark.sql.DataFrame = [value: array<struct<item_id:int,rating:float>>]
scala> userReqs.printSchema
root
|-- value: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- item_id: integer (nullable = false)
| | |-- rating: float (nullable = false)
scala> userReqs.select("value.item_id").show(false)
+-------------------------------------------+
|item_id |
+-------------------------------------------+
|[636958, 995322, 1102140, 1160820, 1208899]|
+-------------------------------------------+
scala> val ids = userReqs.select("value.item_id").collect().flatMap(_.getAs[Seq[Int]](0))
ids: Array[Int] = Array(636958, 995322, 1102140, 1160820, 1208899)
I have a UDF that converts a Map (in this case String -> String) to an Array of Struct using the Scala built-in toArray function
val toArray = udf((vs: Map[String, String]) => vs.toArray)
The field names of structs are _1 and _2.
How can I change the UDF definition such that field (key) name was "key" and value name "value" as part of the UDF definition?
[{"_1":"aKey","_2":"aValue"}]
to
[{"key":"aKey","value":"aValue"}]
You can use a class:
case class KV(key:String, value: String)
val toArray = udf((vs: Map[String, String]) => vs.map {
case (k, v) => KV(k, v)
}.toArray )
Spark 3.0+
map_entries($"col_name")
This converts a map to an array of struct with struct field names key and value.
Example:
val df = Seq((Map("aKey"->"aValue", "bKey"->"bValue"))).toDF("col_name")
val df2 = df.withColumn("col_name", map_entries($"col_name"))
df2.printSchema()
// root
// |-- col_name: array (nullable = true)
// | |-- element: struct (containsNull = false)
// | | |-- key: string (nullable = false)
// | | |-- value: string (nullable = true)
For custom field names, just cast a new column schema:
val new_schema = "array<struct<k2:string,v2:string>>"
val df2 = df.withColumn("col_name", map_entries($"col_name").cast(new_schema))
df2.printSchema()
// root
// |-- col_name: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- k2: string (nullable = true)
// | | |-- v2: string (nullable = true)