I'm working with Apache Spark's ALS model, and the recommendForAllUsers method returns a dataframe with the schema
root
|-- user_id: integer (nullable = false)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- item_id: integer (nullable = true)
| | |-- rating: float (nullable = true)
In practice, the recommendations are a WrappedArray like:
WrappedArray([636958,0.32910484], [995322,0.31974298], [1102140,0.30444127], [1160820,0.27908015], [1208899,0.26943958])
I'm trying to extract just the item_ids and return them as a 1D array. So the above example would be [636958,995322,1102140,1160820,1208899]
This is what's giving me trouble. So far I have:
val numberOfRecs = 20
val userRecs = model.recommendForAllUsers(numberOfRecs).cache()
val strippedScores = userRecs.rdd.map(row => {
val user_id = row.getInt(0)
val recs = row.getAs[Seq[Row]](1)
val item_ids = new Array[Int](numberOfRecs)
recs.toArray.foreach(x => {
item_ids :+ x.get(0)
})
item_ids
})
But this just returns [I#2f318251, and if I get the string value of it via mkString(","), it returns 0,0,0,0,0,0
Any thoughts on how I can extract the item_ids and return them as a separate, 1D array?
Found in the Spark ALSModel docs that recommendForAllUsers returns
"a DataFrame of (userCol: Int, recommendations), where recommendations
are stored as an array of (itemCol: Int, rating: Float) Rows"
(https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.recommendation.ALSModel)
By array, it means WrappedArray, so instead of trying to to cast it to Seq[Row], I cast it to mutable.WrappedArray[Row]. I was then able to get each item_id like:
val userRecItems = userRecs.rdd.map(row => {
val user_id = row.getInt(0)
val recs = row.getAs[mutable.WrappedArray[Row]](1)
for (rec <- recs) {
val item_id = rec.getInt(0)
userRecommendatinos += game_id
}
})
where userRecommendations was a mutable ArrayBuffer
You can use a fully qualified name to access a structure element in the array:
scala> case class Recommendation(item_id: Int, rating: Float)
defined class Recommendation
scala> val userReqs = Seq(Array(Recommendation(636958,0.32910484f), Recommendation(995322,0.31974298f), Recommendation(1102140,0.30444127f), Recommendation(1160820,0.27908015f), Recommendation(1208899,0.26943958f))).toDF
userReqs: org.apache.spark.sql.DataFrame = [value: array<struct<item_id:int,rating:float>>]
scala> userReqs.printSchema
root
|-- value: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- item_id: integer (nullable = false)
| | |-- rating: float (nullable = false)
scala> userReqs.select("value.item_id").show(false)
+-------------------------------------------+
|item_id |
+-------------------------------------------+
|[636958, 995322, 1102140, 1160820, 1208899]|
+-------------------------------------------+
scala> val ids = userReqs.select("value.item_id").collect().flatMap(_.getAs[Seq[Int]](0))
ids: Array[Int] = Array(636958, 995322, 1102140, 1160820, 1208899)
Related
I have 5 queries like below:
select * from table1
select * from table2
select * from table3
select * from table4
select * from table5
Now, what I want is I have to execute these queries in the sequential fashion and then keep on storing the output in the single JSON file in the appended mode. I wrote the below code but it stores the output for each query in different part files instead of one.
Below is my code:
def store(jobEntity: JobDetails, jobRunId: Int): Unit = {
UDFUtil.registerUdfFunctions()
var outputTableName: String = null
val jobQueryMap = jobEntity.jobQueryList.map(jobQuery => (jobQuery.sequenceId, jobQuery))
val sortedQueries = scala.collection.immutable.TreeMap(jobQueryMap.toSeq: _*).toMap
LOGGER.debug("sortedQueries ===>" + sortedQueries)
try {
outputTableName = jobEntity.destinationEntity
var resultDF: DataFrame = null
sortedQueries.values.foreach(jobQuery => {
LOGGER.debug(s"jobQuery.query ===> ${jobQuery.query}")
resultDF = SparkSession.builder.getOrCreate.sqlContext.sql(jobQuery.query)
if (jobQuery.partitionColumn != null && !jobQuery.partitionColumn.trim.isEmpty) {
resultDF = resultDF.repartition(jobQuery.partitionColumn.split(",").map(col): _*)
}
if (jobQuery.isKeepInMemory) {
resultDF = resultDF.persist(StorageLevel.MEMORY_AND_DISK_SER)
}
if (jobQuery.isCheckpointEnabled) {
val checkpointDir = ApplicationConfig.getAppConfig(JobConstants.CHECKPOINT_DIR)
val fs = FileSystem.get(new Storage(JsonUtil.toMap[String](jobEntity.sourceConnection)).asHadoopConfig())
val path = new Path(checkpointDir)
if (!fs.exists(path)) {
fs.mkdirs(path)
}
resultDF.explain(true)
SparkSession.builder.getOrCreate.sparkContext.setCheckpointDir(checkpointDir)
resultDF = resultDF.checkpoint
}
resultDF = {
if (jobQuery.isBroadCast) {
import org.apache.spark.sql.functions.broadcast
broadcast(resultDF)
} else
resultDF
}
tempViewsList.+=(jobQuery.queryAliasName)
resultDF.createOrReplaceTempView(jobQuery.queryAliasName)
// resultDF.explain(true)
val map: Map[String, String] = JsonUtil.toMap[String](jobEntity.sinkConnection)
LOGGER.debug("sink details :: " + map)
if (resultDF != null && !resultDF.take(1).isEmpty) {
resultDF.show(false)
val sinkDetails = new Storage(JsonUtil.toMap[String](jobEntity.sinkConnection))
val path = sinkDetails.basePath + File.separator + jobEntity.destinationEntity
println("path::: " + path)
resultDF.repartition(1).write.mode(SaveMode.Append).json(path)
}
}
)
Just ignore the other things(Checkpointing, Logging, Auditing) that I am doing in this method along with reading and writing.
Use the below example as a reference for your problem.
I have three tables with Json data (with different schema) as below:
table1 --> Personal Data Table
table2 --> Company Data Table
table3 --> Salary Data Table
I am reading these three tables one by one in the sequential mode as per your requirement and doing few transformations over data (exploding Json array Column) with the help of List TableColList which contains Array column Name corresponding to table with a semicolon (":") separator.
OutDFList is the list of all transformed DataFrames.
At the end, I am reducing all DataFrames from OutDFList into a single dataframe and writing it into one JSON file.
Note: I have used join to reduced all DataFrames, You can also use
union(if have same columns) or else as per requirement.
Check below code:
scala> spark.sql("select * from table1").printSchema
root
|-- Personal: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- DOB: string (nullable = true)
| | |-- EmpID: string (nullable = true)
| | |-- Name: string (nullable = true)
scala> spark.sql("select * from table2").printSchema
root
|-- Company: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- EmpID: string (nullable = true)
| | |-- JoinDate: string (nullable = true)
| | |-- Project: string (nullable = true)
scala> spark.sql("select * from table3").printSchema
root
|-- Salary: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- EmpID: string (nullable = true)
| | |-- Monthly: string (nullable = true)
| | |-- Yearly: string (nullable = true)
scala> val TableColList = List("table1:Personal", "table2:Company", "table3:Salary")
TableColList: List[String] = List(table1:Personal, table2:Company, table3:Salary)
scala> val OutDFList = TableColList.map{ X =>
| val table = X.split(":")(0)
| val arrayColumn = X.split(":")(1)
| val df = spark.sql(s"""SELECT * FROM """ + table).select(explode(col(arrayColumn)) as "data").select("data.*")
| df}
OutDFList: List[org.apache.spark.sql.DataFrame] = List([DOB: string, EmpID: string ... 1 more field], [EmpID: string, JoinDate: string ... 1 more field], [EmpID: string, Monthly: string ... 1 more field])
scala> val FinalOutDF = OutDFList.reduce((df1, df2) => df1.join(df2, "EmpID"))
FinalOutDF: org.apache.spark.sql.DataFrame = [EmpID: string, DOB: string ... 5 more fields]
scala> FinalOutDF.printSchema
root
|-- EmpID: string (nullable = true)
|-- DOB: string (nullable = true)
|-- Name: string (nullable = true)
|-- JoinDate: string (nullable = true)
|-- Project: string (nullable = true)
|-- Monthly: string (nullable = true)
|-- Yearly: string (nullable = true)
scala> FinalOutDF.write.json("/FinalJsonOut")
First thing first, you need to union all the schemas:
import org.apache.spark.sql.functions._
val df1 = sc.parallelize(List(
(42, 11),
(43, 21)
)).toDF("foo", "bar")
val df2 = sc.parallelize(List(
(44, true, 1.0),
(45, false, 3.0)
)).toDF("foo", "foo0", "foo1")
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
val total = df1.select(expr(cols1, total):_*).unionAll(df2.select(expr(cols2, total):_*))
total.show()
And obvs save to the single JSON file:
df.coalesce(1).write.mode('append').json("/some/path")
UPD
If you are not using DFs, just come along with plain SQL queries (writing to single file remains the same - coalesce(1) or repartition(1)):
spark.sql(
"""
|SELECT id, name
|FROM (
| SELECT first.id, first.name, FROM first
| UNION
| SELECT second.id, second.name FROM second
| ORDER BY second.name
| ) t
""".stripMargin).show()
I have a dataframe:
+--------------------+------+
|people |person|
+--------------------+------+
|[[jack, jill, hero]]|joker |
+--------------------+------+
It's schema:
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
Here, root--person is a string. So, I can update this field using udf as:
def updateString = udf((s: String) => {
"Mr. " + s
})
df.withColumn("person", updateString(col("person"))).select("person").show(false)
output:
+---------+
|person |
+---------+
|Mr. joker|
+---------+
I want to do same operation on root--people--person column which contains array of person. How to achieve this using udf?
def updateArray = udf((arr: Seq[Row]) => ???
df.withColumn("people", updateArray(col("people.person"))).select("people").show(false)
expected:
+------------------------------+
|people |
+------------------------------+
|[Mr. hero, Mr. jack, Mr. jill]|
+------------------------------+
Edit: I also want to preserve its schema after updating root--people--person.
Expected schema of people:
df.select("people").printSchema()
root
|-- people: struct (nullable = false)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
Thanks,
The problem here is that people is s struct with only 1 field. In your UDF, you need to return Tuple1 and then further cast the output of your UDF to keep the names correct:
def updateArray = udf((r: Row) => Tuple1(r.getAs[Seq[String]](0).map(x=>"Mr."+x)))
val newDF = df
.withColumn("people",updateArray($"people").cast("struct<person:array<string>>"))
newDF.printSchema()
newDF.show()
gives
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
+--------------------+------+
| people|person|
+--------------------+------+
|[[Mr.jack, Mr.jil...| joker|
+--------------------+------+
for you just need to update your function and everything remains the same.
here is the code snippet.
scala> df2.show
+------+------------------+
|people| person|
+------+------------------+
| joker|[jack, jill, hero]|
+------+------------------+
//jus order is changed
I just updated your function instead of using Row I am using here Seq[String]
scala> def updateArray = udf((arr: Seq[String]) => arr.map(x=>"Mr."+x))
scala> df2.withColumn("test",updateArray($"person")).show(false)
+------+------------------+---------------------------+
|people|person |test |
+------+------------------+---------------------------+
|joker |[jack, jill, hero]|[Mr.jack, Mr.jill, Mr.hero]|
+------+------------------+---------------------------+
//keep all the column for testing purpose you could drop if you dont want.
let me know if you want to know more about same.
Let's create data for testing
scala> val data = Seq((List(Array("ja", "ji", "he")), "person")).toDF("people", "person")
data: org.apache.spark.sql.DataFrame = [people: array<array<string>>, person: string]
scala> data.printSchema
root
|-- people: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
create UDF for our requirements
scala> def arrayConcat(array:Seq[Seq[String]], str: String) = array.map(_.map(str + _))
arrayConcat: (array: Seq[Seq[String]], str: String)Seq[Seq[String]]
scala> val arrayConcatUDF = udf(arrayConcat _)
arrayConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true), StringType)))
Applying the udf
scala> data.withColumn("dasd", arrayConcatUDF($"people", lit("Mr."))).show(false)
+--------------------------+------+-----------------------------------+
|people |person|dasd |
+--------------------------+------+-----------------------------------+
|[WrappedArray(ja, ji, he)]|person|[WrappedArray(Mr.ja, Mr.ji, Mr.he)]|
+--------------------------+------+-----------------------------------+
You may need to tweak a bit(I think any tweak is hardly required) but this contains the most of it to solve your problem
I have the following schema that I read from csv:
val PersonSchema = StructType(Array(StructField("PersonID",StringType,true), StructField("Name",StringType,true)))
val AddressSchema = StructType(Array(StructField("PersonID",StringType,true), StructField("StreetNumber",StringType,true), StructField("StreetName",StringType,true)))
One person can have multiple addresses and is related through PersonID.
Can someone help transform the records to a PersonAddress records as in the following case class definition?
case class Address(StreetNumber:String, StreetName:String)
case class PersonAddress(PersonID:String, Name:String, Addresses:Array[Address])
I have tried the following but it is giving exception in the last step:
val results = personData.join(addressData, Seq("PersonID"), "left_outer").groupBy("PersonID","Name").agg(collect_list(struct("StreetNumber","StreetName")) as "Addresses")
val personAddresses = results .map(data => PersonAddress(data.getAs("PersonID"),data.getAs("Name"),data.getAs("Addresses")))
personAddresses.show
Gives an error:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to $line26.$read$$iw$$iw$Address
The easiest solution in this situtation would be to use an UDF. First, collect the street numbers and names as two separate lists, then use the UDF to convert everything into a dataframe of PersonAddress.
val convertToCase = udf((id: String, name: String, streetName: Seq[String], streetNumber: Seq[String]) => {
val addresses = streetNumber.zip(streetName)
PersonAddress(id, name, addresses.map(t => Address(t._1, t._2)).toArray)
})
val results = personData.join(addressData, Seq("PersonID"), "left_outer")
.groupBy("PersonID","Name")
.agg(collect_list($"StreetNumber").as("StreetNumbers"),
collect_list($"StreetName").as("StreetNames"))
val personAddresses = results.select(convertToCase($"PersonID", $"Name", $"StreetNumbers", $"StreetNames").as("Person"))
This will give you a schema as below.
root
|-- Person: struct (nullable = true)
| |-- PersonID: string (nullable = true)
| |-- Name: string (nullable = true)
| |-- Addresses: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- StreetNumber: string (nullable = true)
| | | |-- StreetName: string (nullable = true)
I have a UDF that converts a Map (in this case String -> String) to an Array of Struct using the Scala built-in toArray function
val toArray = udf((vs: Map[String, String]) => vs.toArray)
The field names of structs are _1 and _2.
How can I change the UDF definition such that field (key) name was "key" and value name "value" as part of the UDF definition?
[{"_1":"aKey","_2":"aValue"}]
to
[{"key":"aKey","value":"aValue"}]
You can use a class:
case class KV(key:String, value: String)
val toArray = udf((vs: Map[String, String]) => vs.map {
case (k, v) => KV(k, v)
}.toArray )
Spark 3.0+
map_entries($"col_name")
This converts a map to an array of struct with struct field names key and value.
Example:
val df = Seq((Map("aKey"->"aValue", "bKey"->"bValue"))).toDF("col_name")
val df2 = df.withColumn("col_name", map_entries($"col_name"))
df2.printSchema()
// root
// |-- col_name: array (nullable = true)
// | |-- element: struct (containsNull = false)
// | | |-- key: string (nullable = false)
// | | |-- value: string (nullable = true)
For custom field names, just cast a new column schema:
val new_schema = "array<struct<k2:string,v2:string>>"
val df2 = df.withColumn("col_name", map_entries($"col_name").cast(new_schema))
df2.printSchema()
// root
// |-- col_name: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- k2: string (nullable = true)
// | | |-- v2: string (nullable = true)
I'm working through a Databricks example. The schema for the dataframe looks like:
> parquetDF.printSchema
root
|-- department: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
|-- employees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- firstName: string (nullable = true)
| | |-- lastName: string (nullable = true)
| | |-- email: string (nullable = true)
| | |-- salary: integer (nullable = true)
In the example, they show how to explode the employees column into 4 additional columns:
val explodeDF = parquetDF.explode($"employees") {
case Row(employee: Seq[Row]) => employee.map{ employee =>
val firstName = employee(0).asInstanceOf[String]
val lastName = employee(1).asInstanceOf[String]
val email = employee(2).asInstanceOf[String]
val salary = employee(3).asInstanceOf[Int]
Employee(firstName, lastName, email, salary)
}
}.cache()
display(explodeDF)
How would I do something similar with the department column (i.e. add two additional columns to the dataframe called "id" and "name")? The methods aren't exactly the same, and I can only figure out how to create a brand new data frame using:
val explodeDF = parquetDF.select("department.id","department.name")
display(explodeDF)
If I try:
val explodeDF = parquetDF.explode($"department") {
case Row(dept: Seq[String]) => dept.map{dept =>
val id = dept(0)
val name = dept(1)
}
}.cache()
display(explodeDF)
I get the warning and error:
<console>:38: warning: non-variable type argument String in type pattern Seq[String] is unchecked since it is eliminated by erasure
case Row(dept: Seq[String]) => dept.map{dept =>
^
<console>:37: error: inferred type arguments [Unit] do not conform to method explode's type parameter bounds [A <: Product]
val explodeDF = parquetDF.explode($"department") {
^
In my opinion the most elegant solution is to star expand a Struct using a select operator as shown below:
var explodedDf2 = explodedDf.select("department.*","*")
https://docs.databricks.com/spark/latest/spark-sql/complex-types.html
You could use something like that:
var explodeDF = explodeDF.withColumn("id", explodeDF("department.id"))
explodeDeptDF = explodeDeptDF.withColumn("name", explodeDeptDF("department.name"))
which you helped me into and these questions:
Flattening Rows in Spark
Spark 1.4.1 DataFrame explode list of JSON objects
This seems to work (though maybe not the most elegant solution).
var explodeDF2 = explodeDF.withColumn("id", explodeDF("department.id"))
explodeDF2 = explodeDF2.withColumn("name", explodeDF2("department.name"))