Spark Scala flatMap over bson document with subdocument from Mongo - mongodb

I'm new to Spark and Scala. I have a Mongo collection with documents like this:
{
"_id": "doc_1",
"posts": {
"a": { "total": 1 },
"b": { "total": 2 }
}
}
I'm loading this into a Spark RDD like this
val rc = ReadConfig(Map("collection" -> "my_collection"), Some(ReadConfig(sparkSession)))
val rdd = MongoSpark.load(sparkContext, rc)
I would like to use flatMap (or another suitable function) to flatten out the posts subdocuments into a new RDD like this:
|--------|---------|-------|
| doc_id | post_id | total |
|--------|---------|-------|
| doc_1 | a | 1 |
| doc_1 | b | 2 |
| doc_2 | ... | ... |
|--------|---------|-------|
(I'm using an RDD rather than a DataFrame because the documents are large and this appears to use less memory).
The signature of flatMap is flatMap[U](f: (T) => TraversableOnce[U])(implicit arg0: ClassTag[U]): RDD[U]. Each object in the RDD is an org.bson.Document from the Mongo connector, so I want to write something like:
val newRdd = rdd.flatMap( { x: org.bson.Document => { x.posts }})
But this gives:
value posts is not a member of org.bson.Document
I've done a lot of Googling. Seems like this should be simple but I can't figure it out. Can you point me in the right direction?

It's not JavaScript :) You must use only fields that are in the class; JavaScript notation is not allowed.
As I can see, Document has the function get, which you can use:
case class Post (///...
val newRdd = rdd.flatMap( { x: org.bson.Document => { x.get("posts", Post)// do something }})
Where instead of // do something you should post your transformation.

Related

Write transformation in typesafe way

How do I write the below code in typesafe manner in spark scala with Dataset Api:
val schema: StructType = Encoders.product[CaseClass].schema
//read json from a file
val readAsDataSet :CaseClass=sparkSession.read.option("mode",mode).schema(schema).json(path)as[CaseClass]
//below code needs to be written in type safe way:
val someDF= readAsDataSet.withColumn("col1",explode(col("col_to_be_exploded")))
.select(from_unixtime(col("timestamp").divide(1000))
.as("date"), col("col1"))
As someone in the comments said, you can create a Dataset[CaseClass] and do your operations on there. Let's set it up:
import spark.implicits._
case class MyTest (timestamp: Long, col_explode: Seq[String])
val df = Seq(
MyTest(1673850366000L, Seq("some", "strings", "here")),
MyTest(1271850365998L, Seq("pasta", "with", "cream")),
MyTest(611850366000L, Seq("tasty", "food"))
).toDF("timestamp", "col_explode").as[MyTest]
df.show(false)
+-------------+---------------------+
|timestamp |col_explode |
+-------------+---------------------+
|1673850366000|[some, strings, here]|
|1271850365998|[pasta, with, cream] |
|611850366000 |[tasty, food] |
+-------------+---------------------+
Typically, you can do many operations with the map function and the Scala language.
A map function returns the same amount of elements as the input has. The explode function that you're using, however, does not return the same amount of elements. You can implement this behaviour using the flatMap function.
So, using the Scala language and the flatMap function together, you can do something like this:
import java.time.LocalDateTime
import java.time.ZoneOffset
case class Exploded (datetime: String, exploded: String)
val output = df.flatMap{ case MyTest(timestamp, col_explode) =>
col_explode.map( value => {
val date = LocalDateTime.ofEpochSecond(timestamp/1000, 0, ZoneOffset.UTC).toString
Exploded(date, value)
}
)
}
output.show(false)
+-------------------+--------+
|datetime |exploded|
+-------------------+--------+
|2023-01-16T06:26:06|some |
|2023-01-16T06:26:06|strings |
|2023-01-16T06:26:06|here |
|2010-04-21T11:46:05|pasta |
|2010-04-21T11:46:05|with |
|2010-04-21T11:46:05|cream |
|1989-05-22T14:26:06|tasty |
|1989-05-22T14:26:06|food |
+-------------------+--------+
As you see, we've created a second case class called Exploded which we use to type our output dataset. Our output dataset has the following type: org.apache.spark.sql.Dataset[Exploded] so everything is completely type safe.

How to convert datasets from CSV to JSON with nested elements?

I have the following csv file:
LID,Name,age,CID
122,David,29,ECB4
122,Frank,31,ECB4
567,David,29,ECB4
567,Daniel,35,ECB4
I want to group the data first by the CID and later by the LID and save them as json so that they have kind of that structure:
{
"CID": "ECB4",
"logs":[ {
"LID":122,
"body":[{
"name":"David",
"age":29
},
{
"name":"Frank",
"age":31
}
]
},
"LID":567,
"body":[{
"name":"David",
"age":29
},
{
"name":"Daniel",
"age":35
}
]
}
]
}
I have already defined a schema and loaded the data into a dataframe:
sparkSession.sqlContext.read.format("csv")
.option("delimiter",",").schema(someSchema).load("...")
But I have no idea how to group the dataframe in the wanted way. The groupBy functions returns a RelationalGroupedDataset which I can not save as json.
A sql query wants that I use an aggregation after grouping.
I would appreciate any help.
groupBy defines the groupings alone that you can later use to run aggregations upon. In order to have the result saved in JSON format you have to define the final action that will act on the groupings.
groupBy(col1: String, cols: String*): RelationalGroupedDataset Groups the Dataset using the specified columns, so that we can run aggregation on them.
See RelationalGroupedDataset for all the available aggregate functions.
In other words, you have to execute aggregations using RelationalGroupedDataset interface from which you can use the most generic agg operator.
agg(expr: Column, exprs: Column*): DataFrame Compute aggregates by specifying a series of aggregate columns.
If I'm not mistaken (by looking at the output JSON file), you do groupBy to collect the name and age fields per LID.
You should do the following then:
// Load your dataset
val cids = spark.read.option("header", true).csv("cids.csv")
scala> cids.show
+---+------+---+----+
|LID| Name|age| CID|
+---+------+---+----+
|122| David| 29|ECB4|
|122| Frank| 31|ECB4|
|567| David| 29|ECB4|
|567|Daniel| 35|ECB4|
+---+------+---+----+
With the dataset you have to first struct the columns name and age as they participate in the aggregation.
val name_ages = cids.withColumn("name_age", struct("name", "age"))
scala> name_ages.show
+---+------+---+----+-----------+
|LID| Name|age| CID| name_age|
+---+------+---+----+-----------+
|122| David| 29|ECB4| [David,29]|
|122| Frank| 31|ECB4| [Frank,31]|
|567| David| 29|ECB4| [David,29]|
|567|Daniel| 35|ECB4|[Daniel,35]|
+---+------+---+----+-----------+
Now, it should be fairly straightforward.
val logs = name_ages.groupBy("CID", "LID")
.agg(collect_list("name_age") as "logs") // <-- that's the missing piece in the puzzle
scala> logs.show(truncate = false)
+----+---+-------------------------+
|CID |LID|logs |
+----+---+-------------------------+
|ECB4|122|[[David,29], [Frank,31]] |
|ECB4|567|[[David,29], [Daniel,35]]|
+----+---+-------------------------+
Save away...(left as a home exercise :))
Hint: You may want to use struct once more.

How to convert org.apache.spark.sql.ColumnName to string,Decimal type in Spark Scala?

I have a JSON like below
{"name":"method1","parameter1":"P1name","parameter2": 1.0}
I am loading my JSON file
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("C:/Users/test/Desktop/te.txt")
scala> df.show()
+-------+----------+----------+
| name|parameter1|parameter2|
+-------+----------+----------+
|method1| P1name| 1.0 |
+-------+----------+----------+
I have a function like below:
def method1(P1:String, P2:Double)={
| print(P1)
print(P2)
| }
I am calling my method1 based on column name after executing below code it should execute method1.
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
df.withColumn("methodCalling", when($"name" === "method1", method1($"parameter1",$"parameter2")).otherwise(when($"name" === "method2", method2($"parameter1",$"parameter2")))).show(false)
But I am getting bellow error.
<console>:63: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: String
Please let me know how to convert org.apache.spark.sql.ColumnName data type to String
When you pass arguments as
method1($"parameter1",$"parameter2")
You are passing columns to the function and not primitive datatypes. So, I would suggest you to change your method1 and method2 as udf functions, if you want to apply primitive datatype manipulations inside functions. And udf functions would have to return a value for each row of the new column.
import org.apache.spark.sql.functions._
def method1 = udf((P1:String, P2:Double)=>{
print(P1)
print(P2)
P1+P2
})
def method2 = udf((P1:String, P2:Double)=>{
print(P1)
print(P2)
P1+P2
})
Then your withColumn api should work properly
df.withColumn("methodCalling", when($"name" === "method1", method1($"parameter1",$"parameter2")).otherwise(when($"name" === "method2", method2($"parameter1",$"parameter2")))).show(false)
Note: udf functions perform data serialization and deserialzation for changing the column dataTypes to be processed row-wise which would increase complexity and a lot of memory usages. spark functions should be used as much as possible
You can try like this:
scala> def method1(P1:String, P2:Double): Int = {
| println(P1)
| println(P2)
| 0
| }
scala> def method2(P1:String, P2:Double): Int = {
| println(P1)
| println(P2)
| 1
| }
df.withColumn("methodCalling", when($"name" === "method1", method1(df.select($"parameter1").map(_.getString(0)).collect.head,df.select($"parameter2").map(_.getDouble(0)).collect.head))
.otherwise(when($"name" === "method2", method2(df.select($"parameter1").map(_.getString(0)).collect.head,df.select($"parameter2").map(_.getDouble(0)).collect.head)))).show
//output
P1name
1.0
+-------+----------+----------+-------------+
| name|parameter1|parameter2|methodCalling|
+-------+----------+----------+-------------+
|method1| P1name| 1.0| 0|
+-------+----------+----------+-------------+
You have to return something from your method otherwise it will retun unit and it will give error after printing result:
java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:75)
at org.apache.spark.sql.functions$.lit(functions.scala:101)
at org.apache.spark.sql.functions$.when(functions.scala:1245)
... 50 elided
Thanks.
I think you just want to read the JSON and based on that call the methods.
Since you have already created a dataframe, you can do something like :
df.map( row => (row.getString(0), row.getString(1) , row.getDouble(2) ) ).collect
.foreach { x =>
x._1.trim.toLowerCase match {
case "method1" => method1(x._2, x._3)
//case "method2" => method2(x._2, x._3)
//case _ => methodn(x._2, x._3)
}
}
// Output : P1name1.0
// Because you used `print` and not `println` ;)

How to generate multiple records based on column?

I have records like below. I would like to convert a single record into two records with values EXTERNAL and INTERNAL each if the 3rd attribute is All.
Input dataset:
Surender,cts,INTERNAL
Raja,cts,EXTERNAL
Ajay,tcs,All
Expected output:
Surender,cts,INTERNAL
Raja,cts,EXTERNAL
Ajay,tcs,INTERNAL
Ajay,tcs,EXTERNAL
My Spark Code :
case class Customer(name:String,organisation:String,campaign_type:String)
val custRDD = sc.textFile("/user/cloudera/input_files/customer.txt")
val mapRDD = custRDD.map(record => record.split(","))
.map(arr => (arr(0),arr(1),arr(2))
.map(tuple => {
val name = tuple._1.trim
val organisation = tuple._2.trim
val campaign_type = tuple._3.trim.toUpperCase
Customer(name, organisation, campaign_type)
})
mapRDD.toDF().registerTempTable("customer_processed")
sqlContext.sql("SELECT * FROM customer_processed").show
Could Someone help me to fix this issue?
Since it's Scala...
If you want to write a more idiomatic Scala code (and perhaps trading some performance due to lack of optimizations to have a more idiomatic code), you can use flatMap operator (removed the implicit parameter):
flatMap[U](func: (T) ⇒ TraversableOnce[U]): Dataset[U] Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
NOTE: flatMap is equivalent to explode function, but you don't have to register a UDF (as in the other answer).
A solution could be as follows:
// I don't care about the names of the columns since we use Scala
// as you did when you tried to write the code
scala> input.show
+--------+---+--------+
| _c0|_c1| _c2|
+--------+---+--------+
|Surender|cts|INTERNAL|
| Raja|cts|EXTERNAL|
| Ajay|tcs| All|
+--------+---+--------+
val result = input.
as[(String, String, String)].
flatMap { case r # (name, org, campaign) =>
if ("all".equalsIgnoreCase(campaign)) {
Seq("INTERNAL", "EXTERNAL").map { cname =>
(name, org, cname)
}
} else Seq(r)
}
scala> result.show
+--------+---+--------+
| _1| _2| _3|
+--------+---+--------+
|Surender|cts|INTERNAL|
| Raja|cts|EXTERNAL|
| Ajay|tcs|INTERNAL|
| Ajay|tcs|EXTERNAL|
+--------+---+--------+
Comparing performance of the two queries, i.e. flatMap-based vs explode-based queries, I think explode-based may be slightly faster and optimized better as some code is under Spark's control (using logical operators before they get mapped to physical couterparts). In flatMap the entire optimization is your responsibility as a Scala developer.
The below red-bounded area corresponds to flatMap-based code and the warning sign are very cost expensive DeserializeToObject and SerializeFromObject operators.
What's interesting is the number of Spark jobs per query and their durations. It appears that explode-based query takes 2 Spark jobs and 200 ms while flatMap-based take only 1 Spark job and 43 ms.
That surprises me a lot and suggests that flatMap-based query could be faster (!)
You can use and udf to transform the campaign_type column containing a Seq of strings to map it to the campaigns type and then explode :
val campaignType_ : (String => Seq[String]) = {
case s if s == "ALL" => Seq("EXTERNAL", "INTERNAL")
case s => Seq(s)
}
val campaignType = udf(campaignType_)
val df = Seq(("Surender", "cts", "INTERNAL"),
("Raja", "cts", "EXTERNAL"),
("Ajay", "tcs", "ALL"))
.toDF("name", "organisation", "campaign_type")
val step1 = df.withColumn("campaign_type", campaignType($"campaign_type"))
step1.show
// +--------+------------+--------------------+
// | name|organisation| campaign_type|
// +--------+------------+--------------------+
// |Surender| cts| [INTERNAL]|
// | Raja| cts| [EXTERNAL]|
// | Ajay| tcs|[EXTERNAL, INTERNAL]|
// +--------+------------+--------------------+
val step2 = step1.select($"name", $"organisation", explode($"campaign_type"))
step2.show
// +--------+------------+--------+
// | name|organisation| col|
// +--------+------------+--------+
// |Surender| cts|INTERNAL|
// | Raja| cts|EXTERNAL|
// | Ajay| tcs|EXTERNAL|
// | Ajay| tcs|INTERNAL|
// +--------+------------+--------+
EDIT:
You don't actually need a udf, you can use a when().otherwise predicate instead on step1 as followed :
val step1 = df.withColumn("campaign_type",
when(col("campaign_type") === "ALL", array("EXTERNAL", "INTERNAL")).otherwise(array(col("campaign_type")))

Spark SQL - Nested array conditional select

I have a spark SQL question Id appreciate some guidance on the best way to do a conditional select from nested array of structs.
I have an example json document below
```
{
"id":"p1",
"externalIds":[
{"system":"a","id":"1"},
{"system":"b","id":"2"},
{"system":"c","id":"3"}
]
}
```
In spark SQL I want to select the "id" of one of the array structs based on some conditional logic.
e.g for above, select the id field of array sub element that has "system" = "b", namely the id of "2".
How best to do this in SparkSQL?
Cheers and thanks!
Using an UDF, this could look like this, given a Dataframe (all attributes of type String):
+---+---------------------+
|id |externalIds |
+---+---------------------+
|p1 |[[a,1], [b,2], [c,3]]|
+---+---------------------+
Define an UDF to traverse your array and find the desired element:
def getExternal(system: String) = {
udf((row: Seq[Row]) =>
row.map(r => (r.getString(0), r.getString(1)))
.find { case (s, _) => s == system}
.map(_._2)
.orElse(None)
)
}
and use it like this:
df
.withColumn("external",getExternal("b")($"externalIds"))
.show(false)
+---+---------------------+--------+
|id |externalIds |external|
+---+---------------------+--------+
|p1 |[[a,1], [b,2], [c,3]]|2 |
+---+---------------------+--------+