Need the best way to find object - scala

Everyone!
I have 3 classes:
case class Foo(id: String, bars: Option[List[Bar]])
case class Bar(id: String, buzzes: Option[List[Buz]])
case class Buz(id: String, name: String)
And the collection:
val col = Option[List[Foo]]
I need to get:
val search: String = "find me"
val (x: Option[Foo], y: Option[Bar], z: Option[Buz]) = where buz.name == search ???
Help please :)
upd:
i have the json
{
"foos": [{
"id": "...",
"bars": [{
"id": "...",
"buzzes": [{
"id": "...",
"name": "find me"
}]
}]
}]
}
and the name in current context will be unique.
my first thought was - transform collection into list of tuples - like this:
{
(foo)(bar)(buz),
(foo)(bar)(buz),
(foo)(bar)(buz)
}
and filter by buz with name == search
But,i don`t know how to :)

A big problem here is that after digging down far enough to find what you're looking for, the result type is going to reflect that morass of types: Option[List[Option[List...etc.
It's going to be easier (not necessarily better) to save off the find as a side effect.
val bz1 = Buz("bz1", "blah")
val bz2 = Buz("bz2", "target")
val bz3 = Buz("bz3", "bliss")
val br1 = Bar("br1", Option(List(bz1,bz3)))
val br2 = Bar("br2", None)
val br3 = Bar("br3", Option(List(bz1,bz2)))
val fo1 = Foo("fo1", Option(List(br1,br2)))
val fo2 = Foo("fo2", None)
val fo3 = Foo("fo3", Option(List(br2,br3)))
val col: Option[List[Foo]] = Option(List(fo1,fo2,fo3))
import collection.mutable.MutableList
val res:MutableList[(String,String,String)] = MutableList()
col.foreach(_.foreach(f =>
f.bars.foreach(_.foreach(br =>
br.buzzes.foreach(_.collect{
case bz if bz.name == "target" => res.+=((f.id,br.id,bz.id))})))))
res // res: MutableList[(String, String, String)] = MutableList((fo3,br3,bz2))

Related

Raise exception while parsing JSON with the wrong schema on the Optional field

During JSON parsing, I want to catch an exception for optional sequential files which schema differed from my case class.
Let me elaborate
I have the following case class:
case class SimpleFeature(
column: String,
valueType: String,
nullValue: String,
func: Option[String])
case class TaskConfig(
taskInfo: TaskInfo,
inputType: String,
training: Table,
testing: Table,
eval: Table,
splitStrategy: SplitStrategy,
label: Label,
simpleFeatures: Option[List[SimpleFeature]],
model: Model,
evaluation: Evaluation,
output: Output)
And this is part of JSON file I want to point attention to:
"simpleFeatures": [
{
"column": "pcat_id",
"value": "categorical",
"nullValue": "DUMMY"
},
{
"column": "brand_code",
"valueType": "categorical",
"nullValue": "DUMMY"
}
]
As you can see the first element has an error in the schema and while parsing, I want to raise an error. At the same time, I want to keep optional behavior in case there is no object to parse.
One idea that I've been researching for a while - to create the custom serializer and manually check fields, but not sure I'm on the right track
object JSONSerializer extends CustomKeySerializer[SimpleFeatures](format => {
case jsonObj: JObject => {
case Some(simplFeatures (jsonObj \ "simpleFeatures")) => {
// Extraction logic goes here
}
}
})
I might be not quite proficient in Scala and json4s so any advice is appreciated.
json4s version
3.2.10
scala version
2.11.12
jdk version
1.8.0
I think you need to extend CustomSerializer class since CustomKeySerializer it is used to implement custom logic for JSON keys:
import org.json4s.{CustomSerializer, MappingException}
import org.json4s.JsonAST._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
case class SimpleFeature(column: String,
valueType: String,
nullValue: String,
func: Option[String])
case class TaskConfig(simpleFeatures: Option[Seq[SimpleFeature]])
object Main extends App {
implicit val formats = new DefaultFormats {
override val strictOptionParsing: Boolean = true
} + new SimpleFeatureSerializer()
class SimpleFeatureSerializer extends CustomSerializer[SimpleFeature](_ => ( {
case jsonObj: JObject =>
val requiredKeys = Set[String]("column", "valueType", "nullValue")
val diff = requiredKeys.diff(jsonObj.values.keySet)
if (diff.nonEmpty)
throw new MappingException(s"Fields [${requiredKeys.mkString(",")}] are mandatory. Missing fields: [${diff.mkString(",")}]")
val column = (jsonObj \ "column").extract[String]
val valueType = (jsonObj \ "valueType").extract[String]
val nullValue = (jsonObj \ "nullValue").extract[String]
val func = (jsonObj \ "func").extract[Option[String]]
SimpleFeature(column, valueType, nullValue, func)
}, {
case sf: SimpleFeature =>
("column" -> sf.column) ~
("valueType" -> sf.valueType) ~
("nullValue" -> sf.nullValue) ~
("func" -> sf.func)
}
))
// case 1: Test single feature
val singleFeature = """
{
"column": "pcat_id",
"valueType": "categorical",
"nullValue": "DUMMY"
}
"""
val singleFeatureValid = parse(singleFeature).extract[SimpleFeature]
println(singleFeatureValid)
// SimpleFeature(pcat_id,categorical,DUMMY,None)
// case 2: Test task config
val taskConfig = """{
"simpleFeatures": [
{
"column": "pcat_id",
"valueType": "categorical",
"nullValue": "DUMMY"
},
{
"column": "brand_code",
"valueType": "categorical",
"nullValue": "DUMMY"
}]
}"""
val taskConfigValid = parse(taskConfig).extract[TaskConfig]
println(taskConfigValid)
// TaskConfig(List(SimpleFeature(pcat_id,categorical,DUMMY,None), SimpleFeature(brand_code,categorical,DUMMY,None)))
// case 3: Invalid json
val invalidSingleFeature = """
{
"column": "pcat_id",
"value": "categorical",
"nullValue": "DUMMY"
}
"""
val singleFeatureInvalid = parse(invalidSingleFeature).extract[SimpleFeature]
// throws MappingException
}
Analysis: the main question here is how to gain access to the keys of jsonObj in order to check whether there is an invalid or missing key, one way to achieve that is through jsonObj.values.keySet. For the implementation, first we assign the mandatory fields to the requiredKeys variable, then we compare the requiredKeys with the ones that are currently present with requiredKeys.diff(jsonObj.values.keySet). If the difference is not empty that means there is a missing mandatory field, in this case we throw an exception including the necessary information.
Note1: we should not forget to add the new serializer to the available formats.
Note2: we throw an instance of MappingException, which is already used by json4s internally when parsing JSON string.
UPDATE
In order to force validation of Option fields you need to set strictOptionParsing option to true by overriding the corresponding method:
implicit val formats = new DefaultFormats {
override val strictOptionParsing: Boolean = true
} + new SimpleFeatureSerializer()
Resources
https://nmatpt.com/blog/2017/01/29/json4s-custom-serializer/
https://danielasfregola.com/2015/08/17/spray-how-to-deserialize-entities-with-json4s/
https://www.programcreek.com/scala/org.json4s.CustomSerializer
You can try using play.api.libs.json
"com.typesafe.play" %% "play-json" % "2.7.2",
"net.liftweb" % "lift-json_2.11" % "2.6.2"
You just need to define case class and formatters.
Example:
case class Example(a: String, b: String)
implicit val formats: DefaultFormats.type = DefaultFormats
implicit val instancesFormat= Json.format[Example]
and then just do :
Json.parse(jsonData).asOpt[Example]
In case the above gives some errors: Try adding "net.liftweb" % "lift-json_2.11" % "2.6.2" too in your dependency.

How do I detect if a Spark DataFrame has a column

When I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column exists before calling .select
Example JSON schema:
{
"a": {
"b": 1,
"c": 2
}
}
This is what I want to do:
potential_columns = Seq("b", "c", "d")
df = sqlContext.read.json(filename)
potential_columns.map(column => if(df.hasColumn(column)) df.select(s"a.$column"))
but I can't find a good function for hasColumn. The closest I've gotten is to test if the column is in this somewhat awkward array:
scala> df.select("a.*").columns
res17: Array[String] = Array(b, c)
Just assume it exists and let it fail with Try. Plain and simple and supports an arbitrary nesting:
import scala.util.Try
import org.apache.spark.sql.DataFrame
def hasColumn(df: DataFrame, path: String) = Try(df(path)).isSuccess
val df = sqlContext.read.json(sc.parallelize(
"""{"foo": [{"bar": {"foobar": 3}}]}""" :: Nil))
hasColumn(df, "foobar")
// Boolean = false
hasColumn(df, "foo")
// Boolean = true
hasColumn(df, "foo.bar")
// Boolean = true
hasColumn(df, "foo.bar.foobar")
// Boolean = true
hasColumn(df, "foo.bar.foobaz")
// Boolean = false
Or even simpler:
val columns = Seq(
"foobar", "foo", "foo.bar", "foo.bar.foobar", "foo.bar.foobaz")
columns.flatMap(c => Try(df(c)).toOption)
// Seq[org.apache.spark.sql.Column] = List(
// foo, foo.bar AS bar#12, foo.bar.foobar AS foobar#13)
Python equivalent:
from pyspark.sql.utils import AnalysisException
from pyspark.sql import Row
def has_column(df, col):
try:
df[col]
return True
except AnalysisException:
return False
df = sc.parallelize([Row(foo=[Row(bar=Row(foobar=3))])]).toDF()
has_column(df, "foobar")
## False
has_column(df, "foo")
## True
has_column(df, "foo.bar")
## True
has_column(df, "foo.bar.foobar")
## True
has_column(df, "foo.bar.foobaz")
## False
Another option which I normally use is
df.columns.contains("column-name-to-check")
This returns a boolean
Actually you don't even need to call select in order to use columns, you can just call it on the dataframe itself
// define test data
case class Test(a: Int, b: Int)
val testList = List(Test(1,2), Test(3,4))
val testDF = sqlContext.createDataFrame(testList)
// define the hasColumn function
def hasColumn(df: org.apache.spark.sql.DataFrame, colName: String) = df.columns.contains(colName)
// then you can just use it on the DF with a given column name
hasColumn(testDF, "a") // <-- true
hasColumn(testDF, "c") // <-- false
Alternatively you can define an implicit class using the pimp my library pattern so that the hasColumn method is available on your dataframes directly
implicit class DataFrameImprovements(df: org.apache.spark.sql.DataFrame) {
def hasColumn(colName: String) = df.columns.contains(colName)
}
Then you can use it as:
testDF.hasColumn("a") // <-- true
testDF.hasColumn("c") // <-- false
Try is not optimal as the it will evaluate the expression inside Try before it takes the decision.
For large data sets, use the below in Scala:
df.schema.fieldNames.contains("column_name")
For those who stumble across this looking for a Python solution, I use:
if 'column_name_to_check' in df.columns:
# do something
When I tried #Jai Prakash's answer of df.columns.contains('column-name-to-check') using Python, I got AttributeError: 'list' object has no attribute 'contains'.
Your other option for this would be to do some array manipulation (in this case an intersect) on the df.columns and your potential_columns.
// Loading some data (so you can just copy & paste right into spark-shell)
case class Document( a: String, b: String, c: String)
val df = sc.parallelize(Seq(Document("a", "b", "c")), 2).toDF
// The columns we want to extract
val potential_columns = Seq("b", "c", "d")
// Get the intersect of the potential columns and the actual columns,
// we turn the array of strings into column objects
// Finally turn the result into a vararg (: _*)
df.select(potential_columns.intersect(df.columns).map(df(_)): _*).show
Alas this will not work for you inner object scenario above. You will need to look at the schema for that.
I'm going to change your potential_columns to fully qualified column names
val potential_columns = Seq("a.b", "a.c", "a.d")
// Our object model
case class Document( a: String, b: String, c: String)
case class Document2( a: Document, b: String, c: String)
// And some data...
val df = sc.parallelize(Seq(Document2(Document("a", "b", "c"), "c2")), 2).toDF
// We go through each of the fields in the schema.
// For StructTypes we return an array of parentName.fieldName
// For everything else we return an array containing just the field name
// We then flatten the complete list of field names
// Then we intersect that with our potential_columns leaving us just a list of column we want
// we turn the array of strings into column objects
// Finally turn the result into a vararg (: _*)
df.select(df.schema.map(a => a.dataType match { case s : org.apache.spark.sql.types.StructType => s.fieldNames.map(x => a.name + "." + x) case _ => Array(a.name) }).flatMap(x => x).intersect(potential_columns).map(df(_)) : _*).show
This only goes one level deep, so to make it generic you would have to do more work.
in pyspark you can simply run
'field' in df.columns
If you shred your json using a schema definition when you load it then you don't need to check for the column. if it's not in the json source it will appear as a null column.
val schemaJson = """
{
"type": "struct",
"fields": [
{
"name": field1
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": field2
"type": "string",
"nullable": true,
"metadata": {}
}
]
}
"""
val schema = DataType.fromJson(schemaJson).asInstanceOf[StructType]
val djson = sqlContext.read
.schema(schema )
.option("badRecordsPath", readExceptionPath)
.json(dataPath)
def hasColumn(df: org.apache.spark.sql.DataFrame, colName: String) =
Try(df.select(colName)).isSuccess
Use the above mentioned function to check the existence of column including nested column name.
In PySpark, df.columns gives you a list of columns in the dataframe, so
"colName" in df.columns
would return a True or False. Give a try on that. Good luck!
For nested columns you can use
df.schema.simpleString().find('column_name')

Scala filter and print a class

So I have defined here a class:
class CsvEntry(val organisation: String, val yearAndQuartal : String, val medKF:Int, val trueOrFalse: Int, val name: String, val money:String){
override def toString = s"$organisation, $yearAndQuartal, $medKF, $trueOrFalse, $name, $money"
}
Lets assume the value of "medKF" is wheter 2,4 or 31.
How can I filter all CsvEntries to only give me an output for these where the "medKF" is 2? (with foreach println)?
Creating two random entries:
val a = new CsvEntry("s", "11", 12, 0, "asdasd", "1")
val b = new CsvEntry("s", "11", 2, 0, "asdasd", "234,123.23")
Then filter:
List(a,b).withFilter(_.medKF == 2).foreach(println)
To sum those entries:
List(a,b).map(_.medKF).sum
To add the money:
def moneyToCent(money: String): Long = (money.replace(",","").toDouble*100).toLong
List(a,b).withFilter(_.medKF == 2).map(x => moneyToCent(x.money)).sum

Strange result when using squeryl and scala

I'm trying to select the coupled user by getting the correct linkedAccount.
The query that is created is correct but when trying to use a property
on dbuser e.g dbuser.lastName I get a compile error since dbuser is not
of type User but Query1 size=?
It's probably something really simple but I can't figure it out since I'm
a scala and squeryl noob!
Why doesn't it return the correct value and what have I done wrong in my query?
Also, saving works without any issues.
User:
class User(
#Column("id") val id: Long,
#Column("first_name") val firstName : String,
#Column("last_name") val lastName : String,
#Column("email") val email : String,
#Column("email_validated") val emailValidated: Boolean = false,
#Column("last_login") val lastLogin: Timestamp = null,
val created: Timestamp,
val modified: Timestamp,
val active: Boolean = false
) extends KeyedEntity[Long] {
lazy val linkedAccounts: OneToMany[LinkedAccount] = AppDB.usersToLinkedAccounts.left(this)
}
LinkedAccount:
class LinkedAccount(
#Column("id") val id: Long,
#Column("user_id") val userId: Long,
#Column("provider_user_id") val providerUserId: String,
#Column("salt") val salt: String,
#Column("provider_key") val providerKey: String) extends KeyedEntity[Long] {
lazy val user: ManyToOne[User] = AppDB.usersToLinkedAccounts.right(this)
}
AppDB:
object AppDB extends Schema {
val users = table[User]("users")
val linkedAccounts = table[LinkedAccount]("linked_account")
val usersToLinkedAccounts = oneToManyRelation(users, linkedAccounts).via((u, l) => u.id === l.userId)
def userByLinkedAccount(prodivderKey: String, providerUserId: String) = {
from(AppDB.users)(u =>
where(u.id in
from(AppDB.linkedAccounts)(la =>
where(la.userId === u.id and la.providerKey === prodivderKey and la.providerUserId === providerUserId)
select (la.userId)
)
)
select (u)
)
}
The call:
val dbuser = inTransaction {
val u2 = AppDB.userByLinkedAccount(id.providerId, id.id)
println(u2.statement)
}
println(dbuser.lastName)
The sql generated
Select
users10.last_login as users10_last_login,
users10.email as users10_email,
users10.modified as users10_modified,
users10.last_name as users10_last_name,
users10.first_name as users10_first_name,
users10.id as users10_id,
users10.created as users10_created,
users10.email_validated as users10_email_validated,
users10.active as users10_active
From
users users10
Where
(users10.id in ((Select
linked_account13.user_id as linked_account13_user_id
From
linked_account linked_account13
Where
(((linked_account13.user_id = users10.id) and (linked_account13.provider_key = 'facebook')) and (linked_account13.provider_user_id = 'XXXXXXXXXX'))
) ))
BTW, in the documentation to #Column and #ColumnBase it is said:
The preferred way to define column metadata is not not define them (!)
So, you can define columns just as
val id: Long,
instead of
#Column("id") val id: Long
Ok figured it out. I need to make the call, in this case:
.headOption
Also fixed the query after some tips from Per
def userByLinkedAccount(providerKey : String, providerUserId : String) = {
inTransaction {
from(AppDB.users, AppDB.linkedAccounts)((u,la) =>
where (u.id === la.userId and la.providerKey === providerKey and la.providerUserId === providerUserId)
select(u)
).headOption
}
}

Working with nested maps from a JSON string

Given a JSON string like this:
{"Locations":
{"list":
[
{"description": "some description", "name": "the name", "id": "dev123"},
{"description": "other description", "name": "other name", "id": "dev59"}
]
}
}
I'd like to return a list of "id"s from a function parsing the above string. JSON.parseFull() (from scala.util.parsing.json) gives me a result of type Option[Any]. Scala REPL shows it as Some(Map(Locations -> Map(list -> List(Map(id -> dev123, ... and as a beginner in Scala I'm puzzled as to which way to approach it.
Scala API docs suggest "to treat it as a collection or monad and use map, flatMap, filter, or foreach". Top-level element is an Option[Any] however that should be Some with a Map that should contain a single key "Locations", that should contain a single key "list" that finally is a List. What would be an idiomatic way in Scala to write a function retrieving the "id"s?
First of all, you should cast json from Any to right type:
val json = anyJson.asInstanceOf[Option[Map[String,List[Map[String,String]]]]]
And then you may extract ids from Option using map method:
val ids = json.map(_("Locations")("list").map(_("id"))).getOrElse(List())
Because Any is everywhere is the returned result, you'll have to cast. Using one of my earlier answers:
class CC[T] { def unapply(a:Any):Option[T] = Some(a.asInstanceOf[T]) }
object M extends CC[Map[String, Any]]
object L extends CC[List[Any]]
object S extends CC[String]
object D extends CC[Double]
object B extends CC[Boolean]
for {
Some(M(map)) <- List(JSON.parseFull(jsonString))
M(locMap) = map("Locations")
L(list) = locMap("list")
description <- list
M(desc) = description
S(id) = desc("id")
} yield id
// res0: List[String] = List(dev123, dev59)
For this type of tasks, you should take a look at Rapture.io. I'm also a scala beginner, but from what I've searched for, this seems to have the friendliest syntax. Here's a short example, taken from a gist:
import rapture.io._
// Let's parse some JSON
val src: Json = Json.parse("""
{
"foo": "Hello world",
"bar": {
"baz": 42
}
}
""")
// We can now access the value bar.baz
val x: Json = src.bar.baz
// And get it as an integer
val y: Int = x.get[Int]
// Alternatively, we can use an extractor to get the values we want:
val json""" { "bar": { "baz": $x }, "foo": $z }""" = src
// Now x = 42 and z = "Hello world".
Is this what you need? (using lift-json)
scala> import net.liftweb.json._
import net.liftweb.json._
scala> implicit val formats = DefaultFormats
formats: net.liftweb.json.DefaultFormats.type = net.liftweb.json.DefaultFormats$#79e379
scala> val jsonString = """{"Locations":
{"list":
[
{"description": "some description", "name": "the name", "id": "dev123"},
{"description": "other description", "name": "other name", "id": "dev59"}
]
}
}"""
jsonString: java.lang.String =
{"Locations":
{"list":
[
{"description": "some description", "name": "the name", "id": "dev123"},
{"description": "other description", "name": "other name", "id": "dev59"}
]
}
}
scala> Serialization.read[Map[String, Map[String, List[Map[String, String]]]]](jsonString)
res43: Map[String,Map[String,List[Map[String,String]]]] = Map(Locations -> Map(list -> List(Map(description -> some desc
ription, name -> the name, id -> dev123), Map(description -> other description, name -> other name, id -> dev59))))
scala> val json = parse(jsonString)
json: net.liftweb.json.package.JValue = JObject(List(JField(Locations,JObject(List(JField(list,JArray(List(JObject(List(
JField(description,JString(some description)), JField(name,JString(the name)), JField(id,JString(dev123)))), JObject(Lis
t(JField(description,JString(other description)), JField(name,JString(other name)), JField(id,JString(dev59))))))))))))
scala> json \\ "id"
res44: net.liftweb.json.JsonAST.JValue = JObject(List(JField(id,JString(dev123)), JField(id,JString(dev59))))
scala> compact(render(res44))
res45: String = {"id":"dev123","id":"dev59"}
In a branch of SON of JSON, this will work. Note that I'm not using the parser. Not that it doesn't exist. It's just that creating an JSON object using the builder methods is easier:
scala> import nl.typeset.sonofjson._
import nl.typeset.sonofjson._
scala> var all = obj(
| locations = arr(
| obj(description = "foo", id = "807",
| obj(description = "bar", id = "23324"
| )
| )
scala> all.locations.map(_.id).as[List[String]]
res2: List[String] = List(23324, 807)
Or use a for comprehension:
scala> (for (location <- all.locations) yield location.id).as[List[String]]
res4: List[String] = List(23324, 807)