Non-recursive extraction in Lift JSON for-comprehension - scala

I'm using Lift JSON's for-comprehensions to parse some JSON. The JSON is recursive, so e.g. the field id exists at each level. Here is an example:
val json = """
{
"id": 1
"children": [
{
"id": 2
},
{
"id": 3
}
]
}
"""
The following code
var ids = for {
JObject(parent) <- parse(json)
JField("id", JInt(id)) <- parent
} yield id
println(ids)
produces List(1, 2, 3). I was expecting it to product List(1).
In my program this results in quadratic computation, though I only need linear.
Is it possible to use for-comprehensions to match the top level id fields only?

I haven't delved deep enough to figure out why the default comprehension is recursive, however you can solve this by simply qualifying your search root:
scala> for ( JField( "id", JInt( id ) ) <- parent.children ) yield id
res4: List[BigInt] = List(1)
Note the use of parent.children.

Related

How to efficiently delete subset in spark RDD

When conducting research, I find it somewhat difficult to delete all the subsets in Spark RDD.
The data structure is RDD[(key,set)]. For example, it could be:
RDD[ ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ]
Since the set of mike (Set(1,3)) is a subset of peter's (Set(1,2,3)), I want to delete "mike", which will end up with
RDD[ ("peter",Set(1,2,3)), ("jack",Set(5)) ]
It is easy to implement in python locally with two "for" loop operation. But when I want to extend to cloud with scala and spark, it is not that easy to find a good solution.
Thanks
I doubt we can escape to comparing each element to each other (the equivalent of a double loop in a non-distributed algorithm). The subset operation between sets is not reflexive, meaning that we need to compare is "alice" subsetof "bob" and is "bob" subsetof "alice".
To do this using the Spark API, we can resort to multiplying the data with itself using a cartesian product and verifying each entry of the resulting matrix:
val data = Seq(("peter",Set(1,2,3)), ("mike",Set(1,3)), ("anne", Set(7)),("jack",Set(5,4,1)), ("lizza", Set(5,1)), ("bart", Set(5,4)), ("maggie", Set(5)))
// expected result from this dataset = peter, olga, anne, jack
val userSet = sparkContext.parallelize(data)
val prod = userSet.cartesian(userSet)
val subsetMembers = prod.collect{case ((name1, set1), (name2,set2)) if (name1 != name2) && (set2.subsetOf(set1)) && (set1 -- set2).nonEmpty => (name2, set2) }
val superset = userSet.subtract(subsetMembers)
// lets see the results:
superset.collect()
// Array[(String, scala.collection.immutable.Set[Int])] = Array((olga,Set(1, 2, 3)), (peter,Set(1, 2, 3)), (anne,Set(7)), (jack,Set(5, 4, 1)))
This can be achieved by using RDD.fold function.
In this case the output required is a "List" (ItemList) of superset items. For this the input should also be converted to "List" (RDD of ItemList)
import org.apache.spark.rdd.RDD
// type alias for convinience
type Item = Tuple2[String, Set[Int]]
type ItemList = List[Item]
// Source RDD
val lst:RDD[Item] = sc.parallelize( List( ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ) )
// Convert each element as a List. This is needed for using fold function on RDD
// since the data-type of the parameters are the same as output parameter
// data-type for fold function
val listOflst:RDD[ItemList] = lst.map(x => List(x))
// for each element in second ItemList
// - Check if it is not subset of any element in first ItemList and add first
// - Remove the subset of newly added elements
def combiner(first:ItemList, second:ItemList) : ItemList = {
def helper(lst: ItemList, i:Item) : ItemList = {
val isSubset: Boolean = lst.exists( x=> i._2.subsetOf(x._2))
if( isSubset) lst else i :: lst.filterNot( x => x._2.subsetOf(i._2))
}
second.foldLeft(first)(helper)
}
listOflst.fold(List())(combiner)
You can use filter after a map.
You can build like a map that will return a value for what you want to delete. First build a function:
def filter_mike(line):
if line[1] != Set(1,3):
return line
else:
return None
Then you can filter now like this:
your_rdd.map(filter_mike).filter(lambda x: x != None)
This will work

Scala on Eclipse gives errors on Map operations

I am trying to write a word Count program using Maps in Scala. From various sources on the internet, I found that 'contains', adding elements to the Map using '+' and updating the existing values are valid. But Eclipse gives me errors when I try to use those operations in my code:
object wc {
def main(args:Array[String])={
val story = """ Once upon a time there was a poor lady with a son who was lazy
she was worried how she will grow up and
survive after she goes """
count(story.split("\n ,.".toCharArray()))
}
def count(s:Array[String])={
var count = scala.collection.mutable.Map
for(i <- 0 until s.size){
if(count.contains(s(i))) {
count(s(i)) = count(s(i))+1
}
else count = count + (s(i),1)
}
println(count)
}
}
these are the error messages I get in eclipse:
1.)
2.)
3.)
I tried these operations on REPL and they were working fine without any errors. Any help would be appreciated. Thank you!
You need to instantiate a typed mutable Map (otherwise you're looking for the contains attribute on Map.type; which isn't there):
def count(s: Array[String]) ={
var count = scala.collection.mutable.Map[String, Int]()
for(i <- 0 until s.size){
if (count.contains(s(i))) {
// count += s(i) -> (count(s(i)) + 1)
// can be rewritten as
count(s(i)) += 1
}
else count += s(i) -> 1
}
println(count)
}
Note: I also fixed up the lines updating count.
Perhaps this is better written as a groupBy:
a.groupBy({s: String => s}).mapValues(_.length)
val a = List("a", "a", "b", "c", "c", "c")
scala> a.groupBy({s: String => s}).mapValues(_.length)
Map("b" -> 1, "a" -> 2, "c" -> 3): Map[String, Int]

Collections in scala, how to get elements in map

From lift-json:
scala> val json = parse("""
{
"name": "joe",
"addresses": {
"address1": {
"street": "Bulevard",
"city": "Helsinki"
},
"address2": {
"street": "Soho",
"city": "London"
}
}
}""")
scala> case class Address(street:String, city: String)
scala> case class PersonWithAddresses(name: String, addresses: Map[String, Address])
scala> val joe = json.extract[PersonWithAddresses]
res0: PersonWithAddresses("joe", Map("address1" -> Address("Bulevard", "Helsinki"),
"address2" -> Address("Soho", "London")))
I want to access elements of joe. I want to know the Joe's address1 city for example. How?
Bonus Question:
what if PersonWithAddresses was
case class PersonWithAddress(name:String, addresses: Map[String, List[Address]])
how would I extract the size of that list?
P.S. question:
what's the difference between joe.addresses("address1").size() and
joe.addresses.get("address1").size ?
Your question has nothing really to do with json and lift itself. You already have your object, you just don't know how to use scala collections.
In case without list, you can get your city with:
# joe.addresses("address1")
res4: Address = Address("Bulevard", "Helsinki")
# res4.city
res5: String = "Helsinki"
or joe.addresses("address1").city for short.
In case of list
case class PersonWithAddress(name:String, addresses: Map[String, List[Address]])
you just call size on list.
joe.addresses("address1").size
As for a difference between these two:
# res7.addresses("address1").size
res8: Int = 1
# res7.addresses.get("address1").size
res9: Int = 1
There is a big difference, see what happens when you call get
# res7.addresses.get("address1")
res10: Option[List[Address]] = Some(List(Address("Bulevard", "Helsinki")))
It returns an Option which could be viewed as a collection of size 0 or 1. Checking its size is not what you want to do.
map.get("key")
returns an Option which is either Some(value) if value is present in map, or None if it's not
map("key") or desugared map.apply("key") returns the item associated with key or exception if element is not present in the map.

How to create a List of Wildcard elements Scala

I'm trying to write a function that returns a list (for querying purposes) that has some wildcard elements:
def createPattern(query: List[(String,String)]) = {
val l = List[(_,_,_,_,_,_,_)]
var iter = query
while(iter != null) {
val x = iter.head._1 match {
case "userId" => 0
case "userName" => 1
case "email" => 2
case "userPassword" => 3
case "creationDate" => 4
case "lastLoginDate" => 5
case "removed" => 6
}
l(x) = iter.head._2
iter = iter.tail
}
l
}
So, the user enters some query terms as a list. The function parses through these terms and inserts them into val l. The fields that the user doesn't specify are entered as wildcards.
Val l is causing me troubles. Am I going the right route or are there better ways to do this?
Thanks!
Gosh, where to start. I'd begin by getting an IDE (IntelliJ / Eclipse) which will tell you when you're writing nonsense and why.
Read up on how List works. It's an immutable linked list so your attempts to update by index are very misguided.
Don't use tuples - use case classes.
You shouldn't ever need to use null and I guess here you mean Nil.
Don't use var and while - use for-expression, or the relevant higher-order functions foreach, map etc.
Your code doesn't make much sense as it is, but it seems you're trying to return a 7-element list with the second element of each tuple in the input list mapped via a lookup to position in the output list.
To improve it... don't do that. What you're doing (as programmers have done since arrays were invented) is to use the index as a crude proxy for a Map from Int to whatever. What you want is an actual Map. I don't know what you want to do with it, but wouldn't it be nicer if it were from these key strings themselves, rather than by a number? If so, you can simplify your whole method to
def createPattern(query: List[(String,String)]) = query.toMap
at which point you should realise you probably don't need the method at all, since you can just use toMap at the call site.
If you insist on using an Int index, you could write
def createPattern(query: List[(String,String)]) = {
def intVal(x: String) = x match {
case "userId" => 0
case "userName" => 1
case "email" => 2
case "userPassword" => 3
case "creationDate" => 4
case "lastLoginDate" => 5
case "removed" => 6
}
val tuples = for ((key, value) <- query) yield (intVal(key), value)
tuples.toMap
}
Not sure what you want to do with the resulting list, but you can't create a List of wildcards like that.
What do you want to do with the resulting list, and what type should it be?
Here's how you might build something if you wanted the result to be a List[String], and if you wanted wildcards to be "*":
def createPattern(query:List[(String,String)]) = {
val wildcard = "*"
def orElseWildcard(key:String) = query.find(_._1 == key).getOrElse("",wildcard)._2
orElseWildcard("userID") ::
orElseWildcard("userName") ::
orElseWildcard("email") ::
orElseWildcard("userPassword") ::
orElseWildcard("creationDate") ::
orElseWildcard("lastLoginDate") ::
orElseWildcard("removed") ::
Nil
}
You're not using List, Tuple, iterator, or wild-cards correctly.
I'd take a different approach - maybe something like this:
case class Pattern ( valueMap:Map[String,String] ) {
def this( valueList:List[(String,String)] ) = this( valueList.toMap )
val Seq(
userId,userName,email,userPassword,creationDate,
lastLoginDate,removed
):Seq[Option[String]] = Seq( "userId", "userName",
"email", "userPassword", "creationDate", "lastLoginDate",
"removed" ).map( valueMap.get(_) )
}
Then you can do something like this:
scala> val pattern = new Pattern( List( "userId" -> "Fred" ) )
pattern: Pattern = Pattern(Map(userId -> Fred))
scala> pattern.email
res2: Option[String] = None
scala> pattern.userId
res3: Option[String] = Some(Fred)
, or just use the map directly.

MongoDB+Scala: Accessing deep nested data

I think there should be an easy solution around, but I wasn't able to find it.
I start accessing data from MongoDB with the following in Scala:
val search = MongoDBObject("_id" -> new ObjectId("xxx"))
val fields = MongoDBObject("community.member.name" -> 1, "community.member.age" -> 1)
for (res <- mongoColl.find(search, fields)) {
var memberInfo = res.getAs[BasicDBObject]("community").get
println(memberInfo)
}
and get a BasicDBObject as result:
{
"member" : [
{
"name" : "John Doe",
"age" : "32",
},{
"name" : "Jane Doe",
"age" : "29",
},
...
]
}
I know that I can access values with getAs[String], though this is not working here...
Anyone has an idea? Searching for a solution for several hours...
If you working with complex MongoDB objects, you can use Salat, which provides simple case class serialization.
Sample with your data:
case class Community(members:Seq[Member], _id: ObjectId = new ObjectId)
case class Member(name:String, age:Int)
val mongoColl: MongoCollection = _
val dao = new SalatDAO[Community, ObjectId](mongoColl) {}
val community = Community(Seq(Member("John Doe", 32), Member("Jane Doe", 29)))
dao.save(community)
for {
c <- dao.findOneById(community._id)
m <- c.members
} println("%s (%s)" format (m.name, m.age))
I think you should try
val member = memberInfo.as[MongoDBList]("member").as[BasicDBObject](0)
println(member("name"))
This problem has not to do really with MongoDB, but rather with your data structure. Your JSON/BSON data structure includes
An object community, which includes
An array of members
Each member has properties name or age.
Your problem is completely equivalent to the following:
case class Community(members:List[Member])
case class Member(name:String, age:Int)
val a = List(member1,member2)
// a.name does not compile, name is a property defined on a member, not on the list
Yes, you can do this beautifully with comprehensions. You could do the following:
for { record <- mongoColl.find(search,fields).toList
community <- record.getAs[MongoDBObject]("community")
member <- record.getAs[MongoDBObject]("member")
name <- member.getAs[String]("name") } yield name
This would work just to get the name. To get multiple values, I think you would do:
for { record <- mongoColl.find(search,fields).toList
community <- record.getAs[MongoDBObject]("community")
member <- record.getAs[MongoDBObject]("member")
field <- List("name","age") } yield member.get(field).toString