Spark: querying Mongodb with Stratio and RDD - mongodb

I am querying MongoDB with Spark using Stratio (0.11.). I am interested to use RDDs (no DataFrame).
What I am doing right now is:
val mongoRDD = new MongodbRDD(sqlContext, readConfig, new MongodbPartitioner(readConfig))
mongoRDD.foreach(println)
and it displays the collection content in a correct way.
Is there a way to use a query (as String or built via QueryBuilder) with Stratio (in my case the query is $near type) to apply to MongodbRDD?

As #zero323 has hinted, the way to do that is using filters parameter. These filters are checked by the library and matched against the MongoDB QueryBuilder available filters.
From Spark-MongoDB source code:
sFilters.foreach {
case EqualTo(attribute, value) =>
queryBuilder.put(attribute).is(checkObjectID(attribute, value))
case GreaterThan(attribute, value) =>
queryBuilder.put(attribute).greaterThan(checkObjectID(attribute, value))
case GreaterThanOrEqual(attribute, value) =>
queryBuilder.put(attribute).greaterThanEquals(checkObjectID(attribute, value))
case In(attribute, values) =>
queryBuilder.put(attribute).in(values.map(value => checkObjectID(attribute, value)))
case LessThan(attribute, value) =>
queryBuilder.put(attribute).lessThan(checkObjectID(attribute, value))
case LessThanOrEqual(attribute, value) =>
queryBuilder.put(attribute).lessThanEquals(checkObjectID(attribute, value))
case IsNull(attribute) =>
queryBuilder.put(attribute).is(null)
case IsNotNull(attribute) =>
queryBuilder.put(attribute).notEquals(null)
case And(leftFilter, rightFilter) if !parentFilterIsNot =>
queryBuilder.and(filtersToDBObject(Array(leftFilter)), filtersToDBObject(Array(rightFilter)))
case Or(leftFilter, rightFilter) if !parentFilterIsNot =>
queryBuilder.or(filtersToDBObject(Array(leftFilter)), filtersToDBObject(Array(rightFilter)))
case StringStartsWith(attribute, value) if !parentFilterIsNot =>
queryBuilder.put(attribute).regex(Pattern.compile("^" + value + ".*$"))
case StringEndsWith(attribute, value) if !parentFilterIsNot =>
queryBuilder.put(attribute).regex(Pattern.compile("^.*" + value + "$"))
case StringContains(attribute, value) if !parentFilterIsNot =>
queryBuilder.put(attribute).regex(Pattern.compile(".*" + value + ".*"))
case Not(filter) =>
filtersToDBObject(Array(filter), true)
}
As you can see, near is not being applied but it seems like it could be easily added to the connector functionality since QueryBuilder offers methods to use that MongoDB function.
You can try to modify the connector. However I'll try to implement it and make a PR in the following days.
EDIT:
A PR has been opened including a source filter type which describes $near so you can use MongodbRdd as:
val mongoRDD = new MongodbRDD(
sqlContext,
readConfig,
new MongodbPartitioner(readConfig),
filters = FilterSection(Array(Near("x", 3.0, 4.0))))
)

Related

Flink scala Case class

I want to know how to replace x._1._2, x._1._3 by the name of the field using case class
def keyuid(l:Array[String]) : (String,Long,String) ={
//val l=s.split(",")
val ip=l(3).split(":")(1)
val values=Array("",0,0,0)
val uid=l(1).split(":")(1)
val timestamp=l(2).split(":")(1).toLong*1000
val impression=l(4).split(":")(1)
return (uid,timestamp,ip)
}
val cli_ip = click.map(_.split(","))
.map(x => (keyuid(x), 1.0)).assignAscendingTimestamps(x=>x._1._2)
.keyBy(x => x._1._3)
.timeWindow(Time.seconds(10))
.sum(1)
Use Scala pattern matching when writing lambda functions using curly braces and case keyword.
val cli_ip = click.map(_.split(","))
.map(x => (keyuid(x), 1.0)).assignAscendingTimestamps {
case ((_, timestamp, _), _) => timestamp
}
.keyBy { elem => elem match {
case ((_, _, ip), _) => ip
}
}
.timeWindow(Time.seconds(10))
.sum(1)
More information on Tuples and their pattern matching syntax here: https://docs.scala-lang.org/tour/tuples.html
Pattern Matching is indeed a good idea and makes the code more readable.
To answer your question, to make keyuid function returns a case class, you first need to define it, for instance :
case class Click(string uid, long timestamp, string ip)
Then instead of return (uid,timestamp,ip) you need to do return Click(uid,timestamp,ip)
Case class are not related to flink, but to scala : https://docs.scala-lang.org/tour/case-classes.html

read a file in scala and get key value pairs as Map[String, List[String]]

i am reading a file and getting the records as a Map[String, List[String]] in spark-scala. similar thing i want to achieve in pure scala form without any spark references(not reading an rdd). what should i change to make it work in a pure scala way
rdd
.filter(x => (x != null) && (x.length > 0))
.zipWithIndex()
.map {
case (line, index) =>
val array = line.split("~").map(_.trim)
(array(0), array(1), index)
}
.groupBy(_._1)
.mapValues(x => x.toList.sortBy(_._3).map(_._2))
.collect
.toMap
For the most part it will remain the same except for the groupBy part in rdd. Scala List also has the map, filter, reduce etc. methods. So they can be used in almost a similar fashion.
val lines = Source.fromFile('filename.txt').getLines.toList
Once the file is read and stored in List, the methods can be applied to it.
For the groupBy part, one simple approach can be to sort the tuples on the key. That will effectively cluster the tuples with same keys together.
val grouped = scala.util.Sorting.stablesort(arr, (e1: String, e2: String, e3: String)
=> e1._1 < e2._2)
There can be better solutions definitely, but this would effectively do the same task.
I came up with the below approach
Source.fromInputStream(
getClass.getResourceAsStream(filePath)).getLines.filter(
lines =>(lines != null) && (lines.length > 0)).map(_.split("~")).toList.groupBy(_(0)).map{ case (key, values) => (key, values.map(_(1))) }

Scala List Regex match in Map and return key

I want to match a string with list of regex within a Map[String, List[Regex]] and return the key[String] as String in case there is a match.
e.g:
//Map[String, List[Regex]]
Map(m3 -> List(([^ ]*)(rule3)([^ ]*)), m1 -> List(([^ ]*)(rule1)([^ ]*)), m4 -> List(([^ ]*)(rule5)([^ ]*)), m2 -> List(([^ ]*)(rule2)([^ ]*)))
if the string is "***rule3****" it should return me the key "m3", similarly if the string is "****rule5****" it should return key "m4".
How do i implement this?
something that i tried which is not working
rulesMap.mapValues (y => y.par.foreach (x => x.findFirstMatchIn("description"))).keys.toString()
For Scala 2.13.x
rulesMap
.filter({ case (_, regexList) => regexList.exists(regex => regex.matches("yourString")) })
.keys
For Scala 2.12.x
rulesMap
.filter({ case (_, regexList) => regexList.exists(regex => regex.findFirstIn("yourString").isDefined) })
.keys
collect is the best way of both filtering and mapping a collection because it only does a single pass over the data.
def findKeys(s: String) =
rulesMap.collect {
case (key, exps) if exps.exists(_.findFirstIn(s).nonEmpty) => key
}

csv to avro without apache spark in scala

Is there a way I can convert a scv file to Avro without using Apache Spark. I see most of the post suggests using spark which I cannot in my case. I have a schema in a separate file. I was thinking of some custom serializer and deserializer that will use the Schema and convert csv to avro. Any kind of reference would work for me.
Thanks
If you only have strings and primitives, you could put together a crude implementation like this fairly easily:
def csvToAvro(file: Sting, schema: Schema) = {
val rec = new GenericData.Record(schema)
val types = schema
.getFields
.map { f => f.pos -> f.schema.getType }
Source.fromFile(file)
.getLines
.map(_.split("_").toSeq)
.foreach { data =>
(data zip types)
.foreach {
case (str, (idx, STRING)) => rec.put(idx, str)
case (str, (idx, INT)) => rec.put(idx, str.toInt)
case (str, (idx, LONG)) => rec.put(idx, str.toLong)
case (str, (idx, FLOAT)) => rec.put(idx, str.toFloat)
case (str, (idx, DOUBLE)) => rec.put(idx, str.toDouble)
case (str, (idx, BOOLEAN)) => rec.put(idx, str.toBoolean)
case (str, (idx, unknown)) => throw new IllegalArgumentException(s"Don't know how to convert $str to $unknown at $idx))
}
}
rec
}
Note this does not handle nullable fields: for those the type is going to be UNION, and you'll have to look inside the schema to find out the actual data type.
Also, "parsing csv" is very crude here (just splitting at the comma isn't really a good idea, because it'll break if a string field happens to contain , in the data, or if fields are escaped with double-quotes).
And also, you'll probably want to add some sanity-checking to make sure, for example, that the number of fields in the csv line matches the number of fields in the schema etc.
But the above considerations notwithstanding, this should be sufficient to illustrate the approach and get you started.
Avro is an open format, there are many languages which support it.
Just pick one, like python for example which also support csv. But Go would do, and Java also.

Filtering on Scala Option[Set]

I have a Scala value of type Option[Set[String]] that I am trying to use in a collection filter method:
val opt: Option[Set[String]] = ...
collection.filter {
value =>
opt match {
case Some(set) => set.contains(value)
case None => true
}
}
If the opt value is Some(...) I want to use the enclosed Set to filter the collection, otherwise I want to include all items in the collection.
Is there a better (more idiomatic) way to use the Option (map, filter, getOrElse, etc.)?
The opt comes from an optional command line argument containing a list of terms to include. If the command line argument is missing, then include all terms.
I'd use the fact that Set[A] extends A => Boolean and pass the constant function returning true to getOrElse:
val p: String => Boolean = opt.getOrElse(_ => true)
Or:
val p = opt.getOrElse(Function.const(true) _)
Now you can just write collection.filter(p). If you want a one-liner, either of the following will work as well:
collection filter opt.getOrElse(_ => true)
collection filter opt.getOrElse(Function.const(true) _)
It seems a bit wasteful to filter on _ => true so I would prefer
opt match {
case Some(s) => collection filter s
case None => collection
}
I guess this equates to:
opt map (collection filter) getOrElse collection