How can I take the highest ranking filter condition that ends up matching in a dataframe? - scala

Wording of my question might be confusing so let me explain. Say I have an array of strings. They are ranked in order of the best case scenario match. So at index 0 we want this to always exist in the dataframe column, but if it doesn't then index 1 is the next best option. I have written this logic like this, but I don't feel like this is the most efficient way to have done this. Is there another way of doing it that is better?
The datasets are quite small, but I fear this can't really scale very well.
val df = spark.createDataFrame(data)
val nameArray = Array[String]("Name", "Name%", "%Name%", "Person Name", "Person Name%", "%Person Name%")
nameArray.foreach(x => {
val nameDf = df.where("text like '" + x + "'")
if(nameDf.count() > 0){
nameDf.show(1)
break()
}
})

If values are order according to preference from left (highest precedence) to right (lowest precedence) and lower precedence patterns already cover higher precedence ones (it doesn't look like it is the case in your example) you generate expression like this
import org.apache.spark.sql._
def matched(df: DataFrame, nameArray: Seq[String], c: String = "text") = {
val matchIdx = nameArray.zipWithIndex.foldRight(lit(-1)){
case ((s, i), acc) => when(col(c) like s, lit(i)).otherwise(acc)
}
df.select(max(matchIdx)).first match {
case Row(-1) => None // No pattern matches all records
case Row(i: Int) => Some(nameArray(i))
}
}
Usage examples:
matched(Seq("Some Name", "Name", "Name Surname").toDF("text"), Seq("Name", "Name%", "%Name%"))
// Option[String] = Some(%Name%)
There are two advantages of this method:
Only one action is required.
Pattern matching can be short circuited.
If pre-conditions are not satisfied you can
import org.apache.spark.sql.functions._
val unmatchedCount: Map[String, Long] = df.select(
nameArray.map(s => count(when(not($"text" like s), 1)).alias(s)): _*
).first.getValuesMap[Long](nameArray)
Unlike the first approach it will check all patterns, but it requires only one action.

Related

How to create two sequence out of one comparing one custom object with another in that sequence?

case class Submission(name: String, plannedDate: Option[LocalDate], revisedDate: Option[LocalDate])
val submission_1 = Submission("Åwesh Care", Some(2020-05-11), Some(2020-06-11))
val submission_2 = Submission("robin Dore", Some(2020-05-11), Some(2020-05-30))
val submission_3 = Submission("AIMS Hospital", Some(2020-01-24), Some(2020-07-30))
val submissions = Seq(submission_1, submission_2, submission_3)
Split the submissions so that the submission with the same plannedDate and/or revisedDate
goes to sameDateGroup and others go to remainder.
val (sameDateGroup, remainder) = someFunction(submissions)
Example result as below:
sameDateGroup should have
Seq(Submission("Åwesh Care", Some(2020-05-11), Some(2020-06-11)),
Submission("robin Dore", Some(2020-05-11), Some(2020-05-30)))
and remainder should have:
Seq(Submission("AIMS Hospital", Some(2020-01-24), Some(2020-07-30)))
So, if I understand the logic here, submission A shares a date with submission B (and both would go in the sameDateGrooup) IFF:
subA.plannedDate == subB.plannedDate
OR subA.plannedDate == subB.revisedDate
OR subA.revisedDate == subB.plannedDate
OR subA.revisedDate == subB.revisedDate
Likewise, and conversely, submission C belongs in the remainder category IFF:
subC.plannedDate // is unique among all planned dates
AND subC.plannedDate // does not exist among all revised dates
AND subC.revisedDate // is unique among all revised dates
AND subC.revisedDate // does not exist among all planned dates
Given all that, I think this does what you're describing.
import java.time.LocalDate
case class Submission(name : String
,plannedDate : Option[LocalDate]
,revisedDate : Option[LocalDate])
val submission_1 = Submission("Åwesh Care"
,Some(LocalDate.parse("2020-05-11"))
,Some(LocalDate.parse("2020-06-11")))
val submission_2 = Submission("robin Dore"
,Some(LocalDate.parse("2020-05-11"))
,Some(LocalDate.parse("2020-05-30")))
val submission_3 = Submission("AIMS Hospital"
,Some(LocalDate.parse("2020-01-24"))
,Some(LocalDate.parse("2020-07-30")))
val submissions = Seq(submission_1, submission_2, submission_3)
val pDates = submissions.groupBy(_.plannedDate)
val rDates = submissions.groupBy(_.revisedDate)
val (sameDateGroup, remainder) = submissions.partition(sub =>
pDates(sub.plannedDate).lengthIs > 1 ||
rDates(sub.revisedDate).lengthIs > 1 ||
pDates.keySet(sub.revisedDate) ||
rDates.keySet(sub.plannedDate))
A simple way to do this is to count the number of matching submissions for each submission in the list, and use that to partition the list:
def matching(s1: Submission, s2: Submission) =
s1.plannedDate == s2.plannedDate || s1.revisedDate == s2.revisedDate
val (sameDateGroup, remainder) =
submissions.partition { s1 =>
submissions.count(s2 => matching(s1, s2)) > 1
}
The matching function can contain whatever specific test is required.
This is O(n^2) so a more sophisticated algorithm would be needed for very long lists.
I think this will do the trick.
I'm sorry, some variablenames are not very meaningful, because I used different case class hen trying this. For some reason I only thought about using .groupBy later. So I'm not really recommend using this, as it is a bit uncomprehensible and can be solved easier with groupby
case class Submission(name: String, plannedDate: Option[String], revisedDate: Option[String])
val l =
List(
Submission("Åwesh Care", Some("2020-05-11"), Some("2020-06-11")),
Submission("robin Dore", Some("2020-05-11"), Some("2020-05-30")),
Submission("AIMS Hospital", Some("2020-01-24"), Some("2020-07-30")))
val t = l
.map((_, 1))
.foldLeft(Map.empty[Option[String], (List[Submission], Int)])((acc, idnTuple) => idnTuple match {
case (idn, count) => {
acc
.get(idn.plannedDate)
.map {
case (mapIdn, mapCount) => acc + (idn.plannedDate -> (idn :: mapIdn, mapCount + count))
}.getOrElse(acc + (idn.plannedDate -> (List(idn), count)))
}})
.values
.partition(_._2 > 1)
val r = (t._1.map(_._1).flatten, t._2.map(_._1).flatten)
println(r)
It basically follows the map-reduce wordcount schema.
If someone sees this, and knows how to do the tuple deconstruction easier, please let me know in the comments.

Decompose Scala sequence into member values

I'm looking for an elegant way of accessing two items in a Seq at the same time. I've checked earlier in my code that the Seq will have exactly two items. Now I would like to be able to give them names, so they have meaning.
records
.sliding(2) // makes sure we get `Seq` with two items
.map(recs => {
// Something like this...
val (former, latter) = recs
})
Is there an elegant and/or idiomatic way to achieve this in Scala?
I'm not sure if it is any more elegant, but you can also unpick the sequence like this:
val former +: latter +: _ = recs
You can access the elements by their index:
map { recs => {
val (former, latter) = recs(0), recs(1)
}}
You can use pattern matching to decompose the structure of your list:
val records = List("first", "second")
records match {
case first +: second +: Nil => println(s"1: $first, 2: $second")
case _ => // Won't happen (you can omit this)
}
will output
1: first, 2: second
The result of sliding is a List. Using a pattern match, you can give name to these elements like this:
map{ case List(former, latter) =>
...
}
Note that since it's a pattern match, you need to use {} instead of ().
For a records of known types (for example, Int):
records.sliding (2).map (_ match {
case List (former:Int, latter:Int) => former + latter })
Note, that this will unify element (0, 1), then (1, 2), (2, 3) ... and so on. To combine pairwise, use sliding (2, 2):
val pairs = records.sliding (2, 2).map (_ match {
case List (former: Int, latter: Int) => former + latter
case List (one: Int) => one
}).toList
Note, that you now need an extra case for just one element, if the records size is odd.

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

How to skip entries that don't match a pattern when constructing an RDD

I'm doing some pattern matching on an RDD and would like to select only thoese rows/records matching a pattern. Here is what I currently have,
val idPattern = """Id="([^"]*)""".r.unanchored
val typePattern = """PostTypeId="([^"]*)""".r.unanchored
val datePattern = """CreationDate="([^"]*)""".r.unanchored
val tagPattern = """Tags="([^"]*)""".r.unanchored
val projectedPostsAnswers = postsAnswers.map {line => {
val id = line match {case idPattern(x) => x}
val typeId = line match {case typePattern(x) => x}
val date = line match {case datePattern(x) => x}
val tags = line match {case tagPattern(x) => x}
Post(Some(id),Some(typeId),Some(date),Some(tags))
}
}
case class Post(Id: Option[String], Type: Option[String], CreationDate: Option[String], Tags: Option[String])
I'm only interested in row/record which match to all patterns (that is, records that have all those four fields). How can I skip those rows/records not satisfy my requirement?
You can use RDD.collect(scala.PartialFunction f) to do the filtering and mapping in one step.
For example, if you know the order of these fields in your input, you can merge the regular expressions into one and use a single case:
val pattern = """Id="([^"]*).*PostTypeId="([^"]*).*CreationDate="([^"]*).*Tags="([^"]*)""".r.unanchored
val projectedPostsAnswers = postsAnswers.collect {
case pattern(id, typeId, date, tags) => Post(Some(id), Some(typeId), Some(date), Some(tags))
}
The returned RDD[Post] will only contain records for which this case matched. Notice that this collect(PartialFunction) has nothing to do with collect() - it does not collect entire data set into driver memory.
Use the filter API.
val projectedPostAnswers = postsAnswers.filter(line => f(line)).map{....
Create a function 'f' that does your data cleansing for you. Make sure the function returns true or false, as thats what filter uses to decide whether or not to pass the record.

spark scala - replace text if exists in list

I have 2 datasets.
One is a dataframe with a bunch of data, one column has comments (a string).
The other is a list of words.
If a comment contains a word in the list, I want to replace the word in the comment with ##### and return the comment in full with the replaced words.
Here's some sample data:
CommentSample.txt
1 A badword small town
2 "Love the truck, though rattle is annoying."
3 Love the paint!
4
5 "Like that you added the ""oh badword2"" handle to passenger side."
6 "badword you. specific enough for you, badword3?"
7 This car is a piece if badword2
ProfanitySample.txt
badword
badword2
badword3
Here's my code so far:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Response(UniqueID: Int, Comment: String)
val response = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim.toString, r(10).trim.toInt)).toDF()
var profanity = sc.textFile("file:/data/ProfanitySample.txt").map(x => (x.toLowerCase())).toArray();
def replaceProfanity(s: String): String = {
val l = s.toLowerCase()
val r = "#####"
if(profanity.contains(s))
r
else
s
}
def processComment(s: String): String = {
val commentWords = sc.parallelize(s.split(' '))
commentWords.foreach(replaceProfanity)
commentWords.collect().mkString(" ")
}
response.select(processComment("Comment")).show(100)
It compiles, it runs, but the words are not replaced.
I don't know how to debug in scala.
I'm totally new! This is my first project ever!
Many thanks for any pointers.
-M
First, I think the usecase you describe here won't benefit much from the use of DataFrames - it's simpler to implement using RDDs only (DataFrames are mostly convenient when your transformations can easily be described using SQL, which isn't the case here).
So - here's a possible implementation using RDDs. This assumes the list of profanities isn't too large (i.e. up to ~thousands), so we can collect it into non-distributed memory. If that's not the case, a different approach (involving a join) might be needed.
case class Response(UniqueID: Int, Comment: String)
val mask = "#####"
val responses: RDD[Response] = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim))
val profanities: Array[String] = sc.textFile("file:/data/ProfanitySample.txt").collect()
val result = responses.map(r => {
// using foldLeft here means we'll replace profanities one by one,
// with the result of each replace as the input of the next,
// starting with the original comment
profanities.foldLeft(r.Comment)({
case (updatedComment, profanity) => updatedComment.replaceAll(s"(?i)\\b$profanity\\b", mask)
})
})
result.take(10).foreach(println) // just printing some examples...
Note that the case-insensitivity and the "words only" limitations are implemented in the regex itself: "(?i)\\bSomeWord\\b".