how can i match two collections in scala by a field equality - scala

i have this code :
case class DataText(name:String)
val dataModels = Seq(DataText("a.dm"),DataText("b.dm"),DataText("c.dm"),DataText("d.dm"),DataText("e.dm"),DataText("f.dm"))
val dataReports = Seq(DataText("a.d0"),DataText("b1.do"),DataText("c2.do"),DataText("d.do"),DataText("e3.do"),DataText("f5.do"))
how can i match dataModels and dataReports, when an item in dataModels split by "." like name.split(".").head can match to dataReports split by "." like name.split(".").head
The result could be:
Seq(DataText("a.dm"),DataText("d.dm"))
i have tried with map and an embedded filter but don't work.

I would transform dataReports into a Set of the target sub-elements for filtering via contains (which is a constant time O(1) operation):
val dataReportsSet = dataReports.map(_.name.split("\\.")(0)).toSet
dataModels.filter(dm => dataReportsSet.contains(dm.name.split("\\.")(0)))
// res1: Seq[DataText] = List(DataText(a.dm), DataText(d.dm))

Related

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

How to efficiently delete subset in spark RDD

When conducting research, I find it somewhat difficult to delete all the subsets in Spark RDD.
The data structure is RDD[(key,set)]. For example, it could be:
RDD[ ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ]
Since the set of mike (Set(1,3)) is a subset of peter's (Set(1,2,3)), I want to delete "mike", which will end up with
RDD[ ("peter",Set(1,2,3)), ("jack",Set(5)) ]
It is easy to implement in python locally with two "for" loop operation. But when I want to extend to cloud with scala and spark, it is not that easy to find a good solution.
Thanks
I doubt we can escape to comparing each element to each other (the equivalent of a double loop in a non-distributed algorithm). The subset operation between sets is not reflexive, meaning that we need to compare is "alice" subsetof "bob" and is "bob" subsetof "alice".
To do this using the Spark API, we can resort to multiplying the data with itself using a cartesian product and verifying each entry of the resulting matrix:
val data = Seq(("peter",Set(1,2,3)), ("mike",Set(1,3)), ("anne", Set(7)),("jack",Set(5,4,1)), ("lizza", Set(5,1)), ("bart", Set(5,4)), ("maggie", Set(5)))
// expected result from this dataset = peter, olga, anne, jack
val userSet = sparkContext.parallelize(data)
val prod = userSet.cartesian(userSet)
val subsetMembers = prod.collect{case ((name1, set1), (name2,set2)) if (name1 != name2) && (set2.subsetOf(set1)) && (set1 -- set2).nonEmpty => (name2, set2) }
val superset = userSet.subtract(subsetMembers)
// lets see the results:
superset.collect()
// Array[(String, scala.collection.immutable.Set[Int])] = Array((olga,Set(1, 2, 3)), (peter,Set(1, 2, 3)), (anne,Set(7)), (jack,Set(5, 4, 1)))
This can be achieved by using RDD.fold function.
In this case the output required is a "List" (ItemList) of superset items. For this the input should also be converted to "List" (RDD of ItemList)
import org.apache.spark.rdd.RDD
// type alias for convinience
type Item = Tuple2[String, Set[Int]]
type ItemList = List[Item]
// Source RDD
val lst:RDD[Item] = sc.parallelize( List( ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ) )
// Convert each element as a List. This is needed for using fold function on RDD
// since the data-type of the parameters are the same as output parameter
// data-type for fold function
val listOflst:RDD[ItemList] = lst.map(x => List(x))
// for each element in second ItemList
// - Check if it is not subset of any element in first ItemList and add first
// - Remove the subset of newly added elements
def combiner(first:ItemList, second:ItemList) : ItemList = {
def helper(lst: ItemList, i:Item) : ItemList = {
val isSubset: Boolean = lst.exists( x=> i._2.subsetOf(x._2))
if( isSubset) lst else i :: lst.filterNot( x => x._2.subsetOf(i._2))
}
second.foldLeft(first)(helper)
}
listOflst.fold(List())(combiner)
You can use filter after a map.
You can build like a map that will return a value for what you want to delete. First build a function:
def filter_mike(line):
if line[1] != Set(1,3):
return line
else:
return None
Then you can filter now like this:
your_rdd.map(filter_mike).filter(lambda x: x != None)
This will work

Sort files by name

How can I sort in ascending/descending order a group of files based on their name with the following naming convention: myPath\numberTheFileInt.ext?
I would like to obtain something like the following:
myPath\1.csv
myPath\02.csv
...
myPath\21.csv
...
myPath\101.csv
Here is what I have at the moment:
myFiles = getFiles(myFilesDirectory).sortWith(_.getName < _.getName)
Despite the files being sorted in the directory, they are unsorted in myFiles.
I have in output:
myPath\1.csv
myPath\101.csv
myPath\02.csv
...
myPath\21.csv
I tried multiple things but it always throws an NoSuchElementException.
Has anyone already done this?
Comparing strings would yield an order based on unicode values of the strings being compared. What you need is to extract the file number and order based on that as an Integer.
import java.io.File
val extractor = "([\\d]+).csv$".r
val files = List(
"myPath/1.csv",
"myPath/101.csv",
"myPath/02.csv",
"myPath/21.csv",
"myPath/33.csv"
).map(new File(_))
val sorted = files.sortWith {(l, r) =>
val extractor(lFileNumber) = l.getName
val extractor(rFileNumber) = r.getName
lFileNumber.toInt < rFileNumber.toInt
}
sorted.foreach(println)
Results:
myPath/1.csv
myPath/02.csv
myPath/21.csv
myPath/33.csv
myPath/101.csv
UPDATE
An alternative as proposed by #dhg
val sorted = files.sortBy { f => f.getName match {
case extractor(n) => n.toInt
}}
A cleaner version of J.Romero's answer, using sortBy:
val Extractor = "([\\d]+)\\.csv".r
val sorted = files.map(_.getName).sortBy{ case Extractor(n) => n.toInt }

How to skip entries that don't match a pattern when constructing an RDD

I'm doing some pattern matching on an RDD and would like to select only thoese rows/records matching a pattern. Here is what I currently have,
val idPattern = """Id="([^"]*)""".r.unanchored
val typePattern = """PostTypeId="([^"]*)""".r.unanchored
val datePattern = """CreationDate="([^"]*)""".r.unanchored
val tagPattern = """Tags="([^"]*)""".r.unanchored
val projectedPostsAnswers = postsAnswers.map {line => {
val id = line match {case idPattern(x) => x}
val typeId = line match {case typePattern(x) => x}
val date = line match {case datePattern(x) => x}
val tags = line match {case tagPattern(x) => x}
Post(Some(id),Some(typeId),Some(date),Some(tags))
}
}
case class Post(Id: Option[String], Type: Option[String], CreationDate: Option[String], Tags: Option[String])
I'm only interested in row/record which match to all patterns (that is, records that have all those four fields). How can I skip those rows/records not satisfy my requirement?
You can use RDD.collect(scala.PartialFunction f) to do the filtering and mapping in one step.
For example, if you know the order of these fields in your input, you can merge the regular expressions into one and use a single case:
val pattern = """Id="([^"]*).*PostTypeId="([^"]*).*CreationDate="([^"]*).*Tags="([^"]*)""".r.unanchored
val projectedPostsAnswers = postsAnswers.collect {
case pattern(id, typeId, date, tags) => Post(Some(id), Some(typeId), Some(date), Some(tags))
}
The returned RDD[Post] will only contain records for which this case matched. Notice that this collect(PartialFunction) has nothing to do with collect() - it does not collect entire data set into driver memory.
Use the filter API.
val projectedPostAnswers = postsAnswers.filter(line => f(line)).map{....
Create a function 'f' that does your data cleansing for you. Make sure the function returns true or false, as thats what filter uses to decide whether or not to pass the record.

Find an element in a List by comparing with another list in scala

I have two list.
val lis1= List("pt1","pt2","")
val lis2= List("pt1","")
I need to find the empty string in the lis1 so I am trying to do
val find= lis1.find(lis=>lis2.contains(""))
Here instead returning me "" , its returning me ("pt1"). Kindly help me how can I get empty string instead of "pt1"
It sounds like you want the intersection of the two lists. You can use filter + contains, similar to your original approach. Alternative you can use the intersect method.
val lis1 = List("pt1", "pt2", "")
val lis2 = List("pt1", "")
lis1.filter(item => lis2.contains(item))
// > res0: List[String] = List(pt1, "")
lis1.intersect(lis2)
// > res1: List[String] = List(pt1, "")