Simplify two filters in Scala - scala

Is there a way to simplify this scala code into a for comprehension?
val selectedNames = names filter {setOfNames}
val selectedPersons = persons filter {p => seletectedNames contains p.name}
Here I'm assuming that persons have a name attribute.
Edit
Of course the value names is obtained as
val names = persons map _.name

How about
val selectedPersons = persons filter { person => setOfNames contains person.name }

I'm not sure this is much of a simplification. It's just doing the same thing via a for comprehension as requested.
val selectedPersons = for {
p <- persons
if setOfNames(p.name)
} yield p

Related

Scala Tuple of seq to seq of object

I am having tuples of format as (DBIO[Seq[Person]], DBIO[Seq[Address]]) as one to one mapping. Person and Address is separate table in RDBMS. Profile definition is Profile(person: Person, address: Address). Now I want to convert the former into DBIO[Seq[Profile]]. Following is code snippet for how I have got (DBIO[Seq[Person]], DBIO[Seq[Address]])
for {
person <- personQuery if person.personId === personId
address <- addressQuery if address.addressId === profile.addressId
} yield (person.result, address.result)
Need help with this transformation to DBIO[Seq[Profile].
Assuming you can't use a join and you need to work with two actions (two DBIOs), what you can do is combine the two actions into a single one:
// Combine two actions into a single action
val pairs: DBIO[ ( Seq[Person], Seq[Address] ) ] =
(person.result).zip(address.result)
(zip is just one of many combinators you can use to manipulate DBIO).
From there you can use DBIO.map to convert the pair into the datastructure you want.
For example:
// Use Slick's DBIO.map to map the DBIO value into a sequence of profiles:
val profiles: DBIO[Seq[Profile]] = pairs.map { case (ppl, places) =>
// We now use a regular Scala `zip` on two sequences:
ppl.zip(places).map { case (person, place) => Profile(person, place) }
}
I am unfamiliar with whatever DBIO is. Assuming it is a case class of some type T :
val (DBIO(people), DBIO(addresses)) = for {
person <- personQuery if person.personId === personId
address <- addressQuery if address.addressId === profile.addressId
} yield (person.result, address.result)
val profiles = DBIO(people.zip(addresses).map{ case (person, address) => Profile(person, address)})

Scala broadcast join with "one to many" relationship

I am fairly new to Scala and RDDs.
I have a very simple scenario yet it seems very hard to implement with RDDs.
Scenario:
I have two tables. One large and one small. I broadcast the smaller table.
I then want to join the table and finally aggregate the values after the join to a final total.
Here is an example of the code:
val bigRDD = sc.parallelize(List(("A",1,"1Jan2000"),("B",2,"1Jan2000"),("C",3,"1Jan2000"),("D",3,"1Jan2000"),("E",3,"1Jan2000")))
val smallRDD = sc.parallelize(List(("A","Fruit","Apples"),("A","ZipCode","1234"),("B","Fruit","Apples"),("B","ZipCode","456")))
val broadcastVar = sc.broadcast(smallRDD.keyBy{ a => (a._1,a._2) } // turn to pair RDD
.collectAsMap() // collect as Map
)
//first join
val joinedRDD = bigRDD.map( accs => {
//get list of groups
val groups = List("Fruit", "ZipCode")
val i = "Fruit"
//for each group
//for(i <- groups) {
if (broadcastVar.value.get(accs._1, i) != None) {
( broadcastVar.value.get(accs._1, i).get._1,
broadcastVar.value.get(accs._1, i).get._2,
accs._2, accs._3)
} else {
None
}
//}
}
)
//expected after this
//("A","Fruit","Apples",1, "1Jan2000"),("B","Fruit","Apples",2, "1Jan2000"),
//("A","ZipCode","1234", 1,"1Jan2000"),("B","ZipCode","456", 2,"1Jan2000")
//then group and sum
//cannot do anything with the joinedRDD!!!
//error == value copy is not a member of Product with Serializable
// Final Expected Result
//("Fruit","Apples",3, "1Jan2000"),("ZipCode","1234", 1,"1Jan2000"),("ZipCode","456", 2,"1Jan2000")
My questions:
Is this the best approach first of all with RDDs?
Disclaimer - I have done this whole task using dataframes successfully. The idea is to create another version using only RDDs to compare performance.
Why is the type of my joinedRDD not recognised after it was created so that I can continue to use functions like copy on it?
How can I get away with not doing a .collectAsMap() when broadcasting the variable. I currently have to include the first to items to enforce uniqueness and not dropping any values.
Thanks for the help in advance!
Final solution for anyone interested
case class dt (group:String, group_key:String, count:Long, date:String)
val bigRDD = sc.parallelize(List(("A",1,"1Jan2000"),("B",2,"1Jan2000"),("C",3,"1Jan2000"),("D",3,"1Jan2000"),("E",3,"1Jan2000")))
val smallRDD = sc.parallelize(List(("A","Fruit","Apples"),("A","ZipCode","1234"),("B","Fruit","Apples"),("B","ZipCode","456")))
val broadcastVar = sc.broadcast(smallRDD.keyBy{ a => (a._1) } // turn to pair RDD
.groupByKey() //to not loose any data
.collectAsMap() // collect as Map
)
//first join
val joinedRDD = bigRDD.flatMap( accs => {
if (broadcastVar.value.get(accs._1) != None) {
val bc = broadcastVar.value.get(accs._1).get
bc.map(p => {
dt(p._2, p._3,accs._2, accs._3)
})
} else {
None
}
}
)
//expected after this
//("Fruit","Apples",1, "1Jan2000"),("Fruit","Apples",2, "1Jan2000"),
//("ZipCode","1234", 1,"1Jan2000"),("ZipCode","456", 2,"1Jan2000")
//then group and sum
var finalRDD = joinedRDD.map(s => {
(s.copy(count=0),s.count) //trick to keep code to minimum (count = 0)
})
.reduceByKey(_ + _)
.map(pair => {
pair._1.copy(count=pair._2)
})
In your map statement you return either a tuple or None based on the if condition. These types do not match so you fall back the a common supertype so joinedRDD is an RDD[Product with Serializable] Which is not what you want at all (it's basically RDD[Any]). You need to make sure all paths return the same type. In this case, you probably want an Option[(String, String, Int, String)]. All you need to do is wrap the tuple result into a Some
if (broadcastVar.value.get(accs._1, i) != None) {
Some(( broadcastVar.value.get(accs._1, i).get.group_key,
broadcastVar.value.get(accs._1, i).get.group,
accs._2, accs._3))
} else {
None
}
And now your types will match up. This will make joinedRDD and RDD[Option(String, String, Int, String)]. Now that the type is correct the data is usable, however, it means that you will need to map the Option to work with the tuples. If you don't need the None values in the final result, you can use flatmap instead of map to create joinedRDD which will unwrap the Options for you, filtering out all the Nones.
CollectAsMap is the correct way to turnan RDD into a Hashmap, but you need multiple values for a single key. Before using collectAsMap but after mapping the smallRDD into a Key,Value pair, use groupByKey to group all of the values for a single key together. When when you look up a key from your HashMap, you can map over the values, creating a new record for each one.

Filter a list based on a parameter

I want to filter the employees based on name and return the id of each employee
case class Company(emp:List[Employee])
case class Employee(id:String,name:String)
val emp1=Employee("1","abc")
val emp2=Employee("2","def")
val cmpy= Company(List(emp1,emp2))
val a = cmpy.emp.find(_.name == "abc")
val b = a.map(_.id)
val c = cmpy.emp.find(_.name == "def")
val d = c.map(_.id)
println(b)
println(d)
I want to create a generic function that contains the filter logic and I can have different kind of list and filter parameter for those list
Ex employeeIdByName which takes the parameters
Updated
criteria for filter eg :_.name and id
list to filter eg:cmpy.emp value
for comparison eg :abc/def
Any better way to achieve the result
I have used map and find
If you really want a "generic" filter function, that can filter any list of elements, by any property of these elements, based on a closed set of "allowed" values, while mapping results to some other property - it would look something like this:
def filter[T, P, R](
list: List[T], // input list of elements with type T (in our case: Employee)
propertyGetter: T => P, // function extracting value for comparison, in our case a function from Employee to String
values: List[P], // "allowed" values for the result of propertyGetter
resultMapper: T => R // function extracting result from each item, in our case from Employee to String
): List[R] = {
list
// first we filter only items for which the result of
// applying "propertyGetter" is one of the "allowed" values:
.filter(item => values.contains(propertyGetter(item)))
// then we map remaining values to the result using the "resultMapper"
.map(resultMapper)
}
// for example, we can use it to filter by name and return id:
filter(
List(emp1, emp2),
(emp: Employee) => emp.name, // function that takes an Employee and returns its name
List("abc"),
(emp: Employee) => emp.id // function that takes an Employee and returns its id
)
// List(1)
However, this is a ton of noise around a very simple Scala operation: filtering and mapping a list; This specific usecase can be written as:
val goodNames = List("abc")
val input = List(emp1, emp2)
val result = input.filter(emp => goodNames.contains(emp.name)).map(_.id)
Or even:
val result = input.collect {
case Employee(id, name) if goodNames.contains(name) => id
}
Scala's built-in map, filter, collect functions are already "generic" in the sense that they can filter/map by any function that applies to the elements in the collection.
You can use Shapeless. If you have a employees: List[Employee], you can use
import shapeless._
import shapeless.record._
employees.map(LabelledGeneric[Employee].to(_).toMap)
To convert each Employee to a map from field key to field value. Then you can apply the filters on the map.

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

Sort files by name

How can I sort in ascending/descending order a group of files based on their name with the following naming convention: myPath\numberTheFileInt.ext?
I would like to obtain something like the following:
myPath\1.csv
myPath\02.csv
...
myPath\21.csv
...
myPath\101.csv
Here is what I have at the moment:
myFiles = getFiles(myFilesDirectory).sortWith(_.getName < _.getName)
Despite the files being sorted in the directory, they are unsorted in myFiles.
I have in output:
myPath\1.csv
myPath\101.csv
myPath\02.csv
...
myPath\21.csv
I tried multiple things but it always throws an NoSuchElementException.
Has anyone already done this?
Comparing strings would yield an order based on unicode values of the strings being compared. What you need is to extract the file number and order based on that as an Integer.
import java.io.File
val extractor = "([\\d]+).csv$".r
val files = List(
"myPath/1.csv",
"myPath/101.csv",
"myPath/02.csv",
"myPath/21.csv",
"myPath/33.csv"
).map(new File(_))
val sorted = files.sortWith {(l, r) =>
val extractor(lFileNumber) = l.getName
val extractor(rFileNumber) = r.getName
lFileNumber.toInt < rFileNumber.toInt
}
sorted.foreach(println)
Results:
myPath/1.csv
myPath/02.csv
myPath/21.csv
myPath/33.csv
myPath/101.csv
UPDATE
An alternative as proposed by #dhg
val sorted = files.sortBy { f => f.getName match {
case extractor(n) => n.toInt
}}
A cleaner version of J.Romero's answer, using sortBy:
val Extractor = "([\\d]+)\\.csv".r
val sorted = files.map(_.getName).sortBy{ case Extractor(n) => n.toInt }