Scala filtering through nested Seq Collection - scala

I'm trying to filter through a nested Seq collection.
The aim is to iterate through the myList Seq collection while identifying all 'Campaigns' in the Seq collection whose country is 'UK' and its banner has a width and height value as defined by the parameters h and w below.
c defines the country filter parameter.
So far the filtering code I've got below gives me the error:
error: value height is not a member of List[Banner]
val filteredList = myList.filter(c=> ((c.country == c) && (c.banners.height ==h) && (c.banners.width == w)))
Thanks in advance!
val h = 250 // height filter
val w = 300 // width filter
val c = "UK" // country filter
Sample of the Seq collection:
case class Campaign(id: Int, country: String, targeting: Targeting, banners: List[Banner], bid: Double)
case class Targeting(targetedSiteIds: Seq[String])
case class Banner(id: Int, src: String, width: Int, height: Int)
val myList = Seq(
Campaign( // Campaign 1
id=1,
country="LT",
targeting =Targeting(
targetedSiteIds=Seq("0006a522ce0f4bbbbaa6b3c38cafaa0f") // TargetedSiteIds
),
banners = List(
Banner(
id = 1,
src = "https://business.eskimi.com/wp-content/uploads/2020/06/openGraph.jpeg", // URL source
width = 100,
height = 200
)
),
bid = 5d
),
Campaign( // Campaign 2
id=1,
country="UK",
targeting =Targeting(
targetedSiteIds=Seq("0006a522ce0f4bbbbaa6b3c38cafaa0f") // TargetedSiteIds
),
banners = List(
Banner(
id = 1,
src = "https://business.eskimi.com/wp-content/uploads/2020/06/openGraph.jpeg", // URL source
width = 300,
height = 250
)
),
bid = 5d
)
)
val filteredList = myList.filter(c=> ((c.country == c) && (c.banners.height ==h) && (c.banners.width == w)))

val filteredList =
myList.filter(c => c.country == "UK" &&
c.banners.exists(_.height == 250) &&
c.banners.exists(_.width == 300))
Or, probably more correct.
val filteredList =
myList.filter(c => c.country == "UK" &&
c.banners.exists(b => b.height == 250 &&
b.width == 300))

Related

How to get the limit bound for a specific value from an array in Scala?

I have an Array
val bins = Array(0,100, 250, 500, 1000, 2000, 3000)
and here is my peice of code:
private var cumulativeDelay:Map[String ,Double] = linkIds.zip(freeFlowDelay).groupBy(_._1).mapValues(l => l.map(_._2).sum)
private var cumulativeCapacity:Map[String , Double] = linkIds.zip(linkCapacity).groupBy(_._1).mapValues(l => l.map(_._2).sum)
cumulativeCapacity foreach {
case(linkId , capacity) => {
val rangeToValue = bins.zip(bins.tail)
.collectFirst { case (left, right) if capacity >= left && capacity <= right =>
Map(s"$left-$right" -> cumulativeDelay.get(linkId))
}
.getOrElse(Map.empty[String, Double])
}
}
So value of rangeToValue is coming like Map(1000-2000 -> Some(625)) but I want rangeToValue:Map[String,Double] = (1000-2000 -> 625)
You should try something like this, but it doesn't work with values out of range:
val bins = Array(0, 100, 250, 500, 1000, 2000, 3000)
val effectiveValue = 625
val rangeToValue = bins.zip(bins.tail)
.collectFirst { case (left, right) if effectiveValue >= left && effectiveValue <= right =>
Map(s"$left-$right" -> effectiveValue)
}
.getOrElse(Map.empty[String, Int])
rangeToValue("500-1000")

Spark and Scala: Apply a function to each element of an RDD

I have an RDD of VertexRDD[(VertexId, Long)] structured as follows:
(533, 1)
(571, 2)
(590, 0)
...
Where, each element is composed of the vertex id (533, 571, 590, etc) and its number of outgoing edges (1, 2, 0, etc).
I want to apply a function to each element of this RDD. This function must perform a comparison between the number of outgoing edges and 4 thresholds.
If the number of outgoing edges is less than or equal to one of the 4 thresholds then the corresponding vertex id must be inserted into an Array (or another similar data structure), so as to obtain at the end 4 data structures, each containing the ids of the vertices that satisfy the omparison with the corresponding threshold.
I need that the ids that satisfy the comparison with the same threshold to be accumulated in the same data structure. How can I parallel and implement this approach with Spark and Scala?
My code:
val usersGraphQuery = "MATCH (u1:Utente)-[p:PIU_SA_DI]->(u2:Utente) RETURN id(u1), id(u2), type(p)"
val usersGraph = neo.rels(usersGraphQuery).loadGraph[Any, Any]
val numUserGraphNodes = usersGraph.vertices.count
val numUserGraphEdges = usersGraph.edges.count
val maxNumOutDegreeEdgesPerNode = numUserGraphNodes - 1
// get id and number of outgoing edges of each node from the graph
// except those that have 0 outgoing edges (default behavior of the outDegrees API)
var userNodesOutDegreesRdd: VertexRDD[Int] = usersGraph.outDegrees
/* userNodesOutDegreesRdd.foreach(println)
* Now you can see
* (533, 1)
* (571, 2)
*/
// I also get ids of nodes with zero outgoing edges
var fixedGraph: Graph[Any, Any] = usersGraph.outerJoinVertices(userNodesOutDegreesRdd)( (vid: Any, defaultOutDegrees: Any, outDegOpt: Option[Any]) => outDegOpt.getOrElse(0L) )
var completeUserNodesOutDregreesRdd = fixedGraph.vertices
/* completeUserNodesOutDregreesRdd.foreach(println)
* Now you can see
* (533, 1)
* (571, 2)
* (590, 0) <--
*/
// 4 thresholds that identify the 4 clusters of User nodes based on the number of their outgoing edges
var soglia25: Double = (maxNumOutDegreeEdgesPerNode.toDouble/100)*25
var soglia50: Double = (maxNumOutDegreeEdgesPerNode.toDouble/100)*50
var soglia75: Double = (maxNumOutDegreeEdgesPerNode.toDouble/100)*75
var soglia100: Double = maxNumOutDegreeEdgesPerNode
println("soglie: "+soglia25+", "+soglia50+", "+soglia75+", "+soglia100)
// containers of individual clusters
var lowSAUsers = new ListBuffer[(Long, Any)]()
var mediumLowSAUsers = new ListBuffer[(Long, Any)]()
var mediumHighSAUsers = new ListBuffer[(Long, Any)]()
var highSAUsers = new ListBuffer[(Long, Any)]()
// overall container of the 4 clusters
var clustersContainer = new ListBuffer[ (String, ListBuffer[(Long, Any)]) ]()
// I WANT PARALLEL FROM HERE -----------------------------------------------
// from RDD to Array
var completeUserNodesOutDregreesArray = completeUserNodesOutDregreesRdd.take(numUserGraphNodes.toInt)
// analizzo ogni nodo Utente e lo assegno al cluster di appartenenza
for(i<-0 to numUserGraphNodes.toInt-1) {
// confronto il valore del numero di archi in uscita (convertito in stringa)
// con le varie soglie per determinare in quale classe inserire il relativo nodo Utente
if( (completeUserNodesOutDregreesArray(i)._2).toString().toLong <= soglia25 ) {
println("ok soglia25 ")
lowSAUsers += completeUserNodesOutDregreesArray(i)
}else if( (completeUserNodesOutDregreesArray(i)._2).toString().toLong <= soglia50 ){
println("ok soglia50 ")
mediumLowSAUsers += completeUserNodesOutDregreesArray(i)
}else if( (completeUserNodesOutDregreesArray(i)._2).toString().toLong <= soglia75 ){
println("ok soglia75 ")
mediumHighSAUsers += completeUserNodesOutDregreesArray(i)
}else if( (completeUserNodesOutDregreesArray(i)._2).toString().toLong <= soglia100 ){
println("ok soglia100 ")
highSAUsers += completeUserNodesOutDregreesArray(i)
}
}
// I put each cluster in the final container
clustersContainer += Tuple2("lowSAUsers", lowSAUsers)
clustersContainer += Tuple2("mediumLowSAUsers", mediumLowSAUsers)
clustersContainer += Tuple2("mediumHighSAUsers", mediumHighSAUsers)
clustersContainer += Tuple2("highSAUsers", highSAUsers)
/* clustersContainer.foreach(println)
* Now you can see
* (lowSAUsers,ListBuffer((590,0)))
* (mediumLowSAUsers,ListBuffer((533,1)))
* (mediumHighSAUsers,ListBuffer())
* (highSAUsers,ListBuffer((571,2)))
*/
// ---------------------------------------------------------------------
how about you create an array of tuples representing different bins:
val bins = Seq(0, soglia25, soglia50, soglia75, soglia100).sliding(2)
.map(seq => (seq(0), seq(1))).toArray
Then for each element of your RDD you find a corresponding bin, make it a key, convert id to Seq and reduce by key:
def getBin(bins: Array[(Double, Double)], value: Int): Int = {
bins.indexWhere {case (a: Double, b: Double) => a < value && b >= value}
}
userNodesOutDegreesRdd.map {
case (id, value) => (getBin(bins, value), Seq(id))
}.reduceByKey(_ ++ _)

filtering dataframe in scala

Let say I have a dataframe created from a text file using case class schema. Below is the data stored in dataframe.
id - Type- qt - P
1, X, 10, 100.0
2, Y, 20, 200.0
1, Y, 15, 150.0
1, X, 5, 120.0
I need to filter dataframe by "id" and Type. And for every "id" iterate through the dataframe for some calculation.
I tried this way but it did not work. Code snapshot.
case class MyClass(id: Int, type: String, qt: Long, PRICE: Double)
val df = sc.textFile("xyz.txt")
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)
.toDF().cache()
val productList: List[Int] = df.map{row => row.getInt(0)}.distinct.collect.toList
val xList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "X" })}.toList
val yList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "Y" })}.toList
Taking the distinct idea from your example, simply iterate over all the IDs and filter the DataFrame according to the current ID. After this you have a DataFrame with only the relevant data:
val df3 = sc.textFile("src/main/resources/importantStuff.txt") //Your data here
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)).toDF().cache()
val productList: List[Int] = df3.map{row => row.getInt(0)}.distinct.collect.toList
println(productList)
productList.foreach(id => {
val sqlDF = df3.filter(df3("id") === id)
sqlDF.show()
})
sqlDF in the loop is the DF with the relevant data, later you can run your calculations on it.

Scala groupBy of a tuple to calculate stock basis

I am working on an exercise to calculate stock basis given a list of stock purchases in the form of thruples (ticker, qty, stock_price). I've got it working, but would like to do the calculation part in more of a functional way. Anyone have an answer for this?
// input:
// List(("TSLA", 20, 200),
// ("TSLA", 20, 100),
// ("FB", 10, 100)
// output:
// List(("FB", (10, 100)),
// ("TSLA", (40, 150))))
def generateBasis(trades: Iterable[(String, Int, Int)]) = {
val basises = trades groupBy(_._1) map {
case (key, pairs) =>
val quantity = pairs.map(_._2).toList
val price = pairs.map(_._3).toList
var totalPrice: Int = 0
for (i <- quantity.indices) {
totalPrice += quantity(i) * price(i)
}
key -> (quantity.sum, totalPrice / quantity.sum)
}
basises
}
This looks like this might work for you. (updated)
def generateBasis(trades: Iterable[(String, Int, Int)]) =
trades.groupBy(_._1).mapValues {
_.foldLeft((0,0)){case ((tq,tp),(_,q,p)) => (tq + q, tp + q * p)}
}.map{case (k, (q,p)) => (k,q,p/q)} // turn Map into tuples (triples)
I came up with the solution below. Thanks everyone for their input. I'd love to hear if anyone had a more elegant solution.
// input:
// List(("TSLA", 20, 200),
// ("TSLA", 10, 100),
// ("FB", 5, 50)
// output:
// List(("FB", (5, 50)),
// ("TSLA", (30, 166)))
def generateBasis(trades: Iterable[(String, Int, Int)]) = {
val groupedTrades = (trades groupBy(_._1)) map {
case (key, pairs) =>
key -> (pairs.map(e => (e._2, e._3)))
} // List((FB,List((5,50))), (TSLA,List((20,200), (10,100))))
val costBasises = for {groupedTrade <- groupedTrades
tradeCost = for {tup <- groupedTrade._2 // (qty, cost)
} yield tup._1 * tup._2 // (trade_qty * trade_cost)
tradeQuantity = for { tup <- groupedTrade._2
} yield tup._1 // trade_qty
} yield (groupedTrade._1, tradeQuantity.sum, tradeCost.sum / tradeQuantity.sum )
costBasises.toList // List(("FB", (5, 50)),("TSLA", (30, 166)))
}

Slick build filter criteria

I think the code below mostly speaks for itself, but here's a short explanation.
I have a list of ids that need to be added to a query condition. I can easily "and" the conditions onto the query (see val incorrect below), but am having trouble coming up with a good way to "or" the conditions.
The list of ids is not static, I just put some in there as an example. If possible, I'd like to know how to do it using a for comprehension, and without using a for comprehension.
Also, you should be able to drop this code in a repl and add some imports if you want to run the code.
object Tbl1Table {
case class Tbl1(id:Int, gid: Int, item: Int)
class Tbl1Table(tag:Tag) extends Table[Tbl1](tag, "TBL1") {
val id = column[Int]("id")
val gid = column[Int]("gid")
val item = column[Int]("item")
def * = (id, gid, item) <> (Tbl1.tupled, Tbl1.unapply)
}
lazy val theTable = new TableQuery(tag => new Tbl1Table(tag))
val ids = List((204, 11), (204, 12), (204, 13), (205, 19))
val query = for {
x <- theTable
} yield x
println(s"select is ${query.selectStatement}")
//prints: select is select x2."id", x2."gid", x2."item" from "TBL1" x2
val idsGrp = ids.groupBy(_._1)
val incorrect = idsGrp.foldLeft(query)((b, a) =>
b.filter(r => (r.gid is a._1) && (r.item inSet(a._2.map(_._2))))
)
println(s"select is ${incorrect.selectStatement}")
//prints: select is select x2."id", x2."gid", x2."item" from "TBL1" x2
// where ((x2."gid" = 205) and (x2."item" in (19))) and
// ((x2."gid" = 204) and (x2."item" in (11, 12, 13)))
//but want to "or" everything, ie:
//prints: select is select x2."id", x2."gid", x2."item" from "TBL1" x2
// where ((x2."gid" = 205) and (x2."item" in (19))) or
// ((x2."gid" = 204) and (x2."item" in (11, 12, 13)))
}
This seems to work fine:
import scala.slick.driver.PostgresDriver.simple._
case class Tbl1Row(id:Int, gid: Int, item: Int)
class Tbl1Table(tag:Tag) extends Table[Tbl1Row](tag, "TBL1") {
val id = column[Int]("id")
val gid = column[Int]("gid")
val item = column[Int]("item")
def * = (id, gid, item) <> (Tbl1Row.tupled, Tbl1Row.unapply)
}
lazy val theTable = new TableQuery(tag => new Tbl1Table(tag))
val ids = List((204, 11), (204, 12), (204, 13), (205, 19))
val idsGrp = ids.groupBy(_._1)
val correct = theTable.filter(r => idsGrp.map(t => (r.gid is t._1) && (r.item inSet(t._2.map(_._2)))).reduce(_ || _))
println(s"select is ${correct.selectStatement}")
Output is
select is select s16."id", s16."gid", s16."item" from "TBL1" s16 where ((s16."gid" = 205) and (s16."item" in (19))) or ((s16."gid" = 204) and (s16."item" in (11, 12, 13)))