I think the code below mostly speaks for itself, but here's a short explanation.
I have a list of ids that need to be added to a query condition. I can easily "and" the conditions onto the query (see val incorrect below), but am having trouble coming up with a good way to "or" the conditions.
The list of ids is not static, I just put some in there as an example. If possible, I'd like to know how to do it using a for comprehension, and without using a for comprehension.
Also, you should be able to drop this code in a repl and add some imports if you want to run the code.
object Tbl1Table {
case class Tbl1(id:Int, gid: Int, item: Int)
class Tbl1Table(tag:Tag) extends Table[Tbl1](tag, "TBL1") {
val id = column[Int]("id")
val gid = column[Int]("gid")
val item = column[Int]("item")
def * = (id, gid, item) <> (Tbl1.tupled, Tbl1.unapply)
}
lazy val theTable = new TableQuery(tag => new Tbl1Table(tag))
val ids = List((204, 11), (204, 12), (204, 13), (205, 19))
val query = for {
x <- theTable
} yield x
println(s"select is ${query.selectStatement}")
//prints: select is select x2."id", x2."gid", x2."item" from "TBL1" x2
val idsGrp = ids.groupBy(_._1)
val incorrect = idsGrp.foldLeft(query)((b, a) =>
b.filter(r => (r.gid is a._1) && (r.item inSet(a._2.map(_._2))))
)
println(s"select is ${incorrect.selectStatement}")
//prints: select is select x2."id", x2."gid", x2."item" from "TBL1" x2
// where ((x2."gid" = 205) and (x2."item" in (19))) and
// ((x2."gid" = 204) and (x2."item" in (11, 12, 13)))
//but want to "or" everything, ie:
//prints: select is select x2."id", x2."gid", x2."item" from "TBL1" x2
// where ((x2."gid" = 205) and (x2."item" in (19))) or
// ((x2."gid" = 204) and (x2."item" in (11, 12, 13)))
}
This seems to work fine:
import scala.slick.driver.PostgresDriver.simple._
case class Tbl1Row(id:Int, gid: Int, item: Int)
class Tbl1Table(tag:Tag) extends Table[Tbl1Row](tag, "TBL1") {
val id = column[Int]("id")
val gid = column[Int]("gid")
val item = column[Int]("item")
def * = (id, gid, item) <> (Tbl1Row.tupled, Tbl1Row.unapply)
}
lazy val theTable = new TableQuery(tag => new Tbl1Table(tag))
val ids = List((204, 11), (204, 12), (204, 13), (205, 19))
val idsGrp = ids.groupBy(_._1)
val correct = theTable.filter(r => idsGrp.map(t => (r.gid is t._1) && (r.item inSet(t._2.map(_._2)))).reduce(_ || _))
println(s"select is ${correct.selectStatement}")
Output is
select is select s16."id", s16."gid", s16."item" from "TBL1" s16 where ((s16."gid" = 205) and (s16."item" in (19))) or ((s16."gid" = 204) and (s16."item" in (11, 12, 13)))
Related
I have a DataSet[Metric] and transform it to a KeyValueGroupedDataset (grouping by metricId) in order to then perform reduceGroups.
The problem that I've faced is that when there is just one record with some metricId, like metric3 in the example below, it is returned as-is and the processTime field is not getting updated. However when there is more than one record with the same metricId, they are getting reduced and the processTime field is updated correctly.
I guess that it's happening since reduceGroups needs at least 2 records in a group and otherwise just returns the single record unchanged.
But I can't figure out how to achieve updating the processTime field when there is a single record in a group?
case class Metric (
metricId: String,
rank: Int,
features: List[Feature]
processTime: Timestamp
)
case class Feature (
featureId: String,
name: String,
value: String
)
val f1 = Feature(1, "f1", "v1")
val f2 = Feature(1, "f2", "v2")
val f3 = Feature(2, "f3", "v3")
val metric1 = Metric("1", 1, List(f1, f2, f3), Timestamp.valueOf("2019-07-01 00:00:00"))
val metric2 = Metric("1", 2, List(f3, f2), Timestamp.valueOf("2019-07-01 00:00:00"))
val metric3 = Metric("2", 1, List(f1, f2), Timestamp.valueOf("2019-07-21 00:00:00"))
val metricsList = List(metric1, metric2, metric3)
val groupedMetrics: KeyValueGroupedDataset[String, Metric] = metricsList.groupByKey(x => x.metricId)
val aggregatedMetrics: Dataset[(String, Metric)] = groupedMetrics.reduceGroups {
(m1: Metric, m2: Metric) =>
val theMetric: Metric = if (m2.rank >= m1.rank) {
m2
} else {
m1
}
Metric(
m2.metricId,
m2.rank,
m2.features ++ m1.features
Timestamp.valueOf(LocalDateTime.now()),
)
}
I have an rdd say sample_rdd of type RDD[(String, String, Int))] with 3 columns id,item,count. sample data:
id1|item1|1
id1|item2|3
id1|item3|4
id2|item1|3
id2|item4|2
I want to join each id against a lookup_rdd this:
item1|0
item2|0
item3|0
item4|0
item5|0
The output should give me following for id1, outerjoin with lookuptable:
item1|1
item2|3
item3|4
item4|0
item5|0
Similarly for id2 i should get:
item1|3
item2|0
item3|0
item4|2
item5|0
Finally output for each id should have all counts with id:
id1,1,3,4,0,0
id2,3,0,0,2,0
IMPORTANT:this output should be always ordered according to the order in lookup
This is what i have tried:
val line = rdd_sample.map { case (id, item, count) => (id, (item,count)) }.map(row=>(row._1,row._2)).groupByKey()
get(line).map(l=>(l._1,l._2)).mapValues(item_count=>lookup_rdd.leftOuterJoin(item_count))
def get (line: RDD[(String, Iterable[(String, Int)])]) = { for{ (id, item_cnt) <- line i = item_cnt.map(tuple => (tuple._1,tuple._2)) } yield (id,i)
Try below. Run each step on your local console to understand whats happening in detail.
The idea is to zipwithindex and form seq based on lookup_rdd.
(i1,0),(i2,1)..(i5,4) and (id1,0),(id2,1)
Index of final result wanted = [delta(length of lookup_rdd seq) * index of id1..id2 ] + index of i1...i5
So the base seq generated will be (0,(i1,id1)),(1,(i2,id1))...(8,(i4,id2)),(9,(i5,id2))
and then based on the key(i1,id1) reduce and calculate count.
val res2 = sc.parallelize(arr) //sample_rdd
val res3 = sc.parallelize(cart) //lookup_rdd
val delta = res3.count
val res83 = res3.map(_._1).zipWithIndex.cartesian(res2.map(_._1).distinct.zipWithIndex).map(x => (((x._1._1,x._2._1),((delta * x._2._2) + x._1._2, 0)))
val res86 = res2.map(x => ((x._2,x._1),x._3)).reduceByKey(_+_)
val res88 = res83.leftOuterJoin(res86)
val res91 = res88.map( x => {
x._2._2 match {
case Some(x1) => (x._2._1._1, (x._1,x._2._1._2+x1))
case None => (x._2._1._1, (x._1,x._2._1._2))
}
})
val res97 = res91.sortByKey(true).map( x => {
(x._2._1._2,List(x._2._2))}).reduceByKey(_++_)
res97.collect
// SOLUTION: Array((id1,List(1,3,4,0,0)),(id2,List(3,0,0,2,0)))
I am new to scala. I have two RDD's and I need to separate out my training and testing data. In one file I have all the data and in another just the testing data. I need to remove the testing data from my complete data set.
The complete data file is of the format(userID,MovID,Rating,Timestamp):
res8: Array[String] = Array(1, 31, 2.5, 1260759144)
The test data file is of the format(userID,MovID):
res10: Array[String] = Array(1, 1172)
How do I generate ratings_train that will not have the caes matched with the testing dataset
I am using the following function but the returned list is showing empty:
def create_training(data: RDD[String], ratings_test: RDD[String]): ListBuffer[Array[String]] = {
val ratings_split = dropheader(data).map(line => line.split(","))
val ratings_testing = dropheader(ratings_test).map(line => line.split(",")).collect()
var ratings_train = new ListBuffer[Array[String]]()
ratings_split.foreach(x => {
ratings_testing.foreach(y => {
if (x(0) != y(0) || x(1) != y(1)) {
ratings_train += x
}
})
})
return ratings_train
}
EDIT: changed code but running into memory issues.
This may work.
def create_training(data: RDD[String], ratings_test: RDD[String]): Array[Array[String]] = {
val ratings_split = dropheader(data).map(line => line.split(","))
val ratings_testing = dropheader(ratings_test).map(line => line.split(","))
ratings_split.filter(x => {
ratings_testing.exists(y =>
(x(0) == y(0) && x(1) == y(1))
) == false
})
}
The code snippets you posted are not logically correct. A row will only be part of the final data if it has no presence in the test data. But in the code you picked the row if it does not match with any of the test data. But we should check whether it does not match with all of the test data and then only we can decide whether it is a valid row or not.
You are using RDD, but now exploring the full power of them. I guess you are reading the input from a csv file. Then you can structure your data in the RDD, no need to spit the string based on comma character and manually processing them as ROW. You can take a look at the DataFrame API of spark. These links may help: https://www.tutorialspoint.com/spark_sql/spark_sql_dataframes.htm , http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
Using Regex:
def main(args: Array[String]): Unit = {
// creating test data set
val data = spark.sparkContext.parallelize(Seq(
// "userID, MovID, Rating, Timestamp",
"1, 31, 2.5, 1260759144",
"2, 31, 2.5, 1260759144"))
val ratings_test = spark.sparkContext.parallelize(Seq(
// "userID, MovID",
"1, 31",
"2, 30",
"30, 2"
))
val result = getData(data, ratings_test).collect()
// the result will only contain "2, 31, 2.5, 1260759144"
}
def getData(data: RDD[String], ratings_test: RDD[String]): RDD[String] = {
val ratings = dropheader(data)
val ratings_testing = dropheader(ratings_test)
// Broadcasting the test rating data to all spark nodes, since we are collecting this before hand.
// The reason we are collecting the test data is to avoid call collect in the filter logic
val ratings_testing_bc = spark.sparkContext.broadcast(ratings_testing.collect.toSet)
ratings.filter(rating => {
ratings_testing_bc.value.exists(testRating => regexMatch(rating, testRating)) == false
})
}
def regexMatch(data: String, testData: String): Boolean = {
// Regular expression to find first two columns
val regex = """^([^,]*), ([^,\r\n]*),?""".r
val (dataCol1, dataCol2) = regex findFirstIn data match {
case Some(regex(col1, col2)) => (col1, col2)
}
val (testDataCol1, testDataCol2) = regex findFirstIn testData match {
case Some(regex(col1, col2)) => (col1, col2)
}
(dataCol1 == testDataCol1) && (dataCol2 == testDataCol2)
}
Let say I have a dataframe created from a text file using case class schema. Below is the data stored in dataframe.
id - Type- qt - P
1, X, 10, 100.0
2, Y, 20, 200.0
1, Y, 15, 150.0
1, X, 5, 120.0
I need to filter dataframe by "id" and Type. And for every "id" iterate through the dataframe for some calculation.
I tried this way but it did not work. Code snapshot.
case class MyClass(id: Int, type: String, qt: Long, PRICE: Double)
val df = sc.textFile("xyz.txt")
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)
.toDF().cache()
val productList: List[Int] = df.map{row => row.getInt(0)}.distinct.collect.toList
val xList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "X" })}.toList
val yList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "Y" })}.toList
Taking the distinct idea from your example, simply iterate over all the IDs and filter the DataFrame according to the current ID. After this you have a DataFrame with only the relevant data:
val df3 = sc.textFile("src/main/resources/importantStuff.txt") //Your data here
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)).toDF().cache()
val productList: List[Int] = df3.map{row => row.getInt(0)}.distinct.collect.toList
println(productList)
productList.foreach(id => {
val sqlDF = df3.filter(df3("id") === id)
sqlDF.show()
})
sqlDF in the loop is the DF with the relevant data, later you can run your calculations on it.
I have the below data which needed to be sorted using spark(scala) in such a way that, I only need id of the person who visited "Walmart" but not "Bestbuy". store might be repetitive because a person can visit the store any number of times.
Input Data:
id, store
1, Walmart
1, Walmart
1, Bestbuy
2, Target
3, Walmart
4, Bestbuy
Output Expected:
3, Walmart
I have got the output using dataFrames and running SQL queries on spark context. But is there any way to do this using groupByKey/reduceByKey etc without dataFrames. Can someone help me with the code, After map-> groupByKey, a ShuffleRDD has been formed and I am facing difficulty in filtering the CompactBuffer!
The code with which I got it using sqlContext is below:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class Person(id: Int, store: String)
val people = sc.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(p => Person(p(1)trim.toInt, p(1)))
people.registerTempTable("people")
val result = sqlContext.sql("select id, store from people left semi join (select id from people where store in('Walmart','Bestbuy') group by id having count(distinct store)=1) sample on people.id=sample.id and people.url='Walmart'")
The code which I am trying now is this, but I am struck after the third step:
val data = sc.textFile("examples/src/main/resources/people.txt")
.map(x=> (x.split(",")(0),x.split(",")(1)))
.filter(!_.filter("id"))
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.map{case (x,y) =>
val url = y.flatMap(x=> x.split(",")).toList
if (!url.contains("Bestbuy") && url.contains("Walmart")){
x.map(x=> (x,y))}}
if I do dataFiltered.collect(), I am getting
Array[Any] = Array(Vector((3,Walmart)), (), ())
Please help me how to extract the output after this step
To filter an RDD, just use RDD.filter:
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.filter {
// keep only lists that contain Walmart but do not contain Bestbuy:
case (x, y) => val l = y.toList; l.contains("Walmart") && !l.contains("Bestbuy")
}
dataFiltered.foreach(println) // prints: (3,CompactBuffer(Walmart))
// if you want to flatten this back to tuples of (id, store):
val result = dataFiltered.flatMap { case (id, stores) => stores.map(store => (id, store)) }
result.foreach(println) // prints: (3, Walmart)
I also tried it another way and it worked out
val data = sc.textFile("examples/src/main/resources/people.txt")
.filter(!_.filter("id"))
.map(x=> (x.split(",")(0),x.split(",")(1)))
data.cache()
val dataWalmart = data.filter{case (x,y) => y.contains("Walmart")}.distinct()
val dataBestbuy = data.filter{case (x,y) => y.contains("Bestbuy")}.distinct()
val result = dataWalmart.subtractByKey(dataBestbuy)
data.uncache()