filtering dataframe in scala - scala

Let say I have a dataframe created from a text file using case class schema. Below is the data stored in dataframe.
id - Type- qt - P
1, X, 10, 100.0
2, Y, 20, 200.0
1, Y, 15, 150.0
1, X, 5, 120.0
I need to filter dataframe by "id" and Type. And for every "id" iterate through the dataframe for some calculation.
I tried this way but it did not work. Code snapshot.
case class MyClass(id: Int, type: String, qt: Long, PRICE: Double)
val df = sc.textFile("xyz.txt")
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)
.toDF().cache()
val productList: List[Int] = df.map{row => row.getInt(0)}.distinct.collect.toList
val xList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "X" })}.toList
val yList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "Y" })}.toList

Taking the distinct idea from your example, simply iterate over all the IDs and filter the DataFrame according to the current ID. After this you have a DataFrame with only the relevant data:
val df3 = sc.textFile("src/main/resources/importantStuff.txt") //Your data here
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)).toDF().cache()
val productList: List[Int] = df3.map{row => row.getInt(0)}.distinct.collect.toList
println(productList)
productList.foreach(id => {
val sqlDF = df3.filter(df3("id") === id)
sqlDF.show()
})
sqlDF in the loop is the DF with the relevant data, later you can run your calculations on it.

Related

False/True Column constant

TL;DR: I need this spark constant :
val False : Column = lit(1) === lit(0)
Any idea how to do it prettier ?
Problem Context
I want to filter a dataframe from a collection. For exemple
case class Condition(column: String, value: String)
val conditions = Seq(
Condition("name", "bob"),
Condition("age", 18)
)
val personsDF = Seq(
("bob", 30),
("anna", 20),
("jack", 18)
).toDF("name", "age")
When applying my collection to personsDF I expect:
val expected = Seq(
("bob", 30),
("jack", 18)
)
To do so, I am creating a filter from the collection and apply it to the dataframe:
val conditionsFilter = conditions.foldLeft(initialValue) {
case (cumulatedFilter, Condition(column, value)) =>
cumulatedFilter || col(column) === value
}
personsDF.filter(conditionsFilter)
Pretty sweet, right ?
But to do so, I need the neutral value of OR operator which is False. Since False doesn't exist is Spark I used:
val False : Column = lit(1) === lit(0)
Any idea how to do this without tricks ?
You can just do :
val False : Column = lit(false)
This should be your initialValue, right? You can avoid that by using head and tail:
val buildCondition = (c:Condition) => col(c.column)===c.value
val initialValue = buildCondition(conditions.head)
val conditionsFilter = conditions.tail.foldLeft(initialValue)(
(cumulatedFilter, condition) =>
cumulatedFilter || buildCondition(condition)
)
Even shorter, you could use reduce:
val buildCondition = (c:Condition) => col(c.column)===c.value
val conditionsFilter = conditions.map(buildCondition).reduce(_ or _)

spark: join rdd based on sequence of another rdd

I have an rdd say sample_rdd of type RDD[(String, String, Int))] with 3 columns id,item,count. sample data:
id1|item1|1
id1|item2|3
id1|item3|4
id2|item1|3
id2|item4|2
I want to join each id against a lookup_rdd this:
item1|0
item2|0
item3|0
item4|0
item5|0
The output should give me following for id1, outerjoin with lookuptable:
item1|1
item2|3
item3|4
item4|0
item5|0
Similarly for id2 i should get:
item1|3
item2|0
item3|0
item4|2
item5|0
Finally output for each id should have all counts with id:
id1,1,3,4,0,0
id2,3,0,0,2,0
IMPORTANT:this output should be always ordered according to the order in lookup
This is what i have tried:
val line = rdd_sample.map { case (id, item, count) => (id, (item,count)) }.map(row=>(row._1,row._2)).groupByKey()
get(line).map(l=>(l._1,l._2)).mapValues(item_count=>lookup_r‌​dd.leftOuterJoin(ite‌​m_count))
def get (line: RDD[(String, Iterable[(String, Int)])]) = { for{ (id, item_cnt) <- line i = item_cnt.map(tuple => (tuple._1,tuple._2)) } yield (id,i)
Try below. Run each step on your local console to understand whats happening in detail.
The idea is to zipwithindex and form seq based on lookup_rdd.
(i1,0),(i2,1)..(i5,4) and (id1,0),(id2,1)
Index of final result wanted = [delta(length of lookup_rdd seq) * index of id1..id2 ] + index of i1...i5
So the base seq generated will be (0,(i1,id1)),(1,(i2,id1))...(8,(i4,id2)),(9,(i5,id2))
and then based on the key(i1,id1) reduce and calculate count.
val res2 = sc.parallelize(arr) //sample_rdd
val res3 = sc.parallelize(cart) //lookup_rdd
val delta = res3.count
val res83 = res3.map(_._1).zipWithIndex.cartesian(res2.map(_._1).distinct.zipWithIndex).map(x => (((x._1._1,x._2._1),((delta * x._2._2) + x._1._2, 0)))
val res86 = res2.map(x => ((x._2,x._1),x._3)).reduceByKey(_+_)
val res88 = res83.leftOuterJoin(res86)
val res91 = res88.map( x => {
x._2._2 match {
case Some(x1) => (x._2._1._1, (x._1,x._2._1._2+x1))
case None => (x._2._1._1, (x._1,x._2._1._2))
}
})
val res97 = res91.sortByKey(true).map( x => {
(x._2._1._2,List(x._2._2))}).reduceByKey(_++_)
res97.collect
// SOLUTION: Array((id1,List(1,3,4,0,0)),(id2,List(3,0,0,2,0)))

Combining files

I am new to scala. I have two RDD's and I need to separate out my training and testing data. In one file I have all the data and in another just the testing data. I need to remove the testing data from my complete data set.
The complete data file is of the format(userID,MovID,Rating,Timestamp):
res8: Array[String] = Array(1, 31, 2.5, 1260759144)
The test data file is of the format(userID,MovID):
res10: Array[String] = Array(1, 1172)
How do I generate ratings_train that will not have the caes matched with the testing dataset
I am using the following function but the returned list is showing empty:
def create_training(data: RDD[String], ratings_test: RDD[String]): ListBuffer[Array[String]] = {
val ratings_split = dropheader(data).map(line => line.split(","))
val ratings_testing = dropheader(ratings_test).map(line => line.split(",")).collect()
var ratings_train = new ListBuffer[Array[String]]()
ratings_split.foreach(x => {
ratings_testing.foreach(y => {
if (x(0) != y(0) || x(1) != y(1)) {
ratings_train += x
}
})
})
return ratings_train
}
EDIT: changed code but running into memory issues.
This may work.
def create_training(data: RDD[String], ratings_test: RDD[String]): Array[Array[String]] = {
val ratings_split = dropheader(data).map(line => line.split(","))
val ratings_testing = dropheader(ratings_test).map(line => line.split(","))
ratings_split.filter(x => {
ratings_testing.exists(y =>
(x(0) == y(0) && x(1) == y(1))
) == false
})
}
The code snippets you posted are not logically correct. A row will only be part of the final data if it has no presence in the test data. But in the code you picked the row if it does not match with any of the test data. But we should check whether it does not match with all of the test data and then only we can decide whether it is a valid row or not.
You are using RDD, but now exploring the full power of them. I guess you are reading the input from a csv file. Then you can structure your data in the RDD, no need to spit the string based on comma character and manually processing them as ROW. You can take a look at the DataFrame API of spark. These links may help: https://www.tutorialspoint.com/spark_sql/spark_sql_dataframes.htm , http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
Using Regex:
def main(args: Array[String]): Unit = {
// creating test data set
val data = spark.sparkContext.parallelize(Seq(
// "userID, MovID, Rating, Timestamp",
"1, 31, 2.5, 1260759144",
"2, 31, 2.5, 1260759144"))
val ratings_test = spark.sparkContext.parallelize(Seq(
// "userID, MovID",
"1, 31",
"2, 30",
"30, 2"
))
val result = getData(data, ratings_test).collect()
// the result will only contain "2, 31, 2.5, 1260759144"
}
def getData(data: RDD[String], ratings_test: RDD[String]): RDD[String] = {
val ratings = dropheader(data)
val ratings_testing = dropheader(ratings_test)
// Broadcasting the test rating data to all spark nodes, since we are collecting this before hand.
// The reason we are collecting the test data is to avoid call collect in the filter logic
val ratings_testing_bc = spark.sparkContext.broadcast(ratings_testing.collect.toSet)
ratings.filter(rating => {
ratings_testing_bc.value.exists(testRating => regexMatch(rating, testRating)) == false
})
}
def regexMatch(data: String, testData: String): Boolean = {
// Regular expression to find first two columns
val regex = """^([^,]*), ([^,\r\n]*),?""".r
val (dataCol1, dataCol2) = regex findFirstIn data match {
case Some(regex(col1, col2)) => (col1, col2)
}
val (testDataCol1, testDataCol2) = regex findFirstIn testData match {
case Some(regex(col1, col2)) => (col1, col2)
}
(dataCol1 == testDataCol1) && (dataCol2 == testDataCol2)
}

build inverted index in spark application using scala

I am new to Spark and scala programming language. My input is a CSV file. I need to build an inverted index on the values in csv file like explained below with an example.
Input: file.csv
attr1, attr2, attr3
1, AAA, 23
2, BBB, 23
3, AAA, 27
output format: value -> (rowid, collumnid) pairs
for example: AAA -> ((1,2),(3,2))
27 -> (3,3)
I have started with the following code. I am stuck after that. Kindly help.
object Main {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Invert Me!").setMaster("local[2]")
val sc = new SparkContext(conf)
val txtFilePath = "/home/person/Desktop/sample.csv"
val txtFile = sc.textFile(txtFilePath)
val nRows = txtFile.count()
val data = txtFile.map(line => line.split(",").map(elem => elem.trim()))
val nCols = data.collect()(0).length
}
}
Code preserving your style could look as
val header = sc.broadcast(data.first())
val cells = data.zipWithIndex().filter(_._2 > 0).flatMap { case (row, index) =>
row.zip(header.value).map { case (value, column) => value ->(column, index) }
}
val index: RDD[(String, Vector[(String, Long)])] =
cells.aggregateByKey(Vector.empty[(String, Long)])(_ :+ _, _ ++ _)
Here the index value should contain desired mapping of CellValue to pair (ColumnName, RowIndex)
Underscores in above methods are just shortcutted lambdas, it could be written another way as
val cellsVerbose = data.zipWithIndex().flatMap {
case (row, 1) => IndexedSeq.empty // skipping header row
case (row, index) => row.zip(header.value).map {
case (value, column) => value ->(column, index)
}
}
val indexVerbose: RDD[(String, Vector[(String, Long)])] =
cellsVerbose.aggregateByKey(zeroValue = Vector.empty[(String, Long)])(
seqOp = (keys, key) => keys :+ key,
combOp = (keysA, keysB) => keysA ++ keysB)

spark join operation based on two columns

I'm trying to join two datasets based on two columns. It works until I use one column but fails with below error
:29: error: value join is not a member of org.apache.spark.rdd.RDD[(String, String, (String, String, String, String, Double))]
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Code :
import org.apache.spark.rdd.RDD
def zipWithIndex[T](rdd: RDD[T]) = {
val partitionSizes = rdd.mapPartitions(p => Iterator(p.length)).collect
val ranges = partitionSizes.foldLeft(List((0, 0))) { case(accList, count) =>
val start = accList.head._2
val end = start + count
(start, end) :: accList
}.reverse.tail.toArray
rdd.mapPartitionsWithIndex( (index, partition) => {
val start = ranges(index)._1
val end = ranges(index)._2
val indexes = Iterator.range(start, end)
partition.zip(indexes)
})
}
val dimension = sc.
textFile("dimension.txt").
map{ line =>
val parts = line.split("\t")
(parts(0),parts(1),parts(2),parts(3),parts(4),parts(5))
}
val dimensionWithSK =
zipWithIndex(dimension).map { case((nk1,nk2,prop3,prop4,prop5,prop6), idx) => (nk1,nk2,(prop3,prop4,prop5,prop6,idx + nextSurrogateKey)) }
val fact = sc.
textFile("fact.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
(parts(0),parts(1), (parts(2),parts(3), parts(4),parts(5),parts(6).toDouble))
}
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Request someone's help here..
Thanks
Sridhar
If you look at the signature of join it works on an RDD of pairs:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
You have a triple. I guess your trying to join on the first 2 elements of the tuple, and so you need to map your triple to a pair, where the first element of the pair is a pair containing the first two elements of the triple, e.g. for any Types V1 and V2
val left: RDD[(String, String, V1)] = ??? // some rdd
val right: RDD[(String, String, V2)] = ??? // some rdd
left.map {
case (key1, key2, value) => ((key1, key2), value)
}
.join(
right.map {
case (key1, key2, value) => ((key1, key2), value)
})
This will give you an RDD of the form RDD[(String, String), (V1, V2)]
rdd1 Schema :
field1,field2, field3, fieldX,.....
rdd2 Schema :
field1, field2, field3, fieldY,.....
val joinResult = rdd1.join(rdd2,
Seq("field1", "field2", "field3"), "outer")
joinResult schema :
field1, field2, field3, fieldX, fieldY, ......
val emp = sc.
textFile("emp.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val emp_new = sc.
textFile("emp_new.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val finalemp =
emp_new.join(emp).
map { case((nk1,nk2) ,((parts1), (val1))) => (nk1,parts1,val1) }