False/True Column constant - scala

TL;DR: I need this spark constant :
val False : Column = lit(1) === lit(0)
Any idea how to do it prettier ?
Problem Context
I want to filter a dataframe from a collection. For exemple
case class Condition(column: String, value: String)
val conditions = Seq(
Condition("name", "bob"),
Condition("age", 18)
)
val personsDF = Seq(
("bob", 30),
("anna", 20),
("jack", 18)
).toDF("name", "age")
When applying my collection to personsDF I expect:
val expected = Seq(
("bob", 30),
("jack", 18)
)
To do so, I am creating a filter from the collection and apply it to the dataframe:
val conditionsFilter = conditions.foldLeft(initialValue) {
case (cumulatedFilter, Condition(column, value)) =>
cumulatedFilter || col(column) === value
}
personsDF.filter(conditionsFilter)
Pretty sweet, right ?
But to do so, I need the neutral value of OR operator which is False. Since False doesn't exist is Spark I used:
val False : Column = lit(1) === lit(0)
Any idea how to do this without tricks ?

You can just do :
val False : Column = lit(false)
This should be your initialValue, right? You can avoid that by using head and tail:
val buildCondition = (c:Condition) => col(c.column)===c.value
val initialValue = buildCondition(conditions.head)
val conditionsFilter = conditions.tail.foldLeft(initialValue)(
(cumulatedFilter, condition) =>
cumulatedFilter || buildCondition(condition)
)
Even shorter, you could use reduce:
val buildCondition = (c:Condition) => col(c.column)===c.value
val conditionsFilter = conditions.map(buildCondition).reduce(_ or _)

Related

Azure Databricks Scala : How to replace rows following a respective hirarchy

Having in mind the following dataset:
I would like to obtain
As you can see, basically the idea is to follow the path indicated by column ACTUAL_ID until it is null (if it wasn't already)
I tried to use a udf where I was passing the full initial Dataframe and the recursively would find what I want but it seems it is not possible to pass Dataframes to UDFs. I also looked into replacing a value of a row, but it seems that is not possible.
My latest attempt:
def calculateLatestImdate(df: DataFrame, lookupId: String) : String = {
var foundId = df.filter($"ID" === lookupId).select($"ACTUAL_ID").first.getAs[String]("ID");
if (foundId == "" || foundId == null)
{
lookupId
}
else
{
calculateLatestImdate(df, foundId);
}
}
val calculateLatestImdateUdf = udf((df:DataFrame, s:String) => {
calculateLatestImdate(df,s)
})
val df = sc.parallelize(Seq(("1", "", "A"), ("2", "3", "B"), ("3", "6", "C"), ("4", "5", "D"), ("5", "", "E"), ("6", "", "F"))).toDF("ID","ACTUAL_ID", "DATA")
val finalDf = df.withColumn("FINAL_ID", when(isEmpty($"ACTUAL_ID"), $"ID").otherwise(calculateLatestImdateUdf(df, $"ACTUAL_ID")))
This looked a bit like a graph problem to me so I worked up an answer using Scala and graphframes. It makes use of the connectedComponents algorithm and the outDegrees method of the graphframe. I've made an assumption that the end of each tree is unique as per your sample data but this assumption needs to be checked. I'd be interested to see what the performance is like with more data, but let me know what you think of the solution.
The complete script:
// NB graphframes had to be installed separately with the right Scala version
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._
// Create the test data
// Vertices dataframe
val v2 = sqlContext.createDataFrame(List(
( 1, 0, "A" ), ( 2, 3, "B" ), ( 3, 6, "C" ),
( 4, 5, "D" ), ( 5, 0, "E" ), ( 6, 0, "F" )
)).toDF("id", "actual_id", "data")
// Edge dataframe
val e2 = sqlContext.createDataFrame(List(
(2, 3, "is linked to"),
(3, 6, "is linked to"),
(4, 5, "is linked to")
)).toDF("src", "dst", "relationship")
// Create the graph frame
val g2 = GraphFrame(v2, e2)
print(g2)
// The connected components adds a component id to each 'group'
sc.setCheckpointDir("/tmp/graphframes-example-connected-components")
val components = g2.connectedComponents.run() // doesn't work on Spark 1.4
display(components)
// "end" of tree nodes have no outDegree, so add that in to the component df
val endOfTree = components.join(g2.outDegrees, Seq("id"), "left")
.select("component", "data")
.where("outDegree is null")
endOfTree.show()
components.as("c").join(endOfTree.as("t"), $"c.component" === $"t.component")
.select($"c.id", $"c.component", $"t.data")
.orderBy("id")
.show()
My results:
If your data is already in a dataframe, it's easy to generate the edges dataframe from your original with just a select and a where filter, eg
// Create the GraphFrame from the dataframe
val v2 = df
val e2 = df
.select("id", "actual_id")
.withColumn("rel", lit("is linked to"))
.where("actual_id > 0")
.toDF("src", "dst", "rel")
val g2 = GraphFrame(v2, e2)
print(g2)
g2.vertices.show()
g2.edges.show()
Believe I have found an answer to my problem.
def calculateLatestId(df: DataFrame) : DataFrame = {
var joinedDf = df.as("df1").join(df.as("df2"), $"df1.ACTUAL_ID" === $"df2.ID", "outer").withColumn("FINAL_ID", when($"df2.ID".isNull, $"df1.ID").when($"df2.ACTUAL_ID".isNotNull, $"df2.ACTUAL_ID").otherwise($"df2.ID")).select($"df1.*", $"FINAL_ID").filter($"df1.ID".isNotNull)
val differentIds = joinedDf.filter($"df1.ACTUAL_ID" =!= $"FINAL_ID")
joinedDf = joinedDf.withColumn("ACTUAL_ID", $"FINAL_ID").drop($"FINAL_ID")
if(differentIds.count > 0)
{
calculateLatestId(joinedDf)
}
else
{
joinedDf = joinedDf.as("df1").join(joinedDf.as("df2"), $"df1.ACTUAL_ID" === $"df2.ID", "inner").select($"df1.ID", $"df2.*").drop($"df2.ID")
joinedDf
}
}
I believe the performance can be improved somehow, probably by reducing the amount of rows after each iteration and at the end do some sort of join + clean up.

Scala - String and Column objects

Here the variable "exprs" is of column type (i.e. exprs: Array[org.apache.spark.sql.Column] = Array(sum(country), sum(value), sum(price))).
why does exprs: _* runs into error? why should I provide head and tail which as far my understanding is only for string type?
val resGroupByDF2 = data.groupBy($"country").agg(exprs: _*) // why does this not work
case class (
country: String,
value: Double,
price: Double
)
val data = Seq(
cname("NA", 2, 14),
cname("EU", 4, 61),
cname("FE", 5, 1),
)
.toDF()
val exprs = data.columns.map(sum(_)) // here it returns exprs: Array[org.apache.spark.sql.Column] = Array(sum(country), sum(value), sum(price))
val resGroupByDF2 = data.groupBy($"country").agg(exprs.head, exprs.tail: _*) // why just agg(exprs: _*) does not work in select or agg as it is already a column type
It is because of signature of agg.
The signature is (expr: org.apache.spark.sql.Column, exprs: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame. This expects a column at least and optional var-arg.

Combining files

I am new to scala. I have two RDD's and I need to separate out my training and testing data. In one file I have all the data and in another just the testing data. I need to remove the testing data from my complete data set.
The complete data file is of the format(userID,MovID,Rating,Timestamp):
res8: Array[String] = Array(1, 31, 2.5, 1260759144)
The test data file is of the format(userID,MovID):
res10: Array[String] = Array(1, 1172)
How do I generate ratings_train that will not have the caes matched with the testing dataset
I am using the following function but the returned list is showing empty:
def create_training(data: RDD[String], ratings_test: RDD[String]): ListBuffer[Array[String]] = {
val ratings_split = dropheader(data).map(line => line.split(","))
val ratings_testing = dropheader(ratings_test).map(line => line.split(",")).collect()
var ratings_train = new ListBuffer[Array[String]]()
ratings_split.foreach(x => {
ratings_testing.foreach(y => {
if (x(0) != y(0) || x(1) != y(1)) {
ratings_train += x
}
})
})
return ratings_train
}
EDIT: changed code but running into memory issues.
This may work.
def create_training(data: RDD[String], ratings_test: RDD[String]): Array[Array[String]] = {
val ratings_split = dropheader(data).map(line => line.split(","))
val ratings_testing = dropheader(ratings_test).map(line => line.split(","))
ratings_split.filter(x => {
ratings_testing.exists(y =>
(x(0) == y(0) && x(1) == y(1))
) == false
})
}
The code snippets you posted are not logically correct. A row will only be part of the final data if it has no presence in the test data. But in the code you picked the row if it does not match with any of the test data. But we should check whether it does not match with all of the test data and then only we can decide whether it is a valid row or not.
You are using RDD, but now exploring the full power of them. I guess you are reading the input from a csv file. Then you can structure your data in the RDD, no need to spit the string based on comma character and manually processing them as ROW. You can take a look at the DataFrame API of spark. These links may help: https://www.tutorialspoint.com/spark_sql/spark_sql_dataframes.htm , http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
Using Regex:
def main(args: Array[String]): Unit = {
// creating test data set
val data = spark.sparkContext.parallelize(Seq(
// "userID, MovID, Rating, Timestamp",
"1, 31, 2.5, 1260759144",
"2, 31, 2.5, 1260759144"))
val ratings_test = spark.sparkContext.parallelize(Seq(
// "userID, MovID",
"1, 31",
"2, 30",
"30, 2"
))
val result = getData(data, ratings_test).collect()
// the result will only contain "2, 31, 2.5, 1260759144"
}
def getData(data: RDD[String], ratings_test: RDD[String]): RDD[String] = {
val ratings = dropheader(data)
val ratings_testing = dropheader(ratings_test)
// Broadcasting the test rating data to all spark nodes, since we are collecting this before hand.
// The reason we are collecting the test data is to avoid call collect in the filter logic
val ratings_testing_bc = spark.sparkContext.broadcast(ratings_testing.collect.toSet)
ratings.filter(rating => {
ratings_testing_bc.value.exists(testRating => regexMatch(rating, testRating)) == false
})
}
def regexMatch(data: String, testData: String): Boolean = {
// Regular expression to find first two columns
val regex = """^([^,]*), ([^,\r\n]*),?""".r
val (dataCol1, dataCol2) = regex findFirstIn data match {
case Some(regex(col1, col2)) => (col1, col2)
}
val (testDataCol1, testDataCol2) = regex findFirstIn testData match {
case Some(regex(col1, col2)) => (col1, col2)
}
(dataCol1 == testDataCol1) && (dataCol2 == testDataCol2)
}

filtering dataframe in scala

Let say I have a dataframe created from a text file using case class schema. Below is the data stored in dataframe.
id - Type- qt - P
1, X, 10, 100.0
2, Y, 20, 200.0
1, Y, 15, 150.0
1, X, 5, 120.0
I need to filter dataframe by "id" and Type. And for every "id" iterate through the dataframe for some calculation.
I tried this way but it did not work. Code snapshot.
case class MyClass(id: Int, type: String, qt: Long, PRICE: Double)
val df = sc.textFile("xyz.txt")
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)
.toDF().cache()
val productList: List[Int] = df.map{row => row.getInt(0)}.distinct.collect.toList
val xList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "X" })}.toList
val yList: List[RDD[MyClass]] = productList.map {
productId => df.filter({ item: MyClass => (item.id== productId) && (item.type == "Y" })}.toList
Taking the distinct idea from your example, simply iterate over all the IDs and filter the DataFrame according to the current ID. After this you have a DataFrame with only the relevant data:
val df3 = sc.textFile("src/main/resources/importantStuff.txt") //Your data here
.map(_.split(","))
.map(p => MyClass(p(0).trim.toInt, p(1), p(2).trim.toLong, p(3).trim.toDouble)).toDF().cache()
val productList: List[Int] = df3.map{row => row.getInt(0)}.distinct.collect.toList
println(productList)
productList.foreach(id => {
val sqlDF = df3.filter(df3("id") === id)
sqlDF.show()
})
sqlDF in the loop is the DF with the relevant data, later you can run your calculations on it.

obtain a specific value from a RDD according to another RDD

I want to map a RDD by lookup another RDD by this code:
val product = numOfT.map{case((a,b),c)=>
val h = keyValueRecords.lookup(b).take(1).mkString.toInt
(a,(h*c))
}
a,b are Strings and c is a Integer. keyValueRecords is like this: RDD[(string,string)]-
i got type missmatch error: how can I fix it ?
what is my mistake ?
This is a sample of data:
userId,movieId,rating,timestamp
1,16,4.0,1217897793
1,24,1.5,1217895807
1,32,4.0,1217896246
2,3,2.0,859046959
3,7,3.0,8414840873
I'm triying by this code:
val lines = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(1)))
})
val shoppingList = lines.groupByKey()
val coOccurence = shoppingList.flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfT = coOccurence.reduceByKey((a,b)=>(a+b)) // (((item,rate),(item,rate)),coccurence)
// produce recommend for an especial user
val keyValueRecords = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(2)))
}).filter{case(k,v)=> k=="1"}.groupByKey().flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfTForaUser = keyValueRecords.reduceByKey((a,b)=>(a+b))
val joined = numOfT.join(numOfTForaUser).map{case(k,v)=>(k._1._1,(k._2._2.toFloat*v._1.toFloat))}.collect.foreach(println)
The Last RDD won't produced. Is it wrong ?