How to filter an rdd by data type? - scala

I have an rdd that i am trying to filter for only float type. Do Spark rdds provide any way of doing this?
I have a csv where I need only float values greater than 40 into a new rdd. To achieve this, i am checking if it is an instance of type float and filtering them. When I filter with a !, all the strings are still there in the output and when i dont use !, the output is empty.
val airports1 = airports.filter(line => !line.split(",")(6).isInstanceOf[Float])
val airports2 = airports1.filter(line => line.split(",")(6).toFloat > 40)
At the .toFloat , i run into NumberFormatException which I've tried to handle in a try catch block.

Since you have a plain string and you are trying to get float values from it, you are not actually filtering by type. But, if they can be parsed to float instead.
You can accomplish that using a flatMap together with Option.
import org.apache.spark.sql.SparkSession
import scala.util.Try
val spark = SparkSession.builder.master("local[*]").appName("Float caster").getOrCreate()
val sc = spark.sparkContext
val data = List("x,10", "y,3.3", "z,a")
val rdd = sc.parallelize(data) // rdd: RDD[String]
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption) // filtered: RDD[Float]
filtered.collect() // res0: Array[Float] = Array(10.0, 3.3)
For the > 40 part you can either, perform another filter after or filter the inner Option.
(Both should perform more or less equals due spark laziness, thus choose the one is more clear for you).
// Option 1 - Another filter.
val filtered2 = filtered.filter(x => x > 40)
// Option 2 - Filter the inner option in one step.
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption.filter(x => x > 40))
Let me know if you have any question.

Related

Scala map-filtering methods

I'm new to Scala and Spark. I'm trying to remove duplicate rows of a text file.
Each row contains three columns (vector values), such as : -4.5,-4.2,2.7
This is my program :
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import scala.collection.mutable.Map
object WordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount").setMaster("local[*]")
val sc = new SparkContext(conf)
val input = sc.textFile("/opt/spark/WC/WC_input.txt")
val keys = input.flatMap(line => line.split("/n"))
val singleKeys = keys.distinct
singleKeys.foreach(println)
}
}
It works, but I wanted to know if there was a way to employ the filter function. I have to use it in my program, but I don't know how to iterate among all the rows and remove the duplicates (with a loop for example).
If anybody has an idea, would be great!
Thank you!
I think using filter to do that wouldn't be a very effective solution. For each element you would have to either see if this element is already present in some sort of temporary dataset or calculate how much of these elements are in processed dataset.
If you want to iterate over it and maybe do some on-the-fly edits you can apply map and then reduceByKey to group same elements. Like this
val singleKeys =
keys
.map( element => ( element , 0 ) )
.reduceByKey( ( element, count ) => element )
.map( _._1 )
where you can do changes to the dataset in the first map part. count parameter is not used although by definition of reduceByKey we need a second parameter in Tuple or Map.
I think this is basically how distinct internally works.
Duplicate elements of RDD can be removed in this way:
val data = List("-4.5,-4.2,2.7", "10,20,30", "-4.5,-4.2,2.7")
val rdd = sparkContext.parallelize(data)
val result = rdd.map((_, 1)).reduceByKey(_ + _).filter(_._2 == 1).map(_._1)
result.foreach(println)
Result:
10,20,30

Spark scala filter multiple rdd based on string length

I am trying to solve one of the quiz, the question is as below,
Write the missing code in the given program to display the expected output to identify animals that have names with four
letters.
Output: Array((4,lion))
Program
val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant","falcon","squid"),2)
val d = c.keyBy(_.length)
I have tried to write code in spark shell but get stuck with syntax to add 4 RDD and applying filter.
How about using the PairRDD lookup method:
b.lookup(4).toArray
// res1: Array[String] = Array(lion)
d.lookup(4).toArray
// res2: Array[String] = Array()

how to update an RDD

I have and an RDD[(Int,Array[Double],Double, Double)].
val full_data = rdd.map(row => {
val label = row._1
val feature = row._2.map(_.toDouble)
val QD = k_function(feature)
val alpha = 0.0
(label,feature,QD,alpha)
})
Now I want to update the value of alpha in each record (say 10)
var tmp = full_data.map( x=> {
x._4 = 10
})
I got the error
Error: reassignment to val
x._4 = 10
I have changed the all the val to var but still, the error occurs. How to update the value of alpha. and I would like to know how to update the full row or a specific row in an RDD.
RDD's are immutable in nature. They are made so for easy caching, sharing and replicating. Its always safe to copy than to mutate in a multi-threaded system like spark for fault tolerance and correctness in processing. Recreation of immutable data is much easier than mutable data.
Transformation is like copying the RDD data to another RDD every variables are treated as val i.e. they are immutable so if you are looking to replace the last double with 10, you can do is
var tmp = full_data.map( x=> {
(x._1, x._2, x._3, 10)
})

Spark Dataframe Scala: groupby doesn't work after UnionAll

I used unionAll to combine the source DF (with negative weights) and the target DF (with positive weights) into a node DF. Then I perform groupby to sum all the weights of the same nodes, but i don't know why groupby didn't work for the unioned DF at all. Did anyone face the same problem ?:
val src = file.map(_.split("\t")).map(p => node(p(0), (0-p(2).trim.toInt))).toDF()
val target = file.map(_.split("\t")).map(p => node(p(1), p(2).trim.toInt)).toDF()
val srcfl = src.filter(src("weight") != -1)
val targetfl = target.filter(target("weight") != 1)
val nodes = srcfl.unionAll(targetfl)
nodes.groupBy("name").sum()
nodes.map(x => x.mkString("\t")).saveAsTextFile("hdfs://localhost:8020" + args(1))
You're simply ignoring the result of the groupBy operation: just like all DataFrame transformations, .groupBy(...).sum() doesn't mutate the original DataFrame (nodes), it produces a new one. I suspect that if you actually use the return value from sum() you'll see the result you're looking for:
val result = nodes.groupBy("name").sum()
result.map(x => x.mkString("\t")).saveAsTextFile("hdfs://localhost:8020" + args(1))

Spark DataFrame zipWithIndex

I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them.
So I have my file:
val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)
I get the following error even when I convert from a dataframe to RDD:
:26: error: value zipWithIndex is not a member of
org.apache.spark.sql.Row
Anyone know how to do what I am trying to do, essentially trying to get the value and the column index.
I was thinking something like:
val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length
but getting stuck on the last part as not sure how to do the same of zipWithIndex.
You can simply convert Row to Seq:
convRDD.flatMap(r => r.toSeq.zipWithIndex)
Important thing to note here is that extracting type information becomes tricky. Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)].