How to convert Flink DataSet tuple to one column - scala

I've a graph data like
1 2
1 4
4 1
4 2
4 3
3 2
2 3
But I couldn't find a way to convert it a one column dataset like
1
2
1
4
4
1
...
here is my code, I used scala ListBuffer, but couldn't find a way doing it in Flink DataSet
val params: ParameterTool = ParameterTool.fromArgs(args)
val env = ExecutionEnvironment.getExecutionEnvironment
env.getConfig.setGlobalJobParameters(params)
val text = env.readTextFile(params.get("input"))
val tupleText = text.map { line =>
val arr = line.split(" ")
(arr(0), arr(1))
}
var x: Seq[(String, String)] = tupleText.collect()
var tempList = new ListBuffer[String]
x.foreach(line => {
tempList += line._1
tempList += line._2
})
tempList.foreach(println)

You can do that with flatMap:
// get some input
val input: DataSet[(Int, Int)] = env.fromElements((1, 2), (2, 3), (3, 4))
// emit every tuple element as own record
val output: DataSet[Int] = input.flatMap( (t, out) => {
out.collect(t._1)
out.collect(t._2)
})
// print result
output.print()

Related

Outlier Elimination in Spark With InterQuartileRange Results in Error

I have the following recursive function that determines the Outlier using the InterQuartileRange method:
def interQuartileRangeFiltering(df: DataFrame): DataFrame = {
#scala.annotation.tailrec
def inner(cols: List[String], acc: DataFrame): DataFrame = cols match {
case Nil => acc
case column :: xs =>
val quantiles = acc.stat.approxQuantile(column, Array(0.25, 0.75), 0.0) // TODO: values should come from config
println(s"$column ${quantiles.size}")
val q1 = quantiles(0)
val q3 = quantiles(1)
val iqr = q1 - q3
val lowerRange = q1 - 1.5 * iqr
val upperRange = q3 + 1.5 * iqr
val filtered = acc.filter(s"$column < $lowerRange or $column > $upperRange")
inner(xs, filtered)
}
inner(df.columns.toList, df)
}
val outlierDF = interQuartileRangeFiltering(incomingDF)
So basically what I'm doing is that I'm recursively iterating over the columns and eliminating the outliers. Strangely it results in an ArrayIndexOutOfBounds Exception and prints the following:
housing_median_age 2
inland 2
island 2
population 2
total_bedrooms 2
near_bay 2
near_ocean 2
median_house_value 0
java.lang.ArrayIndexOutOfBoundsException: 0
at inner$1(<console>:75)
at interQuartileRangeFiltering(<console>:83)
... 54 elided
What is wrong with my approach?
Here is what I came up with and works like a charm:
def outlierEliminator(df: DataFrame, colsToIgnore: List[String])(fn: (String, DataFrame) => (Double, Double)): DataFrame = {
val ID_COL_NAME = "id"
val dfWithId = DataFrameUtils.addColumnIndex(spark, df, ID_COL_NAME)
val dfWithIgnoredCols = dfWithId.drop(colsToIgnore: _*)
#tailrec
def inner(
cols: List[String],
filterIdSeq: List[Long],
dfWithId: DataFrame
): List[Long] = cols match {
case Nil => filterIdSeq
case column :: xs =>
if (column == ID_COL_NAME) {
inner(xs, filterIdSeq, dfWithId)
} else {
val (lowerBound, upperBound) = fn(column, dfWithId)
val filteredIds =
dfWithId
.filter(s"$column < $lowerBound or $column > $upperBound")
.select(col(ID_COL_NAME))
.map(r => r.getLong(0))
.collect
.toList
inner(xs, filteredIds ++ filterIdSeq, dfWithId)
}
}
val filteredIds = inner(dfWithIgnoredCols.columns.toList, List.empty[Long], dfWithIgnoredCols)
dfWithId.except(dfWithId.filter($"$ID_COL_NAME".isin(filteredIds: _*)))
}

How to update a global variable inside RDD map operation

I have RDD[(Int, Array[Double])] and after that, I called a classFunction
val rdd = spark.sparkContext.parallelize(Seq(
(1, Array(2.0,5.0,6.3)),
(5, Array(1.0,3.3,9.5)),
(1, Array(5.0,4.2,3.1)),
(2, Array(9.6,6.3,2.3)),
(1, Array(8.5,2.5,1.2)),
(5, Array(6.0,2.4,7.8)),
(2, Array(7.8,9.1,4.2))
)
)
val new_class = new ABC
new_class.demo(data)
Inside class, declared a global variable value =0. Inside the demo() the new variable new_value = 0 is declared. After the map operation, the new_value get updated and it prints the updated value inside the map.
class ABC extends Serializable {
var value = 0
def demo(data_new : RDD[(Int ,Array[Double])]): Unit ={
var new_value = 0
data_new.coalesce(1).map(x => {
if(x._1 == 1)
new_value = new_value + 1
println(new_value)
value = new_value
}).count()
println("Outside-->" +value)
}
}
OUTPUT:-
1
1
2
2
3
3
3
Outside-->0
How can I update the global variable value after the map operation?.
I'm not sure about what is it you are doing but you need to use Accumulators to perform the type of operations where you need to add values like this.
Here is an example :
scala> val rdd = spark.sparkContext.parallelize(Seq(
| (1, Array(2.0,5.0,6.3)),
| (5, Array(1.0,3.3,9.5)),
| (1, Array(5.0,4.2,3.1)),
| (2, Array(9.6,6.3,2.3)),
| (1, Array(8.5,2.5,1.2)),
| (5, Array(6.0,2.4,7.8)),
| (2, Array(7.8,9.1,4.2))
| )
| )
rdd: org.apache.spark.rdd.RDD[(Int, Array[Double])] = ParallelCollectionRDD[83] at parallelize at <console>:24
scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 46181, name: Some(My Accumulator), value: 0)
scala> rdd.foreach { x => if(x._1 == 1) accum.add(1) }
scala> accum.value
res38: Long = 3
And as mentioned by #philantrovert, if you wish to count the number of occurrences of each key, you can do the following :
scala> rdd.mapValues(_ => 1L).reduceByKey(_ + _).take(3)
res41: Array[(Int, Long)] = Array((1,3), (2,2), (5,2))
You can also use countByKey but it is to be avoided with big datasets.
No you can't change the global variables from inside the map.
If you are trying to count the number of one in the function than you can use filter
val value = data_new.filter(x => (x._1 == 1)).count
println("Outside-->" +value)
Output:
Outside-->3
Also it is not recommended to use mutable variables var. You should always try to use immutable as val
I hope this helps!
OR You can do achieve your problem in this way also:
class ABC extends Serializable {
def demo(data_new : RDD[(Int ,Array[Double])]): Unit ={
var new_value = 0
data_new.coalesce(1).map(x => {
if(x._1 == 1)
var key = x._1
(key, 1)
}).reduceByKey(_ + _)
}
println("Outside-->" +demo(data_new))
}

Replacing the values of an RDD with another

I have two data sets like below. Each data set has "," separated numbers in each line.
Dataset 1
1,2,0,8,0
2,0,9,0,3
Dataset 2
7,5,4,6,3
4,9,2,1,8
I have to replace the zeroes of the first data set with the corresponding values from the data set 2.
So the result would look like this
1,2,4,8,3
2,9,9,1,3
I replaced the values with the code below.
val rdd1 = sc.textFile(dataset1).flatMap(l => l.split(","))
val rdd2 = sc.textFile(dataset2).flatMap(l => l.split(","))
val result = rdd1.zip(rdd2).map( x => if(x._1 == "0") x._2 else x._1)
The output I got is of the format RDD[String]. But I need the output in the format RDD[Array[String]] as this format would be more suitable for my further transformations.
If you want an RDD[Array[String]], where each element of the array correspond to a line, don't flat map the values after splitting, just map them.
scala> val rdd1 = sc.parallelize(List("1,2,0,8,0", "2,0,9,0,3")).map(l => l.split(","))
rdd1: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[1] at map at <console>:27
scala> val rdd2 = sc.parallelize(List("7,5,4,6,3", "4,9,2,1,8")).map(l => l.split(","))
rdd2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at <console>:27
scala> val result = rdd1.zip(rdd2).map{case(arr1, arr2) => arr1.zip(arr2).map{case(v1, v2) => if(v1 == "0") v2 else v1}}
result: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:31
scala> result.collect
res0: Array[Array[String]] = Array(Array(1, 2, 4, 8, 3), Array(2, 9, 9, 1, 3))
or maybe less verbose:
val result = rdd1.zip(rdd2).map(t => t._1.zip(t._2).map(x => if(x._1 == "0") x._2 else x._1))

Average word length in Spark

I have a list of values and their aggregated lengths of all their occurrences as an array.
Ex: If my sentence is
"I have a cat. The cat looks very cute"
My array looks like
Array((I,1), (have,4), (a,1), (cat,6), (The, 3), (looks, 5), (very ,4), (cute,4))
Now I want to compute the average length of each word. i.e the length / number of occurrences.
I tried to do the coding using Scala as follows:
val avglen = arr.reduceByKey( (x,y) => (x, y.toDouble / x.size.toDouble) )
I'm getting an error as follows at x.size
error: value size is not a member of Int
Please help me where I'm going wrong here.
After your comment I think I got it:
val words = sc.parallelize(Array(("i", 1), ("have", 4),
("a", 1), ("cat", 6),
("the", 3), ("looks", 5),
("very", 4), ("cute", 4)))
val avgs = words.map { case (word, count) => (word, count / word.length.toDouble) }
println("My averages are: ")
avgs.take(100).foreach(println)
Supposing you have a paragraph with those words and You want to calculate the mean size of the words of the paragraph.
In two steps, with a map-reduce approach and in spark-1.5.1:
val words = sc.parallelize(Array(("i", 1), ("have", 4),
("a", 1), ("cat", 6),
("the", 3), ("looks", 5),
("very", 4), ("cute", 4)))
val wordCount = words.map { case (word, count) => count}.reduce((a, b) => a + b)
val wordLength = words.map { case (word, count) => word.length * count}.reduce((a, b) => a + b)
println("The avg length is: " + wordLength / wordCount.toDouble)
I ran this code using an .ipynb connected to a spark-kernel this is the output.
If I understand the problem correctly:
val rdd: RDD[(String, Int) = ???
val ave: RDD[(String, Double) =
rdd.map { case (name, numOccurance) =>
(name, name.length.toDouble / numOccurance)
}
This is a slightly confusing question. If your data is already in an Array[(String, Int)] collection (presumably after a collect() to the driver), then you need not use any RDD transformations. In fact, there's a nifty trick you can run with fold*() to grab the average over a collection:
val average = arr.foldLeft(0.0) { case (sum: Double, (_, count: Int)) => sum + count } / arr.foldLeft(0.0) { case (sum: Double, (word: String, count: Int)) => sum + count / word.length }
Kind of long winded, but it essentially aggregates the total number of characters in the numerator and the number of words in the denominator. Run on your example, I see the following:
scala> val arr = Array(("I",1), ("have",4), ("a",1), ("cat",6), ("The", 3), ("looks", 5), ("very" ,4), ("cute",4))
arr: Array[(String, Int)] = Array((I,1), (have,4), (a,1), (cat,6), (The,3), (looks,5), (very,4), (cute,4))
scala> val average = ...
average: Double = 3.111111111111111
If you have your (String, Int) tuples distributed across an RDD[(String, Int)], you can use accumulators to solve this problem quite easily:
val chars = sc.accumulator(0.0)
val words = sc.accumulator(0.0)
wordsRDD.foreach { case (word: String, count: Int) =>
chars += count; words += count / word.length
}
val average = chars.value / words.value
When running on the above example (placed in an RDD), I see the following:
scala> val arr = Array(("I",1), ("have",4), ("a",1), ("cat",6), ("The", 3), ("looks", 5), ("very" ,4), ("cute",4))
arr: Array[(String, Int)] = Array((I,1), (have,4), (a,1), (cat,6), (The,3), (looks,5), (very,4), (cute,4))
scala> val wordsRDD = sc.parallelize(arr)
wordsRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:14
scala> val chars = sc.accumulator(0.0)
chars: org.apache.spark.Accumulator[Double] = 0.0
scala> val words = sc.accumulator(0.0)
words: org.apache.spark.Accumulator[Double] = 0.0
scala> wordsRDD.foreach { case (word: String, count: Int) =>
| chars += count; words += count / word.length
| }
...
scala> val average = chars.value / words.value
average: Double = 3.111111111111111

In Scala, how do I keep track of running totals without using var?

For example, suppose I wish to read in fat, carbs and protein and wish to print the running total of each variable. An imperative style would look like the following:
var totalFat = 0.0
var totalCarbs = 0.0
var totalProtein = 0.0
var lineNumber = 0
for (lineData <- allData) {
totalFat += lineData...
totalCarbs += lineData...
totalProtein += lineData...
lineNumber += 1
printCSV(lineNumber, totalFat, totalCarbs, totalProtein)
}
How would I write the above using only vals?
Use scanLeft.
val zs = allData.scanLeft((0, 0.0, 0.0, 0.0)) { case(r, c) =>
val lineNr = r._1 + 1
val fat = r._2 + c...
val carbs = r._3 + c...
val protein = r._4 + c...
(lineNr, fat, carbs, protein)
}
zs foreach Function.tupled(printCSV)
Recursion. Pass the sums from previous row to a function that will add them to values from current row, print them to CSV and pass them to itself...
You can transform your data with map and get the total result with sum:
val total = allData map { ... } sum
With scanLeft you get the particular sums of each step:
val steps = allData.scanLeft(0) { case (sum,lineData) => sum+lineData}
val result = steps.last
If you want to create several new values in one iteration step I would prefer a class which hold the values:
case class X(i: Int, str: String)
object X {
def empty = X(0, "")
}
(1 to 10).scanLeft(X.empty) { case (sum, data) => X(sum.i+data, sum.str+data) }
It's just a jump to the left,
and then a fold to the right /:
class Data (val a: Int, val b: Int, val c: Int)
val list = List (new Data (3, 4, 5), new Data (4, 2, 3),
new Data (0, 6, 2), new Data (2, 4, 8))
val res = (new Data (0, 0, 0) /: list)
((acc, x) => new Data (acc.a + x.a, acc.b + x.b, acc.c + x.c))