Scala sort output on Key and then alphabetically - scala

I'm trying out my first Scala program to sort the following output such that when the value is identical, words are sorted alphabetically.
cookie 8
document 6
function 5
name 5
start 5
My current code is as follows:
object Problem1{
def main(args: Array[String]){
val inputFile = args(0)
val outputFolder = args(1)
val kValue = args(2)
val conf = new SparkConf().setAppName("Problem1").setMaster("local")
val sc = new SparkContext(conf)
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.toLowerCase().split( [\\s*&#^'''\\,..:;?!\\[\\](){}<>~\\-_]+"))
.filter(x => x.matches("[A-Za-z]+")&& x.length >2)
.map(word => (word,1)).reduceByKey(_+_).map(_.swap)
val freq = words.sortByKey(false,1).map(_.swap).take(kValue.toInt)
val topKrdd = sc.parallelize(freq)
val tabSeperated = topKrdd.map(f => f._1 +"\t" + f._2)
tabSeperated.saveAsTextFile(outputFolder)
}
}
Can someone help me with the alphabetical sort for the lines where the numerical value is identical?

Usually Scala provides and uses an implicit Ordering for methods like sortByKey, but you can also construct a custom one and pass it in explicitly. The Ordering trait and companion object have a fair few helpful methods for this. You could do this:
val ord = Ordering.Tuple2(Ordering[Int].reverse, Ordering[String])
val freq = words.takeOrdered(kValue.toInt)(ord).map(_.swap)

Related

Understanding the operation of map function

I came across the following example from the book "Fast Processing with Spark" by Holden Karau. I did not understand what the following line of code does in the program:
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
The program is :
package pandaspark.examples
import spark.SparkContext
import spark.SparkContext._
import spark.SparkFiles;
import au.com.bytecode.opencsv.CSVReader
import java.io.StringReader
object LoadCsvExample {
def main(args: Array[String]) {
if (args.length != 2) {
System.err.println("Usage: LoadCsvExample <master>
<inputfile>")
System.exit(1)
}
val master = args(0)
val inputFile = args(1)
val sc = new SparkContext(master, "Load CSV Example",
System.getenv("SPARK_HOME"),
Seq(System.getenv("JARS")))
sc.addFile(inputFile)
val inFile = sc.textFile(inputFile)
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
println(summedData.collect().mkString(","))
}
}
I briefly know the functionality of the above program. It parses the input
CSV and sums all the rows. But how exactly those 3 lines of code work to achieve is what I am unable to understand.
Also could anyone explain how the output would change if those lines are replaced with flatMap? Like:
val splitLines = inFile.flatMap(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.flatMap(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
so in this code is basically reading a CSV file data and adding it's value.
suppose your CSV file is something like -
10,12,13
1,2,3,4
1,2
so here inFile we are fetching a data from CSV file like -
val inFile = sc.textFile("your CSV file path")
so Here inFile is an RDD Which has text formatted data.
and when you apply collect on it then it will look like this -
Array[String] = Array(10,12,13 , 1,2,3,4 , 1,2)
and when you apply map over it then you will find -
line = 10,12,13
line = 1,2,3,4
line = 1,2
and for reading this data in CSV format it is using -
val reader = new CSVReader(new StringReader(line))
reader.readNext()
so after reading data in CSV format, splitLines look like -
Array(
Array(10,12,13),
Array(1,2,3,4),
Array(1,2)
)
on splitLines, it's applying
splitLines.map(line => line.map(_.toDouble))
here in line you will get Array(10,12,13) and after it, it's using
line.map(_.toDouble)
so it's changing all elements type from string to Double.
so in numericData you will get same
Array(Array(10.0, 12.0, 13.0), Array(1.0, 2.0, 3.0, 4.0), Array(1.0, 2.0))
but all elements now in form of Double
and it's applying the sum of the individual row or array so answer something like -
Array(35.0, 10.0, 3.0)
you will get it when you will apply susummedData.collect()
First of all there is no any flatMap operation in your code sample, so title is misleading. But in general map called on collection returns new collection with function applied to each element of collection.
Going line by line through your code snippet:
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
Type of inFile is RDD[String]. You take every such string, create csv reader out of it and call readNext (which returns array of strings). So at the end you will get RDD[String[]].
val numericData = splitLines.map(line => line.map(_.toDouble))
A bit more tricky line with 2 maps operations nested. Again, you take each element of RDD collection (which is now String[]) and apply _.toDouble function to every element of String[]. At the end you will get RDD[Double[]].
val summedData = numericData.map(row => row.sum)
You take elements of RDD and apply sum function to them. Since every element is Double[], sum will produce single Double value. At the end you will get RDD[Double].

how to join two datasets by key in scala spark

I have two datasets and each dataset have two elements.
Below are examples.
Data1: (name, animal)
('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...
Data2: (name, fruit)
('a,efg', 'apple')
('abc,def', 'banana(1)')
...
Results expected: (name, animal, fruit)
('abc,def', 'monkey(1)', 'banana(1)')
...
I want to join these two datasets by using first column 'name.' I have tried to do this for a couple of hours, but I couldn't figure out. Can anyone help me?
val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))
val joined = text1.join(text2)
Above code is not working!
join is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)].
The first step needed is to transform the input data into the right type.
We first need to transform the original data of type String into pairs of (Key, Value):
val parse:String => (String, String) = s => {
val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
s match {
case regex(k,v) => (k,v)
case _ => ("","")
}
}
(Note that we can't use a simple split(",") expression because the key contains commas)
Then we use that function to parse the text input data:
val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")
val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)
val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)
Finally, we use the join method to join the two RDDs
val joined = kvRdd1.join(kvRdd2)
// Let's check out results
joined.collect
// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))
You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.
Please consider the below example.
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))

Scala Spark: Split collection into several RDD?

Is there any Spark function that allows to split a collection into several RDDs according to some creteria? Such function would allow to avoid excessive itteration. For example:
def main(args: Array[String]) {
val logFile = "file.txt"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")
}
In this example I have to iterate 'logData` twice just to write results in two separate files:
val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")
It would be nice instead to have something like this:
val resultMap = logData.map(line => if line.contains("a") ("a", line) else if line.contains("b") ("b", line) else (" - ", line)
resultMap.writeByKey("a", "linesA.txt")
resultMap.writeByKey("b", "linesB.txt")
Any such thing?
Maybe something like this would work:
def singlePassMultiFilter[T](
rdd: RDD[T],
f1: T => Boolean,
f2: T => Boolean,
level: StorageLevel = StorageLevel.MEMORY_ONLY
): (RDD[T], RDD[T], Boolean => Unit) = {
val tempRDD = rdd mapPartitions { iter =>
val abuf1 = ArrayBuffer.empty[T]
val abuf2 = ArrayBuffer.empty[T]
for (x <- iter) {
if (f1(x)) abuf1 += x
if (f2(x)) abuf2 += x
}
Iterator.single((abuf1, abuf2))
}
tempRDD.persist(level)
val rdd1 = tempRDD.flatMap(_._1)
val rdd2 = tempRDD.flatMap(_._2)
(rdd1, rdd2, (blocking: Boolean) => tempRDD.unpersist(blocking))
}
Note that an action called on rdd1 (resp. rdd2) will cause tempRDD to be computed and persisted. This is practically equivalent to computing rdd2 (resp. rdd1) since the overhead of the flatMap in the definitions of rdd1 and rdd2 are, I believe, going to be pretty negligible.
You would use singlePassMultiFitler like so:
val (rdd1, rdd2, cleanUp) = singlePassMultiFilter(rdd, f1, f2)
rdd1.persist() //I'm going to need `rdd1` more later...
println(rdd1.count)
println(rdd2.count)
cleanUp(true) //I'm done with `rdd2` and `rdd1` has been persisted so free stuff up...
println(rdd1.distinct.count)
Clearly this could extended to an arbitrary number of filters, collections of filters, etc.
Have a look at the following question.
Write to multiple outputs by key Spark - one Spark job
You can flatMap an RDD with a function like the following and then do a groupBy on the key.
def multiFilter(words:List[String], line:String) = for { word <- words; if line.contains(word) } yield { (word,line) }
val filterWords = List("a","b")
val filteredRDD = logData.flatMap( line => multiFilter(filterWords, line) )
val groupedRDD = filteredRDD.groupBy(_._1)
But depending on the size of your input RDD you may or not see any performance gains because any of groupBy operations involves a shuffle.
On the other hand if you have enough memory in your Spark cluster you can cache the input RDD and therefore running multiple filter operations may not be as expensive as you think.

convert scala string to RDD[seq[string]]

// 4 workers
val sc = new SparkContext("local[4]", "naivebayes")
// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("/tmp/test.txt").map(_.split(" ").toSeq)
documents.zipWithIndex.foreach{
case (e, i) =>
val collectedResult = Tokenizer.tokenize(e.mkString)
}
val hashingTF = new HashingTF()
//pass collectedResult instead of document
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
in the above code snippet, i would want to extract collectedResult to reuse it for hashingTF.transform, How can this be achieved where the signature of tokenize function is
def tokenize(content: String): Seq[String] = {
...
}
Looks like you want map rather than foreach. I don't understand what you're using zipWithIndex for, nor why you're calling split on your lines only to join them straight back up again with mkString.
val lines: Rdd[String] = sc.textFile("/tmp/test.txt")
val tokenizedLines = lines.map(tokenize)
val hashes = tokenizedLines.map(hashingTF)
hashes.cache()
...

sortByKey in Spark

New to Spark and Scala. Trying to sort a word counting example. My code is based on this simple example.
I want to sort the results alphabetically by key. If I add the key sort to an RDD:
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
then I get a compile error:
error: No implicit view available from java.io.Serializable => Ordered[java.io.Serializable].
[INFO] val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
I don't know what the lack of an implicit view means. Can someone tell me how to fix it? I am running the Cloudera 5 Quickstart VM. I think it bundles Spark version 0.9.
Source of the Scala job
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
Array("NO NAME")
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Some (unsorted) output
("INTERNATIONAL EYELETS INC",879)
("SHAQUITA SALLEY",865)
("PAZ DURIGA",791)
("TERESSA ALCARAZ",824)
("MING CHAIX",878)
("JACKSON SHIELDS YEISER",837)
("AUDRY HULLINGER",875)
("GABRIELLE MOLANDS",802)
("TAM TACKER",775)
("HYACINTH VITELA",837)
No implicit view means there is no scala function like this defined
implicit def SerializableToOrdered(x :java.io.Serializable) = new Ordered[java.io.Serializable](x) //note this function doesn't work
The reason this error is coming out is because in your function you are returning two different types with a super type of java.io.Serializable (ones a String the other an Array[String]). Also reduceByKey for obvious reasons requires the key to be an Orderable. Fix it like this
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
"NO NAME"
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Now the function just returns Strings instead of two different types