How to sort a text file containing integers in Spark - Scala? - scala

I am new to spark programming, I have a data file named "test1.in" which contains random numbers in following way -
123
34
1
45
65
I want to sort these numbers using spark and write the output to a new file. Here is my code so far -
import org.apache.spark.{SparkContext, SparkConf}
val conf = new SparkConf().setMaster("local[*]").setAppName("SortingApp")
val sc = new SparkContext(conf)
val data = sc.textFile("src/main/resources/test1.in")
val d1 = data.map(_.sorted)
d1.foreach(println _)
The result is not what is expected.

When you call:
data.map(_.sorted)
You map each record (which is a String) into it's "sorted" version, which means the String is being converted into a Sequence of chars and these chars are sorted.
What you need to do is NOT to use map which applies your function to each record separately (hence it can't sort the records), but use RDD.sortBy:
data.map(_.toInt).sortBy(t => t)
The t => t is the identity function returning the input as-as, which can be replaced with Scala's built-in generic implementation:
data.map(_.toInt).sortBy(identity)
Or, the shortest version:
input.sortBy(_.toInt)
(which would return a result of type RDD[String])

Use below line to convert text file data into Int then sort it:
val d1 = data.map(_.toInt).sorted

Related

Converting Fields to Ints, Doubles, ect. in Scala in Spark Shell RDD

I have an assignment where I need to load a csv dataset in a spark-shell using spark.read.csv(), and accomplish the following:
Convert the dataset to RDD
Remove the heading (first record (line) in the dataset)
Convert the first two fields to integers
Convert other fields except the last one to doubles. Questions marks should be NaN. The
last field should be converted to a Boolean.
I was able to do steps 1 and 2 with the following code:
//load the dataset as an RDD
val dataRDD = spark.read.csv("block_1.csv").rdd //output is org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[14] at rdd at <console>:23
dataRDD.count() //output 574914
//import Row since RDD is of Row
import org.apache.spark.sql.Row
//function to recognize if a string contains "id_1"
def isHeader(r : Row) = r.toString.contains("id_1")
//filter function will take !isHeader function and apply it to all lines in dataRDD and the //return will form another RDD
val nohead = dataRDD.filter(x => !isHeader(x))
nohead.count() //output is now 574913
nohead.first //output is [37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE]
nohead //output is org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[15] at filter at <console>:28
I'm trying to convert the fields but every time I use a function like toDouble I get an error stating not a member of:
:25: error: value toDouble is not a member of
org.apache.spark.sql.Row
if ("?".equals(s)) Double.NaN else s.toDouble
I'm not sure what I'm doing wrong and I've taken a look at the website https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Row.html#anyNull()
but I still don't know what I'm doing wrong.
I'm not sure how to convert something if there isn't a toDouble, toInt, or toBoolean function.
Can someone please guide me in the right direction to figure what I'm doing wrong? Where I can possibly look to answer? I need to convert the first two fields to integers, the other fields except for the last one to doubles. Question marks should be NaN. The last field should be converted to Boolean.
Convert the first two fields to integers
Convert other fields except the last one to doubles. Questions marks should be NaN. The last field should be converted to a Boolean.
You can do both 3 and 4 at once using a parse function.
First create the toDouble function since it is used in the parse function:
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
def parse(line: String) = {
val pieces = line.split(',')
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
(id1, id2, scores, matched)
}
After you do this, you can call parse on each row in your RDD using map; however, you still have the type issue. To fix this, you could convert nohead from an RDD[Row] to an RDD[String]; however its probably easier to just convert the row to a string as you pass it:
val parsed = noheader.map(line => parse(line.mkString(",")))
This will give parsed as type: RDD[(Int, Int, Array[Double], Boolean)]

How to split 1 RDD into 6 parts in a performant manner?

I have built a Spark RDD where each element of this RDD is a JAXB Root Element representing an XML Record.
I want to split this RDD so as to produce 6 RDDs from this set. Essentially this job simply converts the hierarchical XML structure into 6 sets of flat CSV records. I am currently passing over the same RDD 6 six times to do this.
xmlRdd.cache()
val rddofTypeA = xmlRdd.map {iterate over XML Object and create Type A}
rddOfTypeA.saveAsTextFile("s3://...")
val rddofTypeB = xmlRdd.map { iterate over XML Object and create Type B}
rddOfTypeB.saveAsTextFile("s3://...")
val rddofTypeC = xmlRdd.map { iterate over XML Object and create Type C}
rddOfTypeC.saveAsTextFile("s3://...")
val rddofTypeD = xmlRdd.map { iterate over XML Object and create Type D}
rddOfTypeD.saveAsTextFile("s3://...")
val rddofTypeE = xmlRdd.map { iterate over XML Object and create Type E}
rddOfTypeE.saveAsTextFile("s3://...")
val rddofTypeF = xmlRdd.map { iterate over XML Object and create Type F}
rddOfTypeF.saveAsTextFile("s3://...")
My input dataset are 35 Million Records split into 186 files of 448MB each stored in Amazon S3. My output directory is also on S3. I am using EMR Spark.
With a six node m4.4xlarge cluster it taking 38 minutes to finish this splitting and writing the output.
Is there an efficient way to achieve this without walking over the RDD six times?
The easiest solution (from a Spark developer's perspective) is to do the map and saveAsTextFile per RDD on a separate thread.
What's not widely known (and hence exploited) is the fact that SparkContext is thread-safe and so can be used to submit jobs from separate threads.
With that said, you could do the following (using the simplest Scala solution with Future but not necessarily the best as Future starts a computation at instantiation time not when you say so):
xmlRdd.cache()
import scala.concurrent.ExecutionContext.Implicits.global
val f1 = Future {
val rddofTypeA = xmlRdd.map { map xml to csv}
rddOfTypeA.saveAsTextFile("s3://...")
}
val f2 = Future {
val rddofTypeB = xmlRdd.map { map xml to csv}
rddOfTypeB.saveAsTextFile("s3://...")
}
...
Future.sequence(Seq(f1,f2)).onComplete { ... }
That could cut the time for doing the mapping and saving, but would not cut the number of scans over the dataset. That should not be a big deal anyway since the dataset is cached and hence in memory and/or disk (the default persistence level is MEMORY_AND_DISK in Spark SQL's Dataset.cache).
Depending on your requirements regarding output paths you can solve it using simple partitionByClause with standard DataFrameWriter.
Instead of multiple maps design a function which takes element of xmlRdd and returns a Seq of Tuples. General structure would be like this:
def extractTypes(value: T): Seq[(String, String)] = {
val a: String = extractA(value)
val b: String = extractB(value)
...
val f: String = extractF(value)
Seq(("A", a), ("B", b), ..., ("F", f))
}
flatMap, convert to Dataset and write:
xmlRdd.flatMap(extractTypes _).toDF("id", "value").write
.partitionBy("id")
.option("escapeQuotes", "false")
.csv(...)

Spark scala filter multiple rdd based on string length

I am trying to solve one of the quiz, the question is as below,
Write the missing code in the given program to display the expected output to identify animals that have names with four
letters.
Output: Array((4,lion))
Program
val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant","falcon","squid"),2)
val d = c.keyBy(_.length)
I have tried to write code in spark shell but get stuck with syntax to add 4 RDD and applying filter.
How about using the PairRDD lookup method:
b.lookup(4).toArray
// res1: Array[String] = Array(lion)
d.lookup(4).toArray
// res2: Array[String] = Array()

Counting all the characters in the file using Spark/scala?

How can I calculate all the characters in the file using Spark/Scala? Here is what I am doing in the spark shell :
scala> val logFile = sc.textFile("ClasspathLength.txt")
scala> val counts = logFile.flatMap(line=>line.split("").map(char=>(char,1))).reduceByKey(_ + _)
scala> println(counts.count())
scala> 62
I am getting incorrect count here. Could someone help me fix this?
What you're doing here is:
Counting the number of times each unique character appears in the input:
val counts = logFile.flatMap(line=>line.split("").map(char=>(char,1))).reduceByKey(_ + _)
and then:
Counting the number of records in this result (using counts.count()), which ignores the actual values you calculated in the previous step
If you're interested in displaying the total number characters in the file - there's no need for grouping at all - you can map each line to its length and then use the implicit conversion into DoubleRDDFunctions to call sum():
logFile.map(_.length).sum()
Alternatively you can flatMap into separate record per character and then use count:
logFile.flatMap(_.toList).count
val spark=SparkSession.builder()
.master("local[4]")
.appName("Nos of Word Count")
.getOrCreate()
val sparkConfig=spark.sparkContext
sparkConfig.setLogLevel("ERROR")
val rdd1=sparkConfig.textFile("data/mini.txt")
println(rdd1.count())
val rdd2=rdd1.flatMap(f=>{f.split(" ")})//.map(x=>x.toInt)
println(rdd2.count())
val rdd3=rdd2
.map(w=>(w.count(p=>true))).map(w=>w.toInt)
println(rdd3.sum().round)
All you need here is flatMap + count
logFile.flatMap(identity).count

Spark DataFrame zipWithIndex

I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them.
So I have my file:
val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)
I get the following error even when I convert from a dataframe to RDD:
:26: error: value zipWithIndex is not a member of
org.apache.spark.sql.Row
Anyone know how to do what I am trying to do, essentially trying to get the value and the column index.
I was thinking something like:
val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length
but getting stuck on the last part as not sure how to do the same of zipWithIndex.
You can simply convert Row to Seq:
convRDD.flatMap(r => r.toSeq.zipWithIndex)
Important thing to note here is that extracting type information becomes tricky. Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)].