Counting all the characters in the file using Spark/scala? - scala

How can I calculate all the characters in the file using Spark/Scala? Here is what I am doing in the spark shell :
scala> val logFile = sc.textFile("ClasspathLength.txt")
scala> val counts = logFile.flatMap(line=>line.split("").map(char=>(char,1))).reduceByKey(_ + _)
scala> println(counts.count())
scala> 62
I am getting incorrect count here. Could someone help me fix this?

What you're doing here is:
Counting the number of times each unique character appears in the input:
val counts = logFile.flatMap(line=>line.split("").map(char=>(char,1))).reduceByKey(_ + _)
and then:
Counting the number of records in this result (using counts.count()), which ignores the actual values you calculated in the previous step
If you're interested in displaying the total number characters in the file - there's no need for grouping at all - you can map each line to its length and then use the implicit conversion into DoubleRDDFunctions to call sum():
logFile.map(_.length).sum()
Alternatively you can flatMap into separate record per character and then use count:
logFile.flatMap(_.toList).count

val spark=SparkSession.builder()
.master("local[4]")
.appName("Nos of Word Count")
.getOrCreate()
val sparkConfig=spark.sparkContext
sparkConfig.setLogLevel("ERROR")
val rdd1=sparkConfig.textFile("data/mini.txt")
println(rdd1.count())
val rdd2=rdd1.flatMap(f=>{f.split(" ")})//.map(x=>x.toInt)
println(rdd2.count())
val rdd3=rdd2
.map(w=>(w.count(p=>true))).map(w=>w.toInt)
println(rdd3.sum().round)

All you need here is flatMap + count
logFile.flatMap(identity).count

Related

Write scala filter on data frame, a column should have more than two words

val tempDf = Df.filter(Df("column_1")==="200")
now wanted to filter tempDf on basis of one column (column_2) which should have more than 2 words.
val extractedDf = tempDf.filter(*)
How we can write the filter in scala at *.
You can use the size and split function.
val extractedDf = tempDf.filter(size(split($"column_2"," ")) > 2)

Spark scala filter multiple rdd based on string length

I am trying to solve one of the quiz, the question is as below,
Write the missing code in the given program to display the expected output to identify animals that have names with four
letters.
Output: Array((4,lion))
Program
val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant","falcon","squid"),2)
val d = c.keyBy(_.length)
I have tried to write code in spark shell but get stuck with syntax to add 4 RDD and applying filter.
How about using the PairRDD lookup method:
b.lookup(4).toArray
// res1: Array[String] = Array(lion)
d.lookup(4).toArray
// res2: Array[String] = Array()

How to sort a text file containing integers in Spark - Scala?

I am new to spark programming, I have a data file named "test1.in" which contains random numbers in following way -
123
34
1
45
65
I want to sort these numbers using spark and write the output to a new file. Here is my code so far -
import org.apache.spark.{SparkContext, SparkConf}
val conf = new SparkConf().setMaster("local[*]").setAppName("SortingApp")
val sc = new SparkContext(conf)
val data = sc.textFile("src/main/resources/test1.in")
val d1 = data.map(_.sorted)
d1.foreach(println _)
The result is not what is expected.
When you call:
data.map(_.sorted)
You map each record (which is a String) into it's "sorted" version, which means the String is being converted into a Sequence of chars and these chars are sorted.
What you need to do is NOT to use map which applies your function to each record separately (hence it can't sort the records), but use RDD.sortBy:
data.map(_.toInt).sortBy(t => t)
The t => t is the identity function returning the input as-as, which can be replaced with Scala's built-in generic implementation:
data.map(_.toInt).sortBy(identity)
Or, the shortest version:
input.sortBy(_.toInt)
(which would return a result of type RDD[String])
Use below line to convert text file data into Int then sort it:
val d1 = data.map(_.toInt).sorted

How to remove lines with length less than 3 words?

I have a corpus of millions of Documents
and I want to remove lines which their length less than 3 words,(in Scala and Spark),
How can i do this?
All depends on how you define words but assuming a very simple approach:
def naiveTokenizer(text: String): Array[String] = text.split("""\s+""")
def naiveWordCount(text: String): Int = naiveTokenizer(text).length
val lines = sc.textFile("README.md")
lines.filter(naiveWordCount(_) >= 3)
You could use the count function. (This assumes you have ' ' as your spaces between words):
scala.io.Source.fromFile("myfile.txt").getLines().filter(_.count(' '==) >= 3)
zero33's answer is valid if you want to keep the lines intact. However, if you want the lines tokenized also, then flatMap is more efficient.
val lines = sc.textFile("README.md")
lines.flatMap(line =>{
val tokenized = line.split("""\s+""")
if(tokenized.length >= 3) tokenized
else Seq()
})

Spark processing columns in parallel

I've been playing with Spark, and I managed to get it to crunch my data. My data consists of flat delimited text file, consisting of 50 columns and about 20 millions of rows. I have scala scripts that will process each column.
In terms of parallel processing, I know that RDD operation run on multiple nodes. So, every time I process a column, they are processed in parallel, but the column itself is processed sequentially.
A simple example: if my data is 5 column text delimited file and each column contain text, and I want to do word count for each column. I would do:
for(i <- 0 until 4){
data.map(_.split("\t",-1)(i)).map((_,1)).reduce(_+_)
}
Although each column's operation is run in parallel, the column itself is processed sequentially(bad wording I know. Sorry!). In other words, column 2 is processed after column 1 is done. Column 3 is processed after column 1 and 2 are done, and so on.
My question is: Is there anyway to process multiple column at a time? If you know a way, cor a tutorial, would you mind sharing it with me?
thank you!!
Suppose the inputs are seq. Following can be done to process columns concurrently. The basic idea is to using sequence (column, input) as the key.
scala> val rdd = sc.parallelize((1 to 4).map(x=>Seq("x_0", "x_1", "x_2", "x_3")))
rdd: org.apache.spark.rdd.RDD[Seq[String]] = ParallelCollectionRDD[26] at parallelize at <console>:12
scala> val rdd1 = rdd.flatMap{x=>{(0 to x.size - 1).map(idx=>(idx, x(idx)))}}
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = FlatMappedRDD[27] at flatMap at <console>:14
scala> val rdd2 = rdd1.map(x=>(x, 1))
rdd2: org.apache.spark.rdd.RDD[((Int, String), Int)] = MappedRDD[28] at map at <console>:16
scala> val rdd3 = rdd2.reduceByKey(_+_)
rdd3: org.apache.spark.rdd.RDD[((Int, String), Int)] = ShuffledRDD[29] at reduceByKey at <console>:18
scala> rdd3.take(4)
res22: Array[((Int, String), Int)] = Array(((0,x_0),4), ((3,x_3),4), ((2,x_2),4), ((1,x_1),4))
The example output: ((0, x_0), 4) means the first column, key is x_0, and value is 4. You can start from here to process further.
You can try the following code, which use the scala parallize collection feature,
(0 until 4).map(index => (index,data)).par.map(x => {
x._2.map(_.split("\t",-1)(x._1)).map((_,1)).reduce(_+_)
}
data is a reference, so duplicate the data will not cost to much. And rdd is read-only, so parallelly processing can work. The par method use the parallely collection feature. You can check the parallel jobs on the spark web UI.