How to do data cleansing in Scala - scala

I just started Scala on Spark, so I am not sure if my question is workable or should I turn to other solution/tool:
I have a text file for word counting and sorting, here is the file.
I load the file into HDFS
I then use the following code in Scala to do the counting:
val file = sc.textFile("hdfs://localhost:9000/Peter")
val counts = file.flatMap(line => line.split(" ")).map(p => (p,1)).reduceByKey(_+_).sortByKey(true,1)
counts.saveAsTextFile("Peter_SortedOutput6")
I checked the result on hdfs by hdfs dfs -cat hdfs://localhost:9000/user/root/Peter_SortedOutput5/part-00000
Part of the result is posted here for the convenience of reading:
((For,1)
((not,1)
(1,8)
(10,8)
(11,8)
(12,8)
(13,8)
(14,8)
(15,7)
(16,7)
(17,7)
(18,7)
(19,6)
(2,8)
(20,5)
(21,5)
(22,4)
(23,2)
(24,2)
(25,2)
(3,8)
(4,8)
(5,8)
(6,8)
(7,8)
(8,8)
(9,8)
(Abraham,,1)
(According,1)
(Amen.,4)
(And,19)
(As,5)
(Asia,,1)
(Babylon,,1)
(Balaam,1)
(Be,2)
(Because,1)
First, this is really not what I expect, I want the result showing in the desc order of count.
Second, there are result like the following:
(God,25)
(God's,1)
(God,,9)
(God,),1)
(God.,6)
(God:,2)
(God;,2)
(God?,1)
How to do some cleansing in the split so these occurrences can be grouped into one (God, 47)
Thank you very much.

There is a course BerkeleyX: CS105x Introduction to Apache Spark on edx.org by Berkerly&Databricks. One of the assignment is doing word count.
The steps are
remove punctuation, by replace "[^A-Za-z0-9\s]+" with "", or not include numbers "[^A-Za-z\s]+"
trim all spaces
lower all words
we can add extra step like
remove stop words
Code as follows
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.sql.functions.split
// val reg = raw"[^A-Za-z0-9\s]+" // with numbers
val reg = raw"[^A-Za-z\s]+" // no numbers
val lines = sc.textFile("peter.txt").
map(_.replaceAll(reg, "").trim.toLowerCase).toDF("line")
val words = lines.select(split($"line", " ").alias("words"))
val remover = new StopWordsRemover()
.setInputCol("words")
.setOutputCol("filtered")
val noStopWords = remover.transform(words)
val counts = noStopWords.select(explode($"filtered")).map(word =>(word, 1))
.reduceByKey(_+_)
// from word -> num to num -> word
val mostCommon = counts.map(p => (p._2, p._1)).sortByKey(false, 1)
mostCommon.take(5)

Clean data by use of replaceAll:
val counts = file.flatMap(line => line.trim.toLowerCase.split(" ").replaceAll("[$,?+.;:\'s\\W\\d]", ""));
sort by value in scala API:
.map(item => item.swap) // interchanges position of entries in each tuple
.sortByKey(true, 1) // 1st arg configures ascending sort, 2nd arg configures one task
.map(item => item.swap)
sort by value in python API:
.map(lambda (a, b): (b, a)) \
.sortByKey(1, 1) \ # 1st arg configures ascending sort, 2nd configures 1 task
.map(lambda (a, b): (b, a))
Code should look like this (you may see syntax error, please fix if any):
val file = sc.textFile("hdfs://localhost:9000/Peter")
val counts = file.flatMap(line => line.trim.toLowerCase.split(" ").replaceAll("[$,?+.;:\'s\\W\\d]", ""))
.map(p => (p,1))
.reduceByKey(_+_)
.map(rec => rec.swap)
.sortByKey(true, 1)
.map(rec => rec.swap)
counts.saveAsTextFile("Peter_SortedOutput6")
see scala_regular_expressions - for what [\\W] or [\\d] or [;:',.?] mean.

Related

How to Sorting results after run code in Spark

I created some code lines of scala to count number of words in a text file (in Spark). The result such like this:
(further,,1)
(Hai,,2)
(excluded,1)
(V.,5)
I wonder that can I sort the result as follow:
(V.,5)
(Hai,,2)
(excluded,1)
(further,,1)
The code as showed bellow, thank you for your help!
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts.collect()
wordCounts.saveAsTextFile("./WordCountTest")
If you want to sort your first dataset by the second field, you can use the following code:
val wordCounts = Seq(
("V.",5),
("Hai",2),
("excluded",1),
("further",1)
)
val wcOrdered = wordCounts.sortBy(_._2).reverse
which yields the following result
wcOrdered: Seq[(String, Int)] = List((V.,5), (Hai,2), (further,1), (excluded,1))
You can just call wordCounts.sortBy(_._2, false). Method sortBy from RDD takes boolean as the second argument, which tells if the result should be sorted ascending (true - default) or descending (false).
textFile
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.sortBy(_._2, false)

Reading one file as Map(K,V) and pass V as keys while reading the second file as Map

I have two files. One is a text file and another one is CSV. I want to read the text file as Map(keys, values) and pass these values from the first file as key in Map when I read the second file (CSV file).
I am able to read the first file and get Map(key, value). From this Map, I have extracted the values and passed these values as keys in the second file but didn't get the desired result.
first file - text file
sdp:field(0)
meterNumber:field(1)
date:field(2)
time:field(3)
value:field(4),field(5),field(6),field(7),field(8),field(9),
field(10),field(11),field(12),field(13),field(14),
field(15),field(16),field(17)
second file - csv file
SDP,METERNO,READINGDATE,TIME,Reset Count.,Kilowatt-Hour Last Reset .,Kilowatt-Hour Rate A Last Reset.,Kilowatt-Hour Rate B Last Reset.,Kilowatt-Hour Rate C Last Reset.,Max Kilowatt Rate A Last Reset.,Max Kilowatt Rate B Last Reset.,Max Kilowatt Rate C Last Reset.,Accumulate Kilowatt Rate A Current.,Accumulate Kilowatt Rate B Current.,Accumulate Kilowatt Rate C Current.,Total Kilovar-Hour Last Reset.,Max Kilovar Last Reset.,Accumulate Kilovar Last Reset.
9000000001,500001,02-09-2018,00:00:00,2,48.958,8.319333333,24.31933333,16.31933333,6,24,15,10,9,6,48.958,41,40
this is what I have done to read the first file.
val lines = scala.io.Source.fromFile("D:\\JSON_READER\\dailymapping.txt", "UTF8")
.getLines
.map(line=>line.split(":"))
.map(fields => (fields(0),fields(1))).toMap;
val sdp = lines.get("sdp").get;
val meterNumber = lines.get("meterNumber").get;
val date = lines.get("date").get;
val time = lines.get("time").get;
val values = lines.get("value").get;
now I can see sdp has field(0), meterNumber has field(1), date has field(2), time has field(3) and values has field(4) .. to field(17).
Second file which I m reading using below code
val keyValuePairs = scala.io.Source.fromFile("D:\\JSON_READER\\Daily.csv")
.getLines.drop(1).map(_.stripLineEnd.split(",", -1))
.map{field => ((field(0),field(1),field(2),field(3)) -> (field(4),field(5)))}.toList
val map = Map(keyValuePairs : _*)
System.out.println(map);
above code giving me the following output which is desired output.
Map((9000000001,500001,02-09-2018,00:00:00) -> (2,48.958))
But I want to replace field(0), field(1), field(2), field(3) with sdp, meterNumber, date, time in the above code. So, I don't have to mention keys when I read the second file, keys will come from the first file.
I tried to replace but I got below output which is not desired output.
Map((field(0),field(1),field(2),field(3)) -> (,))
Can somebody please guide me on how can I achieve the desired output.
This might get you close to what you're after. The first Map is used to lookup the correct index into the CSV data.
val fieldRE = raw"field\((\d+)\)".r
val idx = io.Source
.fromFile(<txt_file>, "UTF8")
.getLines
.map(_.split(":"))
.flatMap(fields => fieldRE.replaceAllIn(fields(1), _.group(1))
.split(",")
.map(fields(0) -> _.toInt))
.toMap
val resMap = io.Source
.fromFile(<csv_file>)
.getLines
.drop(1)
.map(_.stripLineEnd.split(",", -1))
.map{ fld =>
(fld(idx("sdp")),fld(idx("meterNumber")),fld(idx("date")),fld(idx("time"))) ->
(fld(4),fld(5)) }
.toMap
//resMap: Map((9000000001,500001,02-09-2018,00:00:00) -> (2,48.958))
UPDATE
Changing the Map of (String identifiers -> Int index values) into a Map of (String identifiers -> collection of Int index values) can be done. I'm not sure what that buys you, but it's doable.
val fieldRE = raw"field\((\d+)\)".r
val idx = io.Source
.fromFile(<txt_file>, "UTF8")
.getLines
.map(_.split(":"))
.flatMap(fields => fieldRE.replaceAllIn(fields(1), _.group(1))
.split(",")
.map(fields(0) -> _.toInt))
.foldLeft(Map[String,Seq[Int]]()){ case (m,(k,v)) =>
m + (k -> (m.getOrElse(k,Seq()) :+ v))
}
val resMap = io.Source
.fromFile(<csv_file>)
.getLines
.drop(1)
.map(_.stripLineEnd.split(",", -1))
.map{fld => (fld(idx("sdp").head)
,fld(idx("meterNumber").head)
,fld(idx("date").head)
,fld(idx("time").head)) -> (fld(4),fld(5))}
.toMap

Spark - calculate max ocurrence per day-event

I have the following RDD[String]:
1:AAAAABAAAAABAAAAABAAABBB
2:BBAAAAAAAAAABBAAAAAAAAAA
3:BBBBBBBBAAAABBAAAAAAAAAA
The first number is supposed to be days and the following characters are events.
I have to calculate the day where each event has the maximum occurrence.
The expected result for this dataset should be:
{ "A" -> Day2 , "B" -> Day3 }
(A has repeated 10 times in day2 and b 10 times in day3)
I am splitting the original dataset
val foo = rdd.map(_.split(":")).map(x => (x(0), x(1).split("")) )
What could be the best implementation for count and aggregation?
Any help is appreciated.
This should do the trick:
import org.apache.spark.sql.functions._
val rdd = sqlContext.sparkContext.makeRDD(Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
))
val keys = Seq("A", "B")
val seqOfMaps: RDD[(String, Map[String, Int])] = rdd.map{str =>
val split = str.split(":")
(s"Day${split.head}", split(1).groupBy(a => a.toString).mapValues(_.length))
}
keys.map{key => {
key -> seqOfMaps.mapValues(_.get(key).get).sortBy(a => -a._2).first._1
}}.toMap
The processing that need to be done consist in transforming the data into a rdd that is easy to apply on functions like: find the maximum for a list
I will try to explain step by step
I've used dummy data for "A" and "B" chars.
The foo rdd is the first step it will give you RDD[(String, Array[String])]
Let's extract each char for the Array[String]
val res3 = foo.map{case (d,s)=> (d, s.toList.groupBy(c => c).map{case (x, xs) => (x, xs.size)}.toList)}
(1,List((A,18), (B,6)))
(2,List((A,20), (B,4)))
(3,List((A,14), (B,10)))
Next we will flatMap over values to expand our rdd by char
res3.flatMapValues(list => list)
(3,(A,14))
(3,(B,10))
(1,(A,18))
(2,(A,20))
(2,(B,4))
(1,(B,6))
Rearrange the rdd in order to look better
res5.map{case (d, (s, c)) => (s, c, d)}
(A,20,2)
(B,4,2)
(A,18,1)
(B,6,1)
(A,14,3)
(B,10,3)
Now we are groupy by char
res7.groupBy(_._1)
(A,CompactBuffer((A,18,1), (A,20,2), (A,14,3)))
(B,CompactBuffer((B,6,1), (B,4,2), (B,10,3)))
Finally we are taking the maxium count for each row
res9.map{case (s, list) => (s, list.maxBy(_._2))}
(B,(B,10,3))
(A,(A,20,2))
Hope this help
Previous answers are good, but I prefer such solution:
val data = Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
)
val initialRDD = sparkContext.parallelize(data)
// to tuples like (1,'A',18)
val charCountRDD = initialRDD.flatMap(s => {
val parts = s.split(":")
val charCount = parts(1).groupBy(i => i).mapValues(_.length)
charCount.map(i => (parts(0), i._1, i._2))
})
// group by character, and take max value from grouped collection
val result = charCountRDD.groupBy(i => i._2).map(k => k._2.maxBy(z => z._3))
result.foreach(println(_))
Result is:
(3,B,10)
(2,A,20)

Spark Split RDD into chunks and concatenate

I have a relatively simple problem.
I have an large Spark RDD[String] (containing JSON). In my use case I want to group (concatenate) N strings together into a new RDD[String], so that it will have the size of oldRDD.size/N.
pseudo example:
val oldRDD : RDD[String] = ['{"id": 1}', '{"id": 2}', '{"id": 3}', '{"id": 4}']
val newRDD : RDD[String] = someTransformation(oldRDD, ",", 2)
newRDD = ['{"id": 1},{"id": 2}','{"id": 3},{"id": 4}']
val anotherRDD : RDD[String] = someTransformation(oldRDD, ",", 3)
anotherRDD = ['{"id": 1},{"id": 2},{"id": 3}','{"id": 4}']
I already looked for a similar case, but couldnt find anything.
Thanks!
Here you have to use zipWithIndex function and then calculate group.
For example, index = 3 and n (number of groups) = 2 gives you 2nd group. 3 / 2 = 1 (integer divide), so 0-based 2nd group
val n = 3;
val newRDD1 = oldRDD.zipWithIndex() // creates tuples (element, index)
// map to tuple (group, content)
.map(x => (x._2 / n, x._1))
// merge
.reduceByKey(_ + ", " + _)
// remove key
.map(x => x._2)
One note: order of "zipWithIndex" is internal order. It can make no sense in business logic, you must check if order is ok in your case. If not, sort RDD and then use zipWithIndex

word count(frequency) spark rdd scala

if I have an rdd accross cluster and I want to do the word count
not only count the appear times,
I want to get the frequency, which is defined as count/total count
What is the best and efficient way to do so in scala?
How can I do reduction job and calculate total number at the same time within one workflow?
BTW I know purely word count can be done in this way.
text_file = spark.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
but what is the difference if I use aggregate? in terms of spark job workflow
val result = pairs
.aggregate(Map[String, Int]())((acc, pair) =>
if(acc.contains(pair._1))
acc ++ Map[String, Int]((pair._1, acc(pair._1)+1))
else
acc ++ Map[String, Int]((pair._1, pair._2))
,
(a, b) =>
(a.toSeq ++ b.toSeq)
.groupBy(_._1)
.mapValues(_.map(_._2).reduce(_ + _))
)
You can use this
val total = counts.map(x => x._2).sum()
val freq = counts.map(x => (x._1, x._2/total))
There exists also the concept of Accumulator which is a write-only variable and you could use it to avoid using the sum() action, but your code would need a lot of change.