Union all files from directory and sort based on first column - scala

After implement the code below:
def ngram(s: String, inSep: String, outSep: String, n:Int): Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.mkString(outSep)).toSet
}
val fPath = "/user/root/data/data220K.txt"
val resultPath = "data/result220K"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word, 1)).reduceByKey((a, b) => a+b)
val sortedResult = result.map(pair => pair.swap).sortByKey(true)
sortedResult.count + "============================")
sortedResult.take(10)
sortedResult.saveAsTextFile(resultPath)
I'm getting a big amount of files in HDFS with this schema:
(Freq_Occurrencies, FieldA, FieldB)
Is possible to join all the file from that directory? Every rows are diferent but I want to have only one file sorted by the Freq_Occurrencies. Is possible?
Many thanks!

sortedResult
.coalesce(1, shuffle = true)
.saveAsTextFile(resultPath)`
coalesce makes Spark use a single task for saving, thus creating only one part. The downside is, of course, performance - all data will have to be shuffled to a single executor and saved using a single thread.

Related

spark 2.x with mapPartitions large number of records parallel processing

I am trying to use spark mapPartitions with Datasets[Spark 2.x] for copying large list of files [1 million records] from one location to another in parallel.
However, at times, I am seeing that one record is getting copied multiple times.
The idea is to split 1 million files into number of partitions (here, 24). Then for each partition, perform copy operation in parallel and finally get result from each partition to perform further actions.
Can someone please tell me what am I doing wrong?
def process(spark: SparkSession): DataFrame = {
import spark.implicits._
//Get source and target List for 1 million records
val sourceAndTargetList =
List(("source1" -> "target1"), ("source 1 Million" -> "Target 1 Million"))
// convert list to dataframe with number of partitions as 24
val SourceTargetDataSet =
sourceAndTargetList.toDF.repartition(24).as[(String, String)]
var dfBuffer = new ListBuffer[DataFrame]()
dfBuffer += SourceTargetDataSet
.mapPartitions(partition => {
println("partition id: " + TaskContext.getPartitionId)
//for each partition
val result = partition
.map(row => {
val source = row._1
val target = row._2
val copyStatus = copyFiles(source, target) // Function to copy files that returns a boolean
val dataframeRow = (target, copyStatus)
dataframeRow
})
.toList
result.toIterator
})
.toDF()
val dfList = dfBuffer.toList
val newDF = dfList.tail.foldLeft(dfList.head)(
(accDF, newDF) => accDF.join(newDF, Seq("_1"))
)
println("newDF Count " + newDF.count)
newDF
}
Update 2: I changed the function as shown below and so far it is giving me consistent results as expected. May I know what I was doing wrong and am I getting the required parallelization using below function? If not, how can this be optimized?
def process(spark: SparkSession): DataFrame = {
import spark.implicits._
//Get source and target List for 1 miilion records
val sourceAndTargetList =
List(("source1" -> "target1"), ("source 1 Million" -> "Target 1 Million"))
// convert list to dataframe with number of partitions as 24
val SourceTargetDataSet =
sourceAndTargetList.toDF.repartition(24).as[(String, String)]
val iterator = SourceTargetDataSet.toDF
.mapPartitions(
(it: Iterator[Row]) =>
it.toList
.map(row => {
println(row)
val source = row.toString.split(",")(0).drop(1)
val target = row.toString.split(",")(1).dropRight(1)
println("source : " + source)
println("target: " + target)
val copyStatus = copyFiles() // Function to copy files that returns a boolean
val dataframeRow = (target, copyStatus)
dataframeRow
})
.iterator
)
.toLocalIterator
val df = y.toList.toDF("targetKey", "copyStatus")
df
}
One should avoid performing write operations in map actions because they can be replayed when an executor dies and the same map has to be performed by another executer.
I'd choose foreach instead.

Add column while maintaining correlation of the existing columns in Apache Spark Scala

I have a dataframe with columns review and rating in Spark Scala
val stopWordsList = scala.io.Source.fromFile("stopWords").getLines.toList
val downSampleReviewsDF = sqlContext.sql("SELECT review, rating FROM ds");
I have written a function which will remove stopWords from a given review (String)
def cleanTextFunc(text: String, removeList: List[String]): String = removeList.fold(text) {
case (text, termToRemove) => text.replaceAll("\\b" + text + "\\b" , "").replaceAll("""[\p{Punct}&&[^.]]""", "").replaceAll(" +", " ")
}
How do I add another column "new_review" along with review and rating. The new_review should use cleanTextFunc() to get cleaned data for every row. cleanTextFunc takes two input arguments 1. text to clean 2. List of stop words to be removed from the text
Output should have Text | Rating | New_Text
Just a few more lines
// Curried method to create UDF from removeList
def getStopWordRemoverUdf(removeList: List[String]): UserDefinedFunction = {
udf { (text: String) =>
cleanTextFunc(text, removeList)
}
}
// Create UDF by passing your removeList
val stopWordRemoverUdf: UserDefinedFunction = getStopWordRemoverUdf(removeList)
// Use UDF to create new column
val cleanedDownSampleReviewsDf: DataFrame = downSampleReviewsDF.withColumn("new_review", stopWordRemoverUdf(downSampleReviewsDF("review")))
References
Passing extra parameters to UDF in Spark

I want to save Map[String, String] to disk and later read it back as same type. Somehow I cannot find collectAsMap method with my sparkContext

I am working on Spark Scala and there is a requirement to save Map[String, String] to the disk so that a different Spark application can read it.
(x,1),(y,2)...
To Save:
sc.parallelize(itemMap.toSeq).coalesce(1).saveAsTextFile(fileName)
I am doing a coalesce as the data is only 450 rows.
But to read it back, I am not able to convert it back to Map[String, String]
val myMap = sc.textFile(fileName).zipWithUniqueId().collect.toMap
the data comes as
((x,1),0),((y,2),1)...
What is the possible solution?
Thanks.
Loading a text file results in RDD[String], so you will have to deserialize your string representations of the tuples.
You can change your Save operation to add a delimiter between tuple value 1 and tuple value 2, or parse the string (:v1, :v2).
val d = spark.sparkContext.textFile(fileName)
val myMap = d.map(s => {
val parsedVals = s.substring(1, s.length-1).split(",")
(parsedVals(0), parsedVals(1))
}).collect.toMap
Alternatively, you can change your save operation to create a delimiter (like a comma) and parse the structure that way:
itemMap.toSeq.map(kv => kv._1 + "," + kv._2).saveAsTextFile(fileName)
val myMap = spark.sparkContext.textFile("trash3.txt")
.map(_.split(","))
.map(d => (d(0), d(1)))
.collect.toMap
Method "collectAsMap" exists in "PairRDDFunctions" class, means, applicable only for RDD with two values RDD[(K, V)].
If this function call is required, can be organized with code below. Dataframe is used for store in csv format ant avoid hand-made parsing
val originalMap = Map("x" -> 1, "y" -> 2)
// write
sparkContext.parallelize(originalMap.toSeq).coalesce(1).toDF("k", "v").write.csv(path)
// read
val restoredDF = spark.read.csv(path)
val restoredMap = restoredDF.rdd.map(r => (r.getString(0), r.getString(1))).collectAsMap()
println("restored map: " + restoredMap)
Output:
restored map: Map(y -> 2, x -> 1)

how to join two datasets by key in scala spark

I have two datasets and each dataset have two elements.
Below are examples.
Data1: (name, animal)
('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...
Data2: (name, fruit)
('a,efg', 'apple')
('abc,def', 'banana(1)')
...
Results expected: (name, animal, fruit)
('abc,def', 'monkey(1)', 'banana(1)')
...
I want to join these two datasets by using first column 'name.' I have tried to do this for a couple of hours, but I couldn't figure out. Can anyone help me?
val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))
val joined = text1.join(text2)
Above code is not working!
join is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)].
The first step needed is to transform the input data into the right type.
We first need to transform the original data of type String into pairs of (Key, Value):
val parse:String => (String, String) = s => {
val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
s match {
case regex(k,v) => (k,v)
case _ => ("","")
}
}
(Note that we can't use a simple split(",") expression because the key contains commas)
Then we use that function to parse the text input data:
val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")
val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)
val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)
Finally, we use the join method to join the two RDDs
val joined = kvRdd1.join(kvRdd2)
// Let's check out results
joined.collect
// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))
You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.
Please consider the below example.
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))

How to group by a select number of fields in an RDD looking for duplicates based on those fields

I am new to Scala and Spark. I am working in the Spark Shell.
I need to Group By and sort by the first three fields of this file, looking for duplicates. If I find duplicates within the group, I need to append a counter to the third field, starting at "1" and incrementing by "1", for each record in the duplicate group. Resetting the counter back to "1" when reading a new group. When no duplicates are found, then just append the counter which would be "1".
CSV File contains the following:
("00111","00111651","4444","PY","MA")
("00111","00111651","4444","XX","MA")
("00112","00112P11","5555","TA","MA")
val csv = sc.textFile("file.csv")
val recs = csv.map(line => line.split(",")
If I apply the logic properly on the above example, the resulting RDD of recs would look like this:
("00111","00111651","44441","PY","MA")
("00111","00111651","44442","XX","MA")
("00112","00112P11","55551","TA","MA")
How about group the data, change it and put it back:
val csv = sc.parallelize(List(
"00111,00111651,4444,PY,MA",
"00111,00111651,4444,XX,MA",
"00112,00112P11,5555,TA,MA"
))
val recs = csv.map(_.split(","))
val grouped = recs.groupBy(line=>(line(0),line(1), line(2)))
val numbered = grouped.mapValues(dataList=>
dataList.zipWithIndex.map{case(data, idx) => data match {
case Array(fst,scd,thd,rest#_*) => Array(fst,scd,thd+(idx+1)) ++ rest
}
})
numbered.flatMap{case (key, values)=>values}
Also grouping the data, changing it, putting it back.
val lists= List(("00111","00111651","4444","PY","MA"),
("00111","00111651","4444","XX","MA"),
("00112","00112P11","5555","TA","MA"))
val grouped = lists.groupBy{case(a,b,c,d,e) => (a,b,c)}
val indexed = grouped.mapValues(
_.zipWithIndex
.map {case ((a,b,c,d,e), idx) => (a,b,c + (idx+1).toString,d,e)}
val unwrapped = indexed.flatMap(_._2)
//List((00112,00112P11,55551,TA,MA),
// (00111,00111651,44442,XX ,MA),
// (00111,00111651,44441,PY,MA))
Version working on Arrays (of arbitary length >= 3)
val lists= List(Array("00111","00111651","4444","PY","MA"),
Array("00111","00111651","4444","XX","MA"),
Array("00112","00112P11","5555","TA","MA"))
val grouped = lists.groupBy{_.take(3)}
val indexed = grouped.mapValues(
_.zipWithIndex
.map {case (Array(a,b,c, rest#_*), idx) => Array(a,b,c+ (idx+1).toString) ++ rest})
val unwrapped = indexed.flatMap(_._2)
// List(Array(00112, 00112P11, 55551, TA, MA),
// Array(00111, 00111651, 44441, XX, MA),
// Array(00111, 00111651, 44441, PY, MA))