How to remove lines with length less than 3 words? - scala
I have a corpus of millions of Documents
and I want to remove lines which their length less than 3 words,(in Scala and Spark),
How can i do this?
All depends on how you define words but assuming a very simple approach:
def naiveTokenizer(text: String): Array[String] = text.split("""\s+""")
def naiveWordCount(text: String): Int = naiveTokenizer(text).length
val lines = sc.textFile("README.md")
lines.filter(naiveWordCount(_) >= 3)
You could use the count function. (This assumes you have ' ' as your spaces between words):
scala.io.Source.fromFile("myfile.txt").getLines().filter(_.count(' '==) >= 3)
zero33's answer is valid if you want to keep the lines intact. However, if you want the lines tokenized also, then flatMap is more efficient.
val lines = sc.textFile("README.md")
lines.flatMap(line =>{
val tokenized = line.split("""\s+""")
if(tokenized.length >= 3) tokenized
else Seq()
})
Related
Scala: Convert a string to string array with and without split given that all special characters except "(" an ")" are allowed
I have an array val a = "((x1,x2),(y1,y2),(z1,z2))" I want to parse this into a scala array val arr = Array(("x1","x2"),("y1","y2"),("z1","z2")) Is there a way of directly doing this with an expr() equivalent ? If not how would one do this using split Note : x1 x2 x3 etc are strings and can contain special characters so key would be to use () delimiters to parse data - Code I munged from Dici and Bogdan Vakulenko val x2 = a.getString(1).trim.split("[\()]").grouped(2).map(x=>x(0).trim).toArray val x3 = x2.drop(1) // first grouping is always null dont know why var jmap = new java.util.HashMap[String, String]() for (i<-x3) { val index = i.lastIndexOf(",") val fv = i.slice(0,index) val lv = i.substring(index+1).trim jmap.put(fv,lv) } This is still suceptible to "," in the second string -
Actually, I think regex are the most convenient way to solve this. val a = "((x1,x2),(y1,y2),(z1,z2))" val regex = "(\\((\\w+),(\\w+)\\))".r println( regex.findAllMatchIn(a) .map(matcher => (matcher.group(2), matcher.group(3))) .toList ) Note that I made some assumptions about the format: no whitespaces in the string (the regex could easily be updated to fix this if needed) always tuples of two elements, never more empty string not valid as a tuple element only alphanumeric characters allowed (this also would be easy to fix)
val a = "((x1,x2),(y1,y2),(z1,z2))" a.replaceAll("[\\(\\) ]","") .split(",") .sliding(2) .map(x=>(x(0),x(1))) .toArray
Is there a better way of converting Iterator[char] to Seq[String]?
Following is my code that I have used to convert Iterator[char] to Seq[String]. val result = IOUtils.toByteArray(new FileInputStream (new File(fileDir))) val remove_comp = result.grouped(11).map{arr => arr.update(2, 32);arr}.flatMap{arr => arr.update(3, 32); arr} val convert_iter = remove_comp.map(_.toChar.toString).toSeq.mkString.split("\n") val rdd_input = Spark.sparkSession.sparkContext.parallelize(convert_iter) val fileDir: 12**34567890 12##34567890 12!!34567890 12¬¬34567890 12 '34567890 I am not happy with this code as the data size is big and converting to string would end up with heap space. val convert_iter = remove_comp.map(_.toChar) convert_iter: Iterator[Char] = non-empty iterator Is there a better way of coding?
By completely disregarding corner cases about empty Strings etc I would start with something like: val test = Iterable('s','f','\n','s','d','\n','s','v','y') val (allButOne, last) = test.foldLeft( (Seq.empty[String], Seq.empty[Char]) ) { case ((strings, chars), char) => if (char == '\n') (strings :+ chars.mkString, Seq.empty) else (strings, chars :+ char) } val result = allButOne :+ last.mkString I am sure it could be made more elegant, and handle corner cases better (once you define you want them handled), but I think it is a nice starting point. But to be honest I am not entirely sure what you want to achieve. I just guessed that you want to group chars divided by \n together and turn them into Strings.
Looking at your code, I see that you are trying to replace the special characters such as **, ## and so on from the file that contains following data 12**34567890 12##34567890 12!!34567890 12¬¬34567890 12 '34567890 For that you can just read the data using sparkContext textFile and use regex replaceAllIn val pattern = new Regex("[¬~!##$^%&*\\(\\)_+={}\\[\\]|;:\"'<,>.?` /\\-]") val result = sc.textFile(fileDir).map(line => pattern.replaceAllIn(line, "")) and you should have you result as RDD[String] which also an iterator 1234567890 1234567890 1234567890 1234567890 12 34567890 Updated If there are \n and \r in between the texts at 3rd and 4th place and if the result is all fixed length of 10 digits text then you can use wholeTextFiles api of sparkContext and use following regex as val pattern = new Regex("[¬~!##$^%&*\\(\\)_+={}\\[\\]|;:\"'<,>.?` /\\-\r\n]") val result = sc.wholeTextFiles(fileDir).flatMap(line => pattern.replaceAllIn(line._2, "").grouped(10)) You should get the output as 1234567890 1234567890 1234567890 1234567890 1234567890 I hope the answer is helpful
Efficientley counting occurrences of each character in a file - scala
I am new to Scala, I want the fastest way to get a map of count of occurrences for each character in a text file, how can I do that?(I used groupBy but I believe it is too slow)
I think that groupBy() is probably pretty efficient, but it simply collects the elements, which means that counting them requires a 2nd traversal. To count all Chars in a single traversal you'd probably need something like this. val tally = Array.ofDim[Long](127) io.Source.fromFile("someFile.txt").foreach(tally(_) += 1) Array was used for its fast indexing. The index is the character that was counted. tally('e') //res0: Long = 74 tally('x') //res1: Long = 1
You can do the following: Read the file first: val lines = Source.fromFile("/Users/Al/.bash_profile").getLines.toSeq You can then write a method that takes the List of lines read and counts the occurence for a given character: def getCharCount(c: Char, lines: Seq[String]) = { lines.foldLeft(0){(acc, elem) => elem.toSeq.count(_ == c) + acc } }
Counting all the characters in the file using Spark/scala?
How can I calculate all the characters in the file using Spark/Scala? Here is what I am doing in the spark shell : scala> val logFile = sc.textFile("ClasspathLength.txt") scala> val counts = logFile.flatMap(line=>line.split("").map(char=>(char,1))).reduceByKey(_ + _) scala> println(counts.count()) scala> 62 I am getting incorrect count here. Could someone help me fix this?
What you're doing here is: Counting the number of times each unique character appears in the input: val counts = logFile.flatMap(line=>line.split("").map(char=>(char,1))).reduceByKey(_ + _) and then: Counting the number of records in this result (using counts.count()), which ignores the actual values you calculated in the previous step If you're interested in displaying the total number characters in the file - there's no need for grouping at all - you can map each line to its length and then use the implicit conversion into DoubleRDDFunctions to call sum(): logFile.map(_.length).sum() Alternatively you can flatMap into separate record per character and then use count: logFile.flatMap(_.toList).count
val spark=SparkSession.builder() .master("local[4]") .appName("Nos of Word Count") .getOrCreate() val sparkConfig=spark.sparkContext sparkConfig.setLogLevel("ERROR") val rdd1=sparkConfig.textFile("data/mini.txt") println(rdd1.count()) val rdd2=rdd1.flatMap(f=>{f.split(" ")})//.map(x=>x.toInt) println(rdd2.count()) val rdd3=rdd2 .map(w=>(w.count(p=>true))).map(w=>w.toInt) println(rdd3.sum().round)
All you need here is flatMap + count logFile.flatMap(identity).count
Splitting string in dataset Apache Spark
I am absolutely new in Spark I have a txt dataset with cathegorical attributes, looking like this: 10000,5,0,1,0,0,5,3,2,2,1,0,1,0,4,3,0,2,0,0,1,0,0,0,0,10,0,1,0,1,0,1,4,2,2,3,0,2,0,2,1,4,3,0,0,0,3,1,0,3,22,0,3,0,1,0,1,0,0,0,5,0,2,1,1,0,11,1,0 10001,6,1,1,0,0,7,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,1,0,0,4,0,2,0,0,0,1,4,1,2,2,0,2,0,2,2,4,2,1,0,0,1,1,0,2,10,0,1,0,1,0,1,0,0,0,1,0,2,1,1,0,5,1,0 10002,3,1,2,0,0,7,4,2,2,0,0,1,0,4,4,0,1,0,1,0,0,0,0,0,1,0,2,0,4,0,10,4,1,2,4,0,2,0,2,1,4,2,2,0,0,0,1,0,2,10,0,6,0,1,0,1,0,0,0,2,0,2,1,1,0,10,1,0 10003,4,1,2,0,0,1,3,2,2,0,0,3,0,3,3,0,1,0,0,0,0,0,0,1,4,0,2,0,2,0,1,4,1,2,2,0,2,0,2,1,2,2,0,0,0,1,1,0,2,10,0,4,0,1,0,1,0,0,0,1,0,1,1,1,0,10,1,0 10004,7,1,1,0,0,0,0,2,2,0,0,3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,2,0,0,0,4,1,2,0,0,2,0,2,1,4,0,1,0,0,0,6,0,2,22,0,1,0,1,0,1,0,0,3,0,0,0,2,2,0,5,6,0 10005,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,0,4,0,2,0,121,0,0,1,0,10,1,0,0,2,0,1,0,0,0,0,0,0,0,0,0,4,0,0 10006,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,1,0,0,2,0,121,0,0,1,0,10,1,0,0,2,0,0,0,0,0,0,0,0,0,0,0,4,0,0 10007,4,1,2,0,0,6,0,2,2,0,0,4,0,5,5,0,2,1,0,0,0,0,0,0,9,0,2,0,0,0,11,4,1,2,3,0,2,0,2,1,2,3,1,0,0,0,1,0,3,10,0,1,0,1,0,1,0,0,0,0,0,2,1,1,0,11,1,0 10008,6,1,1,0,0,1,0,2,2,0,0,7,0,1,0,0,1,0,0,0,0,0,0,0,7,0,2,2,0,0,0,4,1,2,6,0,2,0,2,1,2,2,1,0,0,0,6,0,2,10,0,1,0,1,0,1,0,0,3,0,0,1,1,2,0,10,1,0 10009,3,1,12,0,0,1,0,2,2,0,0,0,0,3,0,0,1,0,0,0,0,0,0,0,4,0,2,2,4,0,0,2,1,2,6,0,2,0,2,1,0,2,2,0,0,0,3,0,2,10,0,6,1,1,1,0,0,0,1,0,0,1,1,2,0,8,1,1 10010,5,11,1,0,0,1,3,2,2,0,0,0,0,3,3,0,3,0,0,0,0,0,0,0,6,0,2,0,0,0,1,4,1,2,1,0,2,0,2,1,0,4,0,0,0,1,1,0,4,21,0,1,0,1,0,0,0,0,0,4,0,2,1,1,0,11,1,0 10011,4,0,1,0,0,1,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,0,0,0,7,0,2,0,0,0,1,4,1,2,1,0,2,0,2,1,3,2,1,0,0,1,1,0,2,10,0,1,0,1,0,1,0,0,0,2,0,2,1,1,0,10,1,0 10012,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,2,0,0,0,2,0,112,0,0,1,0,10,1,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0 10013,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,1,4,0,2,0,121,0,0,1,0,10,1,0,0,2,0,1,0,0,0,0,0,0,0,0,0,4,0,0 10014,3,11,1,0,0,6,4,2,2,0,0,1,0,2,2,0,0,1,0,0,0,0,0,0,3,0,2,0,3,0,1,4,2,2,5,0,2,0,1,2,4,2,10,0,0,1,1,0,2,10,0,5,0,1,0,1,0,0,0,3,0,1,1,1,0,7,1,0 10015,4,3,1,0,0,1,3,2,2,1,0,0,0,3,5,0,3,0,0,1,0,0,0,0,4,0,1,0,0,1,1,2,2,2,2,0,2,0,2,0,0,4,0,0,0,1,1,0,4,10,0,1,3,1,1,0,0,0,0,3,0,2,1,1,0,11,1,1 10016,4,11,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2,2,4,0,0,4,1,1,0,0,1,0,0,2,0,0,12,0,0,0,6,0,2,23,0,6,0,1,0,0,0,0,3,0,0,0,2,0,0,5,7,0 10017,7,1,1,0,0,0,0,2,2,0,0,3,0,0,0,0,0,0,0,1,1,0,1,0,0,0,2,2,0,0,0,4,1,2,0,0,2,0,2,1,4,0,1,0,0,0,6,0,2,10,0,1,0,1,0,1,0,0,3,0,0,0,2,2,0,6,6,0 The task is to get the number of strings, where numeral on 57th position 10001,6,1,1,0,0,7,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,1,0,0,4,0,2,0,0,0,1,4,1,2,2,0,2,0,2,2,4,2,1,0,0,1,1,0,2,10,0,1,0,1,0,((1)),0,0,0,1,0,2,1,1,0,5,1,0 is 1 . The problem is that strings are the elements of the RDD, so I need to split each string and make an array(x,y) to get the position i need. I tried to use val censusText = sc.textFile("USCensus1990.data.txt") val splitRDD = censusText.map(line=>line.split(",")) but It didn't help But I have no idea how to do it. Can you please help me
You can try: censusText.filter(l => l.split(",")(56) == "1").count // res5: Long = 12 Or you can first split the RDD then do the filter / count: val splitRDD = censusText.map(l => l.split(",")) splitRDD.filter(r => r(56) == "1").count // res7: Long = 12