Is there a better way of converting Iterator[char] to Seq[String]? - scala

Following is my code that I have used to convert Iterator[char] to Seq[String].
val result = IOUtils.toByteArray(new FileInputStream (new File(fileDir)))
val remove_comp = result.grouped(11).map{arr => arr.update(2, 32);arr}.flatMap{arr => arr.update(3, 32); arr}
val convert_iter = remove_comp.map(_.toChar.toString).toSeq.mkString.split("\n")
val rdd_input = Spark.sparkSession.sparkContext.parallelize(convert_iter)
val fileDir:
12**34567890
12##34567890
12!!34567890
12¬¬34567890
12
'34567890
I am not happy with this code as the data size is big and converting to string would end up with heap space.
val convert_iter = remove_comp.map(_.toChar)
convert_iter: Iterator[Char] = non-empty iterator
Is there a better way of coding?

By completely disregarding corner cases about empty Strings etc I would start with something like:
val test = Iterable('s','f','\n','s','d','\n','s','v','y')
val (allButOne, last) = test.foldLeft( (Seq.empty[String], Seq.empty[Char]) ) {
case ((strings, chars), char) =>
if (char == '\n')
(strings :+ chars.mkString, Seq.empty)
else
(strings, chars :+ char)
}
val result = allButOne :+ last.mkString
I am sure it could be made more elegant, and handle corner cases better (once you define you want them handled), but I think it is a nice starting point.
But to be honest I am not entirely sure what you want to achieve. I just guessed that you want to group chars divided by \n together and turn them into Strings.

Looking at your code, I see that you are trying to replace the special characters such as **, ## and so on from the file that contains following data
12**34567890
12##34567890
12!!34567890
12¬¬34567890
12
'34567890
For that you can just read the data using sparkContext textFile and use regex replaceAllIn
val pattern = new Regex("[¬~!##$^%&*\\(\\)_+={}\\[\\]|;:\"'<,>.?` /\\-]")
val result = sc.textFile(fileDir).map(line => pattern.replaceAllIn(line, ""))
and you should have you result as RDD[String] which also an iterator
1234567890
1234567890
1234567890
1234567890
12
34567890
Updated
If there are \n and \r in between the texts at 3rd and 4th place and if the result is all fixed length of 10 digits text then you can use wholeTextFiles api of sparkContext and use following regex as
val pattern = new Regex("[¬~!##$^%&*\\(\\)_+={}\\[\\]|;:\"'<,>.?` /\\-\r\n]")
val result = sc.wholeTextFiles(fileDir).flatMap(line => pattern.replaceAllIn(line._2, "").grouped(10))
You should get the output as
1234567890
1234567890
1234567890
1234567890
1234567890
I hope the answer is helpful

Related

Scala: Convert a string to string array with and without split given that all special characters except "(" an ")" are allowed

I have an array
val a = "((x1,x2),(y1,y2),(z1,z2))"
I want to parse this into a scala array
val arr = Array(("x1","x2"),("y1","y2"),("z1","z2"))
Is there a way of directly doing this with an expr() equivalent ?
If not how would one do this using split
Note : x1 x2 x3 etc are strings and can contain special characters so key would be to use () delimiters to parse data -
Code I munged from Dici and Bogdan Vakulenko
val x2 = a.getString(1).trim.split("[\()]").grouped(2).map(x=>x(0).trim).toArray
val x3 = x2.drop(1) // first grouping is always null dont know why
var jmap = new java.util.HashMap[String, String]()
for (i<-x3)
{
val index = i.lastIndexOf(",")
val fv = i.slice(0,index)
val lv = i.substring(index+1).trim
jmap.put(fv,lv)
}
This is still suceptible to "," in the second string -
Actually, I think regex are the most convenient way to solve this.
val a = "((x1,x2),(y1,y2),(z1,z2))"
val regex = "(\\((\\w+),(\\w+)\\))".r
println(
regex.findAllMatchIn(a)
.map(matcher => (matcher.group(2), matcher.group(3)))
.toList
)
Note that I made some assumptions about the format:
no whitespaces in the string (the regex could easily be updated to fix this if needed)
always tuples of two elements, never more
empty string not valid as a tuple element
only alphanumeric characters allowed (this also would be easy to fix)
val a = "((x1,x2),(y1,y2),(z1,z2))"
a.replaceAll("[\\(\\) ]","")
.split(",")
.sliding(2)
.map(x=>(x(0),x(1)))
.toArray

replace multiple occurrence of duplicate string in Scala with empty

I have a string as
something,'' something,nothing_something,op nothing_something,'' cat,cat
I want to achieve my output as
'' something,op nothing_something,cat
Is there any way to achieve it?
If I understand your requirement correctly, here's one approach with the following steps:
Split the input string by "," and create a list of indexed-CSVs and convert it to a Map
Generate 2-combinations of the indexed-CSVs
Check each of the indexed-CSV pairs and capture the index of any CSV which is contained within the other CSV
Since the CSVs corresponding to the captured indexes are contained within some other CSV, removing these indexes will result in remaining indexes we would like to keep
Use the remaining indexes to look up CSVs from the CSV Map and concatenate them back to a string
Here is sample code applying to a string with slightly more general comma-separated values:
val str = "cats,a cat,cat,there is a cat,my cat,cats,cat"
val csvIdxList = (Stream from 1).zip(str.split(",")).toList
val csvMap = csvIdxList.toMap
val csvPairs = csvIdxList.combinations(2).toList
val csvContainedIdx = csvPairs.collect{
case List(x, y) if x._2.contains(y._2) => y._1
case List(x, y) if y._2.contains(x._2) => x._1
}.
distinct
// csvContainedIdx: List[Int] = List(3, 6, 7, 2)
val csvToKeepIdx = (1 to csvIdxList.size) diff csvContainedIdx
// csvToKeepIdx: scala.collection.immutable.IndexedSeq[Int] = Vector(1, 4, 5)
val strDeduped = csvToKeepIdx.map( csvMap.getOrElse(_, "") ).mkString(",")
// strDeduped: String = cats,there is a cat,my cat
Applying the above to your sample string something,'' something,nothing_something,op nothing_something would yield the expected result:
strDeduped: String = '' something,op nothing_something
First create an Array of words separated by commas using split command on the given String, and do other operations using filter and mkString as below:
s.split(",").filter(_.contains(' ')).mkString(",")
In Scala REPL:
scala> val s = "something,'' something,nothing_something,op nothing_something"
s: String = something,'' something,nothing_something,op nothing_something
scala> s.split(",").filter(_.contains(' ')).mkString(",")
res27: String = '' something,op nothing_something
As per Leo C comment, I tested it as below with some other String:
scala> val s = "something,'' something anything anything anything anything,nothing_something,op op op nothing_something"
s: String = something,'' something anything anything anything anything,nothing_something,op op op nothing_something
scala> s.split(",").filter(_.contains(' ')).mkString(",")
res43: String = '' something anything anything anything anything,op op op nothing_something

Splitting string in dataset Apache Spark

I am absolutely new in Spark
I have a txt dataset with cathegorical attributes, looking like this:
10000,5,0,1,0,0,5,3,2,2,1,0,1,0,4,3,0,2,0,0,1,0,0,0,0,10,0,1,0,1,0,1,4,2,2,3,0,2,0,2,1,4,3,0,0,0,3,1,0,3,22,0,3,0,1,0,1,0,0,0,5,0,2,1,1,0,11,1,0
10001,6,1,1,0,0,7,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,1,0,0,4,0,2,0,0,0,1,4,1,2,2,0,2,0,2,2,4,2,1,0,0,1,1,0,2,10,0,1,0,1,0,1,0,0,0,1,0,2,1,1,0,5,1,0
10002,3,1,2,0,0,7,4,2,2,0,0,1,0,4,4,0,1,0,1,0,0,0,0,0,1,0,2,0,4,0,10,4,1,2,4,0,2,0,2,1,4,2,2,0,0,0,1,0,2,10,0,6,0,1,0,1,0,0,0,2,0,2,1,1,0,10,1,0
10003,4,1,2,0,0,1,3,2,2,0,0,3,0,3,3,0,1,0,0,0,0,0,0,1,4,0,2,0,2,0,1,4,1,2,2,0,2,0,2,1,2,2,0,0,0,1,1,0,2,10,0,4,0,1,0,1,0,0,0,1,0,1,1,1,0,10,1,0
10004,7,1,1,0,0,0,0,2,2,0,0,3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,2,0,0,0,4,1,2,0,0,2,0,2,1,4,0,1,0,0,0,6,0,2,22,0,1,0,1,0,1,0,0,3,0,0,0,2,2,0,5,6,0
10005,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,0,4,0,2,0,121,0,0,1,0,10,1,0,0,2,0,1,0,0,0,0,0,0,0,0,0,4,0,0
10006,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,1,0,0,2,0,121,0,0,1,0,10,1,0,0,2,0,0,0,0,0,0,0,0,0,0,0,4,0,0
10007,4,1,2,0,0,6,0,2,2,0,0,4,0,5,5,0,2,1,0,0,0,0,0,0,9,0,2,0,0,0,11,4,1,2,3,0,2,0,2,1,2,3,1,0,0,0,1,0,3,10,0,1,0,1,0,1,0,0,0,0,0,2,1,1,0,11,1,0
10008,6,1,1,0,0,1,0,2,2,0,0,7,0,1,0,0,1,0,0,0,0,0,0,0,7,0,2,2,0,0,0,4,1,2,6,0,2,0,2,1,2,2,1,0,0,0,6,0,2,10,0,1,0,1,0,1,0,0,3,0,0,1,1,2,0,10,1,0
10009,3,1,12,0,0,1,0,2,2,0,0,0,0,3,0,0,1,0,0,0,0,0,0,0,4,0,2,2,4,0,0,2,1,2,6,0,2,0,2,1,0,2,2,0,0,0,3,0,2,10,0,6,1,1,1,0,0,0,1,0,0,1,1,2,0,8,1,1
10010,5,11,1,0,0,1,3,2,2,0,0,0,0,3,3,0,3,0,0,0,0,0,0,0,6,0,2,0,0,0,1,4,1,2,1,0,2,0,2,1,0,4,0,0,0,1,1,0,4,21,0,1,0,1,0,0,0,0,0,4,0,2,1,1,0,11,1,0
10011,4,0,1,0,0,1,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,0,0,0,7,0,2,0,0,0,1,4,1,2,1,0,2,0,2,1,3,2,1,0,0,1,1,0,2,10,0,1,0,1,0,1,0,0,0,2,0,2,1,1,0,10,1,0
10012,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,2,0,0,0,2,0,112,0,0,1,0,10,1,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0
10013,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,1,4,0,2,0,121,0,0,1,0,10,1,0,0,2,0,1,0,0,0,0,0,0,0,0,0,4,0,0
10014,3,11,1,0,0,6,4,2,2,0,0,1,0,2,2,0,0,1,0,0,0,0,0,0,3,0,2,0,3,0,1,4,2,2,5,0,2,0,1,2,4,2,10,0,0,1,1,0,2,10,0,5,0,1,0,1,0,0,0,3,0,1,1,1,0,7,1,0
10015,4,3,1,0,0,1,3,2,2,1,0,0,0,3,5,0,3,0,0,1,0,0,0,0,4,0,1,0,0,1,1,2,2,2,2,0,2,0,2,0,0,4,0,0,0,1,1,0,4,10,0,1,3,1,1,0,0,0,0,3,0,2,1,1,0,11,1,1
10016,4,11,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2,2,4,0,0,4,1,1,0,0,1,0,0,2,0,0,12,0,0,0,6,0,2,23,0,6,0,1,0,0,0,0,3,0,0,0,2,0,0,5,7,0
10017,7,1,1,0,0,0,0,2,2,0,0,3,0,0,0,0,0,0,0,1,1,0,1,0,0,0,2,2,0,0,0,4,1,2,0,0,2,0,2,1,4,0,1,0,0,0,6,0,2,10,0,1,0,1,0,1,0,0,3,0,0,0,2,2,0,6,6,0
The task is to get the number of strings, where numeral on 57th position
10001,6,1,1,0,0,7,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,1,0,0,4,0,2,0,0,0,1,4,1,2,2,0,2,0,2,2,4,2,1,0,0,1,1,0,2,10,0,1,0,1,0,((1)),0,0,0,1,0,2,1,1,0,5,1,0
is 1 . The problem is that strings are the elements of the RDD, so I need to split each string and make an array(x,y) to get the position i need.
I tried to use
val censusText = sc.textFile("USCensus1990.data.txt")
val splitRDD = censusText.map(line=>line.split(","))
but It didn't help
But I have no idea how to do it.
Can you please help me
You can try:
censusText.filter(l => l.split(",")(56) == "1").count
// res5: Long = 12
Or you can first split the RDD then do the filter / count:
val splitRDD = censusText.map(l => l.split(","))
splitRDD.filter(r => r(56) == "1").count
// res7: Long = 12

Replace every word in a document based on a defined pattern

I have a file with thousands of document, and i want to read every documents and replace each word in this document like a pattern (Key,Value) at first all values are zero(0),
docs file :
Problems installing software weak tech support.
great idea executed perfect.
A perfect solution for anyone who listens to the radio.
...
and i have a score_file contains many words: e.g.
idea 1
software 1
weak 1
who 1
perfect 1
price -1
...
output like this pattern:
(Problems,0) (installing,1) (software,1) (develop,2) (weak,1) (tech,1) (support,0).
(great,1) (idea,1) (executed,2) (perfect,1).
(perfect,1) (solution,1) (for,0) (anyone,1) (who,1) (listens,1) (to,0) (the,0) (radio,0).
if a word of document occur in this score_file then value of (left word , this word , right word) in document Adding with 1 or -1 related to word's score.
i've tried :
val Words_file = sc.textFile("score_file.txt")
val Words_List = sc.broadcast(Words_file.map({ (line) =>
val Array(a,b) = line.split(" ").map(_.trim)(a,b.toInt)}).collect().toMap)
val Docs_file = sc.textFile("docs_file.txt")
val tokens = Docs_file.map(line => (line.split(" ").map(word => Words_List.value.getOrElse(word, 0)).reduce(_ + _), line.split(" ").filter(Words_List.value.contains(_) == false).mkString(" ")))
val out_Docs = tokens.map(s => if (s._2.length > 3) {s._1 + "," + s._2})
But it scored every document not its words, how can i generate my favorite output?
It is kinda hard to read you code, you seem to use a weird mix of CamelCase with underscores with vals sometimes starting with uppercase and sometimes not.
I'm not sure I completely got the question, but to get an output where each word in a given line is replaced by itself and the number coming from the other file, this might work:
val sc = new SparkContext()
val wordsFile = sc.textFile("words_file.txt")
val words = sc.broadcast(wordsFile.map( line => {
val Array(a, b) = line.split(" ").map(_.trim)
(a,b.toInt)
}).collect().toMap)
val docs = sc.textFile("docs_file.txt")
val tokens = docs.map(line => {
line.split(" ")
.map(token => s"(${token}, ${words.value.getOrElse(token, 0)})")
.mkString(" ")
})
tokens ends up being just an RDD[String] as the input, preserving lines (documents). I reformatted the code a bit to make it more readable.

How to remove lines with length less than 3 words?

I have a corpus of millions of Documents
and I want to remove lines which their length less than 3 words,(in Scala and Spark),
How can i do this?
All depends on how you define words but assuming a very simple approach:
def naiveTokenizer(text: String): Array[String] = text.split("""\s+""")
def naiveWordCount(text: String): Int = naiveTokenizer(text).length
val lines = sc.textFile("README.md")
lines.filter(naiveWordCount(_) >= 3)
You could use the count function. (This assumes you have ' ' as your spaces between words):
scala.io.Source.fromFile("myfile.txt").getLines().filter(_.count(' '==) >= 3)
zero33's answer is valid if you want to keep the lines intact. However, if you want the lines tokenized also, then flatMap is more efficient.
val lines = sc.textFile("README.md")
lines.flatMap(line =>{
val tokenized = line.split("""\s+""")
if(tokenized.length >= 3) tokenized
else Seq()
})