remove leading(left) zeroes in spark scala - scala
the content of my file is
0001,02,003
004,0005,06
0007,8,9
I read the file as
val file1 = spark.read.textFile("file1").map( x => x.toLowerCase())
file1.collect
res7: Array[String] = Array(0001,02,003, 004,0005,06, 0007,8,9)
I want to remove the leading zeroes.
I know we use ltrim but it is used only in case of strings to remove spaces.
just cast them to Int and you should be fine
val file1 = spark.sparkContext.textFile("file1").map( x => x.split(",").map(_.trim.toInt).mkString(","))
file1.collect
//res0: Array[String] = Array(1,2,3, 4,5,6, 7,8,9)
Related
How do I extract each words from a text file in scala
I'm pretty much new to Scala. I have a text file that has only one line with file words separated by a semi-colon(;). I want to extract each word, remove the white spaces, convert all to lowercase and call them based on the index of each word. Below is how I approached it: newListUpper2.txt contains (Bed; chairs;spoon; CARPET;curtains ) val file = sc.textFile("myfile.txt") val lower = file.map(x=>x.toLowerCase) val result = lower.flatMap(x=>x.trim.split(";")) result.collect.foreach(println) Below is the copy of the REPL when I executed the code scala> val file = sc.textFile("newListUpper2.txt") file: org.apache.spark.rdd.RDD[String] = newListUpper2.txt MapPartitionsRDD[5] at textFile at <console>:24 scala> val lower = file.map(x=>x.toLowerCase) lower: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at map at <console>:26 scala> val result = lower.flatMap(x=>x.trim.split(";")) result: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at flatMap at <console>:28 scala> result.collect.foreach(println) bed chairs spoon carpet curtains scala> result(0) <console>:31: error: org.apache.spark.rdd.RDD[String] does not take parameters result(0) The results are not trimmed and then passing the index as parameter to get the word at that index gives error. My expected outcome should be as stated below if I pass the index of each word as parameter result(0)= bed result(1) = chairs result(2) = spoon result(3) = carpet result(4) = curtains What am I doing wrong?.
newListUpper2.txt contains (Bed; chairs;spoon; CARPET;curtains ) val file = sc.textFile("myfile.txt") val lower = file.map(x=>x.toLowerCase) val result = lower.flatMap(x=>x.trim.split(";")) // x = `bed; chairs;spoon; carpet;curtains` , x.trim does not work. trim func effective for head and tail only result.collect.foreach(println) Try it: val result = lower.flatMap(x=>x.split(";").map(x=>x.trim))
1) Issue 1 scala> result(0) <console>:31: error: org.apache.spark.rdd.RDD[String] does not take parameters result is a RDD and it cant take parameters in this format. Instead you can use result.show(10,false) 2) Issue 2 - To achieve like this - result(0)= bed ,result(1) = chairs..... scala> var result = scala.io.Source.fromFile("/path/to/File").getLines().flatMap(x=>x.split(";").map(x=>x.trim)).toList result: List[String] = List(Bed, chairs, spoon, CARPET, curtains) scala> result(0) res21: String = Bed scala> result(1) res22: String = chairs
replace multiple occurrence of duplicate string in Scala with empty
I have a string as something,'' something,nothing_something,op nothing_something,'' cat,cat I want to achieve my output as '' something,op nothing_something,cat Is there any way to achieve it?
If I understand your requirement correctly, here's one approach with the following steps: Split the input string by "," and create a list of indexed-CSVs and convert it to a Map Generate 2-combinations of the indexed-CSVs Check each of the indexed-CSV pairs and capture the index of any CSV which is contained within the other CSV Since the CSVs corresponding to the captured indexes are contained within some other CSV, removing these indexes will result in remaining indexes we would like to keep Use the remaining indexes to look up CSVs from the CSV Map and concatenate them back to a string Here is sample code applying to a string with slightly more general comma-separated values: val str = "cats,a cat,cat,there is a cat,my cat,cats,cat" val csvIdxList = (Stream from 1).zip(str.split(",")).toList val csvMap = csvIdxList.toMap val csvPairs = csvIdxList.combinations(2).toList val csvContainedIdx = csvPairs.collect{ case List(x, y) if x._2.contains(y._2) => y._1 case List(x, y) if y._2.contains(x._2) => x._1 }. distinct // csvContainedIdx: List[Int] = List(3, 6, 7, 2) val csvToKeepIdx = (1 to csvIdxList.size) diff csvContainedIdx // csvToKeepIdx: scala.collection.immutable.IndexedSeq[Int] = Vector(1, 4, 5) val strDeduped = csvToKeepIdx.map( csvMap.getOrElse(_, "") ).mkString(",") // strDeduped: String = cats,there is a cat,my cat Applying the above to your sample string something,'' something,nothing_something,op nothing_something would yield the expected result: strDeduped: String = '' something,op nothing_something
First create an Array of words separated by commas using split command on the given String, and do other operations using filter and mkString as below: s.split(",").filter(_.contains(' ')).mkString(",") In Scala REPL: scala> val s = "something,'' something,nothing_something,op nothing_something" s: String = something,'' something,nothing_something,op nothing_something scala> s.split(",").filter(_.contains(' ')).mkString(",") res27: String = '' something,op nothing_something As per Leo C comment, I tested it as below with some other String: scala> val s = "something,'' something anything anything anything anything,nothing_something,op op op nothing_something" s: String = something,'' something anything anything anything anything,nothing_something,op op op nothing_something scala> s.split(",").filter(_.contains(' ')).mkString(",") res43: String = '' something anything anything anything anything,op op op nothing_something
Is there a better way of converting Iterator[char] to Seq[String]?
Following is my code that I have used to convert Iterator[char] to Seq[String]. val result = IOUtils.toByteArray(new FileInputStream (new File(fileDir))) val remove_comp = result.grouped(11).map{arr => arr.update(2, 32);arr}.flatMap{arr => arr.update(3, 32); arr} val convert_iter = remove_comp.map(_.toChar.toString).toSeq.mkString.split("\n") val rdd_input = Spark.sparkSession.sparkContext.parallelize(convert_iter) val fileDir: 12**34567890 12##34567890 12!!34567890 12¬¬34567890 12 '34567890 I am not happy with this code as the data size is big and converting to string would end up with heap space. val convert_iter = remove_comp.map(_.toChar) convert_iter: Iterator[Char] = non-empty iterator Is there a better way of coding?
By completely disregarding corner cases about empty Strings etc I would start with something like: val test = Iterable('s','f','\n','s','d','\n','s','v','y') val (allButOne, last) = test.foldLeft( (Seq.empty[String], Seq.empty[Char]) ) { case ((strings, chars), char) => if (char == '\n') (strings :+ chars.mkString, Seq.empty) else (strings, chars :+ char) } val result = allButOne :+ last.mkString I am sure it could be made more elegant, and handle corner cases better (once you define you want them handled), but I think it is a nice starting point. But to be honest I am not entirely sure what you want to achieve. I just guessed that you want to group chars divided by \n together and turn them into Strings.
Looking at your code, I see that you are trying to replace the special characters such as **, ## and so on from the file that contains following data 12**34567890 12##34567890 12!!34567890 12¬¬34567890 12 '34567890 For that you can just read the data using sparkContext textFile and use regex replaceAllIn val pattern = new Regex("[¬~!##$^%&*\\(\\)_+={}\\[\\]|;:\"'<,>.?` /\\-]") val result = sc.textFile(fileDir).map(line => pattern.replaceAllIn(line, "")) and you should have you result as RDD[String] which also an iterator 1234567890 1234567890 1234567890 1234567890 12 34567890 Updated If there are \n and \r in between the texts at 3rd and 4th place and if the result is all fixed length of 10 digits text then you can use wholeTextFiles api of sparkContext and use following regex as val pattern = new Regex("[¬~!##$^%&*\\(\\)_+={}\\[\\]|;:\"'<,>.?` /\\-\r\n]") val result = sc.wholeTextFiles(fileDir).flatMap(line => pattern.replaceAllIn(line._2, "").grouped(10)) You should get the output as 1234567890 1234567890 1234567890 1234567890 1234567890 I hope the answer is helpful
Splitting string in dataset Apache Spark
I am absolutely new in Spark I have a txt dataset with cathegorical attributes, looking like this: 10000,5,0,1,0,0,5,3,2,2,1,0,1,0,4,3,0,2,0,0,1,0,0,0,0,10,0,1,0,1,0,1,4,2,2,3,0,2,0,2,1,4,3,0,0,0,3,1,0,3,22,0,3,0,1,0,1,0,0,0,5,0,2,1,1,0,11,1,0 10001,6,1,1,0,0,7,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,1,0,0,4,0,2,0,0,0,1,4,1,2,2,0,2,0,2,2,4,2,1,0,0,1,1,0,2,10,0,1,0,1,0,1,0,0,0,1,0,2,1,1,0,5,1,0 10002,3,1,2,0,0,7,4,2,2,0,0,1,0,4,4,0,1,0,1,0,0,0,0,0,1,0,2,0,4,0,10,4,1,2,4,0,2,0,2,1,4,2,2,0,0,0,1,0,2,10,0,6,0,1,0,1,0,0,0,2,0,2,1,1,0,10,1,0 10003,4,1,2,0,0,1,3,2,2,0,0,3,0,3,3,0,1,0,0,0,0,0,0,1,4,0,2,0,2,0,1,4,1,2,2,0,2,0,2,1,2,2,0,0,0,1,1,0,2,10,0,4,0,1,0,1,0,0,0,1,0,1,1,1,0,10,1,0 10004,7,1,1,0,0,0,0,2,2,0,0,3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,2,0,0,0,4,1,2,0,0,2,0,2,1,4,0,1,0,0,0,6,0,2,22,0,1,0,1,0,1,0,0,3,0,0,0,2,2,0,5,6,0 10005,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,0,4,0,2,0,121,0,0,1,0,10,1,0,0,2,0,1,0,0,0,0,0,0,0,0,0,4,0,0 10006,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,1,0,0,2,0,121,0,0,1,0,10,1,0,0,2,0,0,0,0,0,0,0,0,0,0,0,4,0,0 10007,4,1,2,0,0,6,0,2,2,0,0,4,0,5,5,0,2,1,0,0,0,0,0,0,9,0,2,0,0,0,11,4,1,2,3,0,2,0,2,1,2,3,1,0,0,0,1,0,3,10,0,1,0,1,0,1,0,0,0,0,0,2,1,1,0,11,1,0 10008,6,1,1,0,0,1,0,2,2,0,0,7,0,1,0,0,1,0,0,0,0,0,0,0,7,0,2,2,0,0,0,4,1,2,6,0,2,0,2,1,2,2,1,0,0,0,6,0,2,10,0,1,0,1,0,1,0,0,3,0,0,1,1,2,0,10,1,0 10009,3,1,12,0,0,1,0,2,2,0,0,0,0,3,0,0,1,0,0,0,0,0,0,0,4,0,2,2,4,0,0,2,1,2,6,0,2,0,2,1,0,2,2,0,0,0,3,0,2,10,0,6,1,1,1,0,0,0,1,0,0,1,1,2,0,8,1,1 10010,5,11,1,0,0,1,3,2,2,0,0,0,0,3,3,0,3,0,0,0,0,0,0,0,6,0,2,0,0,0,1,4,1,2,1,0,2,0,2,1,0,4,0,0,0,1,1,0,4,21,0,1,0,1,0,0,0,0,0,4,0,2,1,1,0,11,1,0 10011,4,0,1,0,0,1,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,0,0,0,7,0,2,0,0,0,1,4,1,2,1,0,2,0,2,1,3,2,1,0,0,1,1,0,2,10,0,1,0,1,0,1,0,0,0,2,0,2,1,1,0,10,1,0 10012,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,2,0,0,0,2,0,112,0,0,1,0,10,1,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0 10013,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,1,4,0,2,0,121,0,0,1,0,10,1,0,0,2,0,1,0,0,0,0,0,0,0,0,0,4,0,0 10014,3,11,1,0,0,6,4,2,2,0,0,1,0,2,2,0,0,1,0,0,0,0,0,0,3,0,2,0,3,0,1,4,2,2,5,0,2,0,1,2,4,2,10,0,0,1,1,0,2,10,0,5,0,1,0,1,0,0,0,3,0,1,1,1,0,7,1,0 10015,4,3,1,0,0,1,3,2,2,1,0,0,0,3,5,0,3,0,0,1,0,0,0,0,4,0,1,0,0,1,1,2,2,2,2,0,2,0,2,0,0,4,0,0,0,1,1,0,4,10,0,1,3,1,1,0,0,0,0,3,0,2,1,1,0,11,1,1 10016,4,11,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2,2,4,0,0,4,1,1,0,0,1,0,0,2,0,0,12,0,0,0,6,0,2,23,0,6,0,1,0,0,0,0,3,0,0,0,2,0,0,5,7,0 10017,7,1,1,0,0,0,0,2,2,0,0,3,0,0,0,0,0,0,0,1,1,0,1,0,0,0,2,2,0,0,0,4,1,2,0,0,2,0,2,1,4,0,1,0,0,0,6,0,2,10,0,1,0,1,0,1,0,0,3,0,0,0,2,2,0,6,6,0 The task is to get the number of strings, where numeral on 57th position 10001,6,1,1,0,0,7,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,1,0,0,4,0,2,0,0,0,1,4,1,2,2,0,2,0,2,2,4,2,1,0,0,1,1,0,2,10,0,1,0,1,0,((1)),0,0,0,1,0,2,1,1,0,5,1,0 is 1 . The problem is that strings are the elements of the RDD, so I need to split each string and make an array(x,y) to get the position i need. I tried to use val censusText = sc.textFile("USCensus1990.data.txt") val splitRDD = censusText.map(line=>line.split(",")) but It didn't help But I have no idea how to do it. Can you please help me
You can try: censusText.filter(l => l.split(",")(56) == "1").count // res5: Long = 12 Or you can first split the RDD then do the filter / count: val splitRDD = censusText.map(l => l.split(",")) splitRDD.filter(r => r(56) == "1").count // res7: Long = 12
How do I escape tilde character in scala?
Given that i have a file that looks like this CS~84~Jimmys Bistro~Jimmys ... using tilde (~) as a delimiter, how can i split it? val company = dataset.map(k=>k.split(""\~"")).map( k => Company(k(0).trim, k(1).toInt, k(2).trim, k(3).trim) The above don't work
Hmmm, I don't see where it needs to be escaped. scala> val str = """CS~84~Jimmys Bistro~Jimmys""" str: String = CS~84~Jimmys Bistro~Jimmys scala> str.split('~') res15: Array[String] = Array(CS, 84, Jimmys Bistro, Jimmys) And the array elements don't need to be trimmed unless you know that errant spaces can be part of the input.