how to iterate over array[string] in spark scala? - scala

enter image description herehere is my sample input:
val list=List("a;bc:de;f","uvw:xy;z","123:456")
I am applying following operation
val upper=list.map(x=>x.split(":")).map(x=>x.split(";"))
but it is throwing error-
error: value split is not a member of Array[String]
can anyone help how to use both split so that i can get answer!
Thank you in advance.

Using list.map(x=>x.split(":")) will give you a list of Array.
upper: List[Array[String]] = List(Array(a;bc, de;f), Array(uvw, xy;z), Array(123, 456))
Mapping afterwards, you can see that the item will be an array where you are trying to run split on.
You might useflatMap instead which will first give you List(a;bc, de;f, uvw, xy;z, 123, 456) and then you can use map on those items splitting on ;
val upper = list.flatMap(_.split(":")).map(_.split(";"))
Output
upper: List[Array[String]] = List(Array(a, bc), Array(de, f), Array(uvw), Array(xy, z), Array(123), Array(456))

You can use split with multiple delimiters in one map iteration :
val upper = list.map(x => x.split("[:;]"))
//upper: List[Array[String]] = List(Array(a, bc, de, f), Array(uvw, xy, z), Array(123, 456))

Here is the code i have tried and it worked:
val upper=list.map(x=>x.split(":")).map(x=>x.map(x=>x.split(";")))
which gives the output:
upper: List[Array[Array[String]]] = List(Array(Array(a, bc), Array(de, f)), Array(Array(uvw), Array(xy, z)), Array(Array(123), Array(456)))

Related

Transpose unevenly-sized lists

I'm having some trouble transposing lists with uneven amounts of elements. Let's say our data is:
ABC
E
GHI
I want my list to be:
List(List('A','E','G'), List('B',' ','H'), List('C',' ','I'))
I cannot manage to get empty spaces (' ') where I need them. With my code:
val l = List("ABC", "E", "GHI")
println(((l map (_.toArray)) toArray).transpose map (_.toList) toList)
// I get: List(List('A', 'E', 'G'), List('B', 'H'), List('C', 'I'))
A solution may be to get the longest line and add white spaces to the rest of the lines, but it's really not clean. Is there a simple solution to this?
Here is a code-golf solution for an input list l:
(0 until l.map(_.size).max).map(i => l.map(s => Try(s(i)).getOrElse(' ')))
which returns:
Vector(List(A, E, G), List(B, , H), List(C, , I))
This:
Retrieves the maximum length of a string in the input list.
Loop from 0 until the max length of a string in the input list
Within each loop it gets each element's char at the given index.
The Try is used to handle items whose length is shorter than the current index being handled. And these cases return " " instead.
To use Try, you need to import:
import scala.util.Try
One approach is to use padTo, although this will involve multiple traversals of the list:
val l = List("ABC", "E", "GHI")
val maxSize = l.map(_.size).max // 3
val transposed = l.map(_.toList).map(_.padTo(maxSize, ' ')).transpose
// List[List[Char]] = List(List(A, E, G), List(B, , H), List(C, , I))

Scala - have List of String, must split by comma, and put into Map

I have a List of the form: ("string,num", "string,num", ...)
I have found solutions online for how to do this with a single string, but have not been able to adapt it to a list of strings.
Also, numerical values should be cast to Int/Double before being mapped.
I would appreciate any help.
This is a perfect job for a fold
// Your input
val lines = List("a,1", "b,2", "gretzky,99", "tyler,11")
// Fold over the lines, and insert them into a map
val map = lines.foldLeft(Map.empty[String, Int]) {
case (map, line) =>
// Split the line on the comma and separate the two parts
val Array(string, num) = line.split(",")
// Add new entry to the map
map + (string -> num.toInt)
}
println(map)
Output:
Map(a -> 1, b -> 2, gretzky -> 99, tyler -> 11)
Probably not the best way to do that, but it should meet your needs
yourlist.groupBy( _.split(',')(0) ).mapValues(v=>v(0).split(',')(1))

Spark print result of Arrays or saveAsTextFile

I have been bogged down by this for some hours now... tried collect and mkString(") and still i am not able to print in console or save as text file.
scala> val au1 = sc.parallelize(List(("a",Array(1,2)),("b",Array(1,2))))
scala> val au2 = sc.parallelize(List(("a",Array(3)),("b",Array(2))))
scala> val au3 = au1.union(au2)
Result of the union is
Array[(String,Array[int])] = Array((a,Array(1,2)),(b,Array(1,2)),(a,Array(3)),(b,Array(2)))
All the print attempts are resulting in following when i do x(0) and x(1)
Array[Int]) does not take parameters
Last attempt, performed following and it is resulting in index error
scala> val au4 = au3.map(x => (x._1, x._2._1._1, x._2._1._2))
<console>:33: error: value _1 is not a member of Array[Int]
val au4 = au3.map(x => (x._1, x._2._1._1, x._2._1._2))
._1 or ._2 can be done in tuples and not in arrays
("a",Array(1,2)) is a tuple so ._1 is a and ._2 is Array(1,2)
so if you want to get elements of an array you need to use () as x._2(0)
but au2 arrays has only one element so x._2(1) will work for au1 and not for au2. You can use Option or Try as below
val au4 = au3.map(x => (x._1, x._2(0), Try(x._2(1)) getOrElse(0)))
The result of au3 is not Array[(String,Array[int])] , it is RDD[(String,Array[int])]
so this how you could do to write output in a file
au3.map( r => (r._1, r._2.map(_.toString).mkString(",")))
.saveAsTextFile("data/result")
You need to map through the array and create a string from it so that it could be written in file as
(a,1:2)
(b,1:2)
(a,3)
(b,2)
You could write to file without brackets as below
au3.map( r => Row(r._1, r._2.map(_.toString).mkString(":")).mkString(","))
.saveAsTextFile("data/result")
Output:
a,1:2
b,1:2
a,3
b,2
The value is comma "," separated and array value are separated as ":"
Hope this helps!

Storing the contents of a file in an immutable Map in scala

I am trying to implement a simple wordcount in scala using an immutable map(this is intentional) and the way I am trying to accomplish it is as follows:
Create an empty immutable map
Create a scanner that reads through the file.
While the scanner.hasNext() is true:
Check if the Map contains the word, if it doesn't contain the word, initialize the count to zero
Create a new entry with the key=word and the value=count+1
Update the map
At the end of the iteration, the map is populated with all the values.
My code is as follows:
val wordMap = Map.empty[String,Int]
val input = new java.util.scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
val currentCount = wordMap.getOrElse(token,0) + 1
val wordMap = wordMap + (token,currentCount)
}
The ides is that wordMap will have all the wordCounts at the end of the iteration...
Whenever I try to run this snippet, I get the following exception
recursive value wordMap needs type.
Can somebody point out why I am getting this exception and what can I do to remedy it?
Thanks
val wordMap = wordMap + (token,currentCount)
This line is redefining an already-defined variable. If you want to do this, you need to define wordMap with var and then just use
wordMap = wordMap + (token,currentCount)
Though how about this instead?:
io.Source.fromFile("textfile.txt") // read from the file
.getLines.flatMap{ line => // for each line
line.split("\\s+") // split the line into tokens
.groupBy(identity).mapValues(_.size) // count each token in the line
} // this produces an iterator of token counts
.toStream // make a Stream so we can groupBy
.groupBy(_._1).mapValues(_.map(_._2).sum) // combine all the per-line counts
.toList
Note that the per-line pre-aggregation is used to try and reduce the memory required. Counting across the entire file at once might be too big.
If your file is really massive, I would suggest using doing this in parallel (since word counting is trivial to parallelize) using either Scala's parallel collections or Hadoop (using one of the cool Scala Hadoop wrappers like Scrunch or Scoobi).
EDIT: Detailed explanation:
Ok, first look at the inner part of the flatMap. We take a string, and split it apart on whitespace:
val line = "a b c b"
val tokens = line.split("\\s+") // Array(a, b, c, a, b)
Now identity is a function that just returns its argument, so if wegroupBy(identity)`, we map each distinct word type, to each word token:
val grouped = tokens.groupBy(identity) // Map(c -> Array(c), a -> Array(a), b -> Array(b, b))
And finally, we want to count up the number of tokens for each type:
val counts = grouped.mapValues(_.size) // Map(c -> 1, a -> 1, b -> 2)
Since we map this over all the lines in the file, we end up with token counts for each line.
So what does flatMap do? Well, it runs the token-counting function over each line, and then combines all the results into one big collection.
Assume the file is:
a b c b
b c d d d
e f c
Then we get:
val countsByLine =
io.Source.fromFile("textfile.txt") // read from the file
.getLines.flatMap{ line => // for each line
line.split("\\s+") // split the line into tokens
.groupBy(identity).mapValues(_.size) // count each token in the line
} // this produces an iterator of token counts
println(countsByLine.toList) // List((c,1), (a,1), (b,2), (c,1), (d,3), (b,1), (c,1), (e,1), (f,1))
So now we need to combine the counts of each line into one big set of counts. The countsByLine variable is an Iterator, so it doesn't have a groupBy method. Instead we can convert it to a Stream, which is basically a lazy list. We want laziness because we don't want to have to read the entire file into memory before we start. Then the groupBy groups all counts of the same word type together.
val groupedCounts = countsByLine.toStream.groupBy(_._1)
println(groupedCounts.mapValues(_.toList)) // Map(e -> List((e,1)), f -> List((f,1)), a -> List((a,1)), b -> List((b,2), (b,1)), c -> List((c,1), (c,1), (c,1)), d -> List((d,3)))
And finally we can sum up the counts from each line for each word type by grabbing the second item from each tuple (the count), and summing:
val totalCounts = groupedCounts.mapValues(_.map(_._2).sum)
println(totalCounts.toList)
List((e,1), (f,1), (a,1), (b,3), (c,3), (d,3))
And there you have it.
You have a few mistakes: you've defined wordMap twice (val is to declare a value). Also, Map is immutable, so you either have to declare it as a var or use a mutable map (I suggest the former).
Try this:
var wordMap = Map.empty[String,Int] withDefaultValue 0
val input = new java.util.Scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
wordMap += token -> (wordMap(token) + 1)
}

Standard function to enumerate all strings of given length over given alphabet

Suppose, I have an alphabet of N symbols and want to enumerate all different strings of length M over this alphabet. Does Scala provide any standard library function for that?
Taking inspiration from another answer:
val letters = Seq("a", "b", "c")
val n = 3
Iterable.fill(n)(letters) reduceLeft { (a, b) =>
for(a<-a;b<-b) yield a+b
}
Seq[java.lang.String] = List(aaa, aab, aac, aba, abb, abc, aca, acb, acc, baa, bab, bac, bba, bbb, bbc, bca, bcb, bcc, caa, cab, cac, cba, cbb, cbc, cca, ccb, ccc)
To work with something other than strings:
val letters = Seq(1, 2, 3)
Iterable.fill(n)(letters).foldLeft(List(List[Int]())) { (a, b) =>
for (a<-a;b<-b) yield(b::a)
}
The need for the extra type annotation is a little annoying but it will not work without it (unless someone knows another way).
Another solution:
val alph = List("a", "b", "c")
val n = 3
alph.flatMap(List.fill(alph.size)(_))
.combinations(n)
.flatMap(_.permutations).toList
Update: If you want to get a list of strings in the output, then alph should be a string.
val alph = "abcd"