Making string of 2-d array in scala - scala

So I have the following in Scala:
scala> val example = "hello \tmy \nname \tis \nmaria \tlee".split("\n").map(_.split("\\s+"))
example: Array[Array[String]] = Array(Array(hello, my), Array(name, is), Array(maria, lee))
I want to take each 1-d array and make it into a string, and make an array of these strings (strings should be comma separated). How do I do this?

scala> example.map(_.mkString)
res0: Array[String] = Array(hellomy, nameis, marialee)
To make the strings comma separated:
scala> example.map(_.mkString(","))
res0: Array[String] = Array(hello,my, name,is, maria,lee)

Related

replace multiple occurrence of duplicate string in Scala with empty

I have a string as
something,'' something,nothing_something,op nothing_something,'' cat,cat
I want to achieve my output as
'' something,op nothing_something,cat
Is there any way to achieve it?
If I understand your requirement correctly, here's one approach with the following steps:
Split the input string by "," and create a list of indexed-CSVs and convert it to a Map
Generate 2-combinations of the indexed-CSVs
Check each of the indexed-CSV pairs and capture the index of any CSV which is contained within the other CSV
Since the CSVs corresponding to the captured indexes are contained within some other CSV, removing these indexes will result in remaining indexes we would like to keep
Use the remaining indexes to look up CSVs from the CSV Map and concatenate them back to a string
Here is sample code applying to a string with slightly more general comma-separated values:
val str = "cats,a cat,cat,there is a cat,my cat,cats,cat"
val csvIdxList = (Stream from 1).zip(str.split(",")).toList
val csvMap = csvIdxList.toMap
val csvPairs = csvIdxList.combinations(2).toList
val csvContainedIdx = csvPairs.collect{
case List(x, y) if x._2.contains(y._2) => y._1
case List(x, y) if y._2.contains(x._2) => x._1
}.
distinct
// csvContainedIdx: List[Int] = List(3, 6, 7, 2)
val csvToKeepIdx = (1 to csvIdxList.size) diff csvContainedIdx
// csvToKeepIdx: scala.collection.immutable.IndexedSeq[Int] = Vector(1, 4, 5)
val strDeduped = csvToKeepIdx.map( csvMap.getOrElse(_, "") ).mkString(",")
// strDeduped: String = cats,there is a cat,my cat
Applying the above to your sample string something,'' something,nothing_something,op nothing_something would yield the expected result:
strDeduped: String = '' something,op nothing_something
First create an Array of words separated by commas using split command on the given String, and do other operations using filter and mkString as below:
s.split(",").filter(_.contains(' ')).mkString(",")
In Scala REPL:
scala> val s = "something,'' something,nothing_something,op nothing_something"
s: String = something,'' something,nothing_something,op nothing_something
scala> s.split(",").filter(_.contains(' ')).mkString(",")
res27: String = '' something,op nothing_something
As per Leo C comment, I tested it as below with some other String:
scala> val s = "something,'' something anything anything anything anything,nothing_something,op op op nothing_something"
s: String = something,'' something anything anything anything anything,nothing_something,op op op nothing_something
scala> s.split(",").filter(_.contains(' ')).mkString(",")
res43: String = '' something anything anything anything anything,op op op nothing_something

Flattening Array[Array[String]] to Array[String]

I would like to flattening my data structure of type Array[Array[String]] to Array[String] where there are some empty Array() too.
For example:
val test=Array(Array("foo"), Array("bar"), Array(),...)
To be converted to:
Array(foo,bar,"")
I tried :
test.flatMap(x=>x.toString())
But this gets broken down into char array:
Array([f, o, o,..])
What am I doing wrong?
You can do this using
test.flatten
The reason your initial approach didn't work is that x in x=>x.toString() is an Array[String] so each Array will become the string representation of that Array

Split function difference between char and string arguments

I try the following code in scala REPL:
"ASD-ASD.KZ".split('.')
res7: Array[String] = Array(ASD-ASD, KZ)
"ASD-ASD.KZ".split(".")
res8: Array[String] = Array()
Why this function calls have a different results?
There's a big difference in the function use.
The split function is overloaded, and this is the implementation from the source code of Scala:
/** For every line in this string:
Strip a leading prefix consisting of blanks or control characters
followed by | from the line.
*/
def stripMargin: String = stripMargin('|')
private def escape(ch: Char): String = "\\Q" + ch + "\\E"
#throws(classOf[java.util.regex.PatternSyntaxException])
def split(separator: Char): Array[String] = toString.split(escape(separator))
#throws(classOf[java.util.regex.PatternSyntaxException])
def split(separators: Array[Char]): Array[String] = {
val re = separators.foldLeft("[")(_+escape(_)) + "]"
toString.split(re)
}
So when you're calling split() with a char, you ask to split by that specific char:
scala> "ASD-ASD.KZ".split('.')
res0: Array[String] = Array(ASD-ASD, KZ)
And when you're calling split() with a string, it means that you want to have a regex. So for you to get the exact result using the double quotes, you need to do:
scala> "ASD-ASD.KZ".split("\\.")
res2: Array[String] = Array(ASD-ASD, KZ)
Where:
First \ escapes the following character
Second \ escapes character for the dot which is a regex expression, and we want to use it as a character
. - the character to split the string by

How to save a two-dimensional array into HDFS in spark?

Something like:
val arr : Array[Array[Double]] = new Array(featureSize)
sc.parallelize(arr, 100).saveAsTextFile(args(1))
Then Spark will store data type into HDFS.
Array in Scala exactly corresponds to Java Arrays - in particular, it's a mutable type, and its toString method will return a reference to the Array. When you save this RDD as textFile, it's invoking toString method on each element of the RDD and therefore giving you gibberish. If you want to output actual elements of the Array, you first have to stringify the Array, for example by applying mkString(",") method to each array. Example from Spark shell:
scala> Array(1,2,3).toString
res11: String = [I#31cba915
scala> Array(1,2,3).mkString(",")
res12: String = 1,2,3
For double arrays:
scala> sc.parallelize(Array( Array(1,2,3), Array(4,5,6), Array(7,8,9) )).collect.mkString("\n")
res15: String =
[I#41ff41b0
[I#5d31aba9
[I#67fd140b
scala> sc.parallelize(Array( Array(1,2,3), Array(4,5,6), Array(7,8,9) ).map(_.mkString(","))).collect.mkString("\n")
res16: String =
1,2,3
4,5,6
7,8,9
So, your code should be:
sc.parallelize(arr.map(_.mkString(",")), 100).saveAsTextFile(args(1))
or
sc.parallelize(arr), 100).map(_.mkString(",")).saveAsTextFile(args(1))

how would I map a list of strings with a known format to a list of tuples?

I have an array of strings. Each string has 2 parts and is separated by white space. Looks like:
x <white space> y
I want to turn it into an array of Tuples where each tuple has (x, y)
How can I write this in scala? I know it will need something similar to:
val results = listOfStrings.collect { str => (str.left, str.right) }
not sure how i can break up each str to the left and right sides needed...
You could take advantage of the fact that in Scala, Regular expressions are also "extractors".
scala> var PairWithSpaces = "(\\w+)\\s+(\\w+)".r
PairWithSpaces: scala.util.matching.Regex = (.+)\s+(.+)
scala> val PairWithSpaces(l, r) = "1 17"
l: String = 1
r: String = 17
Now you can build your extractor into a natural looking "map":
scala> Array("a b", "1 3", "Z x").map{case PairWithSpaces(x,y) => (x, y) }
res10: Array[(String, String)] = Array((a,b), (1,3), (Z,x))
Perhaps overkill for you, but can really help readability if your regex gets fancy. I also like how this approach will fail fast if an illegal string is given.
Warning, not sure if the regex matches exactly what you need...
You could (assuming that you want to drop without complaint any string that doesn't fit the pattern):
val results = listOfStrings.map(_.split("\\s+")).collect { case Array(l,r) => (l,r) }