Scala Type Mismatch underlying type - scala

I'm writing methods to convert Set[Tuple2[String, String]] to String and vice versa.
I'm saving the string value as v1,v2#v3,v4#v5,v6
In order to fill the Set I'm splitting the string by ',' and in order to extract the values I'm trying to split each value by '#' but i receive
type mismatch: found: x.type (with underlying type Array[String]
The code I tried using is
val x = overwriters.split("#")
for(tuple <- x) {
tuple.split(",")
}
The returned type of split is an array of String so it is not clear to me why i cannot split each member of the returned array

tuple.split(",") returns an array of two elements. You need to convert it to a tuple.
val overwriters ="v1,v2#v3,v4#v5,v6"
val x = overwriters.split("#").toSet
for(tuple <- x) yield {
val t = tuple.split(",")
(t(0),t(1))
}

overwrites.split("#").map(_.split(",")).map(x=> (x(0),x(1))).toSet
This will achieve the same in little more idiomatic way.

Related

Creating a list of StructFields from data frame

I need to ultimately build a schema from a CSV. I can read the CSV into data frame, and I've got a case class defined.
case class metadata_class (colname:String,datatype:String,length:Option[Int],precision:Option[int])
val foo = spark.read.format("csv").option("delimiter",",").option("header","true").schema(Encoders.product[metadata_class.schema).load("/path/to/file").as[metadata_file].toDF()
Now I'm trying to iterate through that data frame and build a list of StructFields. My current effort:
val sList: List[StructField] = List(
for (m <- foo.as[metadata_class].collect) {
StructField[m.colname,getType(m.datatype))
})
That gives me a type mismatch:
found : Unit
required: org.apache.spark.sql.types.StructField
for (m <- foo.as[metadata_class].collect) {
^
What am I doing wrong here? Or am I not even close?
There is not usual to use for-loop in scala. For loop has Unit return type, and in your code, result value of sList will be List[Unit]:
val sList: List[Unit] = List(
for (m <- foo.as[metadata_class].collect) {
StructField(m.colname, getType(m.datatype))
}
)
but you declared sList as List[StructField] this is the cause of compile error.
I suppose you should use map function instead of for loop for iterate on metadata_class objects and create StructFields from them:
val structFields: List[StructField] = foo.as[metadata_class]
.collect
.map(m => StructField(m.colname, getType(m.datatype)))
.toList
you will earn List[StructField] such way.
In scala language every statement is expression with return type, for-loop also and it return type is Unit.
read more about statements/expressions:
statement vs expression in scala
statements and expressions in scala

Accessing parts of splitted string in scala

I am working on a scala application. I have a string as follows:
val str = abc,def,xyz
I want to split this string and access splitted parts separately like abc , def and xyx. My code is as follows
val splittedString = str.split(',')
to access each part of this splitted string I am trying something like this splittedString._1, splittedString._2, splittedString._3. But intellij is giving me error stating "cannot resolve symbol _1" and same error for part 2 and 3 as well. How can I access each element of splitted string?
The method split is defined over Strings to return an Array[String].
What you can do is access by (zero-based) index, splittedString(0) being the first item.
Alternatively, if you know the length of resulting array you want to obtain, you can convert it to a tuple and access with the accessor methods you were referring to:
val tuple =
str.split(",") match {
case Array(a, b, c) => (a, b, c)
case _ => throw new IllegalArgumentException
}
tuple._1 will now contain abc in your example.

Convert Spark Row to typed Array of Doubles

I am using Spark 1.3.1 with Hive and have a row object that is a long series of doubles to be passed to a Vecors.dense constructor, however when I convert a Row to an array via
SparkDataFrame.map{r => r.toSeq.toArray}
All type information is lost and I get back an array of [Any] type. I am unable to cast this object to double using
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
array.map(_.toDouble)
} // Fails with value toDouble is not a member of any
as does
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
array.map(_.asInstanceOf[Double])
} // Fails with java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double
I see that the Row object has an API that supports getting specific elements as a type, via:
SparkDataFrame.map{r =>
r.getDouble(5)}
However event this fails with java.lang.Integer cannot be cast to java.lang.Double
The only work around I have found is the following:
SparkDataFrame.map{r =>
doubleArray = Array(r.getInt(5).toDouble, r.getInt(6).toDouble)
Vectors.dense(doubleArray) }
However this is prohibitively tedious when index 5 through 1000 need to be converted to an array of doubles.
Any way around explicitly indexing the row object?
Let's look at your code blocks 1 by 1
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
val doubleArra = array.map(_.toDouble)
} // Fails with value toDouble is not a member of any
Map returns the last statement as the type (i.e. there is a kind of implied return on any function in Scala that the last result is your return value). Your last statement is of type Unit (like Void).. because assigning a variable to a val has no return. To fix that, take out the assignment (this has the side benefit of being less code to read).
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
array.map(_.toDouble)
}
_.toDouble is not a cast..you can do it on a String or in your case an Integer, and it will change the instance of the variable type. If you call _.toDouble on a Int, it is more like doing Double.parseDouble(inputInt).
_.asInstanceOf[Double] would be a cast.. which if your data is really a double, would change the type. But not sure you need to cast here, avoid casting if you can.
Update
So you changed the code to this
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
array.map(_.toDouble)
} // Fails with value toDouble is not a member of any
You are calling toDouble on a node of your SparkDataFrame. Apparently it is not something that has a toDouble method.. i.e. it is not an Int or a String or a Long.
If this works
SparkDataFrame.map{r =>
doubleArray = Array(r.getInt(5).toDouble, r.getInt(6).toDouble)
Vectors.dense(doubleArray) }
But you need to do from 5 to 1000.. why not do
SparkDataFrame.map{r =>
val doubleArray = for (i <- 5 to 1000){
r.getInt(i).toDouble
}.toArray
Vectors.dense(doubleArray)
}
you should use the Double.parseDouble from java.
import java.lang.Double
SparkDataFrame.map{r =>
val doubleArray = for (i <- 5 to 1000){
Double.parseDouble(r.get(i).toString)
}.toArray
Vectors.dense(doubleArray)
}
Had a similar, harder, problem in that my features are not all Double. Here's how I was able to convert from my DataFrame (pulled from Hive table as well) to a LabeledPoint RDD:
val loaff = oaff.map(r =>
LabeledPoint(if (r.getString(classIdx)=="NOT_FRAUD") 0 else 1,
Vectors.dense(featIdxs.map(r.get(_) match {case null => Double.NaN
case d: Double => d
case l: Long => l}).toArray)))

Scala: type mismatch error

I am new to scala and I am practicing it with k-means algorithm following the tutorial from k-means
I am confused by this part of this tutorial:
var newCentroids = pointsGroup.mapValues(ps => average(ps)).collectAsMap()
This causes a type mismatch error because function average needs a Seq, while we give it an Iterable. How can I fix this? What caused this error?
Well Seq is a sub-type of Iterable but not vice-versa, so it is not possible to convert these types in the type systems.
There is an explicit conversion available by writing average(ps.toSeq). This conversion will iterate the Iterable and collect the items into a Seq.
We could easily replace Seq with Iterable in provided solution for average function:
def average(ps: Iterable[Vector]) : Vector = {
val numVectors = ps.size
var out = new Vector(ps.head.elements)
ps foreach ( out += _)
out / numVectors
}
Or even in constant space:
def average(ps: Iterable[Vector]): Vector = {
val numVectors = ps.size
val vSize = ps.head.elements.length
def element(index: Int): Double = ps.map(_(index)).sum / numVectors
new Vector(0 until vSize map element toArray)
}

Yield String from List[Char]

I have a l: List[Char] of characters which I want to concat and return as a String in one for loop.
I tried this
val x: String = for(i <- list) yield(i)
leading to
error: type mismatch;
found : List[Char]
required: String
So how can I change the result type of yield?
Thanks!
Try this:
val x: String = list.mkString
This syntax:
for (i <- list) yield i
is syntactic sugar for:
list.map(i => i)
and will thus return an unchanged copy of your original list.
You can use the following:
val x: String = (for(i <- list) yield(i))(collection.breakOut)
See this question for more information about breakOut.
You can use any of the three mkString overloads. Basically it converts a collection into a flat String by each element's toString method. Overloads add custom separators between each element.
It is a Iterable's method, so you can also use it in Map or Set.
See http://www.scala-lang.org/api/2.7.2/scala/Iterable.html for more details.