I have this code below:
//TABLE FROM HIve
val df = hiveContext.sql("select * from test_table where date ='20160721' LIMIT 300")
//ERROR ON THE LINE BELOW
val row = df.flatMap(row => ((row.get(0), row.get(1), row.get(2)), 1))
I get this error in the code above saying:
Type mismatch, expected: (Row) => Traversable[NotInferedU], actual : (Row) => ((Any, Any, Any), Int)
Can someone check to see what is wrong in my flatMap function. I am not able to understand what this error is stating.
you probably should use map instead. ((row.get(0), row.get(1), row.get(2)), 1) is not a Traversable as the error message stated.
Related
I need to ultimately build a schema from a CSV. I can read the CSV into data frame, and I've got a case class defined.
case class metadata_class (colname:String,datatype:String,length:Option[Int],precision:Option[int])
val foo = spark.read.format("csv").option("delimiter",",").option("header","true").schema(Encoders.product[metadata_class.schema).load("/path/to/file").as[metadata_file].toDF()
Now I'm trying to iterate through that data frame and build a list of StructFields. My current effort:
val sList: List[StructField] = List(
for (m <- foo.as[metadata_class].collect) {
StructField[m.colname,getType(m.datatype))
})
That gives me a type mismatch:
found : Unit
required: org.apache.spark.sql.types.StructField
for (m <- foo.as[metadata_class].collect) {
^
What am I doing wrong here? Or am I not even close?
There is not usual to use for-loop in scala. For loop has Unit return type, and in your code, result value of sList will be List[Unit]:
val sList: List[Unit] = List(
for (m <- foo.as[metadata_class].collect) {
StructField(m.colname, getType(m.datatype))
}
)
but you declared sList as List[StructField] this is the cause of compile error.
I suppose you should use map function instead of for loop for iterate on metadata_class objects and create StructFields from them:
val structFields: List[StructField] = foo.as[metadata_class]
.collect
.map(m => StructField(m.colname, getType(m.datatype)))
.toList
you will earn List[StructField] such way.
In scala language every statement is expression with return type, for-loop also and it return type is Unit.
read more about statements/expressions:
statement vs expression in scala
statements and expressions in scala
I am new to scala - spark and loaded my dataset in RDD . here is my sample data set
scala> flightdata.collect
res39: Array[(String, Int)] = Array((DFW,11956), (DTW,588), (SEA,607), (JFK,1595), (SJC,327), (ORD,4664), (PHX,4993), (STL,661),
from the above dataset , i need to find total sum . Hence i written like this
scala> flightdata.values.sum
res40: Double = 445827.0
scala> flightdata.map(_._2).reduce( (a,b) => a + b)
res41: Int = 445827
Both value.sum and map using reduce is giving the right answer. But i am trying to rewrite the same code tuple with reduce.
scala> flightdata.reduce( (s1,s2) => s1._2 + s2._2)
<console>:26: error: type mismatch;
found : Int
required: (String, Int)
flightdata.reduce( (s1,s2) => s1._2 + s2._2)
it is causing error. type mismatch. why it is causing type mismatch error
It happens because you try to combine two tuples, but have integer as result.
You should return tuple ("", s1._2 + s2._2) instead of s1._2 + s2._2.
I have this code:
1. var data = sc.textFile("test3.tsv")
2. var satir = data.map(line=> ((line.split("\t")(1),line.split("\t")(2)),(1,1)))
3. satir.reduce(((a,b),(c,k)) => k + k)
First and second works properly. What I want is by reducing (a,b), specify last '1'
For example, like this:
((a,b),(1,1))
But when I compile third one I get this error:
<console>:29: error: type mismatch;
found : (Int, Int)
required: String
satir.reduce({ case ((a,b),(k,o)) =>o+o})
What should I do?
When you reduce a value, the output value type must be the same as the input value type, you can use a folding method instead, because you can return another type with him.
scala.io.Source.fromFile("test3.tsv")
.getLines
.toList
.map { line =>
val value = line.split("\t")
((value(0), value(1)), (1,1))
}
.foldLeft(0)((response, tuple) => tuple._2._2 + tuple._2._2)
If you care to understand the theory behind this:
Fold explanation
A tutorial on the universality and
expressiveness of fold
A Translation from Attribute Grammars to
Catamorphisms
Scala create the Map from an list of Option emelents?
myMap = (0 to r.numRows - 1).map { i =>
if (Option(r.getValue(i,"Name")).isDefined)
(r.getValue(i, "Name")) -> (r.getValue(i, "isAvailable").toString)
}.toMap
foo(myMap) //here At this point, I'm geting the exception
I have tried above code but not compiling:
Exception:
Error:(158, 23) Cannot prove that Any <:< (T, U).
}.toMap
^
Maybe try this code:
val myMap = (0 until r.numRows) flatMap { i =>
for {
name <- Option(r.getValue(i, "Name"))
available = r.getValue(i, "isAvailable").toString
} yield (name, available)
}.toMap
foo(myMap)
Your problem is most likely that you use if without else, and if is an expression in scala, which means it evaluates to something.
if (2 > 3) true
is equivalent to
if (2 > 3) true else ()
so the type is the common supertype of Unit and Boolean which is Any and this is the error you get.
Note that you can replace to and -1 with until which does the same thing but is more readable.
You shouldn't really check Option with isDefined and perform action on result if that's true, to do this you use map operation.
To explain my implementation a little: the for part will evaluate to Option[(String, String)], and will contain your tuple if "Name" key was present, otherways None. Conceptually, flatMap will first change your range of indices to a sequence of Option[(String, String)], and then flatten it, i.e. remove all Nones and unwrap all Somes.
If you want to check "isAvailable" for null as well, you can do similar thing
for {
name <- Option(r.getValue(i, "Name"))
available <- Option(r.getValue(i, "isAvailable"))
} yield (name, available.toString)
I am encountering this error:
java.lang.ClassCastException: scala.collection.immutable.$colon$colon cannot be cast to [Ljava.lang.Object;
whenever I try to use "contains" to find if a string is inside an array. Is there a more appropriate way of doing this? Or, am I doing something wrong? (I am fairly new to Scala)
Here is the code:
val matches = Set[JSONObject]()
val config = new SparkConf()
val sc = new SparkContext("local", "SparkExample", config)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val ebay = sqlContext.read.json("/Users/thomassquires/Downloads/products.json")
val catalogue = sqlContext.read.json("/Users/thomassquires/Documents/catalogue2.json")
val eins = ebay.map(item => (item.getAs[String]("ID"), Option(item.getAs[Set[Row]]("itemSpecifics"))))
.filter(item => item._2.isDefined)
.map(item => (item._1 , item._2.get.find(x => x.getAs[String]("k") == "EAN")))
.filter(x => x._2.isDefined)
.map(x => (x._1, x._2.get.getAs[String]("v")))
.collect()
def catEins = catalogue.map(r => (r.getAs[String]("_id"), Option(r.getAs[Array[String]]("item_model_number")))).filter(r => r._2.isDefined).map(r => (r._1, r._2.get)).collect()
def matched = for(ein <- eins) yield (ein._1, catEins.filter(z => z._2.contains(ein._2)))
The exception occurs on the last line. I have tried a few different variants.
My data structure is one List[Tuple2[String, String]] and one List[Tuple2[String, Array[String]]] . I need to find the zero or more matches from the second list that contain the string.
Thanks
Long story short (there is still part that eludes me here*) you're using wrong types. getAs is implemented as fieldIndex (String => Int) followed by get (Int => Any) followed by asInstanceOf.
Since Spark doesn't use Arrays nor Sets but WrappedArray to store array column data, calls like getAs[Array[String]] or getAs[Set[Row]] are not valid. If you want specific types you should use either getAs[Seq[T]] or getAsSeq[T] and convert your data to desired type with toSet / toArray.
* See Why wrapping a generic method call with Option defers ClassCastException?