Unable to use collectAsMap() in scala code - scala

val titleMap = movies.map(line => line.split("\\|")).take(2)
//converting movie-id and movie name as map(key-pair)
val title1 = titleMap.map(array=>(array(0).toInt,array(1)))
val titles = movies.map(line => line.split("\\|").take(2)).map(array
=> (array(0).toInt,
array(1))).collectAsMap()
Whats wrong here with "title1",I am unable to apply collectAsMap function here,same thing I can apply in case of "titles"

The type of title1 is not an RDD, so it doesn't have the method collectAsMap().
The type of titles is an RDD so it does have the method collectAsMap().
Advise reading up on types https://en.wikipedia.org/wiki/Type_safety, https://en.wikipedia.org/wiki/Type_system

Related

change getValuesMap of Rw Spark scala

I am working with ForeachWriter[Row] to implement a custom spark sink.
And for the process function I want to get value of a field as an int.
Hence if I suppose that my val row = Row("city","name","age") I want to get the age as an Int and the remaining fields as string.
def process(row: Row) = {
val fieldNames = row.schema.fieldNames
val rowAsMap = row.getValuesMap(fieldNames)
with the getValuesMap every field is parsed as a string.
I thought about pattern matching to change the getValuesMap function:
val rowAsMap = fieldNames.map {
case "age" => row.getAs[Int]("age")
case _ => row.getAs[String]
}.toMap
This is not working as it age is always written as a string in the sink, any help/ideas, to get values in the expected types from Row
Could you add details on "not working"? Still returns "age" as String, throws exception, some other problems occur?
Overall, your solution seems ok, though I'm not sure about toMap call in the end - you're not providing a key to the map. Maybe try something like
val rowAsMap = fieldNames.map {
case "age" => "age" -> row.getAs[Int]("age")
case rowName => rowName -> row.getAs[String]
}.toMap
I am not sure why are you putting that type casting logic inside ForEachWriter[Row]. If you want age to be in Int Its caller's responsibility convert the schema of age inside Row to Int, no?
Also, I don't think there is a need of doing -
val rowAsMap = fieldNames.map {
case "age" => row.getAs[Int]("age")
case _ => row.getAs[String]
}.toMap
row.getValuesMap(fieldNames) does the same thing.
Please check the sourcecode for getValuesAsMap

Converting Iterable[(Double, Double)] to Seq(Seq(Double))

I want to convert Pair RDD "myRDD" values from Iterable[(Double,Double)] to Seq(Seq(Double)), however I am not sure how to do it. I tried the following but it does not work.
val groupedrdd: RDD[BB,Iterable[(Double,Double)]] = RDDofPoints.groupByKey()
val RDDofSeq = groupedrdd.mapValues{case (x,y) => Seq(x,y)}
The myRDD is formed using a groupByKey operation on a RddofPoints with their respective bounding boxes as keys. The BB is a case class and it is the key for a set of points with type (Double,Double). I want the RDDofSeq to have the type RDD[BB,Seq(Seq(Double))], however after groupByKey, myRDD has the type RDD[BB,Iterable[(Double,Double)]].
Here, it gives an error as:
Error:(107, 58) constructor cannot be instantiated to expected type;
found : (T1, T2)
required: Iterable[(Double, Double)]
I am new to Scala, any help in this regard is appreciated. Thanks.
ANSWER : The following is used to accomplish the above goal:
val RDDofSeq = groupedrdd.mapValues{iterable => iterable.toSeq.map{case (x,y) => Seq(x,y)}}
I tried this on Scalafiddle
val myRDD: Iterable[(Double,Double)] = Seq((1.1, 1.2), (2.1, 2.2))
val RDDofSeq = myRDD.map{case (x,y) => Seq(x,y)}
println(RDDofSeq) // returns List(List(1.1, 1.2), List(2.1, 2.2))
The only difference is that I used myRDD.map(.. instead of myRDD.mapValues(..
Make sure that myRDD is really of the type Iterable[(Double,Double)]!
Update after comment:
If I understand you correctly you want a Seq[Double] and not a Seq[Seq[Double]]
That would be this:
val RDDofSeq = myRDD.map{case (k,v) => v} // returns List(1.2, 2.2)
Update after the Type is now clear:
The values are of type Iterable[(Double,Double)] so you cannot match on a pair.
Try this:
val RDDofSeq = groupedrdd.mapValues{iterable =>
Seq(iterable.head._1, iterable.head._2)}
You just need map, not mapValues.
val RDDofSeq = myRDD.map{case (x,y) => Seq(x,y)}

How to get datatype of column in spark dataframe dynamically

I have a dataframe - converted dtypes to map.
val dfTypesMap:Map[String,String]] = df.dtypes.toMap
Output:
(PRODUCT_ID,StringType)
(PRODUCT_ID_BSTP_MAP,MapType(StringType,IntegerType,false))
(PRODUCT_ID_CAT_MAP,MapType(StringType,StringType,true))
(PRODUCT_ID_FETR_MAP_END_FR,ArrayType(StringType,true))
When I use type [String] hardcoding in row.getAS[String], there is no compilation error.
df.foreach(row => {
val prdValue = row.getAs[String]("PRODUCT_ID")
})
I want to iterate above map dfTypesMap and get corresponding value type. Is there any way to convert dt column types to general types like below?
StringType --> String
MapType(StringType,IntegerType,false) ---> Map[String,Int]
MapType(StringType,StringType,true) ---> Map[String,String]
ArrayType(StringType,true) ---> List[String]
As mentioned, Datasets make it easier to work with types.
Dataset is basically a collection of strongly-typed JVM objects.
You can map your data to case classes like so
case class Foo(PRODUCT_ID: String, PRODUCT_NAME: String)
val ds: Dataset[Foo] = df.as[Foo]
Then you can safely operate on your typed objects. In your case you could do
ds.foreach(foo => {
val prdValue = foo.PRODUCT_ID
})
For more on Datasets, check out
https://spark.apache.org/docs/latest/sql-programming-guide.html#creating-datasets

Change schema of Spark Dataframe

I have a DataFrame[SimpleType]. SimpleType is a class that contains 16 fields. But I have to change it into DataFrame[ComplexType].
I've got only schema of ComplexType(there is more than 400 fields), there is no case class for this type. I know mapping neccesary fields (but I don't know how to map it from DataFrame[SimpleType] -> DataFrame[ComplexType]), the rest fields I want to leave as nulls. Does anyone know how to do this in most efficent way?
Thanks
edit
class SimpleType{
field1
field2
field3
field4
.
.
.
field16
}
I have got DataFrame that contains this simple type. Also I have a schema of complex type.
I want to convert this DataFrame[SimpleType] -> Dataframe[ComplexType]
It's quite simple:
// function to get field names
import scala.reflect.runtime.universe._
def classAccessors[T: TypeTag]: List[String] = typeOf[T].members.collect {
case m: MethodSymbol if m.isCaseAccessor => m}
.toList.map(s => s.name.toString)
val typeComplexFields = classAccessors[ComplexType]
val newDataFrame = simpleDF
.select(typeComplexFields
.map(c => if (simpleDF.columns.contains(c)) col(c) else lit(null).as(c)) : _*)
.as[ComplexType]
Credits also for author of Scala. Get field names list from case class, I've copied his function to get field names with modifications

Using contains in scala - exception

I am encountering this error:
java.lang.ClassCastException: scala.collection.immutable.$colon$colon cannot be cast to [Ljava.lang.Object;
whenever I try to use "contains" to find if a string is inside an array. Is there a more appropriate way of doing this? Or, am I doing something wrong? (I am fairly new to Scala)
Here is the code:
val matches = Set[JSONObject]()
val config = new SparkConf()
val sc = new SparkContext("local", "SparkExample", config)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val ebay = sqlContext.read.json("/Users/thomassquires/Downloads/products.json")
val catalogue = sqlContext.read.json("/Users/thomassquires/Documents/catalogue2.json")
val eins = ebay.map(item => (item.getAs[String]("ID"), Option(item.getAs[Set[Row]]("itemSpecifics"))))
.filter(item => item._2.isDefined)
.map(item => (item._1 , item._2.get.find(x => x.getAs[String]("k") == "EAN")))
.filter(x => x._2.isDefined)
.map(x => (x._1, x._2.get.getAs[String]("v")))
.collect()
def catEins = catalogue.map(r => (r.getAs[String]("_id"), Option(r.getAs[Array[String]]("item_model_number")))).filter(r => r._2.isDefined).map(r => (r._1, r._2.get)).collect()
def matched = for(ein <- eins) yield (ein._1, catEins.filter(z => z._2.contains(ein._2)))
The exception occurs on the last line. I have tried a few different variants.
My data structure is one List[Tuple2[String, String]] and one List[Tuple2[String, Array[String]]] . I need to find the zero or more matches from the second list that contain the string.
Thanks
Long story short (there is still part that eludes me here*) you're using wrong types. getAs is implemented as fieldIndex (String => Int) followed by get (Int => Any) followed by asInstanceOf.
Since Spark doesn't use Arrays nor Sets but WrappedArray to store array column data, calls like getAs[Array[String]] or getAs[Set[Row]] are not valid. If you want specific types you should use either getAs[Seq[T]] or getAsSeq[T] and convert your data to desired type with toSet / toArray.
* See Why wrapping a generic method call with Option defers ClassCastException?