I am trying to use join method between 2 RDD and save it to cassandra but my code don't work . at the begining , i get a huge Main method and everything working well , but when i use function and class this don't work . i am new to scala and spark
code is :
class Migration extends Serializable {
case class userId(offerFamily: String, bp: String, pdl: String) extends Serializable
case class siteExternalId(site_external_id: Option[String]) extends Serializable
case class profileData(begin_ts: Option[Long], Source: Option[String]) extends Serializable
def SparkMigrationProfile(sc: SparkContext) = {
val test = sc.cassandraTable[siteExternalId](KEYSPACE,TABLE)
.keyBy[userId]
.filter(x => x._2.site_external_id != None)
val profileRDD = sc.cassandraTable[profileData](KEYSPACE,TABLE)
.keyBy[userId]
//dont work
test.join(profileRDD)
.foreach(println)
// don't work
test.join(profileRDD)
.saveToCassandra(keyspace, table)
}
At the beginig i get the famous : Exception in thread "main" org.apache.spark.SparkException: Task not serializable at . . .
so i extends my main class and also the case class but stil don't work .
I think you should move your case classes from Migration class to dedicated file and/or object. This should solve your problem. Additionally, Scala case classes are serializable by default.
Related
I have the following class :
case class AucLog(timestamp: UUID, modelname: String, good: Int,
list: List[Double])
class AucDatabase(override val connector : CassandraConnection)
extends Database[AucDatabase](connector) {
object users extends CMetrics with Connector
}
object AucDatabase extends AucDatabase(AucConnector.connector)
abstract class AucMetrics extends Table[AucMetrics, AucLog] {
object id extends UUIDColumn with PartitionKey
object name extends StringColumn
object ud extends IntColumn
object zob extends ListColumn[Double]
}
abstract class CMetrics extends AucMetrics with RootConnector {
def store(metric : AucLog): Future[ResultSet] = {
insert.value(_.id, metric.timestamp)
.value(_.name, metric.modelname)
.value(_.ud, metric.good)
.value(_.zob, metric.list)
.consistencyLevel_=(ConsistencyLevel.ONE)
.future()
}
DmpDatabase.create()
AucDatabase.create()
val pd = DmpDatabase.users.myselect()
val timeout = new Timeout(500000)
val result = Await.result(pd, timeout.duration)
"<--- this attempt to read from my database is working - no problemo ---> "
val todf = result.records.map { elem => elem.idcat }
val rdd = spark.sparkContext.parallelize(todf)
import spark.implicits._
rdd.toDF().show(100)
---> I'm storing one line in my database to be sure that it is not empty when
i'm reading it.
AucDatabase.users.store(new AucLog(UUIDs.timeBased(), "tyron", 0, List(0.1)))
val second = AucDatabase.users.myselect()
val resultmetric = Await.result(second, timeout.duration)
-----> this line cause the Execption
val r = spark.sparkContext.parallelize(resultmetric.records).toDF().show(
What I do not understand is that i'm doing basically the same thing with both databases. Yet, one is throwing the following error : UnsupportedOperationException : No encoder found for com.outworkers.phantom.dsl.UUID.
Thank you.
First of all the store method is macro generated so you don't need to create one. The problem you are having is likely not related to phantom at all, but to some kind of Spark construct.
The phantom UUID is nothing more than a type alias for java.util.UUID, so I'm quite surprised there is no straight up encoder for a default type. If you help me out with the full name of the Encoder class, including the package, I can figure out explicitly what is broken.
I am trying to read avro file with case class: UserItemIds that includes case class type: User , sbt and scala 2.11
case class User(id: Long, description: String)
case class UserItemIds(user: User, itemIds: List[Long])
val UserItemIdsInputStream = env.createInput(new AvroInputFormat[UserItemIds](user_item_ids_In, classOf[UserItemIds]))
UserItemIdsInputStream.print()
but receive: Error:
Caused by: java.lang.NoSuchMethodException: schema.User.<init>()
Can anyone guide me how to work with these types please? This example is with avro files, but this could be parquet or any custom DB input.
Do I need to use TypeInformation ? ex:, if yes how to do so?
val tupleInfo: TypeInformation[(User, List[Long])] = createTypeInformation[(User, List[Long])]
I also saw env.registerType() , does it relate to the issue at all? Any help is greatly appreciated.
I found the solution to this java error as Adding a default constructor in this case I added factory method to scala case class by adding it to the companion object
object UserItemIds{
case class UserItemIds(
user: User,
itemIds: List[Long])
def apply(user:User,itemIds:List[Long]) = new
UserItemIds(user,itemIds)}
but this has not resolved the issue
You have to add a default constructor for the User and UserItemIds type. This could look the following way:
case class User(id: Long, description: String) {
def this() = this(0L, "")
}
case class UserItemIds(user: User, itemIds: List[Long]) {
def this() = this(new User(), List())
}
I want to list out all the case classes which implements a particular trait. I am currently using Clapper ClassUtil for doing that. I am able to get the case classes that are directly implementing a trait. However, I am not able to get the other classes which are not directly implementing the trait. How can I get all classes which directly or indirectly implements a trait. ?
val finder = ClassFinder()
finder.getClasses().filter(_.isConcrete).filter(_.implements("com.myapp.MyTrait"))
Scala Version : 2.11
Clapper Class Util Version : 1.0.6
Is there any other way I can get these information? Can someone point me to the right direction?
I tried using scala.reflect but could not understand how to get the info.
EDIT:
Sample traits and usages:
trait BaseEntity
trait NamedEntity{ val name:String}
trait MasterDataEntity extends NamedEntity
case class Department(id:Long, override val name:String) extends MasterDataEntity
case class Employee(id:Long, name:String) extends BaseEntity
case class User(id:Long, override val name:String) extends NamedEntity
Now, if I give the trait as NamedEntity, I should be able to get both Department and User since they both are directly or indirectly implementing NamedEntity. With implements method, it will give only User. I also tried by using interfaces method, which will also provide the direct super classes only.
Looking at the source code, the problem seems to be that it doesn't follow the interfaces hierarchy. If you do that, you find all instances:
package foo
import java.io.File
import org.clapper.classutil.{ClassFinder, ClassInfo}
object Main extends App {
val jar = new File("target/scala-2.11/class_test_2.11-0.1.0.jar")
val finder = ClassFinder(jar :: Nil)
val classes = ClassFinder.classInfoMap(finder.getClasses().iterator)
val impl = find("foo.NamedEntity", classes)
impl.foreach(println)
def find(ancestor: String, classes: Map[String, ClassInfo]): List[ClassInfo] =
classes.get(ancestor).fold(List.empty[ClassInfo]) { ancestorInfo =>
val ancestorName = ancestorInfo.name
def compare(info: ClassInfo): Boolean =
info.name == ancestorName ||
(info.superClassName :: info.interfaces).exists {
n => classes.get(n).exists(compare)
}
val it = classes.valuesIterator
it.filter { info => info.isConcrete && compare(info) } .toList
}
}
ClassUtil now contains this functionality (v1.4.0, maybe also in earlier versions):
val finder = ClassFinder()
val impl = ClassFinder.concreteSubclasses("foo.NamedEntity", finder.getClasses())
Here is a runnable demo
import java.io._
object Main extends App {
case class Value(s: String)
val serializer = new ObjectOutputStream(new ByteArrayOutputStream())
serializer.writeObject(Value("123"))
println("success") //> success
}
Please note that program succeeds despite I do not mark my class with Serializable. Does Serializable make sense in Scala?
Case classes extend Serializable by default in Scala. If you create a regular class it will need to extend Serializable otherwise it will throw a serialization error.
I am consistently having errors regarding Task not Serializable.
I have made a small Class and it extends Serializable - which is what I believe is meant to be the case when you need values in it to be serialised.
class SGD(filePath : String) extends Serializable {
val rdd = sc.textFile(filePath)
val mappedRDD = rdd.map(x => x.split(" ")
.slice(0,3))
.map(y => Rating(y(0).toInt, y(1).toInt, y(2).toDouble))
.cache
val RNG = new Random(1)
val factorsRDD = mappedRDD(x => (x.user, (x.product, x.rating)))
.groupByKey
.mapValues(listOfItemsAndRatings =>
Vector(Array.fill(2){RNG.nextDouble}))
}
The final line always results in a Task not Serializable error. What I do not understand is: the Class is Serializable; and, the Class Random is also Serializable according to the API. So, what am I doing wrong? I consistently can't get stuff like this to work; therefore, I imagine my understanding is wrong. I keep being told the Class must be Serializable... well it is and it still doesn't work!?
scala.util.Random was not Serializable until 2.11.0-M2.
Most likely you are using an earlier version of Scala.
A class doesn't become Serializable until all its members are Serializable as well (or some other mechanism is provided to serialize them, e.g. transient or readObject/writeObject.)
I get the following stacktrace when running given example in spark-1.3:
Caused by: java.io.NotSerializableException: scala.util.Random
Serialization stack:
- object not serializable (class: scala.util.Random, value: scala.util.Random#52bbf03d)
- field (class: $iwC$$iwC$SGD, name: RNG, type: class scala.util.Random)
One way to fix it is to take instatiation of random variable within mapValues:
mapValues(listOfItemsAndRatings => { val RNG = new Random(1)
Vector(Array.fill(2)(RNG.nextDouble)) })