I am reading a text file and it is fixed width file which I need to convert to csv. My program works fine in local machine but when I run it on cluster, it throws "Task not serializable" exception.
I tried to solve same problem with map and mapPartition.
It works fine by using toLocalIterator on RDD. But it doesm't work with large file(I have files of 8GB)
Below is code by using mapPartition which I recently tried
//reading source file and creating RDD
def main(){
var inpData = sc.textFile(s3File)
LOG.info(s"\n inpData >>>>>>>>>>>>>>> [${inpData.count()}]")
val rowRDD = inpData.mapPartitions(iter=>{
var listOfRow = new ListBuffer[Row]
while(iter.hasNext){
var line = iter.next()
if(line.length() >= maxIndex){
listOfRow += getRow(line,indexList)
}else{
counter+=1
}
}
listOfRow.toIterator
})
rowRDD .foreach(println)
}
case class StartEnd(startingPosition: Int, endingPosition: Int) extends Serializable
def getRow(x: String, inst: List[StartEnd]): Row = {
val columnArray = new Array[String](inst.size)
for (f <- 0 to inst.size - 1) {
columnArray(f) = x.substring(inst(f).startingPosition, inst(f).endingPosition)
}
Row.fromSeq(columnArray)
}
//Note : for your refernce, indexList I have created using StartEnd case class, which looks like below after creation
[List(StartEnd(0,4), StartEnd(4,10), StartEnd(7,12), StartEnd(10,14))]
This program works fine in my local machine. But when I put on cluster(AWS) it throws exception as shown below.
>>>>>>>>Map(ResultantDF -> [], ExceptionString ->
Exception occurred while applying the FileConversion transformation and the exception Message is :Task not serializable
Exception occurred while applying the FileConversion transformation and the exception Message is :Task not serializable)
[Driver] TRACE reflection.ReflectionInvoker$.invokeDTMethod - Exit
I am not able to understand what's wrong here and what is not serializable, why it is throwing exception.
Any help is appreciated.
Thanks in advance!
You call getRow method inside a Spark mapPartition transformation. It forces spark to pass an instance of you main class to workers. The main class contains LOG as a field. Seems that this log is not serialization-friendly.
You can
a) move getRow of LOG to a different object (general way to solve such issues)
b) make LOG a lazy val
c) use another logging library
Related
I've written a try-catch construct which attempts to read parquet files based on an S3 path (paths) obtained from an earlier function. The try-catch construct will change the path if the try block fails and will then try again using the new path:
val rawDF: DataFrame = {
try {
spark.read.format("parquet").load(paths)
} catch {
case NonFatal(e) =>
Thread.sleep(3600000)
val hour = LocalDateTime.now().format(hourParser)
val date = LocalDateTime.now().format(dateParser)
val paths = f"s3a://twitter-kafka-app/processed-data/date=$date/hour=$hour/*"
try {
spark.read.format("parquet").load(paths)
} catch {
case NonFatal(e) => None
print("No path found.")
}
}
}
rawDF.show()
All of these blocks work fine except the final catch block, which causes type mismatch problems because it returns Unit. I've had to specify rawDF as a DataFrame type in order for rawDF.show() to work. I've tried adding 'null' because I was under the assumption that this is how you get around this kind of problem, but it just returns NullPointerException even though I know that the first try block should successfully return the dataframe.
Is there an easy way around this that I'm missing? Thanks.
Much of this has already been said, but using a Try is the better option to get rid of the try-catch blocks. Also it signals the operation might fail, but we're not really interesting in what it throws as long as it's non-fatal and we can recover from. Either can also be used but it's semantically different.
The only question remains is how will you treat the failure. I would return an empty DataFrame. Something like:
def tryOfRawDF(paths: String): Try[DataFrame] = Try {
spark.read.format("parquet").load(paths)
}
val rawDF: DataFrame = tryOfRawDF(paths).getOrElse {
Thread.sleep(3600000)
val hour = LocalDateTime.now().format(hourParser)
val date = LocalDateTime.now().format(dateParser)
val paths = f"s3a://twitter-kafka-app/processed-data/date=$date/hour=$hour/*"
tryOfRawDF(paths).getOrElse {
println("No path found.")
DataFrame.empty
}
}
rawDF.show()
I'm trying to understand how serialization in the case of a self constructed case class and a parser in a separate object works -- and I fail.
I tried to boil down the problem to:
parsing a string into case classes
constructing an RDD from those
taking the first element in order to print it
case class article(title: String, text: String) extends Serializable {
override def toString = title + s"/" + text
}
object parser {
def parse(line: String): article = {
val subs = "</end>"
val i = line.indexOf(subs)
val title = line.substring(6, i)
val text = line.substring(i + subs.length, line.length)
article(title, text)
}
}
val text = """"<beg>Title1</end>Text 1"
"<beg>Title2</end>Text 2"
"""
val lines = text.split('\n')
val res = lines.map( line => parser.parse(line) )
val rdd = sc.parallelize(res)
rdd.take(1).map( println )
I get a
Job aborted due to stage failure: Failed to serialize task, not attempting to retry it. Exception during serialization: java.io.NotSerializableException
Can a gifted Scala expert please help me -- just that I understand the interaction of serialization in workers and master -- how to fix the parser / article interaction such that serialization works?
Thank you very much.
In your map function from lines.map( line => parser.parse(line) ) you call parser.parse and parser it's your object which is not serializable. Spark internally uses partitions which are spread across the cluster. The map functions will be called on each partitions. Because the partitions are not on the same JVM process, the function that is called on each partition needs to be serializable, that is why your object parser has to obey the rule.
I'm trying Flink and wrote the following example program:
object IFJob {
#SerialVersionUID(1L)
final class StringInputFormat extends GenericInputFormat[String] {
val N = 100
var i = 0L
override def reachedEnd(): Boolean = this.synchronized {
i >= N
}
override def nextRecord(ot: String): String = this.synchronized {
i += 1
return (i % 2) + ""
}
}
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val text: DataSet[String] = env.createInput(new StringInputFormat())
val map = text.map {
(_, 1)
}
// map.print()
val by = map.groupBy(0)
val aggregate: AggregateDataSet[(String, Int)] = by.aggregate(Aggregations.SUM, 1)
aggregate.print()
}
}
I am creating a StringInputFormat once and read it in parallel (with a default parallelism = 8).
When I run the above program, the results vary between executions, i.e., they are not deterministic. Results are duplicated 1-8x times.
For example I get the following results:
// first run
(0,150)
(1,150)
// second run
(0,50)
(1,50)
// third run
(0,200)
(1,200)
The expected result would be
(0,400)
(1,400)
Because there the StringInputFormat should generate 8 times 50 "0" and "1" records.
I even added synchronization to the input format, but it didn't help.
What am I missing in the Flink computation model?
The behavior you observe is the result of how Flink assigns work to an InputFormat. This works as follows:
On the master (JobManager), the createInputSplits() method is called which returns an array of InputSplit. An InputSplit is a chunk of data to read (or generate). The GenericInputFormat creates one InputSplit for each parallel task instance. In your case, it creates 8 InputSplit objects and each InputSplit should generate 50 "1" and 50 "0" records.
The parallel instances of a DataSourceTask are started on the workers (TaskManagers). Each DataSourceTask has an own instance of the InputFormat.
Once started, a DataSourceTask requests an InputSplit from the master and calls the open() method of its InputFormat with the InputSplit. When the InputFormat finished processing the InputSplit, the DataSourceTask requests a new from the master.
In your case, each InputSplit is very quickly processed. Hence, there is a race between DataSourceTasks requesting InputSplits for their InputFormats and some InputFormats processes more than one InputSplit. Since an InputFormat does not reset its internal state (i.e., set i = 0) when is opens a new InputSplit it will only generate data for the first InputSplit it processes.
You can fix this by adding this method to the StringInputFormat:
override def open(split: GenericInputSplit): Unit = {
super.open(split)
i = 0
}
EDIT2:
So another heads up on this:
I still have no idea why this happens, but I have now a similar problem with jOOQ and the Dialect I have to it. My code here looks like this:
object MyDB {
private lazy val dialect = SQLDialect.POSTGRES
def withSession[T](f: DSLContext => T) = f(DSL.using(getConnectionPool, dialect))
}
if I remove the "lazy" it blows up when I try to execute jOOQ queries in line 552 of https://github.com/jOOQ/jOOQ/blob/version-3.2.0/jOOQ/src/main/java/org/jooq/impl/DefaultRenderContext.java
That happens to be a line where the dialect is evaluated. After I added the lazy everything works as expected.
Maybe this is an issue with the threading of LiftWeb and the executing thread does not see the correct value of the val? I have no idea...
EDIT:
I have found a way to do what I want simply by adding a lazy to the values in the first, broken version. So with lazy vals it all works well.
However I'll let this stay open, as I have no idea how to explain this behavior.
Original Post:
So I am trying to use Parameterized Queries in Slick.
My code is below, my problem is that I get an NPE (see comments) when I try to run this from within the webapplication (liftweb, container started with sbt) (the application creates an object of the class PlayerListCollector that is given the string "cola")
When I execute the object as App from within Eclipse the println at the bottom works just fine.
class PlayerListCollector(term: String) {
import PlayerListCollector._
val searchResult = executeSearch(term)
}
object PlayerListCollector extends Loggable with App{
private val searchNameCurrent = Parameters[String].flatMap {
case (term) => {
for {
p <- Players if p.uberName.isNotNull
n <- p.displayName if (n.displayName.toLowerCase.like(term))
} yield (p.id, n.displayName)
}
}
private def executeSearch(term: String) = {
val lowerTerm = "%"+term.toLowerCase()+"%"
logger info "HELLO " +lowerTerm // prints HELLO %cola%
val foo = searchNameCurrent(lowerTerm) // NPE right in this line
logger info foo // never executed from here on ...
val byCurrent = foo.list
logger info byCurrent
[...]
}
// this works if run directly from within eclipse!
println(DB withSession {
searchNameCurrent("%cola%").list
})
}
The problem vanishes when I change the code to look like this:
[...]
object PlayerListCollector extends Loggable with App{
private def executeSearch(term: String) = {
val searchNameCurrent = Parameters[String].flatMap {
case (term) => {
for {
p <- Players if p.uberName.isNotNull
n <- p.displayName if (n.displayName.toLowerCase.like(term))
} yield (p.id, n.displayName)
}
}
val lowerTerm = "%"+term.toLowerCase()+"%"
logger info "HELLO " +lowerTerm // prints HELLO %cola%
val foo = searchNameCurrent(lowerTerm) // executes just fine when the query is in a local val
logger info foo
val byCurrent = foo.list
logger info byCurrent // prints expected output
[...]
}
[...]
}
I have no idea whatsoever why this happens.
Isn't the whole point of a paramterized query to put it in a val that is only once filled with a value so it does not need to be compiled multiple times?
So it turns out I used the App-Trait (http://www.scala-lang.org/api/current/index.html#scala.App) on these objects.
Reading the big fat caveat tells us what is happening I guess.
I have an actor based system in which I am reading an external file sitting in an S3 bucket and moving taking each of the file lines and sending it over to another actor that processes that particular line. What I am trouble understanding is what happens when an exception is thrown while reading the file.
My code is as follows:
import akka.actor._
import akka.actor.ActorSystem
class FileWorker(processorWorker: ActorRef) extends Actor with ActorLogging {
val fileUtils = new S3Utils()
private def processFile(fileLocation: String): Unit = {
try{
fileUtils.getLinesFromLocation(fileLocation).foreach {
r =>
{
//Some processing happens for the line
}
}
}
}
}catch{
case e:Exception => log.error("Issue processing files from the following location %s".format(fileLocation))
}
}
def receive() = {
case fileLocation: String => {
processFile(fileLocation)
}
}
}
In my S3Utils class I have defined the getLinesFromLocation method as follows:
def getLinesFromLocation(fileLocation: String): Iterator[String] = {
try{
for {
fileEntry <- getFileInfo(root,fileLocation)
} yield fileEntry
}catch{
case e:Exception => logger.error("Issue with file location %s: %s".format(fileLocation,e.getStackTraceString));throw e
}
}
The method where I am actually reading the file is defined in the private method getFileInfo
private def getFileInfo(rootBucket: String,fileLocation: String): Iterator[String] = {
implicit val codec = Codec(Codec.UTF8)
codec.onMalformedInput(CodingErrorAction.IGNORE)
codec.onUnmappableCharacter(CodingErrorAction.IGNORE)
Source.fromInputStream(s3Client.
getObject(rootBucket,fileLocation).
getObjectContent()).getLines
}
I have written the above pieces with the assumption that the underlying file sitting on S3 will not be cached anywhere and I will be simply iterating through the individual lines in constant space and processing them. In case there's an issue with reading a particular line, the iterator will move on without affecting the Actor.
My first question would be, is my understanding of iterators correct? In all actuality, am I actually reading the lines from the underlying file system(in this case the S3 bucket) without applying any pressure to the memory/or introducing any memory leaks.
The next question would be, if the iterator encounters an error while reading an individual entry, does the entire process of iteration is killed or it moves on to the next entry.
My last question would be, is my file-processing logic written correctly?
It will be great to get some insights into this.
Thanks
Looks like amazon s3 has no async implementation and we are stuck with pinned actors. So your implementation is correct, providing you allocate a thread per connection and will not block input, and will not use too many connections.
Important steps to take:
1) processFile should not block current thread. Preferably it should delegate it's input to the another actor:
private def processFile(fileLocation: String): Unit = {
...
fileUtils.getLinesFromLocation(fileLocation).foreach { r =>
lineWorker ! FileLine(fileLocation, r)
}
...
}
2) Make FileWorker a pinned actor:
## in application.config:
my-pinned-dispatcher {
executor = "thread-pool-executor"
type = PinnedDispatcher
}
// in the code:
val fileWorker = context.actorOf(Props(classOf[FileWorker], lineWorker).withDispatcher("my-pinned-dispatcher"), "FileWorker")
if the iterator encounters an error while reading an individual entry, does the entire process of iteration is killed?
yes, your entire process will be killed and the actor will take the next job from it's mailbox.