I am newbie in scala and i want to convert my java code to scala.
my database is cassandra and column family definition look like this
Family name :colFam
Rowkey: rowKey1
=>(name=comkey1:comkey1,value='xyz',timestamp=1554515485)
=>(name=comkey1:comkey2,value='xyz',timestamp=1554515485)
=>(name=comkey1:comkey3,value='xyz',timestamp=1554515485)
=>(name=comkey1:comkey4,value='xyz',timestamp=1554515485)
=>(name=comkey1:comkey5,value='xyz',timestamp=1554515485)
-------------------------------------------------------
Rowkey: rowKey2
=>(name=comkey1:comkey3,value='abc',timestamp=1554515485)
-------------------------------------------------------
Rowkey: rowKey4
=>(name=comkey1:comkey4,value='pqr',timestamp=1554515485)
-------------------------------------------------------
now i want to fetch all record from cassandra and my columnQuery working fine
val sliceQuery = HFactory.createColumnQuery(rankqKeyspace, StringSerializer.get(), new CompositeSerializer(), StringSerializer.get())
problem in my sliceQuery,normally slice query working fine with simple column name but when i used column type composite,it giving me error
var startKey = new Composite();
var endKey = new Composite();
startKey.addComponent("comkey1", se);
startKey.addComponent("comkey2", se);
endKey.addComponent("comkey1", se);
endKey.addComponent("comkey4", se);
val sliceQuery = HFactory.createSliceQuery(rankqKeyspace, se, new CompositeSerializer(), se)
slicesQuery.setColumnFamily("colFam");
slicesQuery.setKey(rowKey1)
slicesQuery.setRange(startKey, endKey, false, Integer.MAX_VALUE);
val result = slicesQuery.execute()
val orderedRows = result.get();
it giving me error "NullPointerException"
val orderedRows = result.get();
println(orderedRows)
this line return always null value
Edit Question
Stack Trace
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:191)
at com.google.common.collect.ImmutableClassToInstanceMap.getInstance(ImmutableClassToInstanceMap.java:147)
at me.prettyprint.hector.api.beans.AbstractComposite.serializerForComparator(AbstractComposite.java:321)
at me.prettyprint.hector.api.beans.AbstractComposite.getSerializer(AbstractComposite.java:344)
at me.prettyprint.hector.api.beans.AbstractComposite.deserialize(AbstractComposite.java:708)
at me.prettyprint.cassandra.serializers.CompositeSerializer.fromByteBuffer(CompositeSerializer.java:29)
at me.prettyprint.cassandra.serializers.CompositeSerializer.fromByteBuffer(CompositeSerializer.java:17)
at me.prettyprint.cassandra.model.HColumnImpl.getName(HColumnImpl.java:111)
at me.prettyprint.cassandra.model.HColumnImpl.toString(HColumnImpl.java:202)
at java.lang.String.valueOf(String.java:2854)
at java.lang.StringBuilder.append(StringBuilder.java:128)
at java.util.AbstractCollection.toString(AbstractCollection.java:450)
at me.prettyprint.cassandra.model.ColumnSliceImpl.toString(ColumnSliceImpl.java:54)
at java.lang.String.valueOf(String.java:2854)
at java.io.PrintStream.println(PrintStream.java:821)
at scala.Console$.println(Console.scala:240)
at scala.Predef$.println(Predef.scala:287)
at models.PinModel$.testColCompositeKey(PinModel.scala:178)
at controllers.PinController$$anonfun$test$1.apply(PinController.scala:52)
at controllers.PinController$$anonfun$test$1.apply(PinController.scala:49)
at play.api.mvc.ActionBuilder$$anonfun$apply$10.apply(Action.scala:221)
at play.api.mvc.ActionBuilder$$anonfun$apply$10.apply(Action.scala:220)
at play.api.mvc.Action$.invokeBlock(Action.scala:357)
at play.api.mvc.ActionBuilder$$anon$1.apply(Action.scala:309)
at play.api.mvc.Action$$anonfun$apply$1$$anonfun$apply$4$$anonfun$apply$5.apply(Action.scala:109)
at play.api.mvc.Action$$anonfun$apply$1$$anonfun$apply$4$$anonfun$apply$5.apply(Action.scala:109)
at play.utils.Threads$.withContextClassLoader(Threads.scala:18)
at play.api.mvc.Action$$anonfun$apply$1$$anonfun$apply$4.apply(Action.scala:108)
at play.api.mvc.Action$$anonfun$apply$1$$anonfun$apply$4.apply(Action.scala:107)
at scala.Option.map(Option.scala:145)
at play.api.mvc.Action$$anonfun$apply$1.apply(Action.scala:107)
at play.api.mvc.Action$$anonfun$apply$1.apply(Action.scala:100)
at play.api.libs.iteratee.Iteratee$$anonfun$mapM$1.apply(Iteratee.scala:481)
at play.api.libs.iteratee.Iteratee$$anonfun$mapM$1.apply(Iteratee.scala:481)
at play.api.libs.iteratee.Iteratee$$anonfun$flatMapM$1.apply(Iteratee.scala:517)
at play.api.libs.iteratee.Iteratee$$anonfun$flatMapM$1.apply(Iteratee.scala:517)
at play.api.libs.iteratee.Iteratee$$anonfun$flatMap$1$$anonfun$apply$13.apply(Iteratee.scala:493)
at play.api.libs.iteratee.Iteratee$$anonfun$flatMap$1$$anonfun$apply$13.apply(Iteratee.scala:493)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Anyone running into the same problem here, Hector doesn't play well with the latest version of Google Guava resulting in this error, check if there are any libraries depending on a different version of Guava. For my exact situation I ended up using Guava 13.0 since it also contained the parts that were needed for the other library.
I would recommend migrating away from Hector if at all possible, it is no longer maintained. I've personally started moving to the new Datastax driver with CQL and it's making a lot of things easier.
I guess you need explicitly setup serializer by position for each of your composite keys:
var startKey = new Composite();
startKey.setSerializerByPosition(0, comkey1Serializer)
startKey.setSerializerByPosition(1, comkey2Serializer)
startKey.addComponent("comkey1", se);
startKey.addComponent("comkey2", se);
Related
I have a Apache Flink Application, where I want to filter the data by Country which gets read from topic v01 and write the filtered data into the topic v02. For testing purposes I tried to write everything in uppercase.
My Code:
package org.example;
import org.apache.avro.Schema;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.formats.avro.registry.confluent.ConfluentRegistryAvroDeserializationSchema;
import org.apache.flink.formats.avro.registry.confluent.ConfluentRegistryAvroSerializationSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import java.util.Properties;
public class KafkaRead {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.100.100:9092");
properties.setProperty("group.id", "luft");
String schemaRegistryUrl = "http://192.168.100.100:8081";
String valueSchema = "{\"connect.name\":\"com.github.jcustenborder.kafka.connect.model.Value\",\"fields\":[{\"default\":null,\"name\":\"Datum\",\"type\":[\"null\",\"string\"]},{\"default\":null,\"name\":\"Country\",\"type\":[\"null\",\"string\"]},{\"default\":null,\"name\":\"City\",\"type\":[\"null\",\"string\"]},{\"default\":null,\"name\":\"Specie\",\"type\":[\"null\",\"string\"]},{\"default\":null,\"name\":\"count\",\"type\":[\"null\",\"string\"]},{\"default\":null,\"name\":\"min\",\"type\":[\"null\",\"string\"]},{\"default\":null,\"name\":\"max\",\"type\":[\"null\",\"string\"]},{\"default\":null,\"name\":\"median\",\"type\":[\"null\",\"string\"]},{\"default\":null,\"name\":\"variance\",\"type\":[\"null\",\"string\"]}],\"name\":\"Value\",\"namespace\":\"com.github.jcustenborder.kafka.connect.model\",\"type\":\"record\"}";
Schema schema = new Schema.Parser().parse(valueSchema);
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkKafkaConsumer kafkaConsumer = new FlinkKafkaConsumer<> ("v01", ConfluentRegistryAvroDeserializationSchema.forGeneric(schema, schemaRegistryUrl) , properties);
kafkaConsumer.setStartFromEarliest();
DataStream<String> streamIn = env.addSource(kafkaConsumer);
FlinkKafkaProducer kafkaProducer = new FlinkKafkaProducer("v02",ConfluentRegistryAvroSerializationSchema.forGeneric("v02-value",schema,schemaRegistryUrl),properties);
DataStream<String> streamOut = streamIn.map(new MapFunction<String, String>() { //<--Error in this line
#Override
public String map(String value) throws Exception {
return value.toUpperCase();
}
});
streamOut.addSink(kafkaProducer);
env.execute("Flink Streaming In/Out Kafka");
}
}
When executing I get following error:
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to java.lang.String
at org.example.KafkaRead$1.map(KafkaRead.java:45)
It works without problems if I dont use a map function and use my Input as my Output:
DataStream<String> streamOut = streamIn;
For Context:
The data that gets read in the first place looks like this
Date,Country,City,Specie,count,min,max,median,variance
2020-03-24,DE,Hamburg,humidity,288,26.0,54.0,36.5,966.48
2020-03-26,DE,Hamburg,humidity,288,25.0,71.5,44.0,1847.14
2020-04-01,DE,Hamburg,humidity,288,61.0,83.0,75.0,418.07
The csv file are read via the SpoolDirCsvSourceConnector in Kafka. Apache Flink reads from topic v01, which gets an automatic generated schema via the connector, which is saved as avro. In the next step I want to analyse it with filters inside flink and after that I want to write back to v02. For example filtering via country.
Complete Error:
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:147)
at org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$2(MiniClusterJobClient.java:119)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:229)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:996)
at akka.dispatch.OnComplete.internal(Future.scala:264)
at akka.dispatch.OnComplete.internal(Future.scala:261)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:224)
at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:217)
at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:208)
at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:610)
at org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:89)
at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:419)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:286)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:201)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
... 4 more
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to java.lang.String
at org.example.KafkaRead$1.map(KafkaRead.java:45)
at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41)
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71)
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46)
at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26)
at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:52)
at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:30)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollectWithTimestamp(StreamSourceContexts.java:310)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collectWithTimestamp(StreamSourceContexts.java:409)
at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecordsWithTimestamps(AbstractFetcher.java:352)
at org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.partitionConsumerRecordsHandler(KafkaFetcher.java:181)
at org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.runFetchLoop(KafkaFetcher.java:137)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:761)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:215)
Just to extend the comment that has been added. So, basically if You use ConfluentRegistryAvroDeserializationSchema.forGeneric the data produced my the consumer isn't really String but rather GenericRecord<T>.
So, the moment You will try to use it in Your map that expects String it will fail, because your DataStream is not DataStream<String> but rather DataStream<GenericRecord>.
Now, it works if You remove the map only because You havent specified the type when defining FlinkKafkaConsumer and your FlinkKafkaProducer, so Java will just try to cast every object to required type. Your FlinkKafkaProducer is actually FlinkKafkaProducer<GenericRecord> so there will be no problem there and thus it will work as it should.
In this particular case You don't seem to be needing Avro at all, since the data is just raw CSV.
UPDATE:
Seems that You are actually processing Avro, in this case You need to change the type of Your DataStream<String> to DataStream<GenericRecord> and all the functions You gonna write are going to work using GenericRecord not String.
So, You need something like:
.map(new MapFunction<GenericRecord, T>(){...})
I'm trying to develop a K-means model in Flink (Scala), using Zeppelin.
This is part of my simple code:
//Reading data
val mapped : DataSet[Vector] = data.map {x => DenseVector (x._1,x._2) }
//Create algorithm
val knn = KNN()
.setK(3)
.setBlocks(10)
.setDistanceMetric(SquaredEuclideanDistanceMetric())
.setUseQuadTree(false)
.setSizeHint(CrossHint.SECOND_IS_SMALL)
...
//Just to learn I use the same data predicting the model
val result = knn.predict(mapped).collect()
When I print the data or to use predict method, i got this ERROR:
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed.
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:409)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:95)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:382)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:369)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:344)
at org.apache.flink.client.RemoteExecutor.executePlanWithJars(RemoteExecutor.java:211)
at org.apache.flink.client.RemoteExecutor.executePlan(RemoteExecutor.java:188)
at org.apache.flink.api.java.RemoteEnvironment.execute(RemoteEnvironment.java:172)
at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:896)
at org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:637)
at org.apache.flink.api.scala.DataSet.collect(DataSet.scala:547)
... 36 elided
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:822)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:768)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:768)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.api.common.io.ParseException: Line could not be parsed: '-6.59 -44.68'
ParserError NUMERIC_VALUE_FORMAT_ERROR
Expect field types: class java.lang.Double, class java.lang.Double
in file: /home/borja/flink/kmeans/points
at org.apache.flink.api.common.io.GenericCsvInputFormat.parseRecord(GenericCsvInputFormat.java:407)
at org.apache.flink.api.java.io.CsvInputFormat.readRecord(CsvInputFormat.java:110)
at org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:470)
at org.apache.flink.api.java.io.CsvInputFormat.nextRecord(CsvInputFormat.java:78)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:162)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)
at java.lang.Thread.run(Thread.java:748)
I do not know if it's my fault loading the data or it has related with something like that.
Thanks for any help! :)
You haven't shown us the code you are using to read and parse the data, which is where the error is occurring. But given the error message, I'll hazard a guess that you are using readCSVFile with data that is delimited by spaces or tabs, and didn't specify the fieldDelimiter (which defaults to comma). If that's the case, see the docs for how to configure the CSV parser.
I'm trying to read in some json, infer a schema, and write it out again as parquet to s3 (s3a). For some reason, about a third of the way through the writing portion of the run, spark always errors out with the error included below. I can't find any obvious reasons for the issue: it isn't out of memory; there are no long GC pauses. There don't seem to be any additional error messages in the logs of the individual executors.
The script runs fine on another set of data that I have, which is of a very similar structure, but several orders of magnitude smaller.
I am running spark 2.0.1-hadoop-2.7 and am using the FileOutputCommitter. The algorithm version doesn't seem to matter.
Edit:
This does not appear to be a problem in badly formed json or corrupted files. I have unzipped and read in each file individually with no error.
Here's a simplified version of the script:
object Foo {
def parseJson(json: String): Option[Map[String, Any]] = {
if (json == null)
Some(Map())
else
parseOpt(json).map((j: JValue) => j.values.asInstanceOf[Map[String, Any]])
}
}
}
// read in as text and parse json using json4s
val jsonRDD: RDD[String] = sc.textFile(inputPath)
.map(row -> Foo.parseJson(row))
// infer a schema that will encapsulate the most rows in a sample of size sampleRowNum
val schema: StructType = Infer.getMostCommonSchema(sc, jsonRDD, sampleRowNum)
// get documents compatibility with schema
val jsonWithCompatibilityRDD: RDD[(String, Boolean)] = jsonRDD
.map(js => (js, Infer.getSchemaCompatibility(schema, Infer.inferSchema(js)).toBoolean))
.repartition(partitions)
val jsonCompatibleRDD: RDD[String] = jsonWithCompatibilityRDD
.filter { case (js: String, compatible: Boolean) => compatible }
.map { case (js: String, _: Boolean) => js }
// create a dataframe from documents with compatible schema
val dataFrame: DataFrame = spark.read.schema(schema).json(jsonCompatibleRDD)
It completes the earlier schema inferring steps successfully. The error itself occurs on the last line, but I suppose that could encompass at least the immediately preceding statemnt, if not earlier:
org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Failed to commit task
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.org$apache$spark$sql$execution$datasources$DefaultWriterContainer$$commitTask$1(WriterContainer.scala:275)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:257)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1345)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
... 8 more
Suppressed: java.lang.NullPointerException
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:147)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetFileFormat.scala:569)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.org$apache$spark$sql$execution$datasources$DefaultWriterContainer$$abortTask$1(WriterContainer.scala:282)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$2.apply$mcV$sp(WriterContainer.scala:258)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1354)
... 9 more
Caused by: com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:962)
at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:1147)
at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:1136)
at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:142)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:400)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:117)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetFileFormat.scala:569)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.org$apache$spark$sql$execution$datasources$DefaultWriterContainer$$commitTask$1(WriterContainer.scala:267)
... 13 more
Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:279)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:75)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:72)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
... 29 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 2; XML document structures must start and end within the same entity.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown Source)
at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.skipChar(Unknown Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
... 35 more
Here's my conf:
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G -XX:+HeapDumpOnOutOfMemoryError
spark.executor.memory 16G
spark.executor.uri https://s3.amazonaws.com/foo/spark-2.0.1-bin-hadoop2.7.tgz
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.buffer.dir /raid0/spark
spark.hadoop.fs.s3n.buffer.dir /raid0/spark
spark.hadoop.fs.s3a.connection.timeout 500000
spark.hadoop.fs.s3n.multipart.uploads.enabled true
spark.hadoop.parquet.block.size 2147483648
spark.hadoop.parquet.enable.summary-metadata false
spark.jars.packages com.databricks:spark-avro_2.11:3.0.1
spark.local.dir /raid0/spark
spark.mesos.coarse false
spark.mesos.constraints priority:1
spark.network.timeout 600
spark.rpc.message.maxSize 500
spark.speculation false
spark.sql.parquet.mergeSchema false
spark.sql.planner.externalSort true
spark.submit.deployMode client
spark.task.cpus 1
I can think for three possible reasons for this problem.
JVM version. AWS SDK checks for the following ones. "1.6.0_06",
"1.6.0_13", "1.6.0_17", "1.6.0_65", "1.7.0_45". If you are using one
of them, try upgrading.
Old AWS SDK. Refer to
https://github.com/aws/aws-sdk-java/issues/460 for a workaround.
If you lots of files in the directory where you are writing these files, you might be hitting https://issues.apache.org/jira/browse/HADOOP-13164. Consider increasing the timeout to larger values.
A SAXParseException may indicate a badly formatted XML file. Since the job fails roughly a third of the way through consistently, this means it's probably failing in the same place every time (a file whose partition is roughly a third of the way through the partition list).
Can you paste your script? It may be possible to wrap the Spark step in a try/catch loop that will print out the file if this error occurs, which will let you easily zoom in on the problem.
From the logs:
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 2; XML document structures must start and end within the same entity.
and
Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
It looks like you have a corrupted/incorrectly formatted file, and your error is actually occurring during the read portion of the task. You could confirm this by trying another operation that will force the read such as count().
If confirmed, the goal would then be to find the corrupted file. You could do this by listing the s3 files, sc.parallelize() that list, and then trying to read the files in a custom function using map().
import boto3
from pyspark.sql import Row
def scanKeys(startKey, endKey):
bucket = boto3.resource('s3').Bucket('bucketName')
for obj in bucket.objects.filter(Prefix='prefix', Marker=startKey):
if obj.key < endKey:
yield obj.key
else:
return
def testFile(s3Path):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=key)
body = s3obj.get()['Body']
...
logic to test file format, or use a try/except and attempt to parse it
...
if fileFormatedCorrectly == True:
return Row(status='Good', key = s3Path)
else:
return Row(status='Fail', key = s3Path)
keys = list(scanKeys(startKey, endKey))
keyListRdd = sc.parallelize(keys, 1000)
keyListRdd.map(testFile).filter(lambda x: x.asDict.get('status') == 'Fail').collect()
This will return the s3 paths for the incorrectly formatted files
For Googlers:
If you:
have a versioned bucket
use s3a://
see ListBucketHandler and listObjects in your error message
Quick solution:
use s3:// instead of s3a://, which will use S3 driver provided by EMR
You may see this error because s3a:// in older versions uses S3::ListObjects (v1) API instead of S3::ListObjectsV2. The former would return extra info like owner, and is not robust against large number of deletion markers. Newer versions of the s3a:// driver solved this problem, but you could always use the s3:// driver instead.
Quote:
the V1 list API experience always returns 5000 entries (as set in fs.s3a.paging.maximum
except for the final entry
if you have versioning turned on in your bucket, deleted entries retain tombstone markers with references to their versions
which will surface in the S3-side of list calls, but get stripped out from the response
so...for a very large tree, you may end up S3 having to keep a channel open while is skips of thousands to millions of deleted
objects before it can find actual ones to return.
which can time out connections.
Quote:
Introducing a new version of the ListObjects (ListObjectsV2) API that allows listing objects with a large number of delete markers.
Quote:
If there are thousands of delete markers, the list operation might timeout。
This is the example code that came with Spark. I copied the code here and this is the link to it: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala. However, when I was trying to run the program using command "bin/run-example org.apache.spark.examples.streaming.StatefulNetworkWordCount localhost 9999", I was given the following error:
14/07/20 11:52:57 ERROR ActorSystemImpl: Uncaught fatal error from thread [spark-akka.actor.default-dispatcher-4] shutting down ActorSystem [spark]
java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at org.apache.spark.streaming.scheduler.JobScheduler.getPendingTimes(JobScheduler.scala:114)
at org.apache.spark.streaming.Checkpoint.<init>(Checkpoint.scala:43)
at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:259)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:167)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:76)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Exception in thread "Thread-37" org.apache.spark.SparkException: Job cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:639)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:638)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:638)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1215)
at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:201)
at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
at akka.actor.ActorCell.terminate(ActorCell.scala:338)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:240)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/07/20 11:53:00 ERROR Executor: Exception in task ID 0
java.lang.IllegalStateException: cannot create children while terminating or terminated
at akka.actor.dungeon.Children$class.makeChild(Children.scala:184)
at akka.actor.dungeon.Children$class.attachChild(Children.scala:42)
at akka.actor.ActorCell.attachChild(ActorCell.scala:338)
at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:518)
at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.<init>(ReceiverSupervisorImpl.scala:67)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:263)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
14/07/20 11:53:06 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[spark-akka.actor.default-dispatcher-13,5,main]
org.apache.spark.SparkException: Error sending message to BlockManagerMaster [message = HeartBeat(BlockManagerId(<driver>, x-131-212-225-148.uofm-secure.wireless.umn.edu, 47668, 0))]
at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:251)
at org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:51)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:113)
at org.apache.spark.storage.BlockManager$$anonfun$initialize$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(BlockManager.scala:158)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790)
at org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:158)
at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
at akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.run(Scheduler.scala:464)
at akka.actor.LightArrayRevolverScheduler$$anonfun$close$1.apply(Scheduler.scala:281)
at akka.actor.LightArrayRevolverScheduler$$anonfun$close$1.apply(Scheduler.scala:280)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at akka.actor.LightArrayRevolverScheduler.close(Scheduler.scala:279)
at akka.actor.ActorSystemImpl.stopScheduler(ActorSystem.scala:630)
at akka.actor.ActorSystemImpl$$anonfun$_start$1.apply$mcV$sp(ActorSystem.scala:582)
at akka.actor.ActorSystemImpl$$anonfun$_start$1.apply(ActorSystem.scala:582)
at akka.actor.ActorSystemImpl$$anonfun$_start$1.apply(ActorSystem.scala:582)
at akka.actor.ActorSystemImpl$$anon$3.run(ActorSystem.scala:596)
at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.runNext$1(ActorSystem.scala:750)
at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.apply$mcV$sp(ActorSystem.scala:753)
at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.apply(ActorSystem.scala:746)
at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.apply(ActorSystem.scala:746)
at akka.util.ReentrantGuard.withGuard(LockUtil.scala:15)
at akka.actor.ActorSystemImpl$TerminationCallbacks.run(ActorSystem.scala:746)
at akka.actor.ActorSystemImpl$$anonfun$terminationCallbacks$1.apply(ActorSystem.scala:593)
at akka.actor.ActorSystemImpl$$anonfun$terminationCallbacks$1.apply(ActorSystem.scala:593)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Recipient[Actor[akka://spark/user/BlockManagerMaster#1887396223]] had already been terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:236)
****************CODE********************
object StatefulNetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount")
// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint(".")
// Create a NetworkInputDStream on target ip:port and count the
// words in input stream of \n delimited test (eg. generated by 'nc')
val lines = ssc.socketTextStream(args(0), args(1).toInt)
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))
// Update the cumulative count using updateStateByKey
// This will give a Dstream made of state (which is the cumulative count of the words)
val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)
stateDstream.print()
ssc.start()
ssc.awaitTermination()
}
}
I wonder if it is because, it is trying to set up the checkpoint at my local file system by doing commands "ssc.checkpoint(".")", while the file is not a file compatible with hadoop? (the file must be compatible with hadoop in order to set the checkpoint) If it is, how could I fix it? Thanks!
what's your JRE run time release, 1.7 or 1.8 , I have the similar issue that, I compile spark-source-code in 1.8, but if I use 1.7 to run the code, this issue will happen, change back to 1.8 as run time issue will solve.
jdk 1.8, concurrenthashmap.java(line 812):
// views
private transient KeySetView<K,V> keySet;
private transient ValuesView<K,V> values;
private transient EntrySetView<K,V> entrySet;
jdk 1.7 don't have above code
Hope it will help ^_^
The development part of shark/spark wiki is really brief, so I tried to put together a code in an effort to programmatically query a table. Here it is ...
object Test extends App {
val master = "spark://localhost.localdomain:8084"
val jobName = "scratch"
val sparkHome = "/home/shengc/Downloads/software/spark-0.6.1"
val executorEnvVars = Map[String, String](
"SPARK_MEM" -> "1g",
"SPARK_CLASSPATH" -> "",
"HADOOP_HOME" -> "/home/shengc/Downloads/software/hadoop-0.20.205.0",
"JAVA_HOME" -> "/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64",
"HIVE_HOME" -> "/home/shengc/Downloads/software/hive-0.9.0-bin"
)
val sc = new shark.SharkContext(master, jobName, sparkHome, Nil, executorEnvVars)
sc.sql2console("create table src")
sc.sql2console("load data local inpath '/home/shengc/Downloads/software/hive-0.9.0-bin/examples/files/kv1.txt' into table src")
sc.sql2console("select count(1) from src")
}
I can create table src and load data into src fine, but the last query threw NPE and failed, here is the output...
13/01/06 17:33:20 INFO execution.SparkTask: Executing shark.execution.SparkTask
13/01/06 17:33:20 INFO shark.SharkEnv: Initializing SharkEnv
13/01/06 17:33:20 INFO execution.SparkTask: Adding jar file:///home/shengc/workspace/shark/hive/lib/hive-builtins-0.9.0.jar
java.lang.NullPointerException
at shark.execution.SparkTask$$anonfun$execute$5.apply(SparkTask.scala:58)
at shark.execution.SparkTask$$anonfun$execute$5.apply(SparkTask.scala:55)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34)
at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38)
at shark.execution.SparkTask.execute(SparkTask.scala:55)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:134)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1326)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1118)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:951)
at shark.SharkContext.sql(SharkContext.scala:58)
at shark.SharkContext.sql2console(SharkContext.scala:84)
at Test$delayedInit$body.apply(Test.scala:20)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:60)
at scala.App$$anonfun$main$1.apply(App.scala:60)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.immutable.List.foreach(List.scala:76)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:30)
at scala.App$class.main(App.scala:60)
at Test$.main(Test.scala:4)
at Test.main(Test.scala)
FAILED: Execution Error, return code -101 from shark.execution.SparkTask13/01/06 17:33:20 ERROR ql.Driver: FAILED: Execution Error, return code -101 from shark.execution.SparkTask
13/01/06 17:33:20 INFO ql.Driver: </PERFLOG method=Driver.execute start=1357511600030 end=1357511600054 duration=24>
13/01/06 17:33:20 INFO ql.Driver: <PERFLOG method=releaseLocks>
13/01/06 17:33:20 INFO ql.Driver: </PERFLOG method=releaseLocks start=1357511600054 end=1357511600054 duration=0>
However, I can query src table by typing in select * from src within the shell invoked by bin/shark-withinfo
You might ask me how about trying that sql in the shell trigged by "bin/shark-shell". Well, I cannot get into that shell. Here is the error I came across...
https://groups.google.com/forum/?fromgroups=#!topic/shark-users/glZzrUfabGc
[EDIT 1]: this NPE seems to be resulting from SharkENV.sc has not been set, so I added
shark.SharkEnv.sc = sc
right before any sql2console opertions are executed. It then complained ClassNotFoundException of scala.tools.nsc, so I manually put scala-compiler in the classpath. After that, the code complained another ClassNotFoundException, which I cannot figure out how to fix it, since I did put shark jar in classpath.
13/01/06 18:09:34 INFO cluster.TaskSetManager: Lost TID 1 (task 1.0:1)
13/01/06 18:09:34 INFO cluster.TaskSetManager: Loss was due to java.lang.ClassNotFoundException: shark.execution.TableScanOperator$$anonfun$preprocessRdd$3
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
[EDIT 2]: OK, I figured out another code which can fulfill what I want by following exactly shark's source code of how to initialize the interactive repl.
System.setProperty("MASTER", "spark://localhost.localdomain:8084")
System.setProperty("SPARK_MEM", "1g")
System.setProperty("SPARK_CLASSPATH", "")
System.setProperty("HADOOP_HOME", "/home/shengc/Downloads/software/hadoop-0.20.205.0")
System.setProperty("JAVA_HOME", "/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64")
System.setProperty("HIVE_HOME", "/home/shengc/Downloads/software/hive-0.9.0-bin")
System.setProperty("SCALA_HOME", "/home/shengc/Downloads/software/scala-2.9.2")
shark.SharkEnv.initWithSharkContext("scratch")
val sc = shark.SharkEnv.sc.asInstanceOf[shark.SharkContext]
sc.sql2console("select * from src")
this is ugly, but at least it works. Any comments of how to write a more robust piece of code is welcome!!
For whoever wishes to programmatically operate on shark, please note that all hive and shark jars must be in your CLASSPATH, and scala compiler has to be in your classpath too. The other important thing is hadoop's conf should be in the classpath too.
I believe the issue is your SharkEnv is not initialized.
I'm using shark 0.9.0 (but I believe you have to initialize SharkEnv in 0.6.1 too), and my SharkEnv is initialized in the following way:
// SharkContext
val sc = new SharkContext(master,
jobName,
System.getenv("SPARK_HOME"),
Nil,
executorEnvVar)
// Initialize SharkEnv
SharkEnv.sc = sc
// create and populate table
sc.runSql("CREATE TABLE src(key INT, value STRING)")
sc.runSql("LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE src")
// print result to stdout
println(sc.runSql("select * from src"))
println(sc.runSql("select count(*) from src"))
Also, try to query data from src table (comment line with "select count(*) ...") without aggregating functions, I had similar issue when data query was ok, but count(*) throwed exception, fixed by adding mysql-connector-java.jar to yarn.application.classpath in my case.