How to resolve "Spark-Submit Error: Failed to load class" issue? - scala

I am new to Spark / Scala, I tried to execute some sample scala program and facing some issue on this. Please find the steps I tried so far,
I have created the scala class (object) Sivam under the package com.dhana.MyScalaDemo
Step 1: Started Master
Step 2: Started Slave with above Master URL (Able to see both the Master & Worker node in localhost:8080 UI)
Step 3: Submitted to Spark as spark-submit --class com.dhana.MyScalaDemo.Sivam --master spark://10.xxx.xx.xxx:7077 file:///C:/Users/DHANABALAN/workspace/MyScalaDemo/target/MyScalaDemo-0.0.1-SNAPSHOT.jar
Getting the error as below,
Error: Failed to load class com.dhana.MyScalaDemo.Sivam.
Could you please assist me with this?

Related

Apache Beam Spark Portable Runner

I am running a sample pipeline and my environment is this.
python "SaiStudy - Apache-Beam-Spark.py" --runner=PortableRunner --job_endpoint=192.168.99.102:8099
My Spark is running on a Docker Container and I can see that the JobService is running at 8099.
I am getting the following error:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1603539936.536000000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_chann
el.cc","file_line":4090,"referenced_errors":[{"created":"#1603539936.536000000","description":"failed to connect to all addresses","file":"src/core/ext/filters/cli
ent_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
When I curl to ip:port, I can see the following error from the docker logs
Oct 24, 2020 11:34:50 AM org.apache.beam.vendor.grpc.v1p26p0.io.grpc.netty.NettyServerTransport notifyTerminated
INFO: Transport failed
org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.http2.Http2Exception: Unexpected HTTP/1.x request: GET /
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.http2.Http2Exception.connectionError(Http2Exception.java:103)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.http2.Http2ConnectionHandler$PrefaceDecoder.readClientPrefaceString(Http2ConnectionHandler.java:302)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.http2.Http2ConnectionHandler$PrefaceDecoder.decode(Http2ConnectionHandler.java:239)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Help Please.
Please find instruction here how to setup PortableRunner for Spark:
https://beam.apache.org/documentation/runners/spark/
Basically you need to setup additional Docker container (as described) which acts as runner between Beam (in any language) and Spark.
You connect Beam to runner, and runner to the Spark.

Scala Spark : (org.apache.spark.repl.ExecutorClassLoader) Failed to check existence of class org on REPL class server at path

Running basic df.show() post spark notebook installation
I am getting the following error when running scala - spark code on spark-notebook. Any idea when this occurs and how to avoid?
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class org.apache.spark.sql.catalyst.expressions.Object on REPL class server at spark://192.168.10.194:50935/classes
[org.apache.spark.util.Utils] Aborting task
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class org on REPL class server at spark://192.168.10.194:50935/classes
[org.apache.spark.util.Utils] Aborting task
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class
I installed the spark on local, and when I was using following code it was giving me the same error.
spark.read.format("json").load("Downloads/test.json")
I think the issue was, it was trying to find some master node and taking some random or default IP. I specified the mode and then provided the IP as 127.0.0.1 and it resolved my issue.
Solution
Run the spark using local master
usr/local/bin/spark-shell --master "local[4]" --conf spark.driver.host=127.0.0.1'

The initialization of the DataSource's outputs caused an error: The UDF class is not a proper subclass

I have this issue
The initialization of the DataSource's outputs caused an error: The UDF class is not a proper subclass of org.apache.flink.api.common.functions.MapFunction
generated by this code:
val probes: DataSet[Probe] = env.createInput[InputProbe](new ProbesInputFormat).map { i =>
new Probe(
i.rssi,
0,
i.macHash,
i.deviceId,
0,
i.timeStamp)
}
I'm using scala 2.11 on flink 1.4.0 with IDEA.
On Dev machine i have no issue and the job runs properly, while on a Flink Standalone Cluster of 3 nodes i encountered the above error.
Can you help me please ;(
UPDATE:
I resolved implementing a class that extends from RichMapFunction, i don't know why but seems that lambda function => are not supported properly.
Now i have a new issue:
java.lang.ClassCastException: hk.huko.aps2.entities.Registry cannot be cast to scala.Product
Should i open a new POST?
I resolved the issue. It happened because flink load my job JAR many times (classloader) and in somehow it produced that error.
The solution is to not creare a JAR including all external JARs dependencies, but to copy into flink/lib folder those libraries plus your job JAR.

Mappers fail for pig to insert data into MongoDB

I am trying to import a file from HDFS to MongoDB using MongoInsertStorage with PIG. The files are large, around 5GB. The script runs fine when I run it in local mode with
pig -x local example.pig
However if I run it in the mapreduce mode, Most of the mappers fail with the following error:
Error: com.mongodb.ConnectionString.getReadConcern()Lcom/mongodb/ReadConcern;
Container killed by the ApplicationMaster.
Container killed on request.
Exit code is 143 Container exited with a non-zero exit code 143
Can someone help me solve this issue?? I also increased the memory allocated to YARN containers but that hasnt helped.
Some mappers are also timing out after 300 seconds.
Pig Script is as follows
REGISTER mongo-java-driver-3.2.2.jar
REGISTER mongo-hadoop-core-1.4.0.jar
REGISTER mongo-hadoop-pig-1.4.0.jar
REGISTER mongodb-driver-3.2.2.jar
DEFINE MongoInsertStorage com.mongodb.hadoop.pig.MongoInsertStorage();
SET mapreduce.reduce.speculative true
BIG_DATA = LOAD 'hdfs://example.com:8020/user/someuser/sample.csv' using PigStorage(',') As (a:chararray,b:chararray,c:chararray);
STORE BIG_DATA INTO 'mongodb://insert.some.ip.here:27017/test.samplecollection' USING MongoInsertStorage('', '')
Found a solution.
For the error
Error: com.mongodb.ConnectionString.getReadConcern()Lcom/mongodb/ReadConcern;
Container killed by the ApplicationMaster.
Container killed on request.
Exit code is 143 Container exited with a non-zero exit code 143
I changed the JAR versions - hadoopcore and hadooppig from 1.4.0 to 2.0.2 and for Mongo Java driver from 3.2.2 to 3.4.2. This eliminated the ReadConcern Error on the mappers!
For the timeout, I added this after registering the jars:
SET mapreduce.task.timeout 1800000
I had been using SET mapred.task.timeout which didnt work
Hope this helps anyone who has a similar issue!

GCS Connector Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

We are trying to run Hive queries on HDP 2.1 using GCS Connector, it was working fine until yesterday but since today morning our jobs are randomly started failing. When we restart them manually they just work fine. I suspect it's something to do with number of parallel Hive jobs running at a given point of time.
Below is the error message:
vertexId=vertex_1407434664593_37527_2_00, diagnostics=[Vertex Input: audience_history initializer failed., java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found]
DAG failed due to vertex failure. failedVertices:1 killedVertices:0
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
Any help will be highly appreciated.
Thanks!