Apache Beam Spark Portable Runner - apache-beam

I am running a sample pipeline and my environment is this.
python "SaiStudy - Apache-Beam-Spark.py" --runner=PortableRunner --job_endpoint=192.168.99.102:8099
My Spark is running on a Docker Container and I can see that the JobService is running at 8099.
I am getting the following error:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1603539936.536000000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_chann
el.cc","file_line":4090,"referenced_errors":[{"created":"#1603539936.536000000","description":"failed to connect to all addresses","file":"src/core/ext/filters/cli
ent_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
When I curl to ip:port, I can see the following error from the docker logs
Oct 24, 2020 11:34:50 AM org.apache.beam.vendor.grpc.v1p26p0.io.grpc.netty.NettyServerTransport notifyTerminated
INFO: Transport failed
org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.http2.Http2Exception: Unexpected HTTP/1.x request: GET /
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.http2.Http2Exception.connectionError(Http2Exception.java:103)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.http2.Http2ConnectionHandler$PrefaceDecoder.readClientPrefaceString(Http2ConnectionHandler.java:302)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.http2.Http2ConnectionHandler$PrefaceDecoder.decode(Http2ConnectionHandler.java:239)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at org.apache.beam.vendor.grpc.v1p26p0.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Help Please.

Please find instruction here how to setup PortableRunner for Spark:
https://beam.apache.org/documentation/runners/spark/
Basically you need to setup additional Docker container (as described) which acts as runner between Beam (in any language) and Spark.
You connect Beam to runner, and runner to the Spark.

Related

localstack v0.11.5 and kcl v1.13.3

I am using kcl v1.13.3 with the latest localstack v0.11.5
The kcl client now uses edge service port 4566.
Are the kcl and localstack versions compatible?
I keep getting the following error:
com.amazonaws.SdkClientException: Unable to execute HTTP request: The target server failed to respond
Caused by: org.apache.http.NoHttpResponseException: The target server failed to respond
com.amazonaws.SdkClientException: Unable to execute HTTP request: The target server failed to respond
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException (AmazonHttpClient.java:1163)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper (AmazonHttpClient.java:1109)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute (AmazonHttpClient.java:758)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer (AmazonHttpClient.java:732)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute (AmazonHttpClient.java:714)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500 (AmazonHttpClient.java:674)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute (AmazonHttpClient.java:656)
at com.amazonaws.http.AmazonHttpClient.execute (AmazonHttpClient.java:520)
at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke (AmazonKinesisClient.java:2782)
Caused by: org.apache.http.NoHttpResponseException: The target server failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead (DefaultHttpResponseParser.java:141)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead (DefaultHttpResponseParser.java:56)
at org.apache.http.impl.io.AbstractMessageParser.parse (AbstractMessageParser.java:259)
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader (DefaultBHttpClientConnection.java:163)
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader (CPoolProxy.java:165)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse (HttpRequestExecutor.java:273)
at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse (SdkHttpRequestExecutor.java:82)
at org.apache.http.protocol.HttpRequestExecutor.execute (HttpRequestExecutor.java:125)
at org.apache.http.impl.execchain.MainClientExec.execute (MainClientExec.java:272)
at org.apache.http.impl.execchain.ProtocolExec.execute (ProtocolExec.java:185)
at org.apache.http.impl.client.InternalHttpClient.doExecute (InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute (CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute (CloseableHttpClient.java:56)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute (SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest (AmazonHttpClient.java:1285)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper (AmazonHttpClient.java:1101)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute (AmazonHttpClient.java:758)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer (AmazonHttpClient.java:732)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute (AmazonHttpClient.java:714)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500 (AmazonHttpClient.java:674)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute (AmazonHttpClient.java:656)
at com.amazonaws.http.AmazonHttpClient.execute (AmazonHttpClient.java:520)
Did you already confirm that localstack endpoint is functional on that port? For example:
aws kinesis list-streams --endpoint http://localhost:4566
(If you do not have and want aws cli installed, there's always the option of using docker)
Moreover, it might be helpful for you to share how you are boostrapping the AWS client. It should be something along the lines of:
AwsClientBuilder.EndpointConfiguration endpointConfig = new AwsClientBuilder.EndpointConfiguration("http://localhost:4566",
Regions.EU_WEST_1.getName());
return AmazonDynamoDBClientBuilder.standard()
.withEndpointConfiguration(endpointConfig)
.build();
Note that if you are running your kcl app inside another docker container, then you might want to change from "http://localhost:4566" to "http://localstack:4566".

How to resolve "Spark-Submit Error: Failed to load class" issue?

I am new to Spark / Scala, I tried to execute some sample scala program and facing some issue on this. Please find the steps I tried so far,
I have created the scala class (object) Sivam under the package com.dhana.MyScalaDemo
Step 1: Started Master
Step 2: Started Slave with above Master URL (Able to see both the Master & Worker node in localhost:8080 UI)
Step 3: Submitted to Spark as spark-submit --class com.dhana.MyScalaDemo.Sivam --master spark://10.xxx.xx.xxx:7077 file:///C:/Users/DHANABALAN/workspace/MyScalaDemo/target/MyScalaDemo-0.0.1-SNAPSHOT.jar
Getting the error as below,
Error: Failed to load class com.dhana.MyScalaDemo.Sivam.
Could you please assist me with this?

HADOOPFS - Could not verify the base directory in streamsets

I am having issues running the Pipeline with in streamsets, I can see the following error is :
HADOOPFS_44 - Could not verify the base directory: 'java.net.ConnectException: Call From SDC/...... to ......failed on connection exception: java.net.ConnectException: Connection refused;
For more details see:
https://cwiki.apache.org/confluence/display/HADOOP2/ConnectionRefused
You should follow the steps given in the Hadoop wiki. The machine running StreamSets Data Collector cannot connect to the Hadoop cluster for some reason.

Scala Spark : (org.apache.spark.repl.ExecutorClassLoader) Failed to check existence of class org on REPL class server at path

Running basic df.show() post spark notebook installation
I am getting the following error when running scala - spark code on spark-notebook. Any idea when this occurs and how to avoid?
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class org.apache.spark.sql.catalyst.expressions.Object on REPL class server at spark://192.168.10.194:50935/classes
[org.apache.spark.util.Utils] Aborting task
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class org on REPL class server at spark://192.168.10.194:50935/classes
[org.apache.spark.util.Utils] Aborting task
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class
I installed the spark on local, and when I was using following code it was giving me the same error.
spark.read.format("json").load("Downloads/test.json")
I think the issue was, it was trying to find some master node and taking some random or default IP. I specified the mode and then provided the IP as 127.0.0.1 and it resolved my issue.
Solution
Run the spark using local master
usr/local/bin/spark-shell --master "local[4]" --conf spark.driver.host=127.0.0.1'

Failed to open native connection to Cassandra from jar produced by intellij

I've developed a spark-based application that gets data from kafka and saves it in cassandra DB, using intellij.
connection code in scala:
val cluster = Cluster.builder().addContactPoint("192.168.0.253").withPort(9042).build();
val session = cluster.connect()
the code work fine when I run it from intellij, but I get this error when I try to run it from jar using command line:
Exception in thread "main" java.io.IOException:
Failed to open native connection to Cassandra at {192.168.0.253}:9042 at .... ....
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s) tried for query failed
(tried: /192.168.0.253:9042 (com.datastax.driver.core.exceptions.TransportException:
[/192.168.0.253] Error writing)) at
com.datastax.driver.core.ControlConnection.reconnectInternal(
ControlConnection.java:233) at
com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:79)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1424)
at com.datastax.driver.core.Cluster.getMetadata(Cluster.java:403)
at com.datastax.spark.connector.cql.CassandraConnector$.
com$datastax$spark$connector$cql$CassandraConnector$$createSession(
CassandraConnector.scala:155) ... 13 more
I've produced a jar file from intellij, and I built the jar with dependencies, using [ copy to the output directory and link via manifist ] option.
cassandra.yaml file:
#Whether to start the native transport server.
start_native_transport: true
#port for the CQL native transport to listen for clients on
native_transport_port: 9042
Why is this error raised & and how can I fix it?
Thanks in advance