MS spark utilities - file move error using msspark.fs.mv - pyspark

Am in Synapse notebook, using pyspark to move file using msspark.fs.mv(src, dest, True)
Link to ms doc: https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=programming-language-python#move-file
Code:
filepath = "abfss://raw#xxxxdev001.blob.core.windows.net/SASDatFiles/test_sep22.sas7bdat "
movepath = "abfss://raw#xxxxdev001.blob.core.windows.net/SASDatFiles/Processed/test_sep22.sas7bdat"
mssparkutils.fs.mv(filepath,movepath, True)
Error:
**Py4JJavaError: An error occurred while calling z:mssparkutils.fs.mv.
: Operation failed: "An HTTP header that's mandatory for this request is not specified.", 400, PUT, https://xxxxdev001.blob.core.windows.net/raw/SASDatFiles/Processed/test_sep22.sas7bdat?timeout=90, , ""
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:199)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.renamePath(AbfsClient.java:337)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.rename(AzureBlobFileSystemStore.java:774)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.rename(AzureBlobFileSystem.java:354)
at com.microsoft.spark.notebook.msutils.impl.MSFsUtilsImpl.mvWithinFileSystem(MSFsUtilsImpl.scala:128)
at com.microsoft.spark.notebook.msutils.impl.MSFsUtilsImpl.mv(MSFsUtilsImpl.scala:259)
at mssparkutils.fs$.mv(fs.scala:22)
at mssparkutils.fs.mv(fs.scala)
at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)**
Notebook running under my credentials, i got owner / blob contributor role assigned on Azure Data Lake Gen 2.
I can move files with my credentials in Storage Explorer ... no issues.
Any clues of the error?

I try to reproduce same thing in my environment. I Got this output.
Configure your storage account as per below syntax:
spark.conf.set("fs.azure.account.key.<storage_account_name>.blob.core.windows.net","<Access_key>")
mssparkutils.fs.mv("abfss://<container_name>#<storage_account>.dfs.core.windows.net/vamsiba.sas7bdat","abfss://<container_name>#<storage_account>.dfs.core.windows.net/<folder>")
Or
If you want to copy data , use mssparkutils.fs.cp as shown in the below code:
mssparkutils.fs.cp("abfss://<container_name>#<storage_account>.dfs.core.windows.net/vamsiba.sas7bdat","abfss://<container_name>#<storage_account>.dfs.core.windows.net/<folder>")
Note:
Source location: abfss://<container_name>#<storage_account>.dfs.core.windows.net/vamsiba.sas7bdat
Destination location:
abfss://<container_name>#<storage_account>.dfs.core.windows.net/<folder>
Before running the code or (Moving the file) make sure to check destination location should not contain the same file name as vamsiba.sas7bdat in the folder location otherwise you will get an error.

Related

How to use Spark-BigQuery_connector for existing spark environment ( not with google dataproc)

I working on creating data pipeline, which takes the data from various conventional database and CSV files .It will use pyspark for preprocessing and then writes the result dataframe into BigQuery table.
I have copied the jar file gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.23.1.jar to the $SPARK_HOME/jars folder and while creating spark object have specified the same in spark.jars.packages.
Please give some suggestions, as all blogs I can find are using DataProc as example.
spark = SparkSession. \
builder. \
enableHiveSupport(). \
appName('data_load'). \
master('yarn'). \
config('spark.jars.packages','com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.23.1'). \
getOrCreate()
Scala Version = 2.12.15
While writing to big query, I am getting error related to authentication.
BigQuery Connection API is in enabled.
df.write.format('bigquery') \
.option('table',big_table_id) \
.save()
ERROR:
Py4JJavaError: An error occurred while calling o59.save.
: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Request had insufficient authentication scopes.
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:115)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getTable(HttpBigQueryRpc.java:286)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$18.call(BigQueryImpl.java:746)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$18.call(BigQueryImpl.java:743)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.getTable(BigQueryImpl.java:742)
at com.google.cloud.bigquery.connector.common.BigQueryClient.getTable(BigQueryClient.java:107)
at com.google.cloud.spark.bigquery.BigQueryInsertableRelation.getTable$lzycompute(BigQueryInsertableRelation.scala:67)
at com.google.cloud.spark.bigquery.BigQueryInsertableRelation.getTable(BigQueryInsertableRelation.scala:67)
at com.google.cloud.spark.bigquery.BigQueryInsertableRelation.exists$lzycompute(BigQueryInsertableRelation.scala:53)
at com.google.cloud.spark.bigquery.BigQueryInsertableRelation.exists(BigQueryInsertableRelation.scala:49)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:114)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
GET https://www.googleapis.com/bigquery/v2/projects/project212/datasets/NYSE_datasets/tables/NYSE_table?prettyPrint=false
{
"code" : 403,
"details" : [ {
"#type" : "type.googleapis.com/google.rpc.ErrorInfo",
"reason" : "ACCESS_TOKEN_SCOPE_INSUFFICIENT"
} ],
"errors" : [ {
"domain" : "global",
"message" : "Insufficient Permission",
"reason" : "insufficientPermissions"
} ],
"message" : "Request had insufficient authentication scopes.",
"status" : "PERMISSION_DENIED"
}
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:428)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getTable(HttpBigQueryRpc.java:284)
You have to use a service account to authenticate outside Dataproc, as described he in spark-bigquery-connector documentation:
Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as
described
here.
Credentials can also be provided explicitly either as a parameter or
from Spark runtime configuration. It can be passed in as a
base64-encoded string directly, or a file path that contains the
credentials (but not both).
Example: spark.read.format("bigquery").option("credentials", "<SERVICE_ACCOUNT_JSON_IN_BASE64>")
or
spark.conf.set("credentials", "<SERVICE_ACCOUNT_JSON_IN_BASE64>")
Alternatively, specify the credentials file name.
spark.read.format("bigquery").option("credentialsFile","</path/to/key/file>")
or
spark.conf.set("credentialsFile","</path/to/key/file>")
Another alternative to passing the credentials, is to pass the access
token used for authenticating the API calls to the Google Cloud
Platform APIs. You can get the access token by running gcloud auth application-default print-access-token.
spark.read.format("bigquery").option("gcpAccessToken","<acccess token>")
or
spark.conf.set("gcpAccessToken","<access-token>")

AWS GLUE: Cassandra connection using SSL is not working

I wanted to connect to Cassandra using Spark, when trying to connect Cassandra using the default port it is working, but when I try accessing it via SSL the job fails, below is the code:
val spark: SparkSession = SparkSession.builder()
.config("spark.cassandra.connection.host","server.abc")
.config("spark.cassandra.connection.port","9142")
.config("spark.cassandra.connection.ssl.enabled",true)
.config("spark.cassandra.connection.ssl.trustStore.path","s3:/dev-code/certs/trust.jks")
.config("spark.cassandra.connection.ssl.trustStore.password","mypass")
.config("spark.cassandra.auth.username","myuser")
.config("spark.cassandra.auth.password","userpass")
.appName("CassandraIntegration").getOrCreate()
FYI: it has access to the S3 bucket, I am able to read the CSV file from the same location. Also, both the ports are enabled 9042 and 9142. Closed 9042 and kept only 9142 port still the error persists.
Below is the error:
ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
java.io.IOException: Failed to open native connection to Cassandra at {server.abc:9142} :: Error instantiating class com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory (specified by advanced.ssl-engine-factory.class): Cannot initialize SSL Context
at com.datastax.spark.connector.cql.CassandraConnector$.createSession(CassandraConnector.scala:173)
at com.datastax.spark.connector.cql.CassandraConnector$.$anonfun$sessionCache$1(CassandraConnector.scala:161)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:32)
at com.datastax.spark.connector.cql.RefCountedCache.syncAcquire(RefCountedCache.scala:69)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:57)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
at com.datastax.spark.connector.datasource.CassandraCatalog$.com$datastax$spark$connector$datasource$CassandraCatalog$$getMetadata(CassandraCatalog.scala:455)
at com.datastax.spark.connector.datasource.CassandraCatalog$.getTableMetaData(CassandraCatalog.scala:421)
at org.apache.spark.sql.cassandra.DefaultSource.getTable(DefaultSource.scala:68)
at org.apache.spark.sql.cassandra.DefaultSource.inferSchema(DefaultSource.scala:72)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:296)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:266)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226)
at MyCsvToCassandrsJob$.main(csv-to-cassanra-job:63)
at MyCsvToCassandrsJob.main(csv-to-cassanra-job-job)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:47)
at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:47)
at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:75)
at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:123)
at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:29)
at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)
Caused by: java.lang.IllegalArgumentException: Error instantiating class com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory (specified by advanced.ssl-engine-factory.class): Cannot initialize SSL Context
at com.datastax.oss.driver.internal.core.util.Reflection.buildFromConfig(Reflection.java:253)
at com.datastax.oss.driver.internal.core.util.Reflection.buildFromConfig(Reflection.java:108)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.buildSslEngineFactory(DefaultDriverContext.java:414)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.lambda$new$4(DefaultDriverContext.java:279)
at com.datastax.oss.driver.internal.core.util.concurrent.LazyReference.get(LazyReference.java:55)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.getSslEngineFactory(DefaultDriverContext.java:733)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.buildSslHandlerFactory(DefaultDriverContext.java:470)
at com.datastax.oss.driver.internal.core.util.concurrent.LazyReference.get(LazyReference.java:55)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.getSslHandlerFactory(DefaultDriverContext.java:799)
at com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded.init(DefaultSession.java:348)
at com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded.access$1100(DefaultSession.java:300)
at com.datastax.oss.driver.internal.core.session.DefaultSession.lambda$init$0(DefaultSession.java:146)
at com.datastax.oss.driver.shaded.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at com.datastax.oss.driver.shaded.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106)
at com.datastax.oss.driver.shaded.netty.channel.DefaultEventLoop.run(DefaultEventLoop.java:54)
at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at com.datastax.oss.driver.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Cannot initialize SSL Context
at com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory.<init>(DefaultSslEngineFactory.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.datastax.oss.driver.internal.core.util.Reflection.buildFromConfig(Reflection.java:246)
... 18 more
Caused by: java.nio.file.NoSuchFileException: s3:/dev-code/certs/trust.jks
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory.buildContext(DefaultSslEngineFactory.java:119)
at com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory.<init>(DefaultSslEngineFactory.java:72)
... 23 more
Big help if there is any workaround for this problem.
At the bottom of your error message, I see this:
NoSuchFileException: s3:/dev-code/certs/trust.jks
Alex is right, in that you need to provide a path to that file that the Spark connector can actually get to. From the looks of it, S3 won't work here.
Added the .jks s3 file into "Referenced files path" of Glue Job and then just try to access just provide the file name. As the file will be automatically be placed under /tmp folder. But it will still not solve the issue.
From the this website, I understood that we need to provide all the default values as well:
Below is my final code:
val spark: SparkSession = SparkSession.builder()
.config("spark.cassandra.connection.host","server.abc")
.config("spark.cassandra.connection.port","9142")
.config("spark.cassandra.connection.ssl.enabled",true)
.config("spark.cassandra.connection.ssl.enabledAlgorithms", "TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA")
.config("spark.cassandra.connection.ssl.trustStore.path","trust.jks")
.config("spark.cassandra.connection.ssl.trustStore.password","mypass")
.config("spark.cassandra.connection.ssl.trustStore.type","JKS")
.config("spark.cassandra.connection.ssl.protocol","TLS")
.config("spark.cassandra.auth.username","myuser")
.config("spark.cassandra.auth.password","userpass")
.appName("CassandraIntegration").getOrCreate()

Problems with Kafka Source initialization in Siddhi

Can't create stream from Kafka topic using Siddhi. Even if I create string with Design View.
I copied all required jars to lib and bundle folders. Even started Kafka with Zookeeper locally (dunno why I need it locally but nwm).
On tooling.sh start I have following error:
[2020-02-26 22:15:43,041] WARNING {org.wso2.carbon.launcher.extensions.OSGiLibBundleDeployerUtils lambda$getBundlesInfo$1} - Error when loading the OSGi bundle information from /home/Hed/StreamProcessor/siddhi-tooling-5.1.2/lib/kafka-clients-2.3.0.jar
java.io.IOException: Required bundle manifest headers do not exist
at org.wso2.carbon.launcher.extensions.OSGiLibBundleDeployerUtils.getBundleInfo(OSGiLibBundleDeployerUtils.java:183)
at org.wso2.carbon.launcher.extensions.OSGiLibBundleDeployerUtils.lambda$getBundlesInfo$1(OSGiLibBundleDeployerUtils.java:135)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:313)
at java.util.stream.StreamSpliterators$DistinctSpliterator.forEachRemaining(StreamSpliterators.java:1291)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
For this script:
#App:name("HelloKafka")
#App:description('Consume events from a Kafka Topic and publish to a different Kafka Topic')
#source(type='kafka',
topic.list='kafka_topic',
partition.no.list='0',
threading.option='single.thread',
group.id="group",
bootstrap.servers='localhost:9092',
#map(type='json'))
define stream SweetProductionStream (name string, amount double);
I have see error on Run command:
io.siddhi.core.exception.SiddhiAppCreationException: Error on 'HelloKafka' # Line: 10. Position: 26, near '#source(type='kafka',
topic.list='kafka_topic',
partition.no.list='0',
threading.option='single.thread',
group.id="group",
bootstrap.servers='localhost:9092',
#map(type='json'))'. org/apache/kafka/clients/producer/Producer
at io.siddhi.core.util.ExceptionUtil.populateQueryContext(ExceptionUtil.java:43)
at io.siddhi.core.util.parser.helper.DefinitionParserHelper.addEventSource(DefinitionParserHelper.java:388)
at io.siddhi.core.util.SiddhiAppRuntimeBuilder.defineStream(SiddhiAppRuntimeBuilder.java:117)
at io.siddhi.core.util.parser.SiddhiAppParser.defineStreamDefinitions(SiddhiAppParser.java:374)
at io.siddhi.core.util.parser.SiddhiAppParser.parse(SiddhiAppParser.java:230)
at io.siddhi.core.SiddhiManager.createSiddhiAppRuntime(SiddhiManager.java:85)
at io.siddhi.core.SiddhiManager.createSiddhiAppRuntime(SiddhiManager.java:95)
at io.siddhi.distribution.editor.core.internal.DebugRuntime.createRuntime(DebugRuntime.java:201)
at io.siddhi.distribution.editor.core.internal.DebugRuntime.(DebugRuntime.java:56)
at io.siddhi.distribution.editor.core.internal.DebugProcessorService.start(DebugProcessorService.java:38)
at io.siddhi.distribution.editor.core.internal.EditorMicroservice.start(EditorMicroservice.java:761)
at io.siddhi.distribution.editor.core.internal.EditorMicroservice.startWithVariables(EditorMicroservice.java:781)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.wso2.msf4j.internal.router.HttpMethodInfo.invokeResource(HttpMethodInfo.java:187)
at org.wso2.msf4j.internal.router.HttpMethodInfo.invoke(HttpMethodInfo.java:143)
at org.wso2.msf4j.internal.MSF4JHttpConnectorListener.dispatchMethod(MSF4JHttpConnectorListener.java:218)
at org.wso2.msf4j.internal.MSF4JHttpConnectorListener.lambda$onMessage$58(MSF4JHttpConnectorListener.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/Producer
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
at java.lang.Class.getConstructor0(Class.java:3075)
at java.lang.Class.newInstance(Class.java:412)
at io.siddhi.core.util.SiddhiClassLoader.loadClass(SiddhiClassLoader.java:32)
at io.siddhi.core.util.SiddhiClassLoader.loadExtensionImplementation(SiddhiClassLoader.java:48)
at io.siddhi.core.util.parser.helper.DefinitionParserHelper.addEventSource(DefinitionParserHelper.java:346)
... 21 more
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.clients.producer.Producer cannot be found by siddhi-io-kafka_5.0.7
at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:448)
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:361)
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:353)
at org.eclipse.osgi.internal.loader.ModuleClassLoader.loadClass(ModuleClassLoader.java:161)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
... 28 more
Can somebody tell me what am I doing wrong? :(
Please make sure you had the OSGi-converted jars to the "C:\Program Files\WSO2\Enterprise Integrator\7.0.2\streaming-integrator\lib".
The OSGi-converted jar list:
kafka_2.12_2.3.0_1.0.0
kafka_clients_2.3.0_1.0.0
metrics_core_2.2.0_1.0.0
scala_library_2.12.8_1.0.0
zkclient_0.11_1.0.0
zookeeper_3.4.14_1.0.0
The, copy the original jars to to the "C:\Program Files\WSO2\Enterprise Integrator\7.0.2\streaming-integrator\samples\sample-clients\lib"
The list of original jars:
kafka_2.12-2.3.0
kafka-clients-2.3.0
metrics-core-2.2.0
scala-library-2.12.8
zkclient-0.11
zookeeper-3.4.14
In order to generate the OSGi-converted jars, copy all original jars to a folder called "source" and create an empty folder called "destination". Then run the following command in the terminal:
MINGW32 /c/Program Files/WSO2/Enterprise Integrator/7.0.2/streaming-integrator/bin
$ ./jartobundle.sh C:/DevTools/source C:/DevTools/destination
Finally, distribute the OSGis and original in accordance with the directories above.
PS1: in my case i am using kafka_2.12-2.4.1, but the basename of the jars does not change.
PS2: adapt the directories to your installation path
For more details check WSO2 documentation: Kafka transport

Unable to connect from Dataflow job to Schema Registry when Schema Registry requires TLS client authentication

I am developing a GCP Cloud Dataflow job that use Kafka broker and Schema Registry.
Our Kafka broker and Schema Registry requires TLS client certificate.
And I am facing connection issue with Schema Registry on deployment.
Any suggestion is highly welcomed.
Here is what I do for the Dataflow job.
I create Consumer Properties for TLS configurations.
props.put("security.protocol", "SSL");
props.put("ssl.truststore.password", "aaa");
props.put("ssl.keystore.password", "bbb");
props.put("ssl.key.password", "ccc"));
props.put("schema.registry.url", "https://host:port")
props.put("specific.avro.reader", true);
And update Consumer Properties by updateConsumerProperties.
Pipeline p = Pipeline.create(options)
...
.updateConsumerProperties(properties)
...
As this stackoverflow answer suggests, I also download keyStore and trustStore to local directory and specify trustStore / keyStore location on ConsumerProperties in ConsumerFactory.
Truststore and Google Cloud Dataflow
Pipeline p = Pipeline.create(options)
...
.withConsumerFactoryFn(new MyConsumerFactory(...))
...
In ConsumerFactory:
public Consumer<byte[], byte[]> apply(Map<String, Object> config) {
// download keyStore and trustStore from GCS bucket
config.put("ssl.truststore.location", (Object)localTrustStoreFilePath)
config.put("ssl.keystore.location", (Object)localKeyStoreFilePath)
new KafkaConsumer<byte[], byte[]>(config);
}
With this code I succeeded in deployment but the Dataflow job got TLS server certificate verification error.
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387)
sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292)
sun.security.validator.Validator.validate(Validator.java:260)
sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1513)
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:208)
io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:252)
io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:482)
io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:475)
io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getSchemaByIdFromRegistry(CachedSchemaRegistryClient.java:151)
io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getBySubjectAndId(CachedSchemaRegistryClient.java:230)
io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getById(CachedSchemaRegistryClient.java:209)
io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:116)
io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:88)
org.fastretailing.rfid.store.siv.EPCTransactionKafkaAvroDeserializer.deserialize(EPCTransactionKafkaAvroDeserializer.scala:14)
org.fastretailing.rfid.store.siv.EPCTransactionKafkaAvroDeserializer.deserialize(EPCTransactionKafkaAvroDeserializer.scala:7)
org.apache.beam.sdk.io.kafka.KafkaUnboundedReader.advance(KafkaUnboundedReader.java:234)
org.apache.beam.sdk.io.kafka.KafkaUnboundedReader.start(KafkaUnboundedReader.java:176)
org.apache.beam.runners.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.start(WorkerCustomSources.java:779)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:361)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:76)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1228)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:143)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:967)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Then I found that Schema Registry client load TLS configurations from system property.
https://github.com/confluentinc/schema-registry/issues/943
I tested Kafka Consumer with the same configuration, and I confirmed it works fine.
props.put("schema.registry.url", "https://host:port")
props.put("specific.avro.reader", true);
props.put("ssl.truststore.location", System.getProperty("javax.net.ssl.trustStore"));
props.put("ssl.truststore.password", System.getProperty("javax.net.ssl.keyStore"));
props.put("ssl.keystore.location", System.getProperty("javax.net.ssl.keyStore"));
props.put("ssl.keystore.password", System.getProperty("javax.net.ssl.keyStorePassword"));
props.put("ssl.key.password", System.getProperty("javax.net.ssl.key.password"));
Next I applied the same approach, which means apply the same TLS configurations to system properties and Consumer Properties, to Dataflow job code.
I specified password by system properties when executing application.
-Djavax.net.ssl.keyStorePassword=aaa \
-Djavax.net.ssl.key.password=bbb \
-Djavax.net.ssl.trustStorePassword=ccc \
Note: I set system property for trustStore and keyStore location in Consumer Factory since those files are downloaded to local temp directory.
config.put("ssl.truststore.location", (Object)localTrustStoreFilePath)
config.put("ssl.keystore.location", (Object)localKeyStoreFilePath)
System.setProperty("javax.net.ssl.trustStore", localTrustStoreFilePath)
System.setProperty("javax.net.ssl.keyStore", localKeyStoreFilePath)
but even deployment was failed with timeout error.
Exception in thread "main" java.lang.RuntimeException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:224)
...
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
Caused by: java.lang.IllegalArgumentException: DataflowRunner requires gcpTempLocation, but failed to retrieve a value from PipelineOptions
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:246)
Caused by: java.lang.IllegalArgumentException: Error constructing default value for gcpTempLocation: tempLocation is not a valid GCS path, gs://dev-k8s-rfid-store-dataflow/rfid-store-siv-epc-transactions-to-bq/tmp.
at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create(GcpOptions.java:255)
...
Caused by: java.lang.RuntimeException: Unable to verify that GCS bucket gs://dev-k8s-rfid-store-dataflow exists.
at org.apache.beam.sdk.extensions.gcp.storage.GcsPathValidator.verifyPathIsAccessible(GcsPathValidator.java:86)
...
Caused by: java.io.IOException: Error getting access token for service account: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: Default, provider: SunJSSE, class: sun.security.ssl.SSLContextImpl$DefaultSSLContext)
at com.google.auth.oauth2.ServiceAccountCredentials.refreshAccessToken(ServiceAccountCredentials.java:401)
...
Caused by: java.net.SocketException: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: Default, provider: SunJSSE, class: sun.security.ssl.SSLContextImpl$DefaultSSLContext)
at javax.net.ssl.DefaultSSLSocketFactory.throwException(SSLSocketFactory.java:248)
...
Caused by: java.security.NoSuchAlgorithmException: Error constructing implementation (algorithm: Default, provider: SunJSSE, class: sun.security.ssl.SSLContextImpl$DefaultSSLContext)
at java.security.Provider$Service.newInstance(Provider.java:1617)
...
Caused by: java.io.IOException: Keystore was tampered with, or password was incorrect
at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:780)
Caused by: java.security.UnrecoverableKeyException: Password verification failed
at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:778)
Am I missing something?
In the ConsumerFactoryFn, you need to copy the certificate from some location (such as GCS) to a local file path on the machine.
In Truststore and Google Cloud Dataflow, the ConsumerFnFactory that the user writes has this snippet of code which fetches the truststore from GCS:
Storage storage = StorageOptions.newBuilder()
.setProjectId("prj-id-of-your-bucket")
.setCredentials(GoogleCredentials.getApplicationDefault())
.build()
.getService();
Blob blob = storage.get("your-bucket-name", "pth.to.your.kafka.client.truststore.jks");
ReadChannel readChannel = blob.reader();
FileOutputStream fileOuputStream;
fileOuputStream = new FileOutputStream("/tmp/kafka.client.truststore.jks"); //path where the jks file will be stored
fileOuputStream.getChannel().transferFrom(readChannel, 0, Long.MAX_VALUE);
fileOuputStream.close();
File f = new File("/tmp/kafka.client.truststore.jks"); //assuring the store file exists
if (f.exists())
{
LOG.debug("key exists");
}
else
{
LOG.error("key does not exist");
}
You'll need to do something similar (it doesn't have to be GCS but it does need to be accessible from all VMs executing your pipeline on Google Cloud Dataflow).
I got reply from GCP support. It seems that Apache Beam does not support Schema Registry.
Hello,
the Dataflow specialist has reached me back. I will now expose what they have told me.
The answer to your question is no, Apache Beam does not support Schema Registry.
However, they have told me that you could implement the calls to Schema Registry
by yourself as Beam only consumes raw messages and it is user's responsibility to do
whatever they need with the data.
This is based on our understanding of the case that you want to publish messages to Kafka,
and have DF consume those messages, parsing them based on the schema from the registry.
I hope this information can be useful to you, let me know if I can be of further help.
But Dataflow job can still receive Avro format binary message. So you internally internally call Schema Registry REST API as follows.
https://stackoverflow.com/a/55917157

java.lang.AbstractMethodError: com/ibm/stocator/fs/common/IStoreClient.setStocatorPath(Lcom/ibm/stocator/fs/common/StocatorPath;)V

I'm trying to access data on IBM COS from Data Science Experience based on this blog post.
First, I select 1.0.8 version of stocator ...
!pip install --user --upgrade pixiedust
import pixiedust
pixiedust.installPackage("com.ibm.stocator:stocator:1.0.8")
Restarted kernel, then ...
access_key = 'xxxx'
secret_key = 'xxxx'
bucket = 'xxxx'
host = 'lon.ibmselect.objstor.com'
hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.endpoint", "http://" + host)
hconf.set("fs.s3d.service.access.key", access_key)
hconf.set("fs.s3d.service.secret.key", secret_key)
file = 'mydata_file.tsv.gz'
inputDataset = "s3d://{}.service/{}".format(bucket, file)
lines = sc.textFile(inputDataset, 1)
lines.count()
However, that results in the following error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.AbstractMethodError: com/ibm/stocator/fs/common/IStoreClient.setStocatorPath(Lcom/ibm/stocator/fs/common/StocatorPath;)V
at com.ibm.stocator.fs.ObjectStoreFileSystem.initialize(ObjectStoreFileSystem.java:104)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:249)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:249)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:249)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:249)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:53)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:249)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:249)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:932)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:378)
at org.apache.spark.rdd.RDD.collect(RDD.scala:931)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:507)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:785)
​Note: My first attempt at connecting to IBM COS resulted in a different error. That attempt is captured here: No FileSystem for scheme: cos
No need to install stocator it is already there. As Roland mentioned, new installation most likely would collide with the pre-installed one and cause conflicts.
Try ibmos2spark:
https://stackoverflow.com/a/46035893/8558372
Let me know if you still facing problems.
Don't force-install a new Stocator unless you have a really good reason.
I highly recommend the Spark aaS documentation at:
https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic1.html#genTopProcId2
Please choose the correct COS endpoints from:
https://ibm-public-cos.github.io/crs-docs/endpoints
and PLEASE use the private endpoints if you're working from within the IBM Cloud. It will be much faster and cheaper.
It's got examples of how to access COS data using all the nice helpers. It'll boil down to
import ibmos2spark
credentials = {
'endpoint': 's3-api.us-geo.objectstorage.service.networklayer.com', #just an example. Your url might be different
'access_key': 'my access key',
'secret_key': 'my secret key'
}
bucket_name = 'my bucket name'
object_name = 'mydata_file.tsv.gz'
cos = ibmos2spark.CloudObjectStorage(sc, credentials)
lines = sc.textFile(cos.url(object_name, bucket_name),1)
lines.count()
Chris, I usually don't use the 'http://' in the endpoint and that works for me. Not sure if that is the problem here.
Here is how I access the COS objects from DSX notebooks
endpoint = "s3-api.dal-us-geo.objectstorage.softlayer.net"
hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.endpoint",endpoint)
hconf.set("fs.s3d.service.access.key",Access_Key_ID)
hconf.set("fs.s3d.service.secret.key",Secret_Access_Key)
inputObject = "s3d://<bucket>.service/<file>"
myRDD = sc.textFile(inputObject,1)
DSX has a version of stocator on the classpath for Spark 2.0 and Spark 2.1 kernels. The one you installed in your instance is likely to get into conflict with the pre-installed version.