how to validate connection strings of azure iot hub in spark streaming? - scala

i'm new to azure iot hub . im trying to pull message from azure iot hub using spark streaming . im getting error when i execute the code and i could understand there is some problem in connection strings. is there any specific way to validate the connection string in spark and also please tell me the format which i specified is correct or not .
My sample code:
import org.apache.spark.eventhubs._
val eventHubName = "xyztest.azure-devices.net"
val eventHubNSConnStr = "Endpoint=sb://testname.servicebus.windows.net/;SharedAccessKeyName=primary;SharedAccessKey=abcedfgrdxyeurjrsdfyasdf="
val connStr = ConnectionStringBuilder(eventHubNSConnStr).setEventHubName(eventHubName).build
val customEventhubParameters = EventHubsConf(connStr).setMaxEventsPerTrigger(5)
val incomingStream = spark.readStream.format("eventhubs").options(customEventhubParameters.toMap).load()
incomingStream.writeStream.outputMode("append").format("console").option("truncate", false).start().awaitTermination()
Errro:
java.util.concurrent.ExecutionException: com.microsoft.azure.eventhubs.IllegalEntityException: The messaging entity 'sb://testname.servicebus.windows.net/xyztest' could not be found.
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at org.apache.spark.eventhubs.client.EventHubsClient.partitionCount(EventHubsClient.scala:169)

val eventHubName = "xyztest.azure-devices.net"
It seems you set wrong event hub name. "xyztest.azure-devices.net" should be your Azure IoT Hub hostname.
To find the event hub name you can go your iot hub-> endpoints-> events and copy the value of Event Hub-compatible name like this:
In the end, the event hub connect string will have the following format:
Endpoint=sb://SAMPLE;SharedAccessKeyName=KEY_NAME;SharedAccessKey=KEY;EntityPath=EVENTHUB_NAME

Related

How to create a bucket using the python SDK?

I'm trying to create a bucket in cloud object storage using python. I have followed the instructions in the API docs.
This is the code I'm using
COS_ENDPOINT = "https://control.cloud-object-storage.cloud.ibm.com/v2/endpoints"
# Create client
cos = ibm_boto3.client("s3",
ibm_api_key_id=COS_API_KEY_ID,
ibm_service_instance_id=COS_INSTANCE_CRN,
config=Config(signature_version="oauth"),
endpoint_url=COS_ENDPOINT
)
s3 = ibm_boto3.resource('s3')
def create_bucket(bucket_name):
print("Creating new bucket: {0}".format(bucket_name))
s3.Bucket(bucket_name).create()
return
bucket_name = 'test_bucket_442332'
create_bucket(bucket_name)
I'm getting this error - I tried setting CreateBucketConfiguration={"LocationConstraint":"us-south"}, but it doesnt seem to work
"ClientError: An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation: The unspecified location constraint is incompatible for the region specific endpoint this request was sent to."
Resolved by going to https://cloud.ibm.com/docs/cloud-object-storage?topic=cloud-object-storage-endpoints#endpoints
And choosing the endpoint specific to the region I need. The "Endpoint" provided with the credentials, is not the actual endpoint.

How to download from Google Cloud Storage by Alpakka-gcs without providing secret-key?

I'm using Alpakka-gcs connecting to GCS from google-compute-engine perfectly if I provide gcs-secret-key on application.conf like the below.
alpakka.google.cloud.storage {
project-id = "project_id"
client-email = "client_email"
private-key = "************gcs-secret-key************"
base-url = "https://www.googleapis.com/" // default
base-path = "/storage/v1" // default
token-url = "https://www.googleapis.com/oauth2/v4/token" // default
token-scope = "https://www.googleapis.com/auth/devstorage.read_write" // default
}
My question is
how to connect compute-engine already having a credential without providing secret-key for alpakka.
The below code sample is working fine but I want to know alpakka way.
def downloadObject(objectName:String, destFilePath: String): Unit = {
import com.google.cloud.storage.BlobId
import com.google.cloud.storage.StorageOptions
import java.nio.file.Paths
def credential:GoogleCredentials = ComputeEngineCredentials.create()
val storage = StorageOptions.newBuilder.setCredentials(credential).setProjectId(projectId).build.getService
val blob = storage.get(BlobId.of(bucketName, objectName))
blob.downloadTo(Paths.get(destFilePath))
}
If you look into the Alpakka sources, you can see an accessToken creation. Sadly, this version only support the internal call to GoogleTokenApi, a Alpakka made version to request token to Google Cloud. And based only on the private key, not on Metadata server or GOOGLE_APPLICATION_CREDENTIALS environment variable.
You can propose a change in the the project, or even develop it and push it to the project by using the Google Cloud oauth client library.

Could not instantiate EventHubSourceProvider for Azure Databricks

Using the steps documented in structured streaming pyspark, I'm unable to create a dataframe in pyspark from the Azure Event Hub I have set up in order to read the stream data.
Error message is:
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.eventhubs.EventHubsSourceProvider could not be instantiated
I have installed the Maven libraries (com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12 is unavailable) but none appear to work:
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.6
As well as ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString) but the error message returned is:
java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V
The connection string is correct as it is also used in a console application that writes to the Azure Event Hub and that works.
Can someone point me in the right direction, please. Code in use is as follows:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Event Hub Namespace Name
NAMESPACE_NAME = "*myEventHub*"
KEY_NAME = "*MyPolicyName*"
KEY_VALUE = "*MySharedAccessKey*"
# The connection string to your Event Hubs Namespace
connectionString = "Endpoint=sb://{0}.servicebus.windows.net/;SharedAccessKeyName={1};SharedAccessKey={2};EntityPath=ingestion".format(NAMESPACE_NAME, KEY_NAME, KEY_VALUE)
ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
# For 2.3.15 version and above, the configuration dictionary requires that connection string be encrypted.
# ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
To resolve the issue, I did the following:
Uninstall azure event hub library versions
Install com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15 library version from Maven Central
Restart cluster
Validate by re-running code provided in the question
I received this same error when installing libraries with the version number com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.* on a Spark cluster running Spark 3.0 with Scala 2.12
For anyone else finding this via google - check if you have the correct Scala library version. In my case, my cluster is Spark v3 with Scala 2.12
Changing the "2.11" in the library version from the tutorial I was using to "2.12", so it matches my cluster runtime version, fixed the issue.
I had to take this a step further. in the format method I had to add in this:
.format("org.apache.spark.sql.eventhubs.EventHubsSourceProvider") directly.
check the cluster scala version and the library version
Unisntall the older libraries and install :
com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.17
in the shared workspace(right click and install library) and also in the cluster

How to query flink's queryable state

I am using flink 1.8.0 and I am trying to query my job state.
val descriptor = new ValueStateDescriptor("myState", Types.CASE_CLASS[Foo])
descriptor.setQueryable("my-queryable-State")
I used port 9067 which is the default port according to this, my client:
val client = new QueryableStateClient("127.0.0.1", 9067)
val jobId = JobID.fromHexString("d48a6c980d1a147e0622565700158d9e")
val execConfig = new ExecutionConfig
val descriptor = new ValueStateDescriptor("my-queryable-State", Types.CASE_CLASS[Foo])
val res: Future[ValueState[Foo]] = client.getKvState(jobId, "my-queryable-State","a", BasicTypeInfo.STRING_TYPE_INFO, descriptor)
res.map(_.toString).pipeTo(sender)
but I am getting :
[ERROR] [06/25/2019 20:37:05.499] [bvAkkaHttpServer-akka.actor.default-dispatcher-5] [akka.actor.ActorSystemImpl(bvAkkaHttpServer)] Error during processing of request: 'org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9067'. Completing with 500 Internal Server Error response. To change default exception handling behavior, provide a custom ExceptionHandler.
java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9067
what am I doing wrong ?
how and where should I define QueryableStateOptions
So If You want to use the QueryableState You need to add the proper Jar to Your flink. The jar is flink-queryable-state-runtime, it can be found in the opt folder in Your flink distribution and You should move it to the lib folder.
As for the second question the QueryableStateOption is just a class that is used to create static ConfigOption definitions. The definitions are then used to read the configurations from flink-conf.yaml file. So currently the only option to configure the QueryableState is to use the flink-conf file in the flink distribution.
EDIT: Also, try reading this]1 it provides more info on how does Queryable State works. You shouldn't really connect directly to the server port but rather You should use the proxy port which by default is 9069.

Reset connection to Url/Port - If no data has been received in the past N Minutes

I have a flink application that connects to a Url/Port, I see with the restart strategy it allows to check if the connection is still open or not.
My query is... If a connection is open but no data has been received in the past 'N' Minutes I want to RESETthe connection
Currently setup the connection using the basic flink tutorials
// set up the streaming execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
val data_stream = env.socketTextStream(url, port, socket_stream_deliminator, socket_connection_retries)
.map(x => printInput(x))
.writeToSocket(url, port, new SimpleStringSchema())
// execute program
env.execute("Flink Streaming Scala API Skeleton")
Is there a function call or some included map to call to check if the connection has sent data in the past 'N' Minutes and if not reconnect.
How would I go about doing this