I have a glue job that reads directly from redshift, and to do that, one has to provide connection credentials. I have created an embedded glue connection and can extract the credentials with the following pyspark code. Is there a way to do this in Scala?
glue = boto3.client('glue', region_name='us-east-1')
response = glue.get_connection(
Name='name-of-embedded-connection',
HidePassword=False
)
table = spark.read.format(
'com.databricks.spark.redshift'
).option(
'url',
'jdbc:redshift://prod.us-east-1.redshift.amazonaws.com:5439/db'
).option(
'user',
response['Connection']['ConnectionProperties']['USERNAME']
).option(
'password',
response['Connection']['ConnectionProperties']['PASSWORD']
).option(
'dbtable',
'db.table'
).option(
'tempdir',
's3://config/glue/temp/redshift/'
).option(
'forward_spark_s3_credentials', 'true'
).load()
There is no scala equivalent from AWS to issue this API call.But you can use Java SDK code inside scala as mentioned in this answer.
This is the Java SDK call for getConnection and if you don't want to do this then you can follow below approach:
Create AWS Glue python shell job and retrieve the connection information.
Once you have the values then call the other scala Glue job with these as arguments inside your python shell job as shown below :
glue = boto3.client('glue', region_name='us-east-1')
response = glue.get_connection(
Name='name-of-embedded-connection',
HidePassword=False
)
response = client.start_job_run(
JobName = 'my_scala_Job',
Arguments = {
'--username': response['Connection']['ConnectionProperties']['USERNAME'],
'--password': response['Connection']['ConnectionProperties']['PASSWORD'] } )
Then access these parameters inside your scala job using getResolvedOptions as shown below:
import com.amazonaws.services.glue.util.GlueArgParser
val args = GlueArgParser.getResolvedOptions(
sysArgs, Array(
"username",
"password")
)
val user = args("username")
val pwd = args("password")
I have a variable which value I 'd like to be pushed to Airflow so I can use it as an input for the next task. I know that I must use xcoms but I haven't figured out how to push from the spark task to the Airflow
def c_count():
return spark_task(
name='c_count',
script='c_count.py',
dag=dag,
table=None,
host=Variable.get('host'),
trigger_rule="all_done",
provide_context=True,
xcom_push = True
)
def c_int():
return spark_task(
name='c_in',
script='another_test.py',
dag=dag,
table=None,
host=Variable.get('host'),
trigger_rule="all_done",
counts="{{ task_instance.xcom_pull(task_ids='c_count') }}"
)
EDIT:
The spark task is the following:
def spark_task_sapbw(name, script, dag, table, host, **kwargs):
spark_cmd = 'spark-submit'
if Variable.get('spark_master_uri', None):
spark_cmd += ' --master {}'.format(Variable.get('spark_master_uri'))
.
.
.
task = BashOperator(
task_id=name,
bash_command=spark_cmd,
dag=dag,
**kwargs
)
return task
The problem is that what I get back is the last print of the Airflow's log. Is there any way that I can get a specific value from the spark script? Thank you!
You cannot make directly spark and airflow communicate. You have to use Python in between. collect the values you need and push them to airflow with XComs.
I created a simple spark streaming application to consume data from Flume using Pull-based approach.
Spark version: 2.2.0
Flume version: 1.7.0
It works well when I run the program from my PC in Eclipse (Run As - Scala Application). But after compiling it into jar and submit the app via spark-submit, it's not receiving any data from Flume. Here's my code:
def main(args: Array[String]){
val conf = new SparkConf().setAppName("twitter").set("spark.streaming.stopGracefullyOnShutdown", "true")
val ssc = new StreamingContext(conf, Seconds(30))
val flumeStream = FlumeUtils.createPollingStream(ssc, "172.31.190.31", 9999)
val tweets = flumeStream.map(e => new String(e.event.getBody.array()))
tweets.print()
tweets.foreachRDD(rdd=>{
rdd.saveAsTextFile("/warehouse/raw/twitter/data")
})
ssc.start()
ssc.awaitTermination()
}
I build the program via Right Click the project - Run As - Maven Build - Goals=package - Run.
And here's how I submit the app:
spark-submit --master local[*] --deploy-mode client --class co.id.linknet.general.StreamingFlume ./spark/lib/linknet-general-1.0.1.jar
Flume config:
TwitterAgent01.sources = Twitter
TwitterAgent01.channels = MemoryChannel01
TwitterAgent01.sinks = HDFS
TwitterAgent01.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent01.sources.Twitter.channels = MemoryChannel01
TwitterAgent01.sources.Twitter.consumerKey = xxx
TwitterAgent01.sources.Twitter.consumerSecret = xxx
TwitterAgent01.sources.Twitter.accessToken = xxx
TwitterAgent01.sources.Twitter.accessTokenSecret = xxx
TwitterAgent01.sources.Twitter.keywords = keyword1, keyword2, keywordN
TwitterAgent01.sinks = sparkStream
TwitterAgent01.sinks.sparkStream.type = org.apache.spark.streaming.flume.sink.SparkSink
TwitterAgent01.sinks.sparkStream.hostname = edge01
TwitterAgent01.sinks.sparkStream.port = 9999
TwitterAgent01.sinks.sparkStream.channel = MemoryChannel01
TwitterAgent01.channels.MemoryChannel01.type = memory
TwitterAgent01.channels.MemoryChannel01.capacity = 10000
TwitterAgent01.channels.MemoryChannel01.transactionCapacity = 10000
Flume and spark submission are in the same server, I'm able to telnet port 9999 from itself.
For additional info, I've added some required library in flume and spark directory
$FLUME_HOME/lib
spark-streaming-flume_2.11-2.2.0.jar
spark-streaming-flume-sink_2.11-2.2.0.jar
scala-library-2.11.8.jar
commons-lang3-3.5.jar
$SPARK_HOME/jars
spark-streaming-flume_2.11-2.2.0.jar
spark-streaming-flume-sink_2.11-2.2.0.jar
scala-library-2.11.8.jar
commons-lang-2.5.jar
commons-lang3-3.5.jar
Did I miss something ?
I'm trying to connect to IBM Cloud Object Storage from IBM Data Science Experience:
access_key = 'XXX'
secret_key = 'XXX'
bucket = 'mybucket'
host = 'lon.ibmselect.objstor.com'
service = 'mycos'
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.myCos.access.key', access_key)
hconf.set('fs.cos.myCos.endpoint', 'http://' + host)
hconf.set('fs.cose.myCos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
This returns:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No FileSystem for scheme: cos
I'm guessing I need to use the 'cos' scheme based on the stocator docs. However, the error suggests stocator isn't available or is an old version?
Any ideas?
Update 1:
I have also tried the following:
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
service = 'mycos'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
However, this time the response was:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No object store for: cos
at com.ibm.stocator.fs.ObjectStoreVisitor.getStoreClient(ObjectStoreVisitor.java:121)
...
Caused by: java.lang.ClassNotFoundException: com.ibm.stocator.fs.cos.COSAPIClient
The latest version of Stocator (v1.0.9) that supports fs.cos scheme is not yet deployed on Spark aaService (It will be soon). Please use the stocator scheme "fs.s3d" to connect to your COS.
Example:
endpoint = 'endpointXXX'
access_key = 'XXX'
secret_key = 'XXX'
prefix = "fs.s3d.service"
hconf = sc._jsc.hadoopConfiguration()
hconf.set(prefix + ".endpoint", endpoint)
hconf.set(prefix + ".access.key", access_key)
hconf.set(prefix + ".secret.key", secret_key)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('s3d://{0}.service/{1}'.format(bucket, obj))
rdd.count()
Alternatively, you can use ibmos2spark. The lib is already installed on our service. Example:
import ibmos2spark
credentials = {
'endpoint': 'endpointXXXX',
'access_key': 'XXXX',
'secret_key': 'XXXX'
}
configuration_name = 'os_configs' # any string you want
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile(cos.url(obj, bucket))
rdd.count()
Stocator is on the classpath for Spark 2.0 and 2.1 kernels, but the cos scheme is not configured. You can access the config by executing the following in a Python notebook:
!cat $SPARK_CONF_DIR/core-site.xml
Look for the property fs.stocator.scheme.list. What I currently see is:
<property>
<name>fs.stocator.scheme.list</name>
<value>swift2d,swift,s3d</value>
</property>
I recommend that you raise a feature request against DSX to support the cos scheme.
It looks like cos driver is not properly initialized. Try this configuration:
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
UPDATE 1:
You also need to ensure stocator classes are on the classpath. You can use packages system by exceuting pyspark in the following way:
./bin/pyspark --packages com.ibm.stocator:stocator:1.0.24
This works with swift2d and cos scheme.
UPDATE 2:
Just follow Stocator documentation (https://github.com/CODAIT/stocator). It contains all details how to install it, what branch to use, etc.
I found the same issue, and to solve it I just changed environment:
Within IBM Watson Studio, if you start a a Jupyter notebook in an environment without a pre-configured spark cluster, than you get that error. Installing PySpark is not enough.
Instead, if you start a notebook with the Spark cluster available, you will be just fine.
You have to set .config("spark.hadoop.fs.stocator.scheme.list", "cos") along with some others fs.cos... configurations.
Here's an end-to-end snippet code example that works (tested with pyspark==2.3.2 and Python 3.7.3):
from pyspark.sql import SparkSession
stocator_jar = '/path/to/stocator-1.1.2-SNAPSHOT-IBM-SDK.jar'
cos_instance_name = '<myCosIntanceName>'
bucket_name = '<bucketName>'
s3_region = '<region>'
cos_iam_api_key = '*******'
iam_servicce_id = 'crn:v1:bluemix:public:iam-identity::<****************>'
spark_builder = (
SparkSession
.builder
.appName('test_app'))
spark_builder.config('spark.driver.extraClassPath', stocator_jar)
spark_builder.config('spark.executor.extraClassPath', stocator_jar)
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.api.key", cos_iam_api_key)
spark_builder.config(f"fs.cos.{cos_instance_name}.endpoint", f"s3.{s3_region}.cloud-object-storage.appdomain.cloud")
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.service.id", iam_servicce_id)
spark_builder.config("spark.hadoop.fs.stocator.scheme.list", "cos")
spark_builder.config("spark.hadoop.fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")
spark_builder.config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")
spark_builder.config("fs.stocator.cos.scheme", "cos")
spark_sess = spark_builder.getOrCreate()
dataset = spark_sess.range(1, 10)
dataset = dataset.withColumnRenamed('id', 'user_idx')
dataset.repartition(1).write.csv(
f'cos://{bucket_name}.{cos_instance_name}/test.csv',
mode='overwrite',
header=True)
spark_sess.stop()
print('done!')
InvalidSignatureException occurs when trying to add user record using Kinesis Producer library.
AWS_JAVA_SDK_VERSION=1.10.26
AWS_KINESIS_PRODUCER_VERSION=0.10.1
ERROR:
PutRecords failed: {"__type":"InvalidSignatureException","message":"The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method.
SCALA KINESIS PRODUCER CODE
private val configuration: KinesisProducerConfiguration = new KinesisProducerConfiguration
val credentialsProvider: AWSCredentialsProvider = AwsUtil.getAwsCredentials(config.awsAccessKey, config.awsSecretKey)
configuration.setCredentialsProvider(credentialsProvider)
configuration.setRecordMaxBufferedTime(config.timeLimit)
configuration.setAggregationMaxCount(1)
configuration.setRegion(config.streamRegion)
configuration.setMetricsLevel("none")
private val kinesisProducer = new KinesisProducer(configuration)
kinesisProducer.addUserRecord(streamName, key, eventBytes)`
The above code is not working. But its possible for me to add records to kinesis stream through aws cli from terminal and KinesisClient in code which is specified below.
private def createKinesisClient = {
val accessKey = config.awsAccessKey
val secretKey = config.awsSecretKey
val credentialsProvider: AWSCredentialsProvider = AwsUtil.getAwsCredentials(accessKey, secretKey)
val client = new AmazonKinesisClient(credentialsProvider)
client.setEndpoint(config.streamEndpoint)
client
}
This happens because your VM/PC/Server clock might by skewed.
If you're running ubuntu, try updating your system time:
sudo ntpdate ntp.ubuntu.com
If you are using docker-machine on Mac, you can resolve with this command:
docker-machine ssh default 'sudo ntpclient -s -h pool.ntp.org'