EMR - Airflow to run scala jar file airflow.exceptions.AirflowException - postgresql

I am trying to run a scala jar file from AIRFLOW using emr and the jar file is designed to read mssql-jdbc and postgresql.
From airflow, I'm able to create cluster
My SPARK_STEPS looks like
SPARK_STEPS = [
{
'Name': 'Trigger_Source_Target',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit',
'--master', 'yarn',
'--jars', '/mnt/MyScalaImport.jar',
'--class', 'org.classname',
's3://path/SNAPSHOT.jar',
'SQL_Pwd', 'PostgreSQL_PWD', 'loadtype'],
}
}
]
After this I have JOB_FLOW_OVERRIDES defined-
JOB_FLOW_OVERRIDES = {
"Name": "pfdt-cluster-airflow",
"LogUri": "s3://path/elasticmapreduce/",
"ReleaseLabel": "emr-6.4.0",
"Applications": [
{"Name": "Spark"},
],
"Instances": {
"InstanceGroups": [
{
"Name": "Master nodes",
"Market": "ON_DEMAND",
"InstanceRole": "MASTER",
"InstanceType": "m5.xlarge",
"InstanceCount": 1,
}
],
"KeepJobFlowAliveWhenNoSteps": True,
"TerminationProtected": False,
'Ec2KeyName': 'pem_file_name',
"Ec2SubnetId": "subnet-123"
},
'BootstrapActions': [
{
'Name': 'import custom Jars',
'ScriptBootstrapAction': {
'Path': 's3://path/subpath/copytoolsjar.sh',
'Args': []
}
}
],
'Configurations': [
{
'Classification': 'spark-defaults',
'Properties': {
'spark.jars': 's3://jar_path/mssql-jdbc-8.4.1.jre8.jar'
}
}
],
"VisibleToAllUsers": True,
"JobFlowRole": "EMR_EC2_DefaultRole",
"ServiceRole": "EMR_DefaultRole",
"Tags": [
{"Key": "Environment", "Value": "Development"},
],
}
To copy the scala .jar file from S3 to local to airflow- I have a shell script which does the work: Path- s3://path/subpath/copytoolsjar.sh
aws s3 cp s3://path/SNAPSHOT.jar /mnt/MyScalaImport.jar
On triggering the airflow-
It fails at node watch_step
Errors what I'm getting are-
stdout.gz =>
stderr.gz =>
22/04/08 13:38:23 INFO CodeGenerator: Code generated in 25.5907 ms
Exception in thread "main" java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$2(JDBCOptions.scala:108)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:108)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:38)
How to resolve this issue-
I have my jars at-
s3://path/subpath/mssql-jdbc-8.4.1.jre8.jar
s3://path/subpath/postgresql-42.2.24.jar

To upload jar files(mssql-jdbc-8.4.1.jre8.jar, postgresql-42.2.24.jar) to airflow local-
In the bootstrap step-
'BootstrapActions': [ { 'Name': 'import custom Jars', 'ScriptBootstrapAction': { 'Path': 's3://path/subpath/copytoolsjar.sh', 'Args': [] } } ]
In copytoolsjar.sh file write the command as-
aws s3 cp cp s3://path/SNAPSHOT.jar /mnt/MyScalaImport.jar && bash -c "sudo aws s3 cp s3://path/subpath/mssql-jdbc-8.4.1.jre8.jar /usr/lib/spark/jars/" && bash -c "sudo aws s3 cp s3://path/subpath/postgresql-42.2.24.jar /usr/lib/spark/jars/"
Work will be done

Related

Exception in thread "main" org.apache.spark.SparkException: Detected yarn cluster mode, but isn't running on a cluster

I am trying to run a spark submit job from Managed Workflows for Apache Airflow (MWAA). I am able to spin up the cluster but not able to trigger the spark job.
Error:
Exception in thread "main" org.apache.spark.SparkException: Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.
Below is my code config
SPARK_STEPS = [ # Note the params values are supplied to the operator
{
"Name": "Run Spark Job",
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Jar": "s3://eyx-dataplatform-staging/reporting/Airflow/Jars/Reporting_EMR1.jar",
'MainClass': 'com.eyxdp.spark.Test',
"Args": ['spark-submit ',
'--packages',
"com.typesafe:config:1.3.3",
'--master',
"yarn-cluster",
'--deploy-mode',
"yarn",
'--class',
"com.eyxdp.spark.Test",
'--jars',
"s3://eyx-dataplatform-staging/reporting/Airflow/Jars/Reporting_EMR1.jar"
],
},
},
]JOB_FLOW_OVERRIDES = {
"Name": "Spark Job Runner",
"ReleaseLabel": "emr-5.36.0",
"Applications": [{"Name": "Hadoop"}, {"Name": "Spark"}], # We want our EMR cluster to have HDFS and Spark
"Configurations": [
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {"PYSPARK_PYTHON": "/usr/bin/python3"}, # by default EMR uses py2, change it to py3
}
],
}
],
"Instances": {
"InstanceGroups": [
{
"Name": "Master node",
"Market": "SPOT",
"InstanceRole": "MASTER",
"InstanceType": "m5.xlarge",
"InstanceCount": 1,
},
{
"Name": "Core - 2",
"Market": "SPOT", # Spot instances are a "use as available" instances
"InstanceRole": "CORE",
"InstanceType": "m5.xlarge",
"InstanceCount": 2,
},
],
"KeepJobFlowAliveWhenNoSteps": True,
"TerminationProtected": False, # this lets us programmatically terminate the cluster
},
"JobFlowRole": "EMR_EC2_DefaultRole",
"ServiceRole": "EMR_DefaultRole",
}

kubernetes - volume mapping via command

I need to map a volume while starting the container, I am able to do it so with yaml file.
Is there an way volume mapping can be done via command line without using yaml file? just like
-v option in docker?
without using yaml file
Technically, yes: you would need a json file, as illustrated in "Create kubernetes pod with volume using kubectl run"
See kubectl run.
kubectl run -i --rm --tty ubuntu --overrides='
{
"apiVersion": "batch/v1",
"spec": {
"template": {
"spec": {
"containers": [
{
"name": "ubuntu",
"image": "ubuntu:14.04",
"args": [
"bash"
],
"stdin": true,
"stdinOnce": true,
"tty": true,
"volumeMounts": [{
"mountPath": "/home/store",
"name": "store"
}]
}
],
"volumes": [{
"name":"store",
"emptyDir":{}
}]
}
}
}
}
' --image=ubuntu:14.04 --restart=Never -- bash

Specify ECR image instead of S3 file in Cloud Formation Elastic Beanstalk template

I'd like to reference an EC2 Container Registry image in the Elastic Beanstalk section of my Cloud Formation template. The sample file references an S3 bucket for the source bundle:
"applicationVersion": {
"Type": "AWS::ElasticBeanstalk::ApplicationVersion",
"Properties": {
"ApplicationName": { "Ref": "application" },
"SourceBundle": {
"S3Bucket": { "Fn::Join": [ "-", [ "elasticbeanstalk-samples", { "Ref": "AWS::Region" } ] ] },
"S3Key": "php-sample.zip"
}
}
}
Is there any way to reference an EC2 Container Registry image instead? Something like what is available in the EC2 Container Service TaskDefinition?
Upload a Dockerrun file to S3 in order to do this. Here's an example dockerrun:
{
"AWSEBDockerrunVersion": "1",
"Authentication": {
"Bucket": "my-bucket",
"Key": "mydockercfg"
},
"Image": {
"Name": "quay.io/johndoe/private-image",
"Update": "true"
},
"Ports": [
{
"ContainerPort": "8080:80"
}
],
"Volumes": [
{
"HostDirectory": "/var/app/mydb",
"ContainerDirectory": "/etc/mysql"
}
],
"Logging": "/var/log/nginx"
}
Use this file as the s3 key. More info is available here.

AWSCloudFormation - cfn-init failed to run command

I am using cloudformation for installing elasticsearch.
I am downloading and extracting tar.gz.
The following is my EC2 instance section :
"masterinstance": {
"Type": "AWS: : EC2: : Instance",
"Metadata": {
"AWS: : CloudFormation: : Init": {
"configSets" : {
"ascending" : [ "config1" , "config2" ]
},
"config1": {
"sources": {
"/home/ubuntu/": "https: //s3.amazonaws.com/xxxxxxxx/elasticsearch.tar.gz"
},
"files": {
"/home/ubuntu/elasticsearch/config/elasticsearch.yml": {
"content": {
"Fn: : Join": [
"",
[
xxxxxxxx
]
]
}
}
}
},
"config2" : {
"commands": {
"runservice": {
"command": "~/elasticsearch/bin/elasticsearch",
"cwd" : "~",
"test" : "~/elasticsearch/bin/elasticsearch > test.txt",
"ignoreErrors" : "false"
}
}
}
}
},
"Properties": {
"ImageId": "ami-xxxxxxxxxx",
"InstanceType": {
"Ref": "InstanceTypeParameter"
},
"Tags": [
xxxxxxxx
],
"KeyName": "everybody",
"NetworkInterfaces": [
{
"GroupSet": [
{
"Ref": "newSecurity"
}
],
"AssociatePublicIpAddress": "true",
"DeviceIndex": "0",
"SubnetId": {
"Ref": "oneSubnet"
}
}
],
"UserData": {
"Fn: : Base64": {
"Fn: : Join": [
"",
[
"#!/bin/bash\n",
"sudo add-apt-repository-yppa: webupd8team/java\n",
"sudo apt-get update\n",
"echo'oracle-java8-installershared/accepted-oracle-license-v1-1selecttrue'|sudo debconf-set-selections\n",
"sudo apt-getinstall-yoracle-java8-installer\n",
"apt-get update\n",
"apt-get-y installpython-setuptools\n",
"easy_installhttps: //s3.amazonaws.com/cloudformation-examples/aws-cfn-bootstrap-latest.tar.gz\n",
"/usr/local/bin/cfn-init",
"--stack Elasticsearch",
"--resource masterinstance",
"--configsets ascending",
"-v\n"
]
]
}
}
}
}
I am using AWS::CloudFormation::Init for configuration and other settings.
After extracting the tar , I want to start elasticsearch , which I am doing through the command section in AWS::CloudFormation::Init but ,
after the complete creation of stack when I ssh into my instances, I am not able to see my elasticsearch service running.
All other things like extracting tar and creating file is working correctly.
I have gone through the cfn-init.log , it gives me the following information :
2016-07-19 05:53:15,776 P2745 [INFO] Test for Command runservice
2016-07-19 05:53:15,778 P2745 [INFO] -----------------------Command Output-----------------------
2016-07-19 05:53:15,778 P2745 [INFO] /bin/sh: 1: ~/elasticsearch/bin/elasticsearch: not found
2016-07-19 05:53:15,778 P2745 [INFO] ------------------------------------------------------------
2016-07-19 05:53:15,779 P2745 [ERROR] Exited with error code 127
~
If I fire the above command ~/elasticsearch/bin/elasticsearch directly on my instance then it is working perfectly.
What I am doing wrong here.
Thank you.
I'm guessing that the home directory (~) is evaluating to a different user (not Ubuntu) when trying to run ES. I think CFN-Init runs as the root user instead of as ubuntu/ec2-user. Try to change the paths in the config2 command block to fully qualified paths (/home/ubuntu/elasticsearch).

Chef: Trying to get build-essential to install on our node before Postgres

Here's our node configuration:
{
"run_list": [
"recipe[apt]",
"recipe[build-essential]",
[
"rackbox"
]
],
"rackbox": {
"jenkins": {
"job": "job1",
"git_repo": "https://github.com/hayesmp/railsgirls-app.git",
"command": "bundle exec rake",
"ip_address": "192.237.181.154",
"host": "subocean-southerner"
},
"ruby": {
"versions": [
"2.0.0-p247"
],
"global_version": "2.0.0-p247"
},
"apps": {
"unicorn": [
{
"appname": "app1",
"hostname": "app1"
}
]
},
"db_root_password": "iloverandompasswordsbutthiswilldo",
"databases": {
"postgresql": [
{
"database_name": "app1_production",
"username": "app1",
"password": "app1_pass"
}
]
}
}
}
I'm just not sure where to insert the build essential compiletime = true attribute for my configuration.
This is the sample code for this stack overflow post: Chef: Why are resources in an "include_recipe" step being skipped?
name "myapp"
run_list(
"recipe[build-essential]",
"recipe[myapp]"
)
default_attributes(
"build_essential" => {
"compiletime" => true
}
)
Paste this into your node configuration:
"build_essential": {
"compiletime": true
}
BTW: you should use recipe[rackbox] instead of [rackbox] in your run_list