AWS Glue CDK - create job type Spark (Glue 2.0) - pyspark

I could not find any documentation on how to create a glue job with the type spark. The way examples and documentation suggests creates type python shell. Example:
glueETLJob = _glue.CfnJob(
self,
"glue_CDK_job",
command =_glue.CfnJob.JobCommandProperty(
name = "glue_CDK_job",
python_version= '3',
script_location = bucket + "/code/glue_CDK_job.py"
),
role= glueRole.role_arn,
max_retries= 0,
name= "glue_CDK_job",
timeout=30,
glue_version="1.0"
)
Creates a python shell with version 1, but I cannot set glue_version="2.0" because that only seems to exist in type Spark.
Does anyone know how to do create a glue job with type Spark and glue version 2.0 with CDK
Thanks

Turns out the name in theJobCommandProperty is not a id like others, but the type that I was looking for. So if anybody has the same issue it should look like:
glueETLJob = _glue.CfnJob(
self,
"glue_CDK_job",
command =_glue.CfnJob.JobCommandProperty(
name = "glueetl",
python_version= '3',
script_location = bucket + "/code/glue_CDK_job.py"
),
role= glueRole.role_arn,
max_retries= 0,
name= "glue_CDK_job",
timeout=30,
glue_version="2.0"
)
Got the answer from: https://github.com/aws/aws-cdk/issues/4480

I got the same issue, but I realized that the issue was caused by name in JobCommandProperty. Change glue_CDK_job to glueetl will be work.
You can check the documentation of AWS Cloud Formation.
https://github.com/awsdocs/aws-cloudformation-user-guide/blob/main/doc_source/aws-resource-glue-job.md

const processFifaDataJobName = 'process-data-fifa';
const PYTHON_VERSION = "3";
const GLUE_VERSION = "1.0";
const COMMAND_NAME = "glueetl";
const glueJobProcessFifaData = new glue.CfnJob(this, processFifaDataJobName, {
name: processFifaDataJobName,
role: role.roleArn,
command: {
name: COMMAND_NAME,
pythonVersion: PYTHON_VERSION,
scriptLocation: 's3://' + bucketName + '/Scripts/process-data.py'
},
glueVersion: GLUE_VERSION
});
For me above code worked. Although it is very similar what you are trying to do. But this created job with type spark.

Related

How to deploy the kinesis-video-producer Docker image from AWS's own ECR to Fargate using CDK in TypeScript?

I'm trying to stand up a proof of concept that ingests an RTSP video stream into Kinesis Video. The provided documentation has a docker image all set up that seems to have everything I need to do this, hosted by AWS on 546150905175.dkr.ecr.us-west-2.amazonaws.com. What I am having trouble with, though, is getting that deployment (via an Amplify Custom category, in TypeScript CDK) to work.
I've tried different variations on
import * as iam from "#aws-cdk/aws-iam";
import * as ecs from "#aws-cdk/aws-ecs";
import * as ec2 from "#aws-cdk/aws-ec2";
const kinesisUserAccessKey = new iam.AccessKey(this, 'KinesisStreamUserAccessKey', {
user: kinesisStreamUser,
})
const servicePrincipal = new iam.ServicePrincipal('ecs-tasks.amazonaws.com');
const executionRole = new iam.Role(this, 'IngestVideoTaskDefExecutionRole', {
assumedBy: servicePrincipal,
managedPolicies: [
iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonECSTaskExecutionRolePolicy'),
]
});
const taskDefinition = new ecs.FargateTaskDefinition(this, 'IngestVideoTaskDef', {
cpu: 512,
memoryLimitMiB: 1024,
executionRole,
})
const image = ecs.ContainerImage.fromRegistry('546150905175.dkr.ecr.us-west-2.amazonaws.com/kinesis-video-producer-sdk-cpp-amazon-linux:latest');
taskDefinition.addContainer('IngestVideoContainer', {
command: [
'gst-launch-1.0',
'rtspsrc',
`location="${locationParam.secretValue.toString()}"`,
'short-header=TRUE',
'!',
'rtph264depay',
'!',
'video/x-h264,',
'format=avc,alignment=au',
'!',
'kvssink',
`stream-name="${cfnStream.name}"`,
'storage-size=512',
`access-key="${kinesisUserAccessKey.accessKeyId}"`,
`secret-key="${kinesisUserAccessKey.secretAccessKey.toString()}"`,
`aws-region="${REGION}"`,
// `aws-region="${cdk.Aws.REGION}"`,
],
image,
logging: new ecs.AwsLogDriver({
streamPrefix: 'IngestVideoContainer',
}),
})
const service = new ecs.FargateService(this, 'IngestVideoService', {
cluster,
taskDefinition,
desiredCount: 1,
securityGroups: [
ec2.SecurityGroup.fromSecurityGroupId(this, 'DefaultSecurityGroup', SECURITY_GROUP_ID)
],
vpcSubnets: {
subnets: SUBNET_IDS.map(subnetId => ec2.Subnet.fromSubnetId(this, subnetId, subnetId)),
}
})
But it seems like regardless of what I do, an amplify push just stays in 'in progress' for like an hour until I go into the CloudFormation console and cancel the stack update, but deep in the my way to the ECS Console I managed to find an actual error message:
Resourceinitializationerror: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post "https://api.ecr.us-west-2.amazonaws.com/": dial tcp 52.94.177.118:443: i/o timeout
It seems to be some kind of networking issue, but I'm not sure how to proceed. Any assistance you can provide would be wonderful. Cheers!
Figured it out. For those stuck with similar issues, you have to give it an execution role with AmazonECSTaskExecutionRolePolicy, which I already edited above, and set assignPublicIp: true in the service.

Azure Database for PostgreSQL flexible server deployment fails with databaseName param error

I'm trying to deploy PostgreSQL managed service with bicep and in most cases get an error:
"code": "InvalidParameterValue",
"message": "Invalid value given for parameter databaseName. Specify a valid parameter value."
I've tried various names for the DB, even in last version of the script I add random suffix to made it unique. Anyway it finishes with error, but looks like service is working. Another unexplainable thing is that sometimes script finishes without error... It's part of my IaC scenario, i need to be able to rerun it many times...
bicep code:
param location string
#secure()
param sqlserverLoginPassword string
param rand string = uniqueString(resourceGroup().id) // Generate unique String
param sqlserverName string = toLower('invivopsql-${rand}')
param sqlserverAdminName string = 'invivoadmin'
param psqlDatabaseName string = 'postgres'
resource flexibleServer 'Microsoft.DBforPostgreSQL/flexibleServers#2021-06-01' = {
name: sqlserverName
location: location
sku: {
name: 'Standard_B1ms'
tier: 'Burstable'
}
properties: {
createMode: 'Default'
version: '13'
administratorLogin: sqlserverAdminName
administratorLoginPassword: sqlserverLoginPassword
availabilityZone: '1'
storage: {
storageSizeGB: 32
}
backup: {
backupRetentionDays: 7
geoRedundantBackup: 'Disabled'
}
}
}
Please follow this git issue here for a similar error that might help you to fix your problem.

mock outputs in Terragrunt dependency

I want to use Terragrunt to deploy this example: https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/main/examples/complete-kubernetes-addons/main.tf
So far, I was able to create the VPC/EKS resource without a problem, I separated each module into a different module directory, and everything worked as expected.
When I tried to do the same for the Kubernetes-addons module, I faced an issue with the data source trying to call to the cluster and failing since the cluster wasn't created at this point.
Here's my terragrunt.hcl which I'm trying to execute for this specific module:
...
terraform {
source = "git::git#github.com:aws-ia/terraform-aws-eks-blueprints.git//modules/kubernetes-addons?ref=v4.6.1"
}
locals {
# Extract needed variables for reuse
cluster_version = "${include.envcommon.locals.cluster_version}"
name = "${include.envcommon.locals.name}"
}
dependency "eks" {
config_path = "../eks"
mock_outputs = {
eks_cluster_endpoint = "https://000000000000.gr7.eu-west-3.eks.amazonaws.com"
eks_oidc_provider = "something"
eks_cluster_id = "something"
}
}
inputs = {
eks_cluster_id = dependency.eks.outputs.cluster_id
eks_cluster_endpoint = dependency.eks.outputs.eks_cluster_endpoint
eks_oidc_provider = dependency.eks.outputs.eks_oidc_provider
eks_cluster_version = local.cluster_version
...
}
The error that I'm getting here:
`
INFO[0035]
Error: error reading EKS Cluster (something): couldn't find resource
with data.aws_eks_cluster.eks_cluster,
on data.tf line 7, in data "aws_eks_cluster" "eks_cluster":
7: data "aws_eks_cluster" "eks_cluster" {
`
The kubernetes-addons module is deploying addons into an existing Kubernetes cluster. If you don't have a cluster running (apparently you don't have one when you're mocking the cluster_id variable), then you get the error of not having the aws_eks_cluster data source.
You need to create the K8s cluster first, before you can start deploying the addons.

Referencing a loop object

i am currently checking out tanka + jsonnet. But evertime i think i understand it... sth. new irritates me. Can somebody help me understand how to do a loop-reference? (Or general better solution?)
Trying to create multiple deployments with a corresponding configmapVolumeMount and i am not sure how to reference to the according configmap object here?
(using a configVolumeMount it works since it refers to the name, not the object).
deployment: [
deploy.new(
name='demo-' + instance.name,
],
)
+ deploy.configMapVolumeMount('config-' + instance.name, '/config.yml', k.core.v1.volumeMount.withSubPath('config.yml'))
for instance in $._config.demo.instances
],
configMap: [
configMap.new('config-' + instance.name, {
'config.yml': (importstr 'files/config.yml') % {
name: instance.name,
....
},
}),
for instance in $._config.demo.instances
]
regards
Great to read that you're making progress with tanka, it's an awesome tool (once you learned how to ride it heh).
Find below a possible answer, see inline comments in the code, in particular how we ab-use tanka layout flexibility, to "populate" deploys: [...] array with jsonnet objects containing each paired deploy+configMap.
config.jsonnet
{
demo: {
instances: ['foo', 'bar'],
image: 'nginx', // just as example
},
}
main.jsonnet
local config = import 'config.jsonnet';
local k = import 'github.com/grafana/jsonnet-libs/ksonnet-util/kausal.libsonnet';
{
local deployment = k.apps.v1.deployment,
local configMap = k.core.v1.configMap,
_config:: import 'config.jsonnet',
// my_deploy(name) will return name-d deploy+configMap object
my_deploy(name):: {
local this = self,
deployment:
deployment.new(
name='deploy-%s' % name,
replicas=1,
containers=[
k.core.v1.container.new('demo-%s' % name, $._config.demo.image),
],
)
+ deployment.configMapVolumeMount(
this.configMap,
'/config.yml',
k.core.v1.volumeMount.withSubPath('config.yml')
),
configMap:
configMap.new('config-%s' % name)
+ configMap.withData({
// NB: replacing `importstr 'files/config.yml';` by
// a simple YAML multi-line string, just for the sake of having
// a simple yet complete/usable example.
'config.yml': |||
name: %(name)s
other: value
||| % { name: name }, //
}),
},
// Tanka is pretty flexible with the "layout" of the Kubernetes objects
// in the Environment (can be arrays, objects, etc), below using an array
// for simplicity (built via a loop/comprehension)
deploys: [$.my_deploy(name) for name in $._config.demo.instances],
}
output
$ tk init
[...]
## NOTE: using https://kind.sigs.k8s.io/ local Kubernetes cluster
$ tk env set --server-from-context kind-kind environments/default
[... save main.jsonnet, config.jsonnet to ./environments/default/]
$ tk apply --dry-run=server environments/default
[...]
configmap/config-bar created (server dry run)
configmap/config-foo created (server dry run)
deployment.apps/deploy-bar created (server dry run)
deployment.apps/deploy-foo created (server dry run)

How to hide the password from log and rendered template when pass another airflow connection to airflow SSH Operator

Summary of my DAG:
I am using SSH Operator to SSH to an EC2 instance and run a JAR file which will connect to multiple DBs. I've declared the Airflow Connection in my DAG file and able to pass the variables into the EC2 instance. As you can see from below, I'm passing properties into JAVA command.
Airflow version - airflow-1-10.7
Package installed - apache-airflow[crypto]
from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.hooks.ssh_hook import SSHHook
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.hooks.base_hook import BaseHook
from airflow.models.connection import Connection
ssh_hook = SSHHook(ssh_conn_id='ssh_to_ec2')
ssh_hook.no_host_key_check = True
redshift_connection = BaseHook.get_connection("my_redshift")
rs_user = redshift_connection.login
rs_password = redshift_connection.password
mongo_connection = BaseHook.get_connection("my_mongo")
mongo_user = mongo_connection.login
mongo_password = mongo_connection.password
default_args = {
'owner': 'AIRFLOW',
'start_date': datetime(2020, 4, 1, 0, 0),
'email': [],
'retries': 1,
}
dag = DAG('connect_to_redshift', default_args=default_args)
t00_00 = SSHOperator(
task_id='ssh_and_connect_db',
ssh_hook=ssh_hook,
command="java "
"-Drs_user={rs_user} -Drs_pass={rs_pass} "
"-Dmongo_user={mongo_user} -Dmongo_pass={mongo_pass} "
"-jar /home/airflow/root.jar".format(rs_user=rs_user,rs_pass=rs_pass,mongo_user=mongo_user,mongo_pass=mongo_pass),
dag=dag)
t00_00
Problem
The value for rs_pass,mongo_pass will be exposed in Rendered_Template/Airflow log which is not good and I would like to have a solution that can hide all these sensitive information from log and rendered template with SSH Operator.
So far I've tried to minimum the log verbose to ERROR in airflow.cfg, but it still shows in Rendered_Template.
Please enlighten me.
Thanks