I am trying to deploy the CDK stack below:
class MyCdkStack(Stack):
def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
vpc = ec2.Vpc.from_lookup(self, "VPC", vpc_id=EXISTING_VPC_ID)
amzn_linux = ec2.MachineImage.latest_amazon_linux(
generation=ec2.AmazonLinuxGeneration.AMAZON_LINUX_2
)
role = iam.Role(
self, "Role", assumed_by=iam.ServicePrincipal("ec2.amazonaws.com")
)
role.add_managed_policy(
iam.ManagedPolicy.from_aws_managed_policy_name(
"AmazonSSMManagedInstanceCore"
)
)
instance = ec2.Instance(
self,
"Instance",
instance_type=ec2.InstanceType("t3.micro"),
machine_image=amzn_linux,
vpc=vpc,
vpc_subnets=ec2.SubnetSelection(subnet_type=ec2.SubnetType.PUBLIC),
role=role,
init=ec2.CloudFormationInit.from_elements(
ec2.InitPackage.yum("docker"),
),
init_options=ec2.ApplyCloudFormationInitOptions(
timeout=Duration.minutes(5),
ignore_failures=True,
),
)
# Allow ssh connections from anywhere
instance.connections.allow_from_any_ipv4(ec2.Port.tcp(22))
# Elastic IP
eip = ec2.CfnEIP(self, "EIP", instance_id=instance.instance_id)
# Outputs
CfnOutput(self, "EIP Address", value=eip.ref)
The deployment fails after 5 minutes and rolls back with the following error message:
Failed to receive 1 resource signal(s) within the specified duration
Here are possible problems I have considered:
The server might not have outbound internet access (but I have put it on a public subnet).
I've tried using an Amazon Linux 2022 AMI instead.
The 5 minute timeout might not be sufficient (but I have tried increasing to 15 minutes to no avail).
There is something else wrong with my setup (but without the CloudFormationInit stuff the server is created as expected).
Yum installing docker might be impossible (but if I create the server without the CloudFormationInit stuff, I can SSH into the instance and then sudo yum install docker works.
The server is not allowed to send cfg signals (but the raw CloudFormation template created by CDK seems to include the relevant auto-generated user data and permissions, see below):
// Excerpts from autogenerated CDK template json
"UserData": {
"Fn::Base64": {
"Fn::Join": [
"",
[
"#!/bin/bash\n# fingerprint: 7d8f48713aedxxxx\n(\n set +e\n /opt/aws/bin/cfn-init -v --region ",
{
"Ref": "AWS::Region"
},
" --stack ",
{
"Ref": "AWS::StackName"
},
" --resource Instance5FFEF8E4e0ce835dd5aaxxxx -c default\n /opt/aws/bin/cfn-signal -e 0 --region ",
{
"Ref": "AWS::Region"
},
" --stack ",
{
"Ref": "AWS::StackName"
},
" --resource Instance5FFEF8E4e0ce835dd5aaxxxx\n cat /var/log/cfn-init.log >&2\n)"
]
]
}
}
// -----
"RoleDefaultPolicy5FFBxxx": {
"Type": "AWS::IAM::Policy",
"Properties": {
"PolicyDocument": {
"Statement": [
{
"Action": [
"cloudformation:DescribeStackResource",
"cloudformation:SignalResource"
],
"Effect": "Allow",
"Resource": {
"Ref": "AWS::StackId"
}
}
],
"Version": "2012-10-17"
},
"PolicyName": "RoleDefaultPolicy5FFB7xxx",
"Roles": [
{
"Ref": "Role1ABCxxxx"
}
]
},
"Metadata": {
"aws:cdk:path": "xxx/Role/DefaultPolicy/Resource"
}
},
Wondering what else there is left for me to try! Any help would be greatly appreciated. I have that sinking feeling that I've overlooked something obvious...
Edit:
In response to Paolo's comment, here is the full output from cdk synth with identifiers obfuscated.
Resources:
Role1ABCXXXX:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Statement:
- Action: sts:AssumeRole
Effect: Allow
Principal:
Service: ec2.amazonaws.com
Version: "2012-10-17"
ManagedPolicyArns:
- Fn::Join:
- ""
- - "arn:"
- Ref: AWS::Partition
- :iam::aws:policy/AmazonSSMManagedInstanceCore
Metadata:
aws:cdk:path: MyCDK/Role/Resource
RoleDefaultPolicy5FFBXXXX:
Type: AWS::IAM::Policy
Properties:
PolicyDocument:
Statement:
- Action:
- cloudformation:DescribeStackResource
- cloudformation:SignalResource
Effect: Allow
Resource:
Ref: AWS::StackId
Version: "2012-10-17"
PolicyName: RoleDefaultPolicy5FFBXXXX
Roles:
- Ref: Role1ABCXXXX
Metadata:
aws:cdk:path: MyCDK/Role/DefaultPolicy/Resource
InstanceInstanceSecurityGroup698618EC:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: MyCDK/Instance/InstanceSecurityGroup
SecurityGroupEgress:
- CidrIp: 0.0.0.0/0
Description: Allow all outbound traffic by default
IpProtocol: "-1"
SecurityGroupIngress:
- CidrIp: 0.0.0.0/0
Description: from 0.0.0.0/0:22
FromPort: 22
IpProtocol: tcp
ToPort: 22
VpcId: vpc-07848d9441fddea14
Metadata:
aws:cdk:path: MyCDK/Instance/InstanceSecurityGroup/Resource
InstanceInstanceProfile01ECXXXX:
Type: AWS::IAM::InstanceProfile
Properties:
Roles:
- Ref: Role1ABCXXXX
Metadata:
aws:cdk:path: MyCDK/Instance/InstanceProfile
Instance5FFEF8E47f468d710e75XXXX:
Type: AWS::EC2::Instance
Properties:
AvailabilityZone: eu-central-1a
IamInstanceProfile:
Ref: InstanceInstanceProfile01ECXXXX
ImageId:
Ref: SsmParameterValueawsserviceamiamazonlinuxlatestamzn2amihvmx8664gp2C96584B6F00A464EAD1953AFF4B05118Parameter
InstanceType: t3.micro
SecurityGroupIds:
- Fn::GetAtt:
- InstanceInstanceSecurityGroup698618EC
- GroupId
SubnetId: subnet-079be82ff7754XXXX
UserData:
Fn::Base64:
Fn::Join:
- ""
- - |-
#!/bin/bash
# fingerprint: 5af534616771e4af
(
set +e
/opt/aws/bin/cfn-init -v --region
- Ref: AWS::Region
- " --stack "
- Ref: AWS::StackName
- |-2
--resource Instance5FFEF8E47f468d710e75XXXX -c default
/opt/aws/bin/cfn-signal -e 0 --region
- Ref: AWS::Region
- " --stack "
- Ref: AWS::StackName
- |-2
--resource Instance5FFEF8E47f468d710e75XXXX
cat /var/log/cfn-init.log >&2
)
DependsOn:
- RoleDefaultPolicy5FFBXXXX
- Role1ABCXXXX
CreationPolicy:
ResourceSignal:
Count: 1
Timeout: PT5M
Metadata:
aws:cdk:path: MyCDK/Instance/Resource
AWS::CloudFormation::Init:
configSets:
default:
- config
config:
packages:
yum:
docker: []
EIP:
Type: AWS::EC2::EIP
Properties:
InstanceId:
Ref: Instance5FFEF8E47f468d710e75XXXX
Metadata:
aws:cdk:path: MyCDK/EIP
CDKMetadata:
Type: AWS::CDK::Metadata
Properties:
Analytics: v2:deflate64:H4sIAAAAAAAA/2VOyQ6CMBD9Fu5lFDwYz8YYTjbwAabWIY6UlnSJIU3/XcDt4OmteXklFFtYZ+LhcnntckUXiI0XsmM1OhOsRDZl50iih1gbhWzf6gW5USTHWf5YpZ0XWiK3piWFiaEsIX5c1qAMlvx4tXXXX//P+FYnfqh4Ssu+sKJHj3YWp+CH4JcX74OJ8dHfjF5tYAdFmd0dUW6D9tQj1C98AstX0JrnXXXX
Metadata:
aws:cdk:path: MyCDK/CDKMetadata/Default
Parameters:
SsmParameterValueawsserviceamiamazonlinuxlatestamzn2amihvmx8664gp2C96584B6F00A464EAD1953AFF4B05118Parameter:
Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
Default: /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2
BootstrapVersion:
Type: AWS::SSM::Parameter::Value<String>
Default: /cdk-bootstrap/hnb659fds/version
Description: Version of the CDK Bootstrap resources in this environment, automatically retrieved from SSM Parameter Store. [cdk:skip]
Outputs:
EIPAddress:
Value:
Ref: EIP
Rules:
CheckBootstrapVersion:
Assertions:
- Assert:
Fn::Not:
- Fn::Contains:
- - "1"
- "2"
- "3"
- "4"
- "5"
- Ref: BootstrapVersion
AssertDescription: CDK bootstrap stack version 6 required. Please run 'cdk bootstrap' with a recent version of the CDK CLI.à
Edit 2: Here is the init-cloud-output.log.
Cloud-init v. 19.3-45.amzn2 running 'init-local' at Mon, 30 May 2022 10:42:35 +0000. Up 6.48 seconds.
Cloud-init v. 19.3-45.amzn2 running 'init' at Mon, 30 May 2022 10:42:37 +0000. Up 7.60 seconds.
ci-info: ++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: | eth0 | True | 10.0.0.156 | 255.255.255.0 | global | 02:6c:e8:e3:39:84 |
ci-info: | eth0 | True | fe80::6c:e8ff:fee3:3984/64 | . | link | 02:6c:e8:e3:39:84 |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: ++++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++++
ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 10.0.0.1 | 0.0.0.0 | eth0 | UG |
ci-info: | 1 | 10.0.0.0 | 0.0.0.0 | 255.255.255.0 | eth0 | U |
ci-info: | 2 | 169.254.169.254 | 0.0.0.0 | 255.255.255.255 | eth0 | UH |
ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | 9 | fe80::/64 | :: | eth0 | U |
ci-info: | 11 | local | :: | eth0 | U |
ci-info: | 12 | ff00::/8 | :: | eth0 | U |
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 19.3-45.amzn2 running 'modules:config' at Mon, 30 May 2022 10:42:38 +0000. Up 9.21 seconds.
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
One of the configured repositories failed (Unknown),
and yum doesn't have enough cached data to continue. At this point the only
safe thing yum can do is fail. There are a few ways to work "fix" this:
1. Contact the upstream for the repository and get them to fix the problem.
2. Reconfigure the baseurl/etc. for the repository, to point to a working
upstream. This is most often useful if you are using a newer
distribution release than is supported by the repository (and the
packages for the previous distribution release still work).
3. Run the command with the repository temporarily disabled
yum --disablerepo=<repoid> ...
4. Disable the repository permanently, so yum won't use it by default. Yum
will then just ignore the repository until you permanently enable it
again or use --enablerepo for temporary usage:
yum-config-manager --disable <repoid>
or
subscription-manager repos --disable=<repoid>
5. Configure the failing repository to be skipped, if it is unavailable.
Note that yum will try to contact the repo. when it runs most commands,
so will have to try and fail each time (and thus. yum will be be much
slower). If it is a very temporary problem though, this is often a nice
compromise:
yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
Cannot find a valid baseurl for repo: amzn2-core/2/x86_64
Could not retrieve mirrorlist https://amazonlinux-2-repos-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com/2/core/latest/x86_64/mirror.list error was
12: Timeout on https://amazonlinux-2-repos-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com/2/core/latest/x86_64/mirror.list: (28, 'Failed to connect to amazonlinux-2-repos-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com port 443 after 2702 ms: Connection timed out')
May 30 10:42:58 cloud-init[2199]: util.py[WARNING]: Package upgrade failed
May 30 10:42:58 cloud-init[2199]: cc_package_update_upgrade_install.py[WARNING]: 1 failed with exceptions, re-raising the last one
May 30 10:42:58 cloud-init[2199]: util.py[WARNING]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_package_update_upgrade_install.pyc'>) failed
Cloud-init v. 19.3-45.amzn2 running 'modules:final' at Mon, 30 May 2022 10:42:59 +0000. Up 29.98 seconds.
Unknown error retrieving Instance5FFEF8E4e0ce835dd5aaXXXX
ValidationError: Stack arn:aws:cloudformation:eu-central-1:ACCOUNT_ID:stack/MyCDK/d1772460-e004-11ec-b341-29280531XXXX is in CREATE_FAILED state and cannot be signaled
2022-05-30 10:43:00,475 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.eu-central-1.amazonaws.com
2022-05-30 10:43:00,476 [DEBUG] Describing resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK
2022-05-30 10:44:00,476 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:44:00,476 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:44:00,478 [DEBUG] Sleeping for 0.648091 seconds before retrying
2022-05-30 10:44:01,128 [DEBUG] Describing resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK
2022-05-30 10:45:01,128 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:45:01,128 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:45:01,129 [DEBUG] Sleeping for 2.585657 seconds before retrying
2022-05-30 10:45:03,717 [DEBUG] Describing resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK
2022-05-30 10:46:03,717 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:46:03,718 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:46:03,718 [DEBUG] Sleeping for 4.082728 seconds before retrying
2022-05-30 10:46:07,805 [DEBUG] Describing resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK
2022-05-30 10:47:07,805 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:47:07,806 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:47:07,806 [DEBUG] Sleeping for 11.379097 seconds before retrying
2022-05-30 10:47:19,197 [DEBUG] Describing resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK
2022-05-30 10:48:19,197 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:48:19,197 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:48:19,521 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.eu-central-1.amazonaws.com
2022-05-30 10:48:19,523 [DEBUG] Signaling resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK with unique ID i-0b3eb81ec6a111218 and status SUCCESS
2022-05-30 10:49:19,524 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:49:19,524 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:49:19,525 [DEBUG] Sleeping for 0.292454 seconds before retrying
2022-05-30 10:49:19,818 [DEBUG] Signaling resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK with unique ID i-0b3eb81ec6a111218 and status SUCCESS
2022-05-30 10:50:19,818 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:50:19,818 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:50:19,819 [DEBUG] Sleeping for 1.337550 seconds before retrying
2022-05-30 10:50:21,158 [DEBUG] Signaling resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK with unique ID i-0b3eb81ec6a111218 and status SUCCESS
2022-05-30 10:51:21,158 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:51:21,158 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:51:21,159 [DEBUG] Sleeping for 6.997329 seconds before retrying
2022-05-30 10:51:28,163 [DEBUG] Signaling resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK with unique ID i-0b3eb81ec6a111218 and status SUCCESS
2022-05-30 10:52:28,164 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:52:28,164 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:52:28,164 [DEBUG] Sleeping for 5.279977 seconds before retrying
2022-05-30 10:52:33,450 [DEBUG] Signaling resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK with unique ID i-0b3eb81ec6a111218 and status SUCCESS
ci-info: no authorized ssh keys fingerprints found for user ec2-user.
Cloud-init v. 19.3-45.amzn2 finished at Mon, 30 May 2022 10:52:33 +0000. Datasource DataSourceEc2. Up 604.40 seconds
The problem was that the instance didn't have internet access (despite being on a public subnet).
The reason for this was that the VPC is not our default VPC, and therefore the public subnet we created did not have Auto-assign public IPv4 address enabled. Enabling this setting fixed the problem.
Phew!
In our K8 cluster, we use haproxy app for connecting to Galera cluster.
Our haproxy.cnf file looks like
global
maxconn 2048
external-check
stats socket /var/run/haproxy.sock mode 600 expose-fd listeners level user
user haproxy
group haproxy
defaults
log global
mode tcp
retries 10
timeout client 30000
timeout connect 100500
timeout server 30000
frontend mysql-router-service
bind *:6446
mode tcp
option tcplog
default_backend galera_cluster_backend
# MySQL Cluster BE configuration
backend galera_cluster_backend
mode tcp
option tcpka
option mysql-check user haproxy
balance source
server pitipana-opsdb1 192.168.144.82:3306 check weight 1
server pitipana-opsdb2 192.168.144.83:3306 check weight 1
server pitipana-opsdb3 192.168.144.84:3306 check weight 1
Dockerfile for creating haproxy image
FROM haproxy:2.3
COPY haproxy.cfg /usr/local/etc/haproxy/haproxy.cfg
In my Galera nodes, I get constant warning in /var/log/mysql/error.log
2021-12-20 21:16:47 5942 [Warning] Aborted connection 5942 to db: 'ourdb' user: 'ouruser' host: '192.168.1.2' (Got an error reading communication packets)
2021-12-20 21:16:47 5943 [Warning] Aborted connection 5943 to db: 'ourdb' user: 'ouruser' host: '192.168.1.2' (Got an error reading communication packets)
2021-12-20 21:16:47 5944 [Warning] Aborted connection 5944 to db: 'ourdb' user: 'ouruser' host: '192.168.1.2' (Got an error reading communication packets)
I had increased max_packet_size to 64MB and max_connections to 1000.
When I take a tcpdump from galera node :
Frame 16: 106 bytes on wire (848 bits), 106 bytes captured (848 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 192.168.1.2, Dst: 192.168.10.3
Transmission Control Protocol, Src Port: 62495, Dst Port: 3306, Seq: 1, Ack: 1, Len: 50
Source Port: 62495
Destination Port: 3306
[Stream index: 2]
[TCP Segment Len: 50]
Sequence number: 1 (relative sequence number)
[Next sequence number: 51 (relative sequence number)]
Acknowledgment number: 1 (relative ack number)
0101 .... = Header Length: 20 bytes (5)
Flags: 0x018 (PSH, ACK)
000. .... .... = Reserved: Not set
...0 .... .... = Nonce: Not set
.... 0... .... = Congestion Window Reduced (CWR): Not set
.... .0.. .... = ECN-Echo: Not set
.... ..0. .... = Urgent: Not set
.... ...1 .... = Acknowledgment: Set
.... .... 1... = Push: Set
.... .... .0.. = Reset: Not set
.... .... ..0. = Syn: Not set
.... .... ...0 = Fin: Not set
[TCP Flags: ·······AP···]
Window size value: 507
[Calculated window size: 64896]
[Window size scaling factor: 128]
Checksum: 0x3cec [unverified]
[Checksum Status: Unverified]
Urgent pointer: 0
[SEQ/ACK analysis]
[Timestamps]
TCP payload (50 bytes)
[PDU Size: 45]
[PDU Size: 5]
MySQL Protocol
Packet Length: 41
Packet Number: 1
Request Command SLEEP
Command: SLEEP (0)
Payload: 820000008000012100000000000000000000000000000000...
[Expert Info (Warning/Protocol): Unknown/invalid command code]
[Unknown/invalid command code]
[Severity level: Warning]
[Group: Protocol]
MySQL Protocol
Packet Length: 1
Packet Number: 0
Request Command Quit
Command: Quit (1)
Here 192.168.1.2 is a K8 worker node and 192.168.10.3 is the galera node.
When I connect our applications in K8, we can access to applications, but when we try to edit, we get stuck.
Any suggestion to fix this?
In Apache Spark I do join operation between rate stream and csv file reading streaming operation, both of them are needed for very low intensity data generation. Rate produces increasing ids and limits the generation speed, while the csv reader tends to load all the data without rate limit. So jointing the stream should help with limiting the csv data.
readFromCSVFile(tmpPath.toString).as("csv").join(rate.as("counter")).where("csv.id == counter.value")
Unfortunately, join uses HDFS under the hood, so I'm getting the large error stack:
2021-10-15 14:18:02 ERROR Inbox:94 - Ignoring error
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner#137b9386 rejected from java.util.concurrent.ThreadPoolExecutor#4cd6112[Shutting down, pool size = 7, active threads = 7, queued tasks = 0, completed tasks = 30]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:230)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:74)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:835)
2021-10-15 14:18:02 ERROR WriteToDataSourceV2Exec:73 - Data source write support org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite#2d76d5c5 is aborting.
2021-10-15 14:18:02 ERROR WriteToDataSourceV2Exec:73 - Data source write support org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite#2d76d5c5 aborted.
2021-10-15 14:18:02 ERROR MicroBatchExecution:94 - Query kafkaDataGenerator [id = 23a9869d-913f-4cf6-b0ed-e8149ed149e6, runId = d98459c0-ae1f-44e8-a610-7ee413740880] terminated with error
org.apache.spark.SparkException: Writing job aborted.
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:413)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:361)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.writeWithV2(WriteToDataSourceV2Exec.scala:322)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.run(WriteToDataSourceV2Exec.scala:329)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:39)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:39)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:45)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3627)
at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2940)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2940)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:575)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:570)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:352)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:350)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:570)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:223)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:352)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:350)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:191)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:185)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:334)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:245)
Caused by: org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:979)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:977)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:977)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2257)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2170)
at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:1973)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1973)
at org.apache.spark.SparkContext.$anonfun$new$35(SparkContext.scala:631)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:835)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:382)
... 37 more
2021-10-15 14:18:03 ERROR Utils:94 - Aborting task
java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$7(WriteToDataSourceV2Exec.scala:445)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:477)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:385)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:835)
2021-10-15 14:18:03 ERROR DataWritingSparkTask:73 - Aborting commit for partition 30 (task 31, attempt 0, stage 2.0)
and the most important part is there:
java.lang.IllegalStateException: Error committing version 1 into HDFSStateStore[id=(op=0,part=33),dir=file:/C:/Users/eljah32/AppData/Local/Temp/spark-cb8ca918-43cc-43d6-8f36-3bf80d1e7852/kafkaDataGenerator/state/0/33/left-keyWithIndexToValue]
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:139)
It means, that the join operation requires the HDFS to be used. HDFSBackedStateStoreProvider is the only possible implementation, another one known is based on RocksDB. I Haven't found a way is it possible to disable StateStoreProvider for the join operation, if the data amount is too small and we can rely on memory operatons for the particular job? May be there is some option to disable StateStoreProvider usage since there is no pure in memory implementation?
I am trying to create a nfs kerberos configuration with includedir. The context is the following :
default realm points to TEST.REALM.COM (hadoop installation)
nas/nfs realm points to NFS.ANOTHER.REALM.COM
When I put all realms and domain realms in the krb5.conf file, I am able to mount my nfs share. When I use the includedir tag, thinks won't work out.
Here is my krb5.conf
includedir /etc/krb5.conf.d/
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
dns_lookup_realm = false
dns_lookup_kdc = false
forwardable = true
allow_weak_crypto = false
Here is the config file for the default hadoop realm
[libdefaults]
default_realm = TEST.REALM.COM
TEST.REALM.COM = {
ticket_lifetime = 1d
renew_lifetime = 14d
}
[realms]
TEST.REALM.COM = {
kdc = admhadoop1.realm.com
kdc = admhadoop1.realm.com
admin_server = admhadoop1.realm.com
}
[domain_realm]
.realm.com = TEST.REALM.COM
realm.com = TEST.REALM.COM
Here is the config for the nfs realm
[libdefaults]
NFS.ANOTHER.REALM.COM = {
ticket_lifetime = 14d
renew_lifetime = 180d
}
[realms]
NFS.ANOTHER.REALM.COM = {
kdc = admnfs1.realm.com
kdc = admnfs2.realm.com
admin_server = admnfs1.realm.com
}
[domain_realm]
nfs01.realm.com = NFS.ANOTHER.REALM.COM
The /etc/krb5.keytab only containes the users host, nfs and root for the test01 server
Whit this configuration when I try to mount a share from nfs01.realm.com I'll get this kind of error :
rpc.gssd[7078]: dir_notify_handler: sig 37 si 0x7fff55a1d130 data 0x7fff55a1d000
rpc.gssd[7078]: dir_notify_handler: sig 37 si 0x7fff55a185b0 data 0x7fff55a18480
rpc.gssd[7078]: dir_notify_handler: sig 37 si 0x7fff55a1d130 data 0x7fff55a1d000
rpc.gssd[7078]: handling gssd upcall (/var/lib/nfs/rpc_pipefs/nfs/clnt16)
rpc.gssd[7078]: handle_gssd_upcall: 'mech=krb5 uid=0 enctypes=18,17,16,23,3,1,2 '
rpc.gssd[7078]: handling krb5 upcall (/var/lib/nfs/rpc_pipefs/nfs/clnt16)
rpc.gssd[7078]: process_krb5_upcall: service is '<null>'
rpc.gssd[7078]: Full hostname for 'nfs01.realm.com' is 'nfs01.realm.com'
rpc.gssd[7078]: Full hostname for 'test01.realm.com' is 'test01.realm.com'
rpc.gssd[7078]: No key table entry found for TEST01$#TEST.REALM.COM while getting keytab entry for 'TEST01$#TEST.REALM.COM'
rpc.gssd[7078]: No key table entry found for root/test01.realm.com#TEST.REALM.COM while getting keytab entry for 'root/test01.realm.com#TEST.REALM.COM
rpc.gssd[7078]: No key table entry found for nfs/test01.realm.com#TEST.REALM.COM while getting keytab entry for 'nfs/test01.realm.com#TEST.REALM.COM
rpc.gssd[7078]: No key table entry found for host/test01.realm.com#TEST.REALM.COM while getting keytab entry for 'host/test01.realm.com#TEST.REALM.COM
rpc.gssd[7078]: ERROR: gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab /etc/krb5.keytab for connection with host nfs01.realm.com
rpc.gssd[7078]: ERROR: No credentials found for connection to server nfs01.realm.com
rpc.gssd[7078]: doing error downcall
rpc.gssd[7078]: dir_notify_handler: sig 37 si 0x7fff55a1d130 data 0x7fff55a1d000
rpc.gssd[7078]: dir_notify_handler: sig 37 si 0x7fff55a1d130 data 0x7fff55a1d000
rpc.gssd[7078]: destroying client /var/lib/nfs/rpc_pipefs/nfs/clnt17
rpc.gssd[7078]: dir_notify_handler: sig 37 si 0x7fff55a1d130 data 0x7fff55a1d000
rpc.gssd[7078]: dir_notify_handler: sig 37 si 0x7fff55a1d130 data 0x7fff55a1d000
rpc.gssd[7078]: dir_notify_handler: sig 37 si 0x7fff55a1d130 data 0x7fff55a1d000
rpc.gssd[7078]: dir_notify_handler: sig 37 si 0x7fff55a1d130 data 0x7fff55a1d000
rpc.gssd[7078]: dir_notify_handler: sig 37 si 0x7fff55a1d130 data 0x7fff55a1d000
It looks like the nfs deamon doesn not work with the includedir tag.
What do you think ?
the problem was that the files in the included directory should only have alphanumerical names (with "-" and "_") but no "." like in my case.
OK I have a lab setup, I have a Freenas server iscsi setup with Chap setup for discovery and mutual chap for targets.
Here are the requirements:
Implement CHAP security
One-way CHAP for discovery
Two-way (Mutual) CHAP for targets
I can connect and discover sucessfully with two esxi servers, windows 7, windows 2003, 2008, and 2012
Centos can see the discovery list, but when trying to connect with :
iscsiadm --mode node --targetname iqn.2015.lab.com:centos --portal 192.168.1.60:3260 --login
the terminal outputs:
no records found
Here is my iscsid.conf, I left the comments in on the Chap section, but removed it for the rest as it is just so large:
iscsid.startup = /etc/rc.d/init.d/iscsid force-start
node.startup = automatic
node.leading_login = No
# *************
# CHAP Settings
# *************
# To enable CHAP authentication set node.session.auth.authmethod
# to CHAP. The default is None.
node.session.auth.authmethod = CHAP
# To set a CHAP username and password for initiator
# authentication by the target(s), uncomment the following lines:
#node.session.auth.username = group7
#node.session.auth.password = passwordpassword
# To set a CHAP username and password for target(s)
# authentication by the initiator, uncomment the following lines:
node.session.auth.username_in = group7
node.session.auth.password_in = passwordpassword
# To enable CHAP authentication for a discovery session to the target
# set discovery.sendtargets.auth.authmethod to CHAP. The default is None.
discovery.sendtargets.auth.authmethod = CHAP
# To set a discovery session CHAP username and password for the initiator
# authentication by the target(s), uncomment the following lines:
discovery.sendtargets.auth.username = group7
discovery.sendtargets.auth.password = passwordpassword
# To set a discovery session CHAP username and password for target(s)
# authentication by the initiator, uncomment the following lines:
#discovery.sendtargets.auth.username_in = group7
#discovery.sendtargets.auth.password_in = passwordpassword
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.iscsi.FastAbort = Yes
Any help is appreciated. Thank you.
You want mutual CHAP for session setup, but in your configuration file you have commented-out the lines that define the login from initiator to target:
# To set a CHAP username and password for initiator
# authentication by the target(s), uncomment the following lines:
#node.session.auth.username = group7
#node.session.auth.password = passwordpassword