I am trying to deploy the CDK stack below:
class MyCdkStack(Stack):
def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
vpc = ec2.Vpc.from_lookup(self, "VPC", vpc_id=EXISTING_VPC_ID)
amzn_linux = ec2.MachineImage.latest_amazon_linux(
generation=ec2.AmazonLinuxGeneration.AMAZON_LINUX_2
)
role = iam.Role(
self, "Role", assumed_by=iam.ServicePrincipal("ec2.amazonaws.com")
)
role.add_managed_policy(
iam.ManagedPolicy.from_aws_managed_policy_name(
"AmazonSSMManagedInstanceCore"
)
)
instance = ec2.Instance(
self,
"Instance",
instance_type=ec2.InstanceType("t3.micro"),
machine_image=amzn_linux,
vpc=vpc,
vpc_subnets=ec2.SubnetSelection(subnet_type=ec2.SubnetType.PUBLIC),
role=role,
init=ec2.CloudFormationInit.from_elements(
ec2.InitPackage.yum("docker"),
),
init_options=ec2.ApplyCloudFormationInitOptions(
timeout=Duration.minutes(5),
ignore_failures=True,
),
)
# Allow ssh connections from anywhere
instance.connections.allow_from_any_ipv4(ec2.Port.tcp(22))
# Elastic IP
eip = ec2.CfnEIP(self, "EIP", instance_id=instance.instance_id)
# Outputs
CfnOutput(self, "EIP Address", value=eip.ref)
The deployment fails after 5 minutes and rolls back with the following error message:
Failed to receive 1 resource signal(s) within the specified duration
Here are possible problems I have considered:
The server might not have outbound internet access (but I have put it on a public subnet).
I've tried using an Amazon Linux 2022 AMI instead.
The 5 minute timeout might not be sufficient (but I have tried increasing to 15 minutes to no avail).
There is something else wrong with my setup (but without the CloudFormationInit stuff the server is created as expected).
Yum installing docker might be impossible (but if I create the server without the CloudFormationInit stuff, I can SSH into the instance and then sudo yum install docker works.
The server is not allowed to send cfg signals (but the raw CloudFormation template created by CDK seems to include the relevant auto-generated user data and permissions, see below):
// Excerpts from autogenerated CDK template json
"UserData": {
"Fn::Base64": {
"Fn::Join": [
"",
[
"#!/bin/bash\n# fingerprint: 7d8f48713aedxxxx\n(\n set +e\n /opt/aws/bin/cfn-init -v --region ",
{
"Ref": "AWS::Region"
},
" --stack ",
{
"Ref": "AWS::StackName"
},
" --resource Instance5FFEF8E4e0ce835dd5aaxxxx -c default\n /opt/aws/bin/cfn-signal -e 0 --region ",
{
"Ref": "AWS::Region"
},
" --stack ",
{
"Ref": "AWS::StackName"
},
" --resource Instance5FFEF8E4e0ce835dd5aaxxxx\n cat /var/log/cfn-init.log >&2\n)"
]
]
}
}
// -----
"RoleDefaultPolicy5FFBxxx": {
"Type": "AWS::IAM::Policy",
"Properties": {
"PolicyDocument": {
"Statement": [
{
"Action": [
"cloudformation:DescribeStackResource",
"cloudformation:SignalResource"
],
"Effect": "Allow",
"Resource": {
"Ref": "AWS::StackId"
}
}
],
"Version": "2012-10-17"
},
"PolicyName": "RoleDefaultPolicy5FFB7xxx",
"Roles": [
{
"Ref": "Role1ABCxxxx"
}
]
},
"Metadata": {
"aws:cdk:path": "xxx/Role/DefaultPolicy/Resource"
}
},
Wondering what else there is left for me to try! Any help would be greatly appreciated. I have that sinking feeling that I've overlooked something obvious...
Edit:
In response to Paolo's comment, here is the full output from cdk synth with identifiers obfuscated.
Resources:
Role1ABCXXXX:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Statement:
- Action: sts:AssumeRole
Effect: Allow
Principal:
Service: ec2.amazonaws.com
Version: "2012-10-17"
ManagedPolicyArns:
- Fn::Join:
- ""
- - "arn:"
- Ref: AWS::Partition
- :iam::aws:policy/AmazonSSMManagedInstanceCore
Metadata:
aws:cdk:path: MyCDK/Role/Resource
RoleDefaultPolicy5FFBXXXX:
Type: AWS::IAM::Policy
Properties:
PolicyDocument:
Statement:
- Action:
- cloudformation:DescribeStackResource
- cloudformation:SignalResource
Effect: Allow
Resource:
Ref: AWS::StackId
Version: "2012-10-17"
PolicyName: RoleDefaultPolicy5FFBXXXX
Roles:
- Ref: Role1ABCXXXX
Metadata:
aws:cdk:path: MyCDK/Role/DefaultPolicy/Resource
InstanceInstanceSecurityGroup698618EC:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: MyCDK/Instance/InstanceSecurityGroup
SecurityGroupEgress:
- CidrIp: 0.0.0.0/0
Description: Allow all outbound traffic by default
IpProtocol: "-1"
SecurityGroupIngress:
- CidrIp: 0.0.0.0/0
Description: from 0.0.0.0/0:22
FromPort: 22
IpProtocol: tcp
ToPort: 22
VpcId: vpc-07848d9441fddea14
Metadata:
aws:cdk:path: MyCDK/Instance/InstanceSecurityGroup/Resource
InstanceInstanceProfile01ECXXXX:
Type: AWS::IAM::InstanceProfile
Properties:
Roles:
- Ref: Role1ABCXXXX
Metadata:
aws:cdk:path: MyCDK/Instance/InstanceProfile
Instance5FFEF8E47f468d710e75XXXX:
Type: AWS::EC2::Instance
Properties:
AvailabilityZone: eu-central-1a
IamInstanceProfile:
Ref: InstanceInstanceProfile01ECXXXX
ImageId:
Ref: SsmParameterValueawsserviceamiamazonlinuxlatestamzn2amihvmx8664gp2C96584B6F00A464EAD1953AFF4B05118Parameter
InstanceType: t3.micro
SecurityGroupIds:
- Fn::GetAtt:
- InstanceInstanceSecurityGroup698618EC
- GroupId
SubnetId: subnet-079be82ff7754XXXX
UserData:
Fn::Base64:
Fn::Join:
- ""
- - |-
#!/bin/bash
# fingerprint: 5af534616771e4af
(
set +e
/opt/aws/bin/cfn-init -v --region
- Ref: AWS::Region
- " --stack "
- Ref: AWS::StackName
- |-2
--resource Instance5FFEF8E47f468d710e75XXXX -c default
/opt/aws/bin/cfn-signal -e 0 --region
- Ref: AWS::Region
- " --stack "
- Ref: AWS::StackName
- |-2
--resource Instance5FFEF8E47f468d710e75XXXX
cat /var/log/cfn-init.log >&2
)
DependsOn:
- RoleDefaultPolicy5FFBXXXX
- Role1ABCXXXX
CreationPolicy:
ResourceSignal:
Count: 1
Timeout: PT5M
Metadata:
aws:cdk:path: MyCDK/Instance/Resource
AWS::CloudFormation::Init:
configSets:
default:
- config
config:
packages:
yum:
docker: []
EIP:
Type: AWS::EC2::EIP
Properties:
InstanceId:
Ref: Instance5FFEF8E47f468d710e75XXXX
Metadata:
aws:cdk:path: MyCDK/EIP
CDKMetadata:
Type: AWS::CDK::Metadata
Properties:
Analytics: v2:deflate64:H4sIAAAAAAAA/2VOyQ6CMBD9Fu5lFDwYz8YYTjbwAabWIY6UlnSJIU3/XcDt4OmteXklFFtYZ+LhcnntckUXiI0XsmM1OhOsRDZl50iih1gbhWzf6gW5USTHWf5YpZ0XWiK3piWFiaEsIX5c1qAMlvx4tXXXX//P+FYnfqh4Ssu+sKJHj3YWp+CH4JcX74OJ8dHfjF5tYAdFmd0dUW6D9tQj1C98AstX0JrnXXXX
Metadata:
aws:cdk:path: MyCDK/CDKMetadata/Default
Parameters:
SsmParameterValueawsserviceamiamazonlinuxlatestamzn2amihvmx8664gp2C96584B6F00A464EAD1953AFF4B05118Parameter:
Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
Default: /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2
BootstrapVersion:
Type: AWS::SSM::Parameter::Value<String>
Default: /cdk-bootstrap/hnb659fds/version
Description: Version of the CDK Bootstrap resources in this environment, automatically retrieved from SSM Parameter Store. [cdk:skip]
Outputs:
EIPAddress:
Value:
Ref: EIP
Rules:
CheckBootstrapVersion:
Assertions:
- Assert:
Fn::Not:
- Fn::Contains:
- - "1"
- "2"
- "3"
- "4"
- "5"
- Ref: BootstrapVersion
AssertDescription: CDK bootstrap stack version 6 required. Please run 'cdk bootstrap' with a recent version of the CDK CLI.à
Edit 2: Here is the init-cloud-output.log.
Cloud-init v. 19.3-45.amzn2 running 'init-local' at Mon, 30 May 2022 10:42:35 +0000. Up 6.48 seconds.
Cloud-init v. 19.3-45.amzn2 running 'init' at Mon, 30 May 2022 10:42:37 +0000. Up 7.60 seconds.
ci-info: ++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: | eth0 | True | 10.0.0.156 | 255.255.255.0 | global | 02:6c:e8:e3:39:84 |
ci-info: | eth0 | True | fe80::6c:e8ff:fee3:3984/64 | . | link | 02:6c:e8:e3:39:84 |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: ++++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++++
ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 10.0.0.1 | 0.0.0.0 | eth0 | UG |
ci-info: | 1 | 10.0.0.0 | 0.0.0.0 | 255.255.255.0 | eth0 | U |
ci-info: | 2 | 169.254.169.254 | 0.0.0.0 | 255.255.255.255 | eth0 | UH |
ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | 9 | fe80::/64 | :: | eth0 | U |
ci-info: | 11 | local | :: | eth0 | U |
ci-info: | 12 | ff00::/8 | :: | eth0 | U |
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 19.3-45.amzn2 running 'modules:config' at Mon, 30 May 2022 10:42:38 +0000. Up 9.21 seconds.
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
One of the configured repositories failed (Unknown),
and yum doesn't have enough cached data to continue. At this point the only
safe thing yum can do is fail. There are a few ways to work "fix" this:
1. Contact the upstream for the repository and get them to fix the problem.
2. Reconfigure the baseurl/etc. for the repository, to point to a working
upstream. This is most often useful if you are using a newer
distribution release than is supported by the repository (and the
packages for the previous distribution release still work).
3. Run the command with the repository temporarily disabled
yum --disablerepo=<repoid> ...
4. Disable the repository permanently, so yum won't use it by default. Yum
will then just ignore the repository until you permanently enable it
again or use --enablerepo for temporary usage:
yum-config-manager --disable <repoid>
or
subscription-manager repos --disable=<repoid>
5. Configure the failing repository to be skipped, if it is unavailable.
Note that yum will try to contact the repo. when it runs most commands,
so will have to try and fail each time (and thus. yum will be be much
slower). If it is a very temporary problem though, this is often a nice
compromise:
yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
Cannot find a valid baseurl for repo: amzn2-core/2/x86_64
Could not retrieve mirrorlist https://amazonlinux-2-repos-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com/2/core/latest/x86_64/mirror.list error was
12: Timeout on https://amazonlinux-2-repos-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com/2/core/latest/x86_64/mirror.list: (28, 'Failed to connect to amazonlinux-2-repos-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com port 443 after 2702 ms: Connection timed out')
May 30 10:42:58 cloud-init[2199]: util.py[WARNING]: Package upgrade failed
May 30 10:42:58 cloud-init[2199]: cc_package_update_upgrade_install.py[WARNING]: 1 failed with exceptions, re-raising the last one
May 30 10:42:58 cloud-init[2199]: util.py[WARNING]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_package_update_upgrade_install.pyc'>) failed
Cloud-init v. 19.3-45.amzn2 running 'modules:final' at Mon, 30 May 2022 10:42:59 +0000. Up 29.98 seconds.
Unknown error retrieving Instance5FFEF8E4e0ce835dd5aaXXXX
ValidationError: Stack arn:aws:cloudformation:eu-central-1:ACCOUNT_ID:stack/MyCDK/d1772460-e004-11ec-b341-29280531XXXX is in CREATE_FAILED state and cannot be signaled
2022-05-30 10:43:00,475 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.eu-central-1.amazonaws.com
2022-05-30 10:43:00,476 [DEBUG] Describing resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK
2022-05-30 10:44:00,476 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:44:00,476 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:44:00,478 [DEBUG] Sleeping for 0.648091 seconds before retrying
2022-05-30 10:44:01,128 [DEBUG] Describing resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK
2022-05-30 10:45:01,128 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:45:01,128 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:45:01,129 [DEBUG] Sleeping for 2.585657 seconds before retrying
2022-05-30 10:45:03,717 [DEBUG] Describing resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK
2022-05-30 10:46:03,717 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:46:03,718 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:46:03,718 [DEBUG] Sleeping for 4.082728 seconds before retrying
2022-05-30 10:46:07,805 [DEBUG] Describing resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK
2022-05-30 10:47:07,805 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:47:07,806 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:47:07,806 [DEBUG] Sleeping for 11.379097 seconds before retrying
2022-05-30 10:47:19,197 [DEBUG] Describing resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK
2022-05-30 10:48:19,197 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:48:19,197 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:48:19,521 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.eu-central-1.amazonaws.com
2022-05-30 10:48:19,523 [DEBUG] Signaling resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK with unique ID i-0b3eb81ec6a111218 and status SUCCESS
2022-05-30 10:49:19,524 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:49:19,524 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:49:19,525 [DEBUG] Sleeping for 0.292454 seconds before retrying
2022-05-30 10:49:19,818 [DEBUG] Signaling resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK with unique ID i-0b3eb81ec6a111218 and status SUCCESS
2022-05-30 10:50:19,818 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:50:19,818 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:50:19,819 [DEBUG] Sleeping for 1.337550 seconds before retrying
2022-05-30 10:50:21,158 [DEBUG] Signaling resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK with unique ID i-0b3eb81ec6a111218 and status SUCCESS
2022-05-30 10:51:21,158 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:51:21,158 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:51:21,159 [DEBUG] Sleeping for 6.997329 seconds before retrying
2022-05-30 10:51:28,163 [DEBUG] Signaling resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK with unique ID i-0b3eb81ec6a111218 and status SUCCESS
2022-05-30 10:52:28,164 [WARNING] Timeout of 60 seconds breached
2022-05-30 10:52:28,164 [ERROR] Client-side timeout
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 189, in _retry
return f(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 263, in _timeout
"Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2022-05-30 10:52:28,164 [DEBUG] Sleeping for 5.279977 seconds before retrying
2022-05-30 10:52:33,450 [DEBUG] Signaling resource Instance5FFEF8E4e0ce835dd5aaXXXX in stack MyCDK with unique ID i-0b3eb81ec6a111218 and status SUCCESS
ci-info: no authorized ssh keys fingerprints found for user ec2-user.
Cloud-init v. 19.3-45.amzn2 finished at Mon, 30 May 2022 10:52:33 +0000. Datasource DataSourceEc2. Up 604.40 seconds
The problem was that the instance didn't have internet access (despite being on a public subnet).
The reason for this was that the VPC is not our default VPC, and therefore the public subnet we created did not have Auto-assign public IPv4 address enabled. Enabling this setting fixed the problem.
Phew!
I have some breakpoint "pairs," and I'd like to measure the time in between when they are hit.
The simplest thing that would allow me to do this is to include some sort of timestamp (even if it's just clock ticks or something) in the .printf I use when the breakpoint is hit.
I could use the pseudo registers $tid and $dbgtime in the breakpoint code. When I do, the performance really suffers.
bp1000 ucrtbase!malloc ".printf \"[0x%08x] [ucrtbase] [0x%04x] [0x%08x] malloc(%d): \", $dbgtime, $tid, dwo(#esp), dwo(#esp+4); gc "
When the same code is used (without using meaningful values for timestamp and thread id), things work much better.
bp1000 ucrtbase!malloc ".printf \"[0x%08x] [ucrtbase] [0x%04x] [0x%08x] malloc(%d): \", 0, 0, dwo(#esp), dwo(#esp+4); gc "
Is there some other (high-performance) way to get this information? The current time is more valuable than the thread ID. I can always make the breakpoint only apply to a specific thread so that emitting the ID is only sugar.
try this
0:000> bp ucrtbase!malloc "~# ; .echotime ; dd #$csp l2 ; gc ;"
0:000> bl
0 e 00007ff8`ab61c9e0 0001 (0001) 0:**** ucrtbase!malloc "~# ; .echotime ; dd #$csp l2 ; gc ;"
0:000> g
. 0 Id: 1a84.1f14 Suspend: 1 Teb: 00000018`f49d1000 Unfrozen
Start: cdb!wmainCRTStartup (00007ff6`efd2bbf0)
Priority: 0 Priority class: 32 Affinity: f
Debugger (not debuggee) time: Wed Aug 7 22:17:44.992 2019
00000018`f47eeb58 ab622762 00007ff8
. 0 Id: 1a84.1f14 Suspend: 1 Teb: 00000018`f49d1000 Unfrozen
Start: cdb!wmainCRTStartup (00007ff6`efd2bbf0)
Priority: 0 Priority class: 32 Affinity: f
Debugger (not debuggee) time: Wed Aug 7 22:17:44.992 2019 (UTC + 5:30)
00000018`f47eeb08 ab622762 00007ff8
I've some tasks with manually configured routes and 3 workers which were configured to consume tasks from specific queue. But only one worker consuming all of the tasks and I've no idea how to fix this issue.
My celeryconfig.py
class CeleryConfig:
enable_utc = True
timezone = 'UTC'
imports = ('events.tasks')
broker_url = Config.BROKER_URL
broker_transport_options = {'visibility_timeout': 10800} # 3H
worker_hijack_root_logger = False
task_protocol = 2
task_ignore_result = True
task_publish_retry_policy = {'max_retries': 3, 'interval_start': 0, 'interval_step': 0.2, 'interval_max': 0.2}
task_time_limit = 30 # sec
task_soft_time_limit = 15 # sec
task_default_queue = 'low'
task_default_exchange = 'low'
task_default_routing_key = 'low'
task_queues = (
Queue('daily', Exchange('daily'), routing_key='daily'),
Queue('high', Exchange('high'), routing_key='high'),
Queue('normal', Exchange('normal'), routing_key='normal'),
Queue('low', Exchange('low'), routing_key='low'),
Queue('service', Exchange('service'), routing_key='service'),
Queue('award', Exchange('award'), routing_key='award'),
)
task_route = {
# -- SCHEDULE QUEUE --
base_path.format(task='refresh_rank'): {'queue': 'daily'}
# -- HIGH QUEUE --
base_path.format(task='execute_order'): {'queue': 'high'},
# -- NORMAL QUEUE --
base_path.format(task='calculate_cost'): {'queue': 'normal'},
# -- SERVICE QUEUE --
base_path.format(task='send_pin'): {'queue': 'service'},
# -- LOW QUEUE
base_path.format(task='invite_to_tournament'): {'queue': 'low'},
# -- AWARD QUEUE
base_path.format(task='get_lesson_award'): {'queue': 'award'},
# -- TEST TASK
worker_concurrency = multiprocessing.cpu_count() * 2 + 1
worker_prefetch_multiplier = 1 #
worker_max_tasks_per_child = 1
worker_max_memory_per_child = 90000 # 90MB
beat_max_loop_interval = 60 * 5 # 5 min
I run workers in a docker, part of my stack.yml
version: "3.7"
services:
worker_high:
command: celery worker -l debug -A runcelery.celery -Q high -n worker.high#%h
worker_normal:
command: celery worker -l debug -A runcelery.celery -Q normal,award,service,low -n worker.normal#%h
worker_schedule:
command: celery worker -l debug -A runcelery.celery -Q daily -n worker.schedule#%h
beat:
command: celery beat -l debug -A runcelery.celery
flower:
command: flower -l debug -A runcelery.celery --port=5555
broker:
image: redis:5.0-alpine
I thought that my config is right and run command correct too, but docker logs and flower shown that only worker.normal consume all tasks.
I
Update
Here is part of task.py:
def refresh_rank_in_tournaments():
logger.debug(f'Start task refresh_rank_in_tournaments')
return AnalyticBackgroundManager.refresh_tournaments_rank()
base_path is shortcut for full task path:
base_path = 'events.tasks.{task}'
execute_order task code:
#celery.task(bind=True, default_retry_delay=5)
def execute_order(self, private_id, **kwargs):
try:
return OrderBackgroundManager.execute_order(private_id, **kwargs)
except IEXException as exc:
raise self.retry(exc=exc)
This task will call in a view as tasks.execute_order.delay(id)
Your worker.normal is subscribed to the normal,award,service,low queues. Furthermore, the low queue is the default one, so every task that does not have explicitly set queue will be executed on worker.normal.
Currently I am using WorkManager 1.0.0-alpha05. I set periodic Work Request using below code.
When interval is below 1 hr then In Oppo Realme (Android Version - 8.1.0 , ColorOSVersion V5.0)
job execute at 1 hr. When interval greater than 1 hr job execute at exact time . when interval is smaller than 1 hr then job execute at 1 hr.
Please let me know any log or information required :
Code For Schdule Periodic Job:
PeriodicWorkRequest uploadWork = new PeriodicWorkRequest.
Builder(LocationUpdatesJobService.class ,interval, TimeUnit.MILLISECONDS)
.addTag(Constants.Location.TAG_BACKGROUND_LOCATION_PERIODIC)
.setConstraints(constraints).build();
WorkManager.getInstance().enqueueUniquePeriodicWork(
Constants.Location.TAG_BACKGROUND_LOCATION_PERIODIC,
ExistingPeriodicWorkPolicy.REPLACE, uploadWork);
in all other device Periodic Work request interval is proper. In Oppo Realme 1 work execute at 1 hr.
Oppo Realme 1: Interval 15 Min
I debug Job Schduler using below command:
adb shell dumpsys jobscheduler
JOB #u0a249/18: cc2fc59 com.cygneto.field_sales/androidx.work.impl.background.systemjob.SystemJobService
u0a249 tag=job/com.cygneto.field_sales/androidx.work.impl.background.systemjob.SystemJobService
Source: uid=u0a249 user=0 pkg=com.cygneto.field_sales
JobInfo:
Service: com.cygneto.field_sales/androidx.work.impl.background.systemjob.SystemJobService
PERIODIC: interval=+1h0m0s0ms flex=+21m0s0ms
Requires: charging=false batteryNotLow=false deviceIdle=false
Extras: mParcelledData.dataSize=180
Backoff: policy=1 initial=+30s0ms
Has early constraint
Has late constraint
Required constraints: TIMING_DELAY DEADLINE
Satisfied constraints: APP_NOT_IDLE DEVICE_NOT_DOZING
Unsatisfied constraints: TIMING_DELAY DEADLINE
Doze whitelisted: true
Tracking: TIME
Enqueue time: -9m4s617ms
Run time: earliest=+29m55s383ms, latest=+50m55s383ms
Ready: false (job=false user=true !pending=true !active=true !backingup=true comp=true)
Oppo Realme 1: Interval 1hr 10 Min
Log:
JobInfo:
Service: com.cygneto.field_sales/androidx.work.impl.background.systemjob.SystemJobService
PERIODIC: interval=+1h10m0s0ms flex=+1h10m0s0ms
Requires: charging=false batteryNotLow=false deviceIdle=false
Extras: mParcelledData.dataSize=180
Doze whitelisted: true
Tracking: TIME
Enqueue time: -4m19s846ms
Run time: earliest=+1h5m39s833ms, latest=+2h15m39s833ms
Last successful run: 2018-07-25 17:01:23
Ready: false (job=false user=true !pending=true !active=true !backingup=true comp=true)
Other Device :
Log :
JobInfo:
Service:com.cygneto.field_sales/androidx.work.impl.background.systemjob.SystemJobService
PERIODIC: interval=+15m0s0ms flex=+15m0s0ms
Requires: charging=false batteryNotLow=false deviceIdle=false
Tracking: TIME
Enqueue time: -29s237ms
Run time: earliest=+14m30s690ms, latest=+29m30s690ms
Last successful run: 2018-07-25 17:29:19
Ready: false (job=false user=true !pending=true !active=true !backingup=true comp=true)
I also try Using different library. I found same behavior in Job Scheduler and Android-Job.
job period length is 15 min but execute at 1 hr but when i try using firebase job dispatcher
job execute at correct 15 min interval time.
i debug Job Scheduler and Android-Job using below command:
adb shell dumpsys jobscheduler
Job Scheduler:
Interval : 15 Min
Output : 1 hr
Log:
JOB #u0a266/1: a0dd846 com.jobscheduler_periodic/com.periodic.JobSchedulerService
u0a266 tag=*job*/com.jobscheduler_periodic/com.periodic.JobSchedulerService
Source: uid=u0a266 user=0 pkg=com.jobscheduler_periodic
JobInfo:
Service: com.jobscheduler_periodic/com.periodic.JobSchedulerService
PERIODIC: interval=+1h0m0s0ms flex=+15m0s0ms
Android-Job:
Interval : 15 Min
Output : 1 hr:
Log:
JOB #u0a266/3: 10c0d65 com.jobscheduler_periodic/com.evernote.android.job.v21.PlatformJobService
u0a266 tag=*job*/com.jobscheduler_periodic/com.evernote.android.job.v21.PlatformJobService
Source: uid=u0a266 user=0 pkg=com.jobscheduler_periodic
JobInfo:
Service: com.jobscheduler_periodic/com.evernote.android.job.v21.PlatformJobService
PERIODIC: interval=+1h0m0s0ms flex=+5m0s0ms
Firebase Job Dispatcher:
I debug firebase job Dispatcher using below command:
adb shell "dumpsys activity service GcmService | grep com.jobscheduler_periodic"
Interval : 15 Min
Output : 15 min
Log:
u0|com.jobscheduler_periodic: 1
(scheduled) com.jobscheduler_periodic/com.firebase.jobdispatcher.GooglePlayReceiver{u=0 tag="MyJobService" trigger=window{s
tart=720s,end=900s,earliest=46s,latest=226s} requirements=[NET_CONNECTED,CHARGING] attributes=[RECURRING] scheduled=-673s last_
run=N/A jid=N/A status=PENDING retries=0 client_lib=FIREBASE_JOB_DISPATCHER-1}
This happens to be an OEM bug. Unfortunately, it is very hard to work around these kind of bugs, in a battery efficient way. If you want a period of 15 mins, I suggest using the following workaround:
Use a OneTimeWorkRequest instead of a periodic work request, and upon first execution of the first work request, schedule a second from inside the worker with an initialDelay of 15 mins. That will essentially give you what you want.
I have TORQUE installed on Ubuntu 16.04, and I am having trouble because my jobs hang. I have a test script test.pbs:
#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00
cd $PBS_O_WORKDIR
touch done.txt
echo "done"
And I run it with
qsub test.pbs
The job writes done.txt and echoes "done" just fine, but the job hangs in the C state.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
46.localhost test wlandau 00:00:00 C batch
Edit: some diagnostic info on another job from qstat -f 55
qstat -f 55
Job Id: 55.localhost
Job_Name = test
Job_Owner = wlandau#localhost
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = C
queue = batch
server = haggunenon
Checkpoint = u
ctime = Mon Oct 30 07:35:00 2017
Error_Path = localhost:/home/wlandau/Desktop/test.e55
exec_host = localhost/2
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Oct 30 07:35:00 2017
Output_Path = localhost:/home/wlandau/Desktop/test.o55
Priority = 0
qtime = Mon Oct 30 07:35:00 2017
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 00:01:00
session_id = 5115
Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=localhost,
PBS_O_HOME=/home/wlandau,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=wlandau,
PBS_O_PATH=/home/wlandau/bin:/home/wlandau/.local/bin:/usr/local/sbin
:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/ga
mes:/snap/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=localhost,
PBS_O_WORKDIR=/home/wlandau/Desktop
comment = Job started on Mon Oct 30 at 07:35
etime = Mon Oct 30 07:35:00 2017
exit_status = 0
submit_args = test.pbs
start_time = Mon Oct 30 07:35:00 2017
Walltime.Remaining = 60
start_count = 1
fault_tolerant = False
comp_time = Mon Oct 30 07:35:00 2017
And a similar tracejob -n2 62:
/var/spool/torque/server_priv/accounting/20171029: No matching job records located
/var/spool/torque/server_logs/20171029: No matching job records located
/var/spool/torque/mom_logs/20171029: No matching job records located
/var/spool/torque/sched_logs/20171029: No matching job records located
Job: 62.localhost
10/30/2017 17:20:25 S enqueuing into batch, state 1 hop 1
10/30/2017 17:20:25 S Job Queued at request of wlandau#localhost, owner =
wlandau#localhost, job name = jobe945093c2e029c5de5619d6bf7922071,
queue = batch
10/30/2017 17:20:25 S Job Modified at request of Scheduler#Haggunenon
10/30/2017 17:20:25 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:00
10/30/2017 17:20:25 L Job Run
10/30/2017 17:20:25 S Job Run at request of Scheduler#Haggunenon
10/30/2017 17:20:25 S Not sending email: User does not want mail of this type.
10/30/2017 17:20:25 S Not sending email: User does not want mail of this type.
10/30/2017 17:20:25 M job was terminated
10/30/2017 17:20:25 M obit sent to server
10/30/2017 17:20:25 A queue=batch
10/30/2017 17:20:25 M scan_for_terminated: job 62.localhost task 1 terminated, sid=17917
10/30/2017 17:20:25 A user=wlandau group=wlandau
jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
owner=wlandau#localhost exec_host=localhost/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=01:00:00
10/30/2017 17:20:25 A user=wlandau group=wlandau
jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
owner=wlandau#localhost exec_host=localhost/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=01:00:00 session=17917
end=1509398425 Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:00
EDIT: jobs now hanging in E
After some tinkering, I am now using these settings. I have moved on to this tiny pipeline workflow, where some TORQUE jobs wait for other TORQUE jobs to finish. Unfortunately, all the jobs hang in the E state, and any number of jobs more than 4 will just stay queued. To keep things from hanging indefinitely, I have to sudo qdel -p each one, which I think is causing legitimate problems with the project's filesystem as well as an inconvenience.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
113.localhost ...b73ec2cda6dca wlandau 00:00:00 E batch
114.localhost ...b6c8e6da05983 wlandau 00:00:00 E batch
115.localhost ...9123b8e20850b wlandau 00:00:00 E batch
116.localhost ...e6d49a3d7d822 wlandau 00:00:00 E batch
117.localhost ...8c3f6cb68927b wlandau 0 Q batch
118.localhost ...40b1d0cab6400 wlandau 0 Q batch
qmgr -c "list server" shows
Server haggunenon
server_state = Active
scheduling = True
max_running = 300
total_jobs = 5
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:1 Exiting:3
acl_hosts = localhost
managers = root#localhost
operators = root#localhost
default_queue = batch
log_events = 511
mail_from = adm
query_other_jobs = True
resources_assigned.ncpus = 4
resources_assigned.nodect = 4
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
mom_job_sync = True
pbs_version = 2.4.16
keep_completed = 0
submit_hosts = SERVER
allow_node_submit = True
next_job_number = 119
net_counter = 118 94 93
And qmgr -c "list queue batch"
Queue batch
queue_type = Execution
total_jobs = 5
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:4
max_running = 300
resources_max.ncpus = 4
resources_max.nodes = 2
resources_min.ncpus = 1
resources_default.ncpus = 1
resources_default.nodect = 1
resources_default.nodes = 1
resources_default.walltime = 01:00:00
mtime = Wed Nov 1 07:40:45 2017
resources_assigned.ncpus = 4
resources_assigned.nodect = 4
keep_completed = 0
enabled = True
started = True
C state means the job has completed and its status is kept in the system. Usually the status is kept after job completion for a period of time specified by the keep_completed parameter. However certain types of failure may result in the job being kept in this state to provide the information necessary to examine the cause of failure.
Check the output of qstat -f 46 to see if there is anything indicating an error.
To tune the keep_completed parameter you can execute the following command to check the value of this parameter on your system.
qmgr -c "print queue batch keep_completed"
If you have administrative privileges on the Torque server you could also change this value with
qmgr -c "set queue batch keep_completed=120"
To keep jobs in completed state for 2 minutes after completion.
In general having keep_completed set is a useful feature. Advanced schedulers use the information on completed jobs to schedule around failures.