The purpose of this exercise was enabling the use of the AWS ECS agent in External mode on 32bit hardware like the armv7 Raspberry Pi. The changes are working and I have decided to publish my notes from the process of debugging and adjusting the program. Below are my technical notes from the development process.
This article is heavily technical and assumes very good knowledge of AWS and Linux.
I started the process by analyzing the relationship between amazon-ssm-agent and amazon-ecs-agent.
The amazon-ecs-agent must periodically reload rotating AWS credentials. To do so it must use RotatingSharedCredentialsProvider
(https://github.com/aws/amazon-ecs-agent/blob/master/agent/credentials/providers/rotating_shared_credentials_provider.go)
which is dependent on /rotatingcreds/credentials
(https://github.com/aws/amazon-ecs-agent/blob/master/agent/credentials/providers/credentials_filename_linux.go)
The amazon-ssm-agent can be told to write this file and it has the ability to do so (https://github.com/aws/amazon-ssm-agent/blob/8d191ace385c67d43303d79e23e977aa6da68412/agent/managedInstances/sharedCredentials/shared_Credentials.go)
A key to that functionality is the AWS_SHARED_CREDENTIALS_FILE
env var that must be set to the default from the ECS config
/rotatingcreds/credentials
The easiest way to do that is to perform
sudo systemctl edit amazon-ssm-agent
with
[Service]
Environment="AWS_SHARED_CREDENTIALS_FILE=/rotatingcreds/credentials"
in the configuration file. This will make the amazon-ssm-agent
service start with the env var set to the shared credentials location
sudo systemctl start amazon-ssm-agent
and for verification
sudo systemctl show amazon-ssm-agent | grep Environment
must print
Environment=AWS_SHARED_CREDENTIALS_FILE=/rotatingcreds/credentials
ECS_LOGFILE=/mnt/data/ecs/log/ecs-agent.log ECS_LOGLEVEL=debug ECS_DATADIR=/mnt/data/ecs AWS_DEFAULT_REGION=us-east-1 ECS_EXTERNAL=true ECS_CLUSTER=xnet out/amazon-ecs-agent
The amazon-ssm-agent and amazon-ecs-agent are unfortunately designed to be ran as root
/rotatingcreds/credentials
are not really shared, they belong to root
no application can access them without having the root permissions.
The amazon-ssm-agent doesn’t even start without root permissions because it internally performs filesystem hardening and you simply cannot perform chown without having root privileges.
I found out about the filesystem hardening by inspecting the log and running strace on the process. I’m jailing the aws-related processes in their own system user and group so I had to develop the following commands to be able to debug the problem.
|
|
WARN [OnPremIdentity] error while loading server info%!(EXTRA *errors.errorString=Failed to load instance info from vault. Failed to set permission for vault folder or its content. chown /var/lib/amazon/ssm/Vault: operation not permitted)
While it’s understandable that there’s concern about files being accessible, but at the same time it should not be the app’s job to harden it’s own resources and require root permissions to perform the hardening.
The project contains OS specific hardening code, so I went in there and adjusted the code so that the ownership changes only happen when the process has root privileges:
|
|
This change does not disable permissions hardening, but ownership changes are performed on best-effort basis preventing a hard error.
ERROR failed to find identity, retrying: failed to find agent identity
fstatat64(AT_FDCWD, "/var/lib/amazon/ssm/Vault/Manifest", 0xd821d8, 0) = -1 EACCES (Permission denied)
The result is that the process cannot not access the file.
This is caused by the fact the hardening code does not distinguish between a file and a directory. In Linux you need the execute permission to enter a directory. The permissions set to everything, both files and directories were 600 which means read and write. The fix for that behavior is code recognizing a directory and using permissions 700 instead.
Long story short, this is the change to the hardening method.
|
|
https://github.com/kixorz/amazon-ssm-agent/commit/4cef8838ff4f75f2325c2da7e0c8761a672a53ca
Hardening is a good practice, but it should not be in the way. Owner of a directory should not be prevented from accessing it.
ECS Agent now starts and registers itself against the AWS ECS cluster.
Let’s create a test container and upload the image to ECR so the ECS can pull it and run it.
|
|
|
|
|
|
Now we can use CloudFormation to create a task definition and run the task on the cluster via console.
Inspect the task:
CgroupError: Agent could not create task's platform resources
sudo su - aws -s /bin/bash -c "ECS_LOGFILE=/mnt/data/ecs/log/ecs-agent.log ECS_LOGLEVEL=debug ECS_DATADIR=/mnt/data/ecs AWS_DEFAULT_REGION=us-east-1 ECS_EXTERNAL=true ECS_CLUSTER=xnet strace -f -t -e trace=file out/amazon-ecs-agent"
252f3aa8a9d cgroupPath=/ecs/cd1d26daa4d542efb68e9252f3aa8a9d cgroupV2=false err=cgroup create: unable to create controller: v1: mkdir /sys/fs/cgroup/systemd/ecs/cd1d26daa4d542efb68e9252f3aa8a9d: permission denied" task="cd1d26daa4d542efb68e9252f3aa8a9d"
|
|
The cgroup permissions needed to be adjusted:
|
|
Testing the container definition fails on adding a TaskRoleArn:
aws ecs run-task --region us-east-1 --cluster xnet --launch-type EXTERNAL --task-definition arn:aws:ecs:us-east-1:000000000000:task-definition/xnet-TD2-Mhmo4LbdpJlB:1
|
|
The task definition with TaskRoleArn requires the attribute:
com.amazonaws.ecs.capability.task-iam-role
Now we can inspect the container instances again:
aws ecs describe-container-instances --container-instances=ff66f2fa778640b684b4ac8cce5bcc77 --cluster=xnet --region=us-east-1
|
|
The provided branch in my ECS agent fork can be built and the packaged version of the agent works on the officially unsupported 32bit hardware. It was a quite a bit of fun to hack around in AWS Systems Manager and AWS ECS Agent and simply figure out what would it take to make the software work on my hardware.
Here’s the summary of my changes:
Breaking the problem was a nice technical challenge.