ECS Anywhere on 32bit systems

The purpose of this exercise was enabling the use of the AWS ECS agent in External mode on 32bit hardware like the armv7 Raspberry Pi. The changes are working and I have decided to publish my notes from the process of debugging and adjusting the program. Below are my technical notes from the development process.

This article is heavily technical and assumes very good knowledge of AWS and Linux.

I started the process by analyzing the relationship between amazon-ssm-agent and amazon-ecs-agent. The amazon-ecs-agent must periodically reload rotating AWS credentials. To do so it must use RotatingSharedCredentialsProvider (https://github.com/aws/amazon-ecs-agent/blob/master/agent/credentials/providers/rotating_shared_credentials_provider.go) which is dependent on /rotatingcreds/credentials (https://github.com/aws/amazon-ecs-agent/blob/master/agent/credentials/providers/credentials_filename_linux.go)

The amazon-ssm-agent can be told to write this file and it has the ability to do so (https://github.com/aws/amazon-ssm-agent/blob/8d191ace385c67d43303d79e23e977aa6da68412/agent/managedInstances/sharedCredentials/shared_Credentials.go)

A key to that functionality is the AWS_SHARED_CREDENTIALS_FILE env var that must be set to the default from the ECS config /rotatingcreds/credentials

The easiest way to do that is to perform sudo systemctl edit amazon-ssm-agent with

[Service]
Environment="AWS_SHARED_CREDENTIALS_FILE=/rotatingcreds/credentials"

in the configuration file. This will make the amazon-ssm-agent service start with the env var set to the shared credentials location sudo systemctl start amazon-ssm-agent and for verification sudo systemctl show amazon-ssm-agent | grep Environment must print Environment=AWS_SHARED_CREDENTIALS_FILE=/rotatingcreds/credentials

ECS_LOGFILE=/mnt/data/ecs/log/ecs-agent.log ECS_LOGLEVEL=debug ECS_DATADIR=/mnt/data/ecs AWS_DEFAULT_REGION=us-east-1 ECS_EXTERNAL=true ECS_CLUSTER=xnet out/amazon-ecs-agent

The amazon-ssm-agent and amazon-ecs-agent are unfortunately designed to be ran as root /rotatingcreds/credentials are not really shared, they belong to root no application can access them without having the root permissions.

The amazon-ssm-agent doesn’t even start without root permissions because it internally performs filesystem hardening and you simply cannot perform chown without having root privileges.

I found out about the filesystem hardening by inspecting the log and running strace on the process. I’m jailing the aws-related processes in their own system user and group so I had to develop the following commands to be able to debug the problem.

1
su - aws -s /bin/bash -c "strace -f -t -e trace=file amazon-ssm-agent"

WARN [OnPremIdentity] error while loading server info%!(EXTRA *errors.errorString=Failed to load instance info from vault. Failed to set permission for vault folder or its content. chown /var/lib/amazon/ssm/Vault: operation not permitted) While it’s understandable that there’s concern about files being accessible, but at the same time it should not be the app’s job to harden it’s own resources and require root permissions to perform the hardening.

The project contains OS specific hardening code, so I went in there and adjusted the code so that the ownership changes only happen when the process has root privileges:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func Harden(path string) (err error) {
    //code setting permissions
    ...
    
    if os.Getuid() != 0 {
        return
    }
    
    //code changing ownership
    ...
}

This change does not disable permissions hardening, but ownership changes are performed on best-effort basis preventing a hard error.

ERROR failed to find identity, retrying: failed to find agent identity
fstatat64(AT_FDCWD, "/var/lib/amazon/ssm/Vault/Manifest", 0xd821d8, 0) = -1 EACCES (Permission denied)

The result is that the process cannot not access the file.

This is caused by the fact the hardening code does not distinguish between a file and a directory. In Linux you need the execute permission to enter a directory. The permissions set to everything, both files and directories were 600 which means read and write. The fix for that behavior is code recognizing a directory and using permissions 700 instead.

Long story short, this is the change to the hardening method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
func Harden(path string) (err error) {
    ...
	var permission os.FileMode = RWFilePermission

	if fi.IsDir() {
		permission = RWXDirPermission
	}

	if fi.Mode()&permissionMask != permission {
		if err = os.Chmod(path, permission); err != nil {
			return
		}
	}
    ...
}

https://github.com/kixorz/amazon-ssm-agent/commit/4cef8838ff4f75f2325c2da7e0c8761a672a53ca

Hardening is a good practice, but it should not be in the way. Owner of a directory should not be prevented from accessing it.

ECS Agent now starts and registers itself against the AWS ECS cluster.

Let’s create a test container and upload the image to ECR so the ECS can pull it and run it.

1
2
FROM alpine
CMD ["/bin/sh", "-c", "echo 'It works!'"]

1
2
docker build -t test-container .
docker images

1
2
3
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <account>.dkr.ecr.<region>.amazonaws.com
docker tag <image>
docker push <account>.dkr.ecr.<region>.amazonaws.com/<repo>:<tag>

Now we can use CloudFormation to create a task definition and run the task on the cluster via console.

Inspect the task: CgroupError: Agent could not create task's platform resources

sudo su - aws -s /bin/bash -c "ECS_LOGFILE=/mnt/data/ecs/log/ecs-agent.log ECS_LOGLEVEL=debug ECS_DATADIR=/mnt/data/ecs AWS_DEFAULT_REGION=us-east-1 ECS_EXTERNAL=true ECS_CLUSTER=xnet strace -f -t -e trace=file out/amazon-ecs-agent"

252f3aa8a9d cgroupPath=/ecs/cd1d26daa4d542efb68e9252f3aa8a9d cgroupV2=false err=cgroup create: unable to create controller: v1: mkdir /sys/fs/cgroup/systemd/ecs/cd1d26daa4d542efb68e9252f3aa8a9d: permission denied" task="cd1d26daa4d542efb68e9252f3aa8a9d"

1
sudo chown -R aws:aws /sys/fs/cgroup/systemd/ecs

The cgroup permissions needed to be adjusted:

1
2
3
4
5
6
7
dir=ecs
user=aws
group=aws
for i in blkio cpu cpuacct cpuset devices freezer memory net_cls net_prio perf_event pids systemd unified; do
    mkdir -p /sys/fs/cgroup/${i}/${dir};
    chown -R ${user}:${group} /sys/fs/cgroup/${i}/${dir};
done

Testing the container definition fails on adding a TaskRoleArn: aws ecs run-task --region us-east-1 --cluster xnet --launch-type EXTERNAL --task-definition arn:aws:ecs:us-east-1:000000000000:task-definition/xnet-TD2-Mhmo4LbdpJlB:1

1
2
3
4
5
6
7
{
  "tasks": [],
  "failures": [{
    "arn": "arn:aws:ecs:us-east-1:000000000000:container-instance/ff66f2fa778640b684b4ac8cce5bcc77",
    "reason": "ATTRIBUTE"
  }]
}

The task definition with TaskRoleArn requires the attribute:

com.amazonaws.ecs.capability.task-iam-role

Now we can inspect the container instances again: aws ecs describe-container-instances --container-instances=ff66f2fa778640b684b4ac8cce5bcc77 --cluster=xnet --region=us-east-1

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
{
  "containerInstances": [
    {
      "containerInstanceArn": "arn:aws:ecs:us-east-1:000000000000:container-instance/xnet/ff66f2fa778640b684b4ac8cce5bcc77",
      "ec2InstanceId": "mi-039e1be1b2212f5aa",
      "version": 2052,
      "versionInfo": {
        "agentVersion": "1.64.0",
        "agentHash": "*2e05d8f1",
        "dockerVersion": "DockerVersion: 20.10.12"
      },
      "remainingResources": [
        {
          "name": "CPU",
          "type": "INTEGER",
          "doubleValue": 0,
          "longValue": 0,
          "integerValue": 8192
        },
        {
          "name": "MEMORY",
          "type": "INTEGER",
          "doubleValue": 0,
          "longValue": 0,
          "integerValue": 1990
        },
        {
          "name": "PORTS",
          "type": "STRINGSET",
          "doubleValue": 0,
          "longValue": 0,
          "integerValue": 0,
          "stringSetValue": [
            "22",
            "2376",
            "2375",
            "51678",
            "51679"
          ]
        },
        {
          "name": "PORTS_UDP",
          "type": "STRINGSET",
          "doubleValue": 0,
          "longValue": 0,
          "integerValue": 0,
          "stringSetValue": []
        }
      ],
      "registeredResources": [
        {
          "name": "CPU",
          "type": "INTEGER",
          "doubleValue": 0,
          "longValue": 0,
          "integerValue": 8192
        },
        {
          "name": "MEMORY",
          "type": "INTEGER",
          "doubleValue": 0,
          "longValue": 0,
          "integerValue": 1990
        },
        {
          "name": "PORTS",
          "type": "STRINGSET",
          "doubleValue": 0,
          "longValue": 0,
          "integerValue": 0,
          "stringSetValue": [
            "22",
            "2376",
            "2375",
            "51678",
            "51679"
          ]
        },
        {
          "name": "PORTS_UDP",
          "type": "STRINGSET",
          "doubleValue": 0,
          "longValue": 0,
          "integerValue": 0,
          "stringSetValue": []
        }
      ],
      "status": "ACTIVE",
      "agentConnected": true,
      "runningTasksCount": 0,
      "pendingTasksCount": 0,
      "attributes": [
        {
          "name": "ecs.capability.external"
        },
        {
          "name": "ecs.capability.secrets.asm.environment-variables"
        },
        {
          "name": "com.amazonaws.ecs.capability.logging-driver.awsfirelens"
        },
        {
          "name": "ecs.capability.secrets.asm.bootstrap.log-driver"
        },
        {
          "name": "ecs.capability.firelens.options.config.s3"
        },
        {
          "name": "com.amazonaws.ecs.capability.logging-driver.none"
        },
        {
          "name": "ecs.capability.ecr-endpoint"
        },
        {
          "name": "com.amazonaws.ecs.capability.logging-driver.json-file"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.17"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
          "name": "ecs.capability.docker-plugin.local"
        },
        {
          "name": "ecs.capability.task-cpu-mem-limit"
        },
        {
          "name": "ecs.capability.secrets.ssm.bootstrap.log-driver"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.30"
        },
        {
          "name": "ecs.capability.full-sync"
        },
        {
          "name": "ecs.capability.firelens.fluentd"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.31"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.32"
        },
        {
          "name": "ecs.capability.efs"
        },
        {
          "name": "ecs.capability.execution-role-ecr-pull"
        },
        {
          "name": "ecs.capability.firelens.options.config.file"
        },
        {
          "name": "ecs.capability.container-health-check"
        },
        {
          "name": "ecs.os-family",
          "value": "LINUX"
        },
        {
          "name": "ecs.capability.logging-driver.awsfirelens.log-driver-buffer-limit"
        },
        {
          "name": "ecs.capability.increased-task-cpu-limit"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.24"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.25"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.26"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.27"
        },
        {
          "name": "ecs.capability.container-ordering"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.28"
        },
        {
          "name": "com.amazonaws.ecs.capability.privileged-container"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29"
        },
        {
          "name": "ecs.cpu-architecture",
          "value": "arm"
        },
        {
          "name": "ecs.capability.env-files.s3"
        },
        {
          "name": "ecs.capability.secrets.ssm.environment-variables"
        },
        {
          "name": "com.amazonaws.ecs.capability.ecr-auth"
        },
        {
          "name": "ecs.capability.pid-ipc-namespace-sharing"
        },
        {
          "name": "ecs.capability.firelens.fluentbit"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.20"
        },
        {
          "name": "ecs.os-type",
          "value": "linux"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.21"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.22"
        },
        {
          "name": "ecs.capability.private-registry-authentication.secretsmanager"
        },
        {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.23"
        }
      ],
      "registeredAt": "2023-02-30T11:56:36.971000-06:00",
      "attachments": [],
      "tags": []
    }
  ],
  "failures": []
}

The provided branch in my ECS agent fork can be built and the packaged version of the agent works on the officially unsupported 32bit hardware. It was a quite a bit of fun to hack around in AWS Systems Manager and AWS ECS Agent and simply figure out what would it take to make the software work on my hardware.

Here’s the summary of my changes:

SSM agent repository fork here.
ECS agent repository fork here.

Breaking the problem was a nice technical challenge.