AWS Tips 07/2024

Troubleshooting AWS SQS with unknown subscribers.

Why is this a topic?

If you’re using AWS SQS in your systems, you may face a situation where you have multiple subscribers polling on a queue. This is all good when you know who the subscribers are. When you submit a message, one of the subscribers picks it up and starts processing it.

Imagine a different situation where you don’t know who all the subscribers are. There can be a number of reasons where this can happen - like having an abandoned system around that programmatically subscribes to your queue? Maybe another resource like an AWS Lambda function is a subscriber. Now you may have a new problem, because when you enqueue a message, you also expect your poller to pick it up and the pickup just never happens. It’s because the other unknown subscriber picked up the message faster, removed it from the queue and potentially discarded it.

And now what? You have a system losing important messages that are not getting processed correctly.

Low-level AWS troubleshooting

It is impossible to see who polled a message from a SQS queue. This is due to the fact there is no state associated with the client receiving the message in the service itself. Simplified explanation is the client simply calls a SQS API endpoint via HTTP and the result is either a message body or an empty response. This means there is no way to see who is actively polling the queue in the service dashboard.

Fortunately, AWS has allowed insight into service API calls going into the control plane. This functionality is exposed as a service and called CloudTrail. Typical use-case for CloudTrail is in security. The security teams use to capture low-level calls in the cloud environment with CloudTrail for analysis, reporting and compliance.

In our case CloudTrail is very useful for troubleshooting, because we can setup a new trail just for the problematic SQS resources. In the trail we can sample the service API calls and find out who the caller was - finding the queue poller or subscriber.

Once we have the trail for SQS resources up and capturing events, we can query the trail log to find out who is making calls. It is typically easiest to query the logs for the impacted resource ARN or the eventName. The results will show all details about who the caller was and you can quickly figure out what the next step is to fix the behavior.

If you’re curious, here’s an example of a SQS ReceiveMessage CloudTrail event:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
{
  "eventVersion": "1.09",
  "userIdentity": {
    "type": "AssumedRole",
    "principalId": "EXAMPLE_PRINCIPAL_ID",
    "arn": "arn:aws:sts::123456789012:assumed-role/RoleToBeAssumed/SessionName",
    "accountId": "123456789012",
    "accessKeyId": "ACCESS_KEY_ID",
    "sessionContext": {
      "sessionIssuer": {
        "type": "Role",
        "principalId": "AKIAI44QH8DHBEXAMPLE",
        "arn": "arn:aws:sts::123456789012:assumed-role/RoleToBeAssumed",
        "accountId": "123456789012",
        "userName": "RoleToBeAssumed"
      },
      "attributes": {
        "creationDate": "2023-11-07T22:13:06Z",
        "mfaAuthenticated": "false"
      }
    }
  },
  "eventTime": "2023-11-07T23:59:24Z",
  "eventSource": "sqs.amazonaws.com",
  "eventName": "ReceiveMessage",
  "awsRegion": "ap-southeast-4",
  "sourceIPAddress": "10.0.118.80",
  "userAgent": "aws-cli/1.29.16 md/Botocore#1.31.16 ua/2.0 os/linux#5.4.250-173.369.amzn2int.x86_64 md/arch#x86_64 lang/python#3.8.17 md/pyimpl#CPython cfg/retry-mode#legacy botocore/1.31.16",
  "requestParameters": {
    "queueUrl": "https://sqs.ap-southeast-4.amazonaws.com/123456789012/MyQueue",
    "maxNumberOfMessages": 10
  },
  "responseElements": null,
  "requestID": "8b4d4643-8f49-52cd-a6e8-1b875ed54b99",
  "eventID": "f3f23ab7-b0a4-4b71-afc0-141209c49206",
  "readOnly": true,
  "resources": [
    {
      "accountId": "123456789012",
      "type": "AWS::SQS::Queue",
      "ARN": "arn:aws:sqs:ap-southeast-4:123456789012:MyQueue"
    }
  ],
  "eventType": "AwsApiCall",
  "managementEvent": false,
  "recipientAccountId": "123456789012",
  "eventCategory": "Data",
  "tlsDetails": {
    "tlsVersion": "TLSv1.2",
    "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
    "clientProvidedHostHeader": "sqs.ap-southeast-4.amazonaws.com"
  }
}

You can get more details about this topic if you continue reading the official documentation here: Logging Amazon SQS API calls using AWS CloudTrail

Conclusion

In this tip, we’ve addressed the issue of finding the unknown subscribers in AWS SQS and discussed how AWS CloudTrail can be effectively used for troubleshooting the problem.

Remember, the same process can be used to troubleshoot any other AWS service or resource. The CloudTrail service allows you to capture, query and see all of the API calls going to the AWS control plane.