What does a Cloud Engineer actually do?

06/09/2025

I often get asked by newcomers to IT what cloud engineering actually is. People tend to have heard of AWS and Azure, and many even know in a very general sense that moving your infrastructure away to this far off remote place can save your company thousands on costs. It is also quite well known that large FAANG companies like Netflix use cloud providers every day to deliver their services to millions of users. But when you try to get down to the more concrete aspects of the job it can get complex quickly. If you're not in the field I think its normal to be a little hazy on what tasks need to be carried out on a daily basis. After being asked this question so many times I thought it would be worth giving a few examples of what a Cloud Engineer actually does.

Let's jump right in. On any given day a cloud engineer can work on:

Broken Infrastructure-as-Code (IaC) Deployments: IaC is the name of the game in cloud. You're not going to set up and entire infrastructure project using a Graphical User Interface. You're going to work in a terminal or code editor and you'll want to use reusable, compliant, version controlled, secure set ups. Often your CloudFormation or Terraform scripts can fail at the moment when they're applied. When this happens you need to jump in and troubleshoot them to check for missing IAM permissions, incoherent variables or even just typos in your Terraform config files.

Latency issues: An app may be slow to load which will cause a bad user experience and impact the company negatively. If this happens will need to jump into the back end and check multiple layers in the setup. Could it be the network? Is the routing done correctly? Is it a hybrid-cloud setup? If so, has the peering been done correctly? Is there a misconfigured firewall? Once this is all ruled out, you move to the storage layer. Is the EBS or RDS hitting throughput caps? You may check the external dependencies. Is there an issue with the API endpoints? If the problem isn't here you may need to observability, like checking the CloudWatch metrics. What do the query times for RDS, EC2s, CPU, memory usage look like? The checks can get pretty comprehensive.

Misconfigured IAM Roles or Policies: In your set up you will likely have many moving interlocking parts. These different parts will need to access specific services, such as a Lambda function or EC2 instance needing to access an S3 bucket. If the "who" (the identity) hasn't been given proper access to the "what" (the resource), data can't flow properly and the infra can start failing. Engineers will regularly review and update IAM roles or troubleshoot issues using AWS Access Analyzer.

Auto Scaling Groups not scaling: Another important part of the cloud is automatically sclaing up and down or in and out (yes, there are different ways of scaling!) as needed. If instances aren’t launching or terminating as expected an engineer will need to check the scaling policies. Perhaps you will find broken launch templates, or perhaps some health checks are failing. You may even see there's an incorrect reference to an AMI.

DNS / Domain routing issues: Many modern websites are hosted entirely in the cloud, with all resources (HTML, CSS, images, videos, APIs) tied to domain names. DNS plays a critical role in making these resources available. Common issues include subdomains not resolving or pointing to the wrong target (like a misconfigured load balancer), sites working in one region but not another due to routing policies, or DNS changes not appearing right away because of long propagation times. Cloud engineers troubleshoot these problems by reviewing DNS records, checking TTL settings, and validating that domains are correctly mapped to cloud services.

Identifying unused resources: Probably one of the main reasons your company moved to the cloud is to save on costs, right? Therefore you want to only use the services and resources that you need. Often a resource like an EC2 instance may remain up and running despite not being uilized. Same thing goes for unattached EBS volumes, stale snapshots or old Lambda versions. Identifying these mishaps and stopping/deleting them can save on costs in a meaningful way.

CloudWatch alarm triggered: Observability is a very big deal in the cloud. Given that everything is digital it is easier to keep an eye on. Dashboards will tell you everything you need to know about your usage. If you get notifications that something breached a threshold (CPU > 90%, 5xx errors spiking, disk usage critical) it is your job as the engineer to investigate and take the appropriate action. You may scale down a service, restart it, delete it altogether or decide to replace it with a more efficient set up.

Deployment fails in CI/CD Pipeline: This is a big one that happens all the time. CI/CD pipelines are essential for automating builds and deployments. In fact, companies are automating as much as possible. However, there is no guarantee that the execution will go smoothy. Deployments to a ECS services, Lambda functions or other cloud resources can fail. CI/CD tools like GitHub Actions or Jenkins can throw errors. Your job is to identify where the blockage is coming from. Did a test not smoke test not pass? Is there a misconfiguration somewhere in the pipeline?

S3 Bucket or CDN not updating: When running a website with users in different parts of the world, a CDN is often used to deliver content quickly. Caching plays a huge role in giving users a fast and reliable experience. However, one common issue is that changes to a static site may not immediately appear in production. This can happen if the CDN is still serving cached content, or if the S3 bucket configuration (such as cache-control headers or bucket policies) are not set up correctly. As a cloud engineer, troubleshooting often involves checking cache invalidation, reviewing headers, and making sure the bucket policies allow the correct access.

VPC networking problems: Network issues are surprisingly common in Cloud infrastructure. There are so many moving parts, layers and services that basic connections can get overlooked. Cloud Engineers often need to check resources in private subnets that can't reach the internet. You may check to see if there is a missing NAT gateway, some incorrect route tables or no internet gateway at all. The same goes for a multi-VPC environment. How it data traversing the VPCs? Are you using a transit gateway or VPC peering? How have they been set up?

I hope these examples give you a better idea of the tasks you can be asked to do in this type of role. Of course, it goes without saying that not all cloud engineering roles are the same. It's a little difficult to pinpoint any one task that every engineer does in the cloud due to the fact that every company is different, with different needs, different clients, working in different sectors and usually with a whole set of different technologies... and even working in different clouds! Therefore it is a broad term and to work in this position you will learn many skills along the way.

I'll be writing more articles in the future. Watch this space if you enjoyed this content!