Cloud cost optimization articles often focus on reserved instances, storage classes, or rightsizing infrastructure.
This story is about something completely different.
A customer experienced a sudden and significant increase in AWS costs. There were no deployments, no infrastructure changes, and no noticeable increase in user traffic.
At first glance, nothing appeared wrong.
Yet costs continued to rise.
What followed became one of the most interesting investigations I’ve worked on involving CloudFront, S3, crawler behavior, logging, and media protection.
The First Alert
The first indication of a problem came from billing reports.
AWS costs had increased dramatically over a short period of time.
The environment itself was relatively straightforward:
- Website delivered through Amazon CloudFront
- Media content served through a dedicated CloudFront distribution
- Videos and images stored in Amazon S3
- Historical media archived using lower-cost storage classes
There were no infrastructure modifications.
No application releases.
No increase in legitimate user traffic.
Something else was consuming bandwidth.
The Investigation Begins
As with most cloud investigations, the first step was data collection.
Website logs were reviewed.
Traffic sources were analyzed.
User agents were examined.
Referrers were inspected.
Known crawlers and SEO tools were evaluated.
Thousands of requests were analyzed.
Nothing explained the increase.
The website traffic simply did not justify the bandwidth consumption being reported by AWS.
At this point, a concerning realization emerged:
The activity was not coming through the website.
It was hitting the media CDN directly.
The Problem Nobody Noticed
The media distribution had a critical observability gap.
At the time:
- CloudFront access logging was not enabled
- S3 access logging was not enabled
- CloudTrail data events were not configured
We could see the website.
We could not see the media platform.
Essentially, we were trying to solve a network mystery while blindfolded.
The first incident remained unresolved because there simply wasn’t enough telemetry available to identify the source.
Preparing for the Next Incident
After the initial investigation, additional visibility was introduced.
CloudFront access logging was enabled.
Monitoring was improved.
Billing alerts were tightened.
The goal was simple:
If this happened again, we wanted evidence.
Fortunately—or unfortunately—it did.
The Same Problem Returns
Not long afterward, the same pattern reappeared.
Bandwidth consumption surged again.
This time, however, the logs were available.
Instead of spending days making assumptions, we could follow the data.
Within minutes, the answer became obvious.
A single automated crawler was responsible for the overwhelming majority of media downloads.
Not All Bots Are Created Equal
When most engineers think about web crawlers, they imagine search engines reading HTML pages and indexing content.
This crawler behaved differently.
Its objective was media analysis.
Its workflow looked something like this:
Visit Website
↓
Extract Media URLs
↓
Download Full Video Files
↓
Analyze Content
↓
Build Media Index
The crawler wasn’t reading metadata.
It was downloading the actual media files.
And it wasn’t downloading one file at a time.
It was downloading many files simultaneously from multiple IP addresses.
What looked like normal crawler activity at the website layer translated into massive bandwidth consumption at the media layer.
Why the Costs Escalated So Quickly
Media files are fundamentally different from HTML pages.
A crawler requesting a web page may consume a few kilobytes.
A crawler requesting a video may consume hundreds of megabytes or even gigabytes.
Now multiply that by:
- Thousands of media objects
- Multiple crawler instances
- Parallel downloads
- Repeated indexing activity
The result is substantial bandwidth usage in a very short period of time.
In this case, a single crawler generated the overwhelming majority of media traffic.
Actual users represented only a small fraction of total transfer.
The Hidden Cost Multiplier
The situation became even more interesting because some historical content was stored in archival storage classes optimized for cost efficiency.
Every retrieval generated additional charges beyond standard bandwidth costs.
What initially appeared to be a CloudFront issue was also creating downstream storage retrieval costs.
One request was producing charges across multiple AWS services simultaneously.
Evaluating Possible Solutions
Several mitigation strategies were considered.
robots.txt
The first idea was updating robots.txt rules.
This helps communicate crawler preferences.
However, robots.txt is not a security control.
It is merely a request.
Compliant crawlers may honor it.
Others may not.
AWS WAF
AWS WAF was evaluated next.
Advantages:
- Immediate protection
- Fast deployment
- Effective against known traffic patterns
Disadvantages:
- Requires ongoing maintenance
- Depends on identifying crawler characteristics
- Does not inherently protect media URLs
CloudFront Signed URLs
The most effective long-term solution turned out to be CloudFront Signed URLs.
Instead of exposing media objects directly:
https://cdn.example.com/video.mp4
the application generates time-limited signed URLs:
https://cdn.example.com/video.mp4?Expires=...&Signature=...&Key-Pair-Id=...
CloudFront validates the signature before serving the file.
Without a valid signature:
403 Access Denied
This approach shifts access control from the crawler to the platform itself.
The Unexpected Challenge
Like many production changes, implementation introduced its own lesson.
Some media files contained spaces and special characters in their filenames.
Initially, signed URLs appeared correct but CloudFront continued rejecting requests.
After detailed testing, the root cause was discovered.
Browsers automatically URL-encode certain characters.
CloudFront validates signatures against the encoded URL.
The application was signing one version of the URL while CloudFront was validating another.
A single encoded space character was enough to break signature validation.
The fix was straightforward:
rawurlencode($filename)
But finding it required testing, patience, and understanding exactly how CloudFront performs signature validation.
It was a reminder that the smallest implementation details often consume the most troubleshooting time.
The Final Architecture
The final solution combined multiple layers.
Layer 1: CloudFront Signed URLs
Only application-generated requests receive valid access tokens.
Direct media access is blocked.
Layer 2: AWS WAF
Known crawler traffic is filtered at the CloudFront edge before reaching the origin.
Layer 3: Improved Observability
Logging was enabled across critical services to ensure future investigations would start with evidence instead of assumptions.
The Most Valuable Lesson
The biggest takeaway from this project was not related to CloudFront, WAF, or even AWS costs.
It was visibility.
The first incident consumed significant investigation time because the necessary logs did not exist.
The second incident was resolved quickly because the right telemetry was available.
The technical solution was important.
The observability improvements were even more important.
Recommendations for Every AWS Environment
Based on this experience, I strongly recommend:
- Enable CloudFront access logging for every distribution
- Enable S3 access logging for critical buckets
- Enable CloudTrail data events for sensitive object access
- Configure billing alerts early
- Monitor bandwidth anomalies proactively
- Treat robots.txt as guidance, not protection
- Protect expensive media assets using signed access mechanisms
- Test thoroughly with real-world filenames and edge cases
Final Thoughts
Cloud environments are incredibly efficient when everything behaves as expected.
The challenge is that not everything behaves as expected.
Automated crawlers, indexing systems, bots, and third-party services constantly interact with public content in ways that are easy to overlook.
To them, media files are data.
To AWS, media files are bandwidth.
And to your monthly invoice, bandwidth has a cost.
The most effective optimization in this entire project wasn’t a new service or a complex architecture change.
It was visibility.
Because once you can see the problem clearly, solving it becomes much easier.