AWS FinOps Skill for Opsy

Step 1: Cost Explorer First

Start with Cost Explorer — one call covers all regions and services:

Spend by service — identifies top cost drivers
Spend by region — shows where resources live
Daily trend — spots anomalies

Focus on services representing >5% of spend.

If Credits Mask Costs ($0 spend)

Check if Resource Explorer is enabled:

aws resource-explorer-2 list-indexes --region us-east-1

If enabled, use it — one call gets ALL resources:

aws resource-explorer-2 search --query-string "*" --region us-east-1

If NOT enabled, use resourcegroupstaggingapi to find all tagged resources:

aws resourcegroupstaggingapi get-resources --region us-east-1

Then query each active region for core services: EC2, RDS, EBS, Lambda, S3, ECS, EKS, NAT Gateways, Load Balancers.

Step 2: Deep Dive Each Resource

For every resource found, gather full details:

EC2: Instance type, state, launch time, CloudWatch CPU/memory
RDS: Instance class, connections (14d), storage, Multi-AZ, engine
EBS: Attachment status, volume type, size, snapshots
S3: Lifecycle policies, storage class, versioning
Lambda: Invocations (30d), memory, runtime, provisioned concurrency
ECS/EKS: Task definitions, service counts, cluster utilization
ECR: Repositories, image count, lifecycle policies
Load Balancers: Request count (14d), target groups
NAT Gateway: Data processed
Elastic IPs: Association status
CloudWatch Logs: Retention settings
Secrets Manager: Secret count

Check EVERY resource for optimization opportunities. Don't skip services.

Step 3: Check Commitment Coverage

Savings Plans utilization
Reserved Instance coverage gaps
Expiring commitments (next 30 days)

Safety Guardrails

Report findings with evidence, suggest investigation — not direct actions:

"Instance i-xxx averaged 3% CPU over 30 days — rightsizing candidate"
"Volume vol-xxx unattached since [date] — verify before removing"
"RDS db-xxx had 0 connections for 14 days — confirm if still needed"

Thresholds:

Idle: ~0% utilization for 14+ days
Underutilized: <10% average for 14+ days
Rightsizing candidate: <30% average

Smart Recommendation Rules

Only flag when action is possible:

Situation	Action
Minimum size + in use (db.t3.micro with connections)	Skip — already right-sized
Minimum size + idle (db.t3.micro, 0 connections)	Flag as idle
Larger size + low utilization	Flag for rightsizing with specific target
Tagged `FinOps:Skip=true`	Skip
Dev/staging with `Environment=dev`	Skip low utilization (expected)

Before flagging, verify:

Is this the minimum size?
Is it actually in use? (connections/invocations/requests)
Is there a smaller option?

Service Checklists

EC2: Utilization, stopped instances (EBS cost), previous-gen types, On-Demand 24/7 → SP/RI

Lambda: Zero invocations (30d), memory vs duration tradeoff, provisioned concurrency

ECS/EKS: Fargate vs EC2, resource requests vs usage, Spot for fault-tolerant

ECR: Lifecycle policies, image count, total size — old images accumulate

RDS: Connection count, Multi-AZ in dev, instance class utilization, storage, previous-gen

DynamoDB: Provisioned vs On-Demand fit, auto-scaling, TTL

ElastiCache/OpenSearch: Node utilization, reserved coverage

S3: Lifecycle policies, storage class, Intelligent-Tiering, incomplete multipart uploads

EBS: Unattached volumes, gp2→gp3, snapshot retention, IOPS necessity

Networking: Cross-AZ transfer, NAT Gateway → VPC endpoints, CloudFront caching

Load Balancers: Zero requests = orphaned, Classic→ALB/NLB

Elastic IPs: Unassociated = $3.60/month each

CloudWatch: Log retention (default infinite), high-res metrics necessity

Secrets Manager: $0.40/month vs free Parameter Store

API Gateway: HTTP API 70% cheaper than REST

Output Requirements

CSV (Required)

account_id,resource_name,status,recommendation_type,potential_savings_monthly,resource_id,region,resource_type,tags,description
123456789012,web-server-prod,Underutilized,Rightsizing to t3.small,45.00,i-0abc123def456,us-east-1,EC2 Instance,"Environment=prod,Team=platform","Avg CPU 8% over 30 days. Current: t3.large"
123456789012,,Unattached,Verify before removing,12.50,vol-0xyz789,us-east-1,EBS Volume,,"100GB gp2 volume unattached since 2024-12-01"
123456789012,raspberry,No-Lifecycle,Add ECR lifecycle policy,2.00,raspberry,us-east-1,ECR Repository,,"47 images totaling 12GB. No lifecycle policy configured"

Column	Description
`account_id`	AWS account ID
`resource_name`	Name tag value (empty if untagged)
`status`	`Idle`, `Underutilized`, `Overs

...

aws-finops

SKILL.md