JobSpy
An event-driven AWS pipeline that scrapes, filters, and AI-scores LinkedIn job postings — surfacing 40+ qualified roles to my inbox each week.
Phase 02 / Architecture
An event-driven cascade.
01 / The problem
Job hunting on LinkedIn is a volume problem disguised as a quality problem. My existing setup — a single n8n container doing scraping, filtering, and notification sequentially — had no observability, no retry semantics, and one container restart away from missing a day's postings.
02 / The solution
Four Lambdas connected by SQS queues, each with its own DLQ. EventBridge fires every four hours. A three-gate cascade discards obvious rejects on free signals before any paid AI call. Top matches land in my inbox. Fully Terraformed, deployed via GitHub Actions OIDC — no long-lived AWS credentials anywhere.
03 / The cascade
Title filter
Filters out non-relevant titles before any work happens.
Description filter
Reads the full description and drops the obvious mismatches.
AI scoring
AI scores what's left and ranks the best fits for me.
04 / Trade-offs
Lambda over Fargate
Bursty traffic. No idle compute between EventBridge ticks.
Single NAT gateway
Saves ~$32/mo. Acceptable for a personal pipeline.
No Multi-AZ on RDS
If the AZ goes down I'll get fewer job alerts for an hour.
SQS reserved concurrency
Rate-limits LinkedIn at the infra layer instead of in code.
05 / War stories
Bedrock → OpenAI pivot
Submitted a token-quota increase for Claude on Bedrock. Got back a default token quota of zero — account-wide, no override. Pivoted the scoring Lambda to OpenAI in an afternoon. The lesson: AWS account-trust gating is a real category of risk, not a paperwork formality.
Docker manifest format
Built the scraper image on my Mac and pushed it. Lambda refused to pull it. Turned out the local Docker was emitting OCI manifests; Lambda still expected v2 manifest list. Fix was a single buildx flag. Cost an evening.
Lambda concurrency quota
Assumed the default 1000 concurrent executions per account. The real default in a fresh account is 10. The pipeline silently throttled at peak until I noticed alerts were skipping batches. Service quota request, three days, fixed.
06 / What I learned
Terraform apply lifecycle, and how to break it apart across stacks safely.
Hit real errors mid-apply (password conflicts, drift on the rotation stack) and learned the hard way that the order Terraform creates things in matters. Some resources need lifecycle rules to stop Terraform from destroying and recreating them every run.
Terraform's depends_on flag for forcing build order.
Terraform usually figures out dependencies on its own by tracing resource references — if resource A uses resource_b.id, it knows B comes first. But sometimes the dependency is invisible to Terraform, like a NAT Gateway needing the Internet Gateway to exist first even though it doesn't reference it directly. depends_on makes the order explicit so Terraform doesn't try to build things in parallel and fail.
VPC endpoints and how aggressively they save NAT egress charges.
NAT charges $0.045 per GB of traffic going out. VPC endpoints let your Lambdas talk to AWS services (S3, Secrets Manager) without going through NAT at all. S3 endpoints are free; the others cost ~$7/month but pay for themselves fast at scale.
Secrets Manager rotation as a state machine, not a cron job.
Rotation isn't just 'generate a new password every 30 days.' It's four steps: make a new password, apply it to the database, test it works, then promote it. The old password sticks around as a backup in case the new one breaks something.
AWS account-trust gating as a real, planning-level problem category.
Some AWS services aren't click-to-enable. Bedrock model access, SES production mode, and quota increases all need approval that can take hours or days. You have to request them early or they block the whole build.
Longest prefix match is how every routing decision works.
When a packet has multiple routes it could take, the most specific one wins. That's why traffic to 10.0.5.42 stays inside the VPC instead of going to NAT — the /16 local route is more specific than the /0 default. This rule solves most 'why did my traffic go through NAT?' mysteries.
Cross-AZ data transfer is the silent line item on AWS bills.
Anytime data crosses between availability zones — Lambda talking to RDS in another AZ, or Lambdas in AZ-b using a NAT in AZ-a — AWS charges $0.01 per GB each way. It's small at low volume but a major cost driver at scale. Production setups run a NAT per AZ to kill it; I'm running single NAT because my volume is too low to matter.
© 2026 Jeff Lubin
← Home