Multi-Cloud DevOps Engineer crafting scalable infrastructure, CI/CD pipelines, and container orchestration across AWS · Azure · GCP
I'm a Motivated Multi-Cloud DevOps Engineer with hands-on experience across AWS, Azure and GCP. My work sits at the intersection of infrastructure automation and developer experience — making deployments faster, systems more resilient, and operations invisible.
Skilled in Linux administration, CI/CD automation, Docker & Kubernetes orchestration, Infrastructure as Code using Terraform, and observability stacks using Prometheus & Grafana. I believe infrastructure should be code — versioned, tested, and reviewed like any software.
Designed and implemented a complete end-to-end DevOps pipeline that takes code from a developer's commit all the way to a live Kubernetes deployment without any manual intervention. The pipeline integrates GitHub and GitLab webhooks with Jenkins to automatically trigger Maven builds the moment code is pushed. Every build runs through SonarQube static analysis with quality gates configured — a failed gate stops the pipeline before a bad image ever gets created.
Once a build passes, Docker packages the application and the versioned image is pushed to Amazon ECR. Jenkins then applies the updated Kubernetes manifests to an Amazon EKS cluster using rolling update strategies, ensuring zero downtime during releases. Horizontal Pod Autoscaling is configured against CPU and memory thresholds so the cluster scales workloads automatically as traffic changes. Every AWS resource — VPC, subnets, EKS node groups, IAM roles, ECR repositories — is provisioned via modular Terraform, meaning dev, staging and production environments can be spun up from the same codebase with environment-specific variable files.
Architected a production-grade three-tier AWS environment with strict network segmentation designed around the principle of least privilege at every layer. The public subnets host only the Application Load Balancer and NAT Gateway — nothing else is directly internet-reachable. EC2 application servers live in private subnets with no public IP addresses, accessing the internet only through the NAT Gateway for package updates and outbound calls. The database tier runs Amazon RDS MySQL in Multi-AZ configuration across two Availability Zones with automated failover, also in private subnets with no route to the internet.
S3 is used for static asset delivery and backup storage, secured with IAM bucket policies enforcing deny-all-except-explicitly-allowed patterns and server-side AES-256 encryption. The Application Load Balancer terminates SSL using ACM-managed certificates and routes traffic using path-based rules, so a single ALB can serve multiple application endpoints. Security groups between each tier are tightly scoped — the web tier only accepts HTTPS from the internet; the app tier only accepts connections from the web tier's security group; the database tier only accepts MySQL port from the app tier.
Solved a fundamental problem with ephemeral infrastructure: when new EC2 instances spin up inside an Auto Scaling Group during a traffic spike, they need Node Exporter installed and Prometheus needs to know they exist — all without anyone manually running commands. This project automates that entire workflow so monitoring coverage is maintained regardless of how the fleet scales.
Ansible playbooks were written to install and configure Node Exporter across Linux servers, with custom labels embedded directly in the playbook to ensure each instance is correctly identified in Prometheus scrape targets. EC2 tags on the Auto Scaling Group are configured so every new instance inherits consistent naming conventions at launch time. An AWS Lambda function is subscribed to CloudWatch Events for the EC2 Instance Launch Successful event — the moment a new node comes up, Lambda fires, SSM Run Command triggers the Ansible playbook on that instance, and within minutes Node Exporter is running and Prometheus has picked it up via dynamic service discovery. Grafana dashboards aggregate metrics from the entire fleet in real time, so during scale-out events operators have full visibility without any manual steps.
Built a fully serverless ETL pipeline where S3 event notifications act as the trigger mechanism — the moment a file lands in the input bucket, an AWS Lambda function is invoked automatically with no polling, no queues to manage, and no servers to maintain. The Lambda functions, written in Python, handle the full processing lifecycle: reading the incoming file, validating the data structure, applying transformation logic, and writing the processed output to a separate S3 bucket.
Every Lambda execution emits structured JSON logs to CloudWatch Logs, making debugging and audit trails straightforward. CloudWatch Alarms are configured on Lambda error rates, throttle counts and p99 duration so any degradation triggers an alert before it becomes user-facing. All IAM execution roles follow resource-level least-privilege — each function's policy specifies exact S3 ARNs for read or write, with no wildcard actions or resources anywhere in the permission set. The result is a zero-maintenance, auto-scaling data pipeline that handles variable file volumes without provisioning or managing a single EC2 instance.
Built a reusable Terraform module library covering VPC, EC2, IAM and RDS that provisions three fully isolated environments — dev, staging and production — from a single codebase. Each environment is parameterised through Terraform variable files, so the same module code creates a small t3.micro dev setup and a production-grade Multi-AZ deployment without any code duplication. Remote state is stored in S3 with DynamoDB state locking, which prevents two engineers from running terraform apply simultaneously and corrupting state.
GitLab CI/CD is wired into the Terraform workflow: every merge request automatically runs terraform plan and posts the output as a pipeline artifact so reviewers can see exactly what infrastructure will change before merging. Merging to main runs terraform apply behind a manual approval gate — a human has to click approve before any infrastructure is actually modified. IAM roles are scoped per environment, meaning a credential leak in dev cannot affect staging or production resources.
Deployed a web application designed around the assumption that any single component can fail at any time — and the system keeps running without user-visible impact. Traffic enters through an Elastic Load Balancer spread across two Availability Zones, distributing requests to an Auto Scaling Group of EC2 instances. The ASG uses ELB health endpoint checks rather than just EC2 status checks, so an instance that's running but returning 5xx errors gets replaced automatically rather than silently serving errors.
The database layer uses Amazon RDS Multi-AZ, which keeps a synchronous standby replica in a second Availability Zone. If the primary fails — hardware issue, AZ outage — RDS promotes the standby and updates the DNS endpoint within 60 seconds, with no application configuration changes required. CloudWatch dashboards give real-time visibility into ELB request rates, 5xx error percentages and RDS replication lag. SNS notification topics are wired to CloudWatch Alarms so on-call engineers receive immediate alerts when error rates spike or replication lag climbs, well before users are impacted.
Containerised a full multi-service application stack — Nginx frontend, backend REST API and MySQL database — so that the local development environment is byte-for-byte identical to what runs in production. Docker Compose handled local orchestration with service networking, named volumes and environment variable injection. Each service was packaged into a versioned image and stored in a private registry, ensuring deployments reference exact image SHAs rather than mutable tags.
The migration to Kubernetes used hand-crafted manifests rather than auto-generated YAML: Deployments with carefully tuned resource requests and limits, Services for intra-cluster networking, ConfigMaps for application configuration, Secrets for credentials, and PersistentVolumeClaims for the MySQL data directory. Helm charts were authored on top of those manifests to support parameterised releases — the same chart deploys to a test namespace with minimal resources and to production with full replica counts by swapping value files. Nginx acts as the Ingress controller handling path-based routing and SSL passthrough. Prometheus scrapes pod-level metrics via ServiceMonitor custom resources and feeds Grafana dashboards tracking container CPU, memory usage, HTTP request rates and error budgets in real time.
Established a fully GitOps-driven delivery model where the Git repository is the authoritative source of truth for all cluster state — no kubectl commands, no manual deployments, no configuration drift. The workflow is split cleanly between CI and CD: GitHub Actions owns the CI phase, running Maven builds, SonarQube quality gates and Docker image builds. On a successful build, the CI pipeline commits the new image tag back to the Helm chart in the configuration repository rather than applying it directly to the cluster.
ArgoCD monitors the configuration repository and detects when the desired state diverges from what's running in the cluster. It automatically syncs the new image tag and applies the updated Helm chart to Kubernetes. Self-healing is enabled on the ArgoCD Application CRD, so any out-of-band manual change to the cluster — an engineer editing a Deployment directly with kubectl — is automatically reverted to match what's in Git within seconds. Ansible playbooks handle the day-zero Kubernetes node bootstrapping: installing the container runtime, applying kernel parameters and joining nodes to the cluster with no manual SSH steps. Rollbacks are a one-line Git revert — the history is auditable, the recovery is deterministic.
Walk-in interviews are the dominant hiring format for freshers and mid-level roles in India — and the experience is broken on both sides. Candidates show up at a venue with no idea about slot timings, wait in queues for hours, and often leave without getting interviewed at all. Companies have no way to predict candidate volume, end up with empty slots or severe overcrowding, and collect zero data on actual attendance.
HireWalk digitises the entire process. Companies post walk-in drives with a job description, package, date and time slots with defined capacity limits. Candidates browse open drives, pick a slot that fits their schedule, and receive an automated confirmation email instantly. A chaotic physical queue becomes a structured, pre-booked system — while keeping the open "anyone can apply" nature of walk-in hiring intact.
Two distinct user types — companies and job seekers — each have their own authenticated dashboard. Companies register, post drives, define multiple time slots with per-slot capacity, and track exactly who has booked which slot. Candidates register, browse active drives by company or role, book a slot, and receive an email confirmation automatically. The platform also includes an AI mock interview feature powered by Groq's LLaMA model: candidates can practice before their actual interview with a conversational AI that evaluates each response and provides structured feedback.
OTP verification breaking under Gunicorn workers. OTPs were initially stored in Python's in-memory dictionary — fine in development but broken in production, because each Gunicorn worker has its own isolated memory space. A user generating an OTP on worker 1 and verifying it on worker 2 always failed. The fix was moving OTP storage to a dedicated otp_store database table with a timestamp column, so any worker can read the same record. A 10-minute expiry is enforced by comparing the stored timestamp at verification time.
Slot double-booking race condition. When two candidates simultaneously tried to book the last available slot, both would pass the capacity check and both bookings would commit, exceeding the slot limit. This was resolved using a SELECT ... FOR UPDATE row lock on the time slot record inside a database transaction. The second request waits at the lock until the first either commits or rolls back before it can read the capacity count — guaranteeing slot limits are never exceeded regardless of concurrent traffic.
Terraform deployment failing on CloudFront ACM certificate. The CloudFront distribution was failing because the ACM certificate provider alias was incorrectly set to us-west-2. CloudFront is a global service that requires SSL certificates to exist specifically in us-east-1 — this is a hard AWS constraint with no workaround. The fix was correcting the provider alias in versions.tf from us_west_2 to us_east_1.
All infrastructure is provisioned with Terraform — CloudFront distribution, ALB, EC2 Auto Scaling Groups, RDS MySQL, SSM Parameter Store and CloudWatch alarms are all defined as code and deployed via GitHub Actions CI/CD. CloudWatch alarms on both ASGs trigger scale-out above 70% CPU and scale-in below 30%. RDS alarms fire when CPU exceeds 80% or free storage drops below 5 GB. The CloudWatch Agent runs on EC2 instances via IAM policy, streaming system logs and metrics automatically. RDS exports slow query, error and general logs to CloudWatch Logs for query-level debugging.
Open to full-time roles, freelance infrastructure projects, and consulting. Drop me a message — I respond within 24 hours.