10+ years engineering resilient, scalable infrastructure — from bare-metal NOC operations to Kubernetes orchestration at scale on GCP & AWS.
I'm Duc Dinh, a DevOps & Site Reliability Engineer based in Ho Chi Minh City, Vietnam. I specialize in designing and operating cloud-native infrastructure that's built to scale and built to last.
My journey started in the NOC — monitoring screens at 3AM, handling incidents by runbook, learning what "reliability" really means at the ground level. That foundation shaped everything I build today: observable, automated, and resilient systems.
Currently at Inspectorio, I operate the GCP/GKE platform behind an AI-powered supply-chain SaaS serving 12,000+ global brands & retailers — and I'm building OpsRAG, an AI-powered SRE investigation assistant that turns incident context, runbooks, and infra topology into answers.
Operate the GCP/GKE platform behind an AI-powered supply-chain SaaS used by 12,000+ global brands, retailers & multi-tier suppliers. Building OpsRAG (a GraphRAG-based SRE investigation assistant), led the Istio native-sidecar migration, designed Cloud SQL IAM auth + JIT access, and own incident response and the DevOps tooling roadmap across four environments.
Core/digital banking modernization for a Tier-1 bank, delivered by GFT (12,000+ engineers across 20+ countries). On the Data Foundation Squad owning Kafka, Airflow, databases & Redis. Executed 3–5TB cross-account / cross-region database migrations with a zero data-loss SLO; migrated a legacy Chef golden-image pipeline to Packer + AWS CDK.
Global software engineering & digital consultancy. Built and administered CI/CD and centralized logging across multiple international client engagements; cut release lead-time through pipeline standardization and reusable templates, and introduced containers, service mesh & IaC modules to the practice.
Cloud infrastructure & international money-transfer (remittance) services. Started in network operations and 3AM incident monitoring, then grew into IaC, CI/CD, and cloud architecture — building the foundation of incident management and infrastructure automation that shapes how I work today.
A GraphRAG-based SRE investigation platform. A LangGraph agent with hypothesis-tree reasoning, an MCP gateway exposing tools for Rootly, Datadog, GKE & Elasticsearch, semantic Q&A caching, and a DeepEval evaluation pipeline with a custom Vertex Gemini judge — surfaced to engineers through a Slack bot.
A service-mesh topology analyzer & visualizer. Parses Kong & Istio config specs and renders a self-contained HTML report — drop in a URL like domain.com/path and it points you to the matching ingress, route, and upstream service. Custom label-based classification and distribution charts across providers and routes.
Open for freelance, consulting, and interesting infrastructure challenges.