Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions updates/build-log/navigate/on-platform-ops.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# PlatformOps & Platform Engineering

## Introduction: Why PlatformOps Matters to Us

Hey there, fellow Dwarves! In today’s fast-paced software world, we’re all about crafting solutions that are quick, reliable, and secure. But let’s be real: as we scale, the complexity of juggling cloud infrastructure, container orchestration, CI/CD pipelines, and security protocols can bog down even the sharpest of us. Developers shouldn’t have to become jacks-of-all-trades just to ship code. That’s where PlatformOps, powered by the discipline of Platform Engineering, comes in as our strategic hammer to smash through these challenges.

PlatformOps isn’t just a buzzword in our woodland; it’s our way of enabling developer productivity and operational excellence at scale. Think of it as building a sturdy forge where our developers can craft their best work without worrying about the furnace breaking down. This guide is your map to understanding PlatformOps, our roadmap for building it, and the skills you’ll need to wield to make it happen.

## What is PlatformOps?

### The Heart of PlatformOps

At its core, PlatformOps is about designing, building, and running a high-automation, self-service Internal Developer Platform (IDP). We treat this IDP like a first-class product, and our customers? That’s you, our own Dwarves developers. It’s a set of standardized tools and automated workflows, or “golden paths,” that let you build, deploy, and manage applications with minimal friction and maximum independence.

### Why We Do It

Our mission with PlatformOps is simple: cut down the mental load on you, our developers, and crank up the speed of our engineering. We abstract away the messy underbelly of infrastructure and ops, so you can focus on what you love, crafting killer application code. Here’s what we’re aiming for:

- **Speed Up Delivery**: Get your code from commit to production in record time.
- **Boost Reliability**: Build on a secure, standardized, and observable foundation.
- **Embed Governance & Security**: Weave compliance and cost controls right into the platform.
- **Elevate Developer Experience (DevEx)**: Make shipping software smooth, fun, and empowering.

### How It Fits with DevOps and SRE

Wondering how PlatformOps stacks up against other terms you’ve heard? Here’s the breakdown:

- **DevOps**: A cultural mindset of collaboration between dev and ops to drive automation and speed.
- **PlatformOps**: The practical, scaled-up version of DevOps, delivered through a dedicated internal platform.
- **SRE (Site Reliability Engineering)**: A focus on using software engineering to ensure rock-solid reliability.

We see PlatformOps as the bridge that turns DevOps philosophy into actionable, everyday tools for our team.

## Our PlatformOps Roadmap

### Treating the Platform as a Product

Just like we craft software with care, our IDP is a product with its own roadmap, ownership, and feedback loop. We’re not just throwing tools together; we’re building something with purpose. Our roadmap starts with identifying pain points in your workflow, then prioritizing features that tackle those first. For example, if provisioning environments takes forever, that’s step one on our list.

### Paving the Golden Path

We’re all about making the right choices easy. Our “golden paths” are pre-built templates and workflows for common tasks, like spinning up a microservice or setting up a CI/CD pipeline. Think of it as a well-trodden trail through the forest: you can stray if you need to, but the path is there to guide you. Right now, we’re working on a golden path for deploying to Kubernetes with secure defaults baked in.

### Building for Self-Service

You shouldn’t have to wait on another team to get resources or deploy code. Our goal is to let you provision what you need through a self-service portal. We’re rolling out tools like Backstage.io to make this happen, so you can click a few buttons and have a new environment ready to go.

### Staying Opinionated but Flexible

We provide sensible defaults (think Helm charts with pre-configured security settings), but we’re not here to lock you in. If you’ve got a unique use case, we’ve got documented “escape hatches” to let you customize. For instance, our default CI/CD pipeline uses GitHub Actions, but you can switch to GitLab CI if that’s your jam.

### Automating Everything

If a task repeats, it gets automated. From infrastructure provisioning with Terraform to auto-scaling with Kubernetes, we’re embedding automation into the platform. One of our first wins was automating off-hours shutdowns for non-production environments, saving us bucks and headaches.

## Skills of a Platform Engineer at Dwarves

To forge a world-class platform, you need a mix of broad knowledge and deep expertise. We call this the T-shaped skill set: wide across many areas, with sharp depth in a few. Whether you’re a seasoned engineer or just stepping into this realm, here’s what we value and where we’re growing.

### Core DevOps & Infrastructure Skills

These are the building blocks of any platform engineer in our woodland:

- **Cloud Platforms**: Familiarity with AWS, GCP, and Azure. You don’t need to know every nook, but you should be comfy spinning up resources.
- **Infrastructure as Code (IaC)**: Tools like Terraform or Pulumi to define infrastructure declaratively. For example, we use Terraform to manage our Kubernetes clusters.
- **Containers & Orchestration**: Docker for containerization and Kubernetes for orchestration. We’ve got a cluster running most of our services, and knowing how to debug a pod is key.
- **CI/CD**: Experience with GitHub Actions, GitLab CI, or Jenkins. Setting up a pipeline to auto-deploy on merge is a common task here.

### DataOps & Reliability

Keeping things running smoothly means understanding data and observability:

- **Observability Tools**: Prometheus for metrics, Grafana for dashboards, and Jaeger for tracing. We’ve set up dashboards to track app performance, and you’ll help maintain them.
- **Data Pipelines**: Tools like Kafka or Airflow. If you’ve ever streamed data for real-time analytics, you’re ahead of the game.
- **Databases**: Hands-on with PostgreSQL or Redis. We rely on these for most projects, so tuning performance is a handy skill.

### Security & SecOps

Security isn’t an afterthought; it’s baked into our platform:

- **DevSecOps Tools**: Familiarity with Snyk for vulnerability scanning or Trivy for container security. We scan every image before deployment.
- **Secrets Management**: Using Vault or Sealed Secrets to keep sensitive data safe. We’ve got a setup for storing API keys securely, and you’ll help manage it.
- **Hardening Practices**: Knowledge of Kubernetes and cloud IAM best practices. Locking down access is something we take seriously.

### MLOps (The New Frontier)

As we dive into machine learning, these skills are becoming crucial:

- **ML Serving**: Tools like KServe or Seldon Core to deploy models. We’ve got a project serving predictions, and it’s a growing area.
- **Experiment Tracking**: Using MLflow to log experiments. If you’ve tracked model versions, you’re already on track.
- **Vector Databases**: Pinecone or Milvus for AI workloads. These are niche but powerful for our future projects.

### Developer Experience & Quality

Making life easier for our devs is at the heart of what we do:

- **SRE Basics**: Understanding SLIs, SLOs, and error budgets. We’ve got SLOs for app uptime, and you’ll help track them.
- **Testing Infra**: Tools like Terratest to validate infrastructure. We test our Terraform configs before applying them.
- **Documentation**: Writing clear, discoverable guides. If you’ve ever documented a process for your team, that’s the vibe we’re after.

### FinOps & Cost Awareness

We don’t just build platforms; we make sure they’re sustainable:

- **Cost Visibility**: Using tools like AWS Cost Explorer to track spending. We tag everything to know who’s spending what.
- **Optimization**: Right-sizing resources and picking cost-effective pricing models (like Spot Instances). We saved 30% last quarter by switching some workloads to Spot.
- **Governance**: Setting budget alerts and guardrails with Policy as Code. We’ve got alerts to catch runaway costs before they hit.

## Measuring Our Success

How do we know PlatformOps is working? We track metrics that matter to both the business and you, our developers:

- **DORA Metrics**: Lead time for changes, deployment frequency, mean time to recovery (MTTR), and change failure rate. We’re aiming to deploy multiple times a day with minimal hiccups.
- **Developer Experience (DevEx)**: We run NPS-style surveys to gauge how you feel about the platform. If it’s not making your life easier, we’re not done.
- **Reliability**: Tracking SLO compliance and error budgets. We’ve got a 99.9% uptime target for our core services.
- **Cost Efficiency**: Measuring savings from automation and right-sizing. Last month, automation cut our idle resource costs by 15%.

## The Future of PlatformOps in Our Woodland

We’re not just building for today; we’re crafting for tomorrow. Here’s what’s on the horizon:

- **AI-Assisted Engineering**: Using AI to automate IaC and CI/CD setups. Imagine a tool suggesting pipeline optimizations on the fly.
- **Integrated FinOps**: Bringing real-time cost insights into our IDP, so you see the price tag before you deploy.
- **Beyond Kubernetes**: Moving toward code-to-URL abstractions with serverless tech, simplifying deployments even further.
- **Edge Platforms**: Extending our platform to support IoT and distributed systems as we explore new territories.

## Conclusion: Forging Ahead Together

PlatformOps isn’t just a technical discipline for us at Dwarves Foundation; it’s a strategic enabler that lets us scale innovation, speed up delivery, and keep governance, reliability, and cost efficiency in check. By treating our platform as a product and putting your developer experience first, we’re building a forge where every Dwarf can craft their best work. Got ideas or pain points? Let’s chat. Together, we’ll keep sharpening this tool to carve out a brighter future in our woodland.