Abstraction, not syntax

written by Ruud van Asseldonk
published

The world is growing tired of yaml. Alternative configuration formats are making the rounds. Toml has steadily been gaining traction, in part due to tools like Cargo and adoption in the Python standard library. Json supersets (with comments, commas, and the digit 5) are flourishing, while KDL, kson and now maml promise to hit the sweet spot between friendly and simple.

While I do believe that yaml is harmful, all of the simpler formats are basically fine, and their differences are mostly superficial. The one real difference is in their data models. Most formats adopt the json data model of objects and arrays, while KDL, HCL, and e.g. Nginx adopt the XML data model of named nodes with attributes and children. The rest is just syntax. And yes, syntax does matter, but line noise is not the real problem here!

Syntax is superficial

Let’s look at an example: suppose we need to define cloud storage buckets to store backups. We want to back up two databases: Alpha and Bravo. For each we need three buckets: one for hourly, daily, and monthly backups. They should have a lifecycle policy that deletes backups after 4, 30, and 365 days. We don’t want to click around, so we’ll set this up using an infrastructure-as-code tool using the following hypothetical configuration file:

{
  "buckets": [
    {
      "name": "alpha-hourly",
      "region": "eu-west",
      "lifecycle_policy": { "delete_after_seconds": 345600 }
    },
    {
      "name": "alpha-daily",
      "region": "eu-west",
      "lifecycle_policy": { "delete_after_seconds": 2592000 }
    },
    {
      "name": "alpha-monthly",
      "region": "eu-west",
      "lifecycle_policy": { "delete_after_seconds": 31536000 }
    },
    {
      "name": "bravo-hourly",
      "region": "us-west",
      "lifecycle_policy": { "delete_after_seconds": 345600 }
    },
    {
      "name": "bravo-daily",
      "region": "eu-west",
      "lifecycle_policy": { "delete_after_seconds": 259200 }
    },
    {
      "name": "bravo-monthly",
      "region": "eu-west",
      "lifecycle_policy": { "delete_after_seconds": 31536000 }
    }
  ]
}

Yes, this file would look friendlier in a different format. But the file also contains two bugs, and switching formats is not going to fix those. Can you spot them? To avoid spoilers, here’s a bit of yaml to pad the page. I’ll even throw in some comments for clarity:

buckets:
  - name: "alpha-hourly"
    region: "eu-west"
    lifecycle_policy:
      delete_after_seconds: 345600  # 4 days
  - name: "alpha-daily"
    region: "eu-west"
    lifecycle_policy:
      delete_after_seconds: 2592000  # 30 days
  - name: "alpha-monthly"
    region: "eu-west"
    lifecycle_policy:
      delete_after_seconds: 31536000  # 365 days
  - name: "bravo-hourly"
    region: "us-west"
    lifecycle_policy:
      delete_after_seconds: 345600  # 4 days
  - name: "bravo-daily"
    region: "eu-west"
    lifecycle_policy:
      delete_after_seconds: 259200  # 30 days
  - name: "bravo-monthly"
    region: "eu-west"
    lifecycle_policy:
      delete_after_seconds: 31536000  # 365 days

What’s wrong?

Would you have caught those in review? Now suppose we need to add a third database, Charlie. We copy-paste the three stanzas, and change bravo to charlie. Congrats, we now copied the bugs. And now that LLMs are catching on, we don’t even have to copy the bugs manually any more!

Adopting a different format might make the file easier on the eye, but it doesn’t reduce repetition, and therefore it doesn’t address the real problem.

Abstraction

While reducing line noise is a noble goal, enabling abstraction is a more impactful feature. We can bikeshed about quote styles and trailing commas, but what we really need is a for loop. This is what that same configuration looks like in RCL:

{
  buckets = [
    let period_retention_days = {
      hourly = 4,
      daily = 30,
      monthly = 365,
    };
    for database in ["alpha", "bravo"]:
    for period, days in period_retention_days:
    {
      name = f"{database}-{period}",
      region = "eu-west",
      lifecycle_policy = { delete_after_seconds = days * 24 * 3600 },
    }
  ],
}

It’s a bit more to take in at first, but if you’ve ever seen Python, Rust or TypeScript, you can read this file. (For a more gentle introduction, check out the tutorial.) Note how we eliminated two categories of bugs, and made the file more maintainable:

This is where the real win is!

Trade-offs

While generating configuration solves problems, it also creates new problems:

When we have configuration files under source control, we can mitigate these points somewhat by checking in the generated files. (Heresy, I know!) This restores the ability to search, and we can now review changes from both sides: we can see the diff in the generator, but also how it impacts generated files.

Tools

Configuration languages like Cue, Dhall, Jsonnet, or RCL are designed specifically to eliminate boilerplate in repetitive configuration, but you don’t necessarily have to introduce new tools. A bit of Python or Nix that outputs json or toml can go a long way. Remember that yaml is a json superset, so anything that takes yaml accepts json! Deduplication beats copy-pasting, and manipulating data structures is safer than string templating.

Conclusion

The world is growing tired of yaml, and alternative configuration formats are making the rounds. I applaud replacing yaml with simpler formats like toml, and I sure prefer working with pretty code over working with ugly code. However, I also think that arguing over which multi-line string syntax is superior, is missing a deeper issue.

When configurations grow more complex, what we really need is abstraction to eliminate duplication. Real abstraction over data structures, not string templating. That means turning configuration, at least to some extent, into code. How to navigate the balance between code and data remains a matter of applying good judgment, but for large enough configurations, the right balance is unlikely to be at 100% data.

More words

Automating configuration updates with rcl patch

Configuration formats that humans can read and maintain, are hard for scripts to edit safely. With the new patch feature, RCL now has a principled way to enable safe, automated edits. Read full post