The DSC Configuration Data Problem

I’m writing and using a PowerShell module called “Datum” to manage DSC Configuration Data, and raise cattle at work. This is a write up to explain the problem, and why I’m writing it.

I assume you already know a bit about Desired State Configuration, but if it’s new to you or vague, go grab yourself The DSC Book by Don Jones & Missy Januszko, which packs all you need to know about out-of-the-box DSC, and more.

This post tries to explain the DSC Configuration Data composition challenge that I attempt to solve with Datum. It does not include the solution and how to use Datum to solve those problems: you can look on the GitHub repository and see an example of a Control Repository, or come to the PowerShell Summit 2018, where I’ll be presenting Datum used in a DSC Infrastructure.

Raise cattle, not pets!

DSC is a platform with a nice way of using a declarative syntax to define the state required for an infrastructure, offloading the process of making that transformation happen (the actual changes to the systems) to lower levels of abstraction; mainly the (theoretically) idempotent DSC Resources enacted by the Local Configuration Manager (aka LCM).

The common challenge people face can be summarised like so: How do you avoid pets with DSC, while staying away from Partial configurations, its risks and quirks?

What does it take to raise cattle?

When we talk about raising cattle, no pets, or use the snowflake analogy, we mean we don’t want our nodes to be uniquely crafted, shaped over time into a something that can’t be concisely defined. We need to build a mould, so that what’s moulded from it can be reproduced consistently and at low cost.

In terms of configuration management, that means we need to have a definition that can be applied to any number of Node, and give us the same result. In other word, the configuration is generic across multiple nodes.

With DSC, one way to do this could be to create a named configuration, and apply that configuration to a number of Nodes. That can work, but it makes several assumptions:

  • You have no unique piece of data specific to a node in your configuration (i.e. hostname, certificate, machine specific meta data)
  • You can’t do End-to-end system configuration with one DSC configuration (unique data is usually required for this, like hostname, vm id…)
  • All nodes in a ‘role’ must be configured identically. They can’t, for example, have node-specific certificate for Credential Encryption.

I know, those three points say roughly the same thing, but illustrate different impacts when managing infrastructure from code.

Another way to do this is to compose the Configuration Data in roles or components in a way that can be re-used across different nodes, while still allowing node-specific information to avoid the limitations of generic named configurations. This is what’s known as the Node data separated from the NonNodeData.

To my knowledge, this is the state of the art of DSC Configuration Data possible with DSC out-of-the-box (that is, without tooling). This approach has been brilliantly explained and documented by Missy Januszko in her DSC Configuration Data layout tips and tricks article.

Why Non-Node data is not enough?

The problem with the approach above is it only provides access to two types of data that can hardly be mixed together dynamically:

  • The Node Specific Data: e.g. $Node.nodename
  • The Role Specific Data: e.g. $ConfigurationData.DHCPData

The problem here is that when accessing a piece of data (the property $ConfigurationData.DHCPData.FooBar), it will alwaysprovide the same result for each and every $Node.

Hear me out.

Yes it’s good for cattle, but it’s not flexible enough for most real-life infrastructure, where the generic information can vary slightly depending on, for instance, the location of the Node, the Environment it’s running in, the customer it’s targeting, or whatever makes sense in your business context.

You can blend the data in DSC Configurations, and DSC Composite Resources, but that’s code, and it, most of the time, creates tight coupling between the DSC Composite resource and the Infrastructure it’s targeted to.

This could look like the following, in a DSC Composite resource:

In this case you can see that the Configuration is tightly coupled with the need for CustomerA, and would limit its usefulness when shared publicly. Another, bigger, problem is that the logic of the Configuration Data is already passed down to a lower layer of abstraction.

There are many other ways to achieve this goal, but with native DSC, it’ll most likely be a variant of the above.

Building Configuration Data Dynamically

Many of those not afraid to go beyond the out-of-the-box DSC have found ways to build the Configuration Data dynamically, from one or several sources.

What I’ve seen a few times in the wild, is the $ConfigurationData hashtable being composed before compiling the MOF, usually done by custom scripts pulling data from different systems, usually databases, or REST-like services.

Databases, or other applications, can be bad if they are the human interface to the configuration, and where users make changes.

The principle works well, but tends to be tied to the infrastructure it’s used in (the composition being done in the script or database schema, means the rules are hidden in that code, and specific to the business context).

The main benefit of this approach is that relational data is easy to query. It works great for reportingmonitoring and aggregating with different sources. What it’s really bad at, however, is making the changes self-documenting, friction-less and reproducible locally, versioned, and with an editing workflow that does not need a custom solution.

In other words, databases are a nice querying platform, but not a good user interface for reading and editing configuration data. It also relies on technologies that introduce more complexity to the system (i.e. SQL or NoSQL databases). Yes, you could build interfaces on top of the database, but that’s overly complex, and re-inventing the wheel.

Luckily, there are tested and mature solutions with great user interface to display and manipulate configuration data: Files and Version Control Systems, such as git!

…Dynamically yes, but from files!

Despite the risk of losing some of the data querying capabilities (for now), you could move that data into structured files and folders, and use a version control system to manage the changes and enable collaborative work. Use git locally and with a git server such as GitHub, VSTS, Gitlab, Bitbucket or the one of your choice and you have a great collaboration and documentation tool.

Technically, the Nodes may not be best managed in files, when you have more than a couple of hundreds of them (Datum can be easily extended to support other storage technologies).

Make sure all changes come from this central repository, and you have a single source of truth and documentation you can rely and enforce from.

Add tests (to enforce logic, security, business constraints), gates (approvals), controls (change reviews), workflow around changes/releases (branching, pull requests), connect to you favourite CI tool (to automate those tasks on changes), and you can continuously improve quality, repeatability and security, building confidence in your system and team over time, while reducing re-work.

Then assemble those files into the $ConfigurationData hashtable you need and you have it: An Infrastructure definition composed by Configuration Data, and the DSC Code Constructs (Configuration, Composite and Resource).

This makes the start to a Policy-Driven Infrastructure, or known by it’s less accurate,  broader, but more popular term Infrastructure as Code (great book, by Kief Morris).

This principle is what’s recommended in The Release Pipeline Model whitepaper,  written and presented by Michael Greene and Steven Murawski, but applied to Configuration Data!

As a side note, if you hear about Test-Driven Infrastructure (TDI), it is, in my opinion, the result of the Policy-Driven infrastructure concept, applied with the Test-Driven Development (TDD) best practice.

Is that all? What’s new?

There’s more to it, like how to structure your files by optimizing on change scope, and flexibility while ensuring you’re not drifting towards a pet factory, but that’s out of scope for now.

No, it’s not new at all. It’s been done in other Configuration Management Solutions, and back in 2014, Steve Murawski, then working for Stack Exchange, led the way in DSC by implementing some tooling, and open sourced them on the PowerShell.Org Github. This work has been enhanced by Dave Wyatt’s contributions, mainly around the Credential store. After these two main contributors moved on from DSC and Pull Server mode, the project stalled (in the Dev branch), despite its unique value.

refreshed this to be more geared for PowerShell 5, and updated the dependencies as some projects had evolved and moved to different maintainers, locations, and name.

As I was re-writing it, I found that the version offered a very good way to manage configuration data, but in a prescriptive way that was lacking a bit of flexibility for some much needed customisation (layers and ordering). Steve also pointed me to Chef’s Databags, and later I discovered Puppet’s Hiera, which is where I get most of my inspiration for Datum. Worth noting that Trond Hindenes has introduced me to the Ansible approach to Roles and Playbook which is also similar (or seem, for the little I know about it).

Conclusion: why Datum?

Datum is a PowerShell Module that enables you to manage a Policy-Driven Infrastructure using Desired State Configuration (DSC), by letting you organise the Configuration Data in files organised in a hierarchy adapted to your business context, and injecting it into Configurations based on the Nodes and the Roles they implement.

This (opinionated) approach allows to raise cattle instead of pets, while facilitating the management of Configuration Data (the Policy for your infrastructure) and provide defaults with the flexibility of specific overrides, per layers, based on your environment.

The Configuration Data is composed in a customisable hierarchy, where the storage can be using a file system, and the format Yaml, Json, PSD1 allowing all the use of version control systems such as git.

Now if you want to learn how, the best place to start might be the Datum project, or the example of control repository of DscInfraSample.

You could also follow me on twitter or WordPress to get notified of my next post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s