How to take the pain out of patching Linux and Windows systems at scale

Share
  • June 24, 2019

In all my years of being a sysadmin (I’ve been a UNIX systems administrator for over 25 years) patching was always the bane of everyone’s existence. Patching is manually intensive. Centralised information almost never exists. It requires a lot of coordination to be done to arrange downtime and the processes vary wildly between different operating systems, distributions and even releases of distributions.

There are lots of different interpretations out there, but for our purposes, this is what patching means: Applying changes to computer software with the intention of resolving functional or security bugs, improving usability, reliability or performance.

Pain points

Unpatched vulnerabilities are known to be a huge concern and put organisations at significant risk. Even if you know about a vulnerability, tracking down affected servers either requires manual effort, specialised tools, custom tool development, or a combination of the above.

Once you have a list of affected servers, how do you go about actually applying the updates? Centralised control of the process by an IT team is a common approach and this can work well, however, there are some areas where this falls short. Picking patching windows, preventing a server from being patched and controlling if all patches or just security ones should be applied.

Self-service options solve some of those issues, but open-up others. Training, access control and enforcement of standards come to mind.

Once patching is complete, you need to validate success and ensure your reporting systems are updated with the new state.

Ensuring all stakeholders have access to the patch state of the servers and having that data be timely and accurate.

If it is such a risk, why is it still so hard?

… wait a minute, I have an idea!

As a Puppet user, I already have a location where I can centralise data, a way of keeping data accurate, an RBAC system and the ability to trigger ad-hoc work on nodes. Not only that, most of what I need can already be driven by and reported on through an API and a web console. So, what was I waiting for?

SEE ALSO: Lokomotive: Production-ready Kubernetes distribution with Linux technologies

I immediately started work on what eventually became the os_patching module. The module is fully functional on Linux (RedHat, Debian and Suse) and Windows support has been added as of V0.11.0. Support for other OS types is being actively worked on.

What did it need to do?

  • Report the patch state on a server, via custom facts, back into PuppetDB
  • Show which updates are related to security (if possible)
  • Allow the servers to be assigned to a ‘patch window’ to allow simple scheduling
  • Allow blackout times to be set for servers which would prevent any patching activity
  • Ability to control if a post-patching reboot is performed
  • Enable the execution of a ‘patch run’ on a defined group of servers, including:
    • being able to clean package caches
    • restricting to security patching
    • supplying overrides for the arguments to the OS package commands
    • being able to trigger the patch run from the command line, console or through an API
    • have control on who can execute a patch run
  • Store the canonical patching state data on the node

The last item was one of the most important. No matter what happened, I wanted to be sure that the facts on the node were the source of truth for patching information and everything else was fed from there.

How does it do it?

To get started with the module, you simply need to declare the os_patching module onto your nodes. It will set up a scheduled task to refresh the patch information and allow access to the tasks to carry out the patching.

“Just the facts ma’am”

Custom facts underpin the entire module, so making sure we have a good structure and management system for them was key.

The custom facts pull their values from cached values, some are generated by a scheduled script/task and others by the os_patching class. We cache these values as having hundreds of nodes hitting your apt/yum servers every time they run Facter would cause some pretty big issues. The scheduled script, by default, refreshes the cache data every hour, however this can be overridden through classification.

The facts are broken down into a couple of sections:

State facts

These facts show the state of the node.

  • Are there patches to apply?
  • If so, how many?
  • Are they security related?
  • Does the node need to be rebooted?
  • Do apps need to be restarted?
  • Is patching blocked?

Control facts

These facts control the execution of patching.

  • Do we have blackout windows defined?
  • Is the node allocated to a patching window?
  • Should this node override the reboot parameter?

The facts give us access to all the information we need to audit the fleet and control the patch runs.

Nodes can be assigned to a patch_window to group them (think “Group 1”, “Week 4”). Blackout windows can be defined for change freezes or for nodes which cannot be patched.

When you combine these facts, you can write queries such as:

puppet task run os_patching::patch_server --query='nodes[certname] { facts.os_patching.patch_window = "Week3" and facts.os_patching.package_update_count > 0 and facts.os_patching.blocked = false }'

This task would patch any node that is assigned to the patch window “Week3”, is not blocked, and that has patches waiting to apply.

Controlling the facts

There are a number of other settings you can configure if you’d like:

  • patch_window: a string descriptor used to “tag” a group of machines, i.e. Week3 or Group2
  • blackout_windows: a hash of datetime start/end dates during which updates are blocked
  • security_only: boolean, when enabled only the security_package_updates packages and dependencies are updated
  • reboot_override: boolean, overrides the task’s reboot flag (default: false)
  • dpkg_options/yum_options: a string of additional flags/options to dpkg or yum, respectively

You can set these in hiera. For instance, my global config has some blackout windows for the next few years:

Actually do the patching

Since I was going to be using tasks, I didn’t have to worry about how to implement RBAC or how to trigger the patch run.

I set up 3 tasks:

  • clean_cache: cleans the package cache on the nodes (yum clean all for example)
  • refresh_fact: forces a regeneration of the patching cache data
  • patch_server: actually runs the patching

The patch_server task is what we’ll look at in more detail now.

When triggered, the task will first check the value of the fact os_patching.blocked. If it is set to true, the task exits as there is a reason that patching cannot continue. This would usually mean that the node is within a blackout window.

Providing there are patches to apply, the task then kicks off the OS patching command under a timeout value (3600 seconds by default). It waits for completion and then pulls back the job information for reporting.

The facts are then refreshed and pushed up to the puppetserver (puppet fact upload) then we enter Reboot Town.

To reboot or not to reboot

Remember, patching the node does no good unless you restart the processes were using the affected packages. This could be as simple as an application restart or as invasive as a full reboot. So how do we control that?

SEE ALSO: DevOps study: Report claims increasing developer role in AppSec

You can control what reboot action the task will take by using the reboot parameter. It accepts the following values:

  • always: Irrespective of what happened during the task, reboot the node. This will ALWAYS trigger a reboot
  • never: Irrespective of what happened during the task, do not reboot the node. This will NEVER trigger a reboot
  • patched: Trigger a reboot if any patches were applied
  • smart: Use the OS tools (needs-restarting on redhat, /var/run/reboot_required on debian) to determine if a reboot is required after patching. This will only trigger a reboot kernel/core libraries are updated.

You can also use the fact os_patching.reboot_override to customise behaviour on a granular level, such as having all nodes set to reboot other than three which are set to never as you know they will be rebooted manually at a later date.

Flowchart

This is all a bit complex, it might be easier in a flow chart.

Output

The following is a sample of the task output, visible from the command line, through the console, or through the API.

The entries are:

  • pinned_packages: any packages version locked/pinned at the OS layer
  • debug: full output from the patching command
  • start_time/end_time: when the task started/stopped
  • reboot: the reboot parameter used
  • packages_updated: which packages were affected
  • security: the security parameter used
  • job_id: the yum job ID (only populated on RedHat family nodes)
  • message: status info

TL;DR

There is a lot of info above, but you might just want to get started with using the os_patching module, so here are the steps.

  • Add mod 'albatrossflavour-os_patching', '0.11.0' to your Puppetfile and deploy your control repo
  • Classify the nodes you wish to be able to patch with the os_patching module
  • Run puppet on these nodes and expect the following changes:
    • The file /usr/local/bin/os_patching_fact_generation.sh will be installed (c:programdataos_patchingos_patching_fact_generation.ps1 on windows)
    • Cron jobs will be setup to run the script every hour (using fqdn_rand) and at reboot
    • The directory /var/cache/os_patching will be created
    • /usr/local/bin/os_patching_fact_generation.sh will run and will populate files into /var/cache/os_patching
    • A new fact (os_patching) will be available
  • View the contents of the os_patching fact on the nodes you classified:
    • facter -p os_patching
    • puppet-task run facter_task fact=os_patching --nodes centos.example.com
    • Use the console to view the fact
  • Execute a patch run on these nodes:
    • puppet task run os_patching::patch_server --query='nodes[certname] { facts.os_patching.package_update_count > 0 and facts.os_patching.blocked = false }'
    • Run the task through the console
      &nbsbp;

The post How to take the pain out of patching Linux and Windows systems at scale appeared first on JAXenter.

Source : JAXenter