Sensu @ Bluebox

Scaling Sensu from dozens to thousands of servers

https://github.com/paulczar/sensu-at-bluebox-sensuconf-2017

12 years of bluebox

in 5 dot points
  1. Jesse Proudman builds website for a dentist
  2. white glove managed hosting
  3. openvz based public cloud
  4. managed private cloud (openstack)
  5. SRE Operations Platform (SiteController)

$ ansible-playbook -i ../envs/dc1 site.yml
				

random thoughts...

  • Automate! Automate! Automate!
  • flapjack handler == poor concurrancy
  • poorly written checks are bad
  • Metrics with Sensu...
  • NOBODY EVER LOOKS AT DASHBOARDS :sadface:

Cuttle (sitecontroller)

https://github.com/IBM/cuttle

Blue Box / IBM SRE Operations Platform

Large monolithic ansible repo responsible for deploying:

  • Sensu / Graphite / Uchiwa / Grafana
  • Elasticsearch / Logstash / Kibana
  • Mirrors - apt, yum, pypi, rubygems
  • 2FA (yubikey or TOTP) Bastion
  • much much more...

Data Driven Infrastructure

  1. $ vim site/template.yml
  2. $ ansible-playbook -i site/ generate.yml
  3. generates - site/{hosts,ssh_config,group_vars,host_vars,etc}
  4. $ ansible-playbook -i site/hosts site.yml

Data Driven Checks

check scripts

  • all sensu check scripts in a git repo
  • jenkins builds .deb files on commit to git
  • ansible does `apt-get install sensu-checks`

Data Driven Checks

Ansible variable


sensu_checks:
  rabbitmq:
    check_rabbitmq_messages:
      handler: default
      notification: "too many queued rabbitmq messages"
      interval: 120
      occurrences: 5
      standalone: true
      command: "check-rabbitmq-messages.rb -w 10000 -c 50000 \
      --user {{ sensu.server.rabbitmq.username }} \
      --password {{ sensu.server.rabbitmq.password }}"
      service_owner: "{{ monitoring_common.service_owner }}"
					

Data Driven Checks

Ansible task


- name: install sensu checks
  sensu_check_dict: name="{{ item.name }}" check="{{ item.check }}"
  with_items:
    - name: check-rabbitmq-messages
      check: "{{ sensu_checks.rabbitmq.check_rabbitmq_messages }}"
notify: restart sensu-client
					

Data Driven Checks

resultant check file


$ cat /etc/sensu/conf.d/checks/check-rabbitmq-messages.json
{
    "checks": {
        "check-rabbitmq-messages": {
            "standalone": true,
            "notification": "too many queued rabbitmq messages",
            "interval": 120,
            "service_owner": "default",
            "handler": "default",
            "command": "check-rabbitmq-messages.rb --user sensu --password sensu -w 10000 -c 50000",
            "occurrences": 5
        }
    }
}
			

Data Driven Checks

quick fix a thing!


$ ansible-playbook -e \
  "sensu_checks.rabbitmq.check_rabbitmq_messages.interval=60" \
	-i site/hosts site.yml --tags=sensu-checks
...
...
			

sensu driven compliance

  • compliance rules written in [server|in]spec
  • sensu runs [server|in]spec and alerts when failures
  • Runs hourly, so we know within an hour when a machine falls out of compliance
  • sensu-client.log goes to ELK .. so we get a FREE compliance audit report.

Links

Thank You!