Sensu @ Bluebox
Scaling Sensu from dozens to thousands of servers
https://github.com/paulczar/sensu-at-bluebox-sensuconf-2017
12 years of bluebox
in 5 dot points
- Jesse Proudman builds website for a dentist
- white glove managed hosting
- openvz based public cloud
- managed private cloud (openstack)
- SRE Operations Platform (SiteController)
$ ansible-playbook -i ../envs/dc1 site.yml
random thoughts...
- Automate! Automate! Automate!
- flapjack handler == poor concurrancy
- poorly written checks are bad
- Metrics with Sensu...
- NOBODY EVER LOOKS AT DASHBOARDS :sadface:
Cuttle (sitecontroller)
https://github.com/IBM/cuttle
Blue Box / IBM SRE Operations Platform
Large monolithic ansible repo responsible for deploying:
- Sensu / Graphite / Uchiwa / Grafana
- Elasticsearch / Logstash / Kibana
- Mirrors - apt, yum, pypi, rubygems
- 2FA (yubikey or TOTP) Bastion
- much much more...
Data Driven Infrastructure
- $ vim site/template.yml
- $ ansible-playbook -i site/ generate.yml
- generates - site/{hosts,ssh_config,group_vars,host_vars,etc}
- $ ansible-playbook -i site/hosts site.yml
Data Driven Checks
check scripts
- all sensu check scripts in a git repo
- jenkins builds .deb files on commit to git
- ansible does `apt-get install sensu-checks`
Data Driven Checks
Ansible variable
sensu_checks:
rabbitmq:
check_rabbitmq_messages:
handler: default
notification: "too many queued rabbitmq messages"
interval: 120
occurrences: 5
standalone: true
command: "check-rabbitmq-messages.rb -w 10000 -c 50000 \
--user {{ sensu.server.rabbitmq.username }} \
--password {{ sensu.server.rabbitmq.password }}"
service_owner: "{{ monitoring_common.service_owner }}"
Data Driven Checks
Ansible task
- name: install sensu checks
sensu_check_dict: name="{{ item.name }}" check="{{ item.check }}"
with_items:
- name: check-rabbitmq-messages
check: "{{ sensu_checks.rabbitmq.check_rabbitmq_messages }}"
notify: restart sensu-client
Data Driven Checks
resultant check file
$ cat /etc/sensu/conf.d/checks/check-rabbitmq-messages.json
{
"checks": {
"check-rabbitmq-messages": {
"standalone": true,
"notification": "too many queued rabbitmq messages",
"interval": 120,
"service_owner": "default",
"handler": "default",
"command": "check-rabbitmq-messages.rb --user sensu --password sensu -w 10000 -c 50000",
"occurrences": 5
}
}
}
Data Driven Checks
quick fix a thing!
$ ansible-playbook -e \
"sensu_checks.rabbitmq.check_rabbitmq_messages.interval=60" \
-i site/hosts site.yml --tags=sensu-checks
...
...
sensu driven compliance
- compliance rules written in [server|in]spec
- sensu runs [server|in]spec and alerts when failures
- Runs hourly, so we know within an hour when a machine falls out of compliance
- sensu-client.log goes to ELK .. so we get a FREE compliance audit report.
Thank You!