New Salt version break

I started getting these error messages last week and I discovered it was actually caused by an incompatibility between the new version 2014.01 and an old server 0.17. In general if you’re finding strange errors, particularly with configuration that used to work, double check your version numbers match. You can always do a ‘salt machine test.version’ to check the clients version.

salt machine state.highstate
machine:
----------
	State: - no
	Name:  	states
	Function:  None
    	Result:	False
    	Comment:   No Top file or external nodes data matches found
    	Changes:

Or via salt-call state.highstate

[ERROR   ] Got a bad pillar from master, type bool, expecting dict: False

LXC Networking

My infrequently used laptop appears to have developed some issues with the networking on new lxc containers (Ubuntu 13.10) so I finally started to dig into how it’s setup. I first blew away the caches in /var/cache/lxc, then tried installing a fresh box with a completely stock template. The network interface didn’t acquire one of the usual 10.0.3.0 network ip addresses. A look in the logs suggested the dhcp server was receiving a DHCP request, and it was sending back a lease, but the client wasn’t picking that up.

tail /var/log/syslog
dnsmasq-dhcp[1261]: DHCPDISCOVER(lxcbr0) 00:22:3e:ea:aa:fa 
dnsmasq-dhcp[1261]: DHCPOFFER(lxcbr0) 10.0.3.231 00:22:3e:ea:aa:fa 
dnsmasq-dhcp[1261]: DHCPDISCOVER(lxcbr0) 00:22:3e:ea:aa:fa 
dnsmasq-dhcp[1261]: DHCPOFFER(lxcbr0) 10.0.3.231 00:22:3e:ea:aa:fa 
dnsmasq-dhcp[1261]: DHCPDISCOVER(lxcbr0) 00:22:3e:ea:aa:fa 
dnsmasq-dhcp[1261]: DHCPOFFER(lxcbr0) 10.0.3.231 00:22:3e:ea:aa:fa 

A google suggested my problem was the UDP checksums – bug 1204069. A look using tcpdump suggested that indeed my checksums were corrupt; note the ‘bad udp cksum’ on the second packet.

sudo tcpdump -vvv -i lxcbr0
21:36:34.009788 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 328)
    0.0.0.0.bootpc > 255.255.255.255.bootps: [udp sum ok] BOOTP/DHCP, Request from 00:22:3e:ea:aa:fa (oui Unknown), length 300, xid 0xb407cc75, secs 49, Flags [none] (0x0000)
	  Client-Ethernet-Address 00:22:3e:ea:aa:fa (oui Unknown)
	  Vendor-rfc1048 Extensions
	    DHCP-Message Option 53, length 1: Discover
	    Parameter-Request Option 55, length 13: 
	      Subnet-Mask, BR, Time-Zone, Default-Gateway
	      Domain-Name, Domain-Name-Server, Option 119, Hostname
	      Netbios-Name-Server, Netbios-Scope, MTU, Classless-Static-Route
	      NTP
	    END Option 255, length 0
	    PAD Option 0, length 0, occurs 41
21:36:34.010027 IP (tos 0xc0, ttl 64, id 44699, offset 0, flags [none], proto UDP (17), length 328)
    10.0.3.1.bootps > 10.0.3.231.bootpc: [bad udp cksum 0x1c2d -> 0x1a4c!] BOOTP/DHCP, Reply, length 300, xid 0xb407cc75, secs 49, Flags [none] (0x0000)
	  Your-IP 10.0.3.231
	  Server-IP 10.0.3.1
	  Client-Ethernet-Address 00:22:3e:ea:aa:fa (oui Unknown)
	  Vendor-rfc1048 Extensions
	    DHCP-Message Option 53, length 1: Offer
	    Server-ID Option 54, length 4: 10.0.3.1
	    Lease-Time Option 51, length 4: 3600
	    RN Option 58, length 4: 1800
	    RB Option 59, length 4: 3150
	    Subnet-Mask Option 1, length 4: 255.255.255.0
	    BR Option 28, length 4: 10.0.3.255
	    Default-Gateway Option 3, length 4: 10.0.3.1
	    Domain-Name-Server Option 6, length 4: 10.0.3.1
	    END Option 255, length 0
	    PAD Option 0, length 0, occurs 8

The bug suggests that that should no longer be a problem, so I carried on googling. Turning my search criteria to DHCP udp problems. That came up with bug 930962 which gave a potential work around and suggested that it should be fixed. Since it said there should already be a firewall fix in the config I decided to take a look at the configuration. To figure out where that was I used dpkg -L

dpkg -L lxc
…
/etc/init/lxc-net.conf
…

After a little digging I found the network configuration in /etc/init/lxc-net.conf where the setup of all the networking out of the box that I so like about ubuntu was all there. There was no mangle line as suggested in the bug report. At that point I checked the version I was running, and the version on the bug report and realised that was a pretty recent bug fix, and I simply don’t have it. Rather than figure out where the fixed package is, I figured I may as well just try to fix it myself for now since I’ve come so far.

sudo iptables -t mangle -A POSTROUTING -o lxcbr0 -p udp --dport bootpc -j CHECKSUM --checksum-fill

After that the networking came up, or more specifically the server obtained an ip address and I was able to ssh to it. I added pretty much that line into my /etc/init.lxc-net.conf and it now works out of the box again and I have a better understanding of how lxc sets up it’s network. I’ve also finally solved a case of checksum offload issues for myself rather than just hearing about them. Possibly in the strangest way, i.e. not turning off the network offload of the checksums.

Of course the adventure didn’t end there. I then found that salt wasn’t working. When I looked for salt keys I didn’t find any requests from new servers. A tcpdump showed a similar checksum issue with the salt traffic. A quick hack with iptables again solved that problem.

sudo iptables -t mangle -A POSTROUTING -o lxcbr0 -p udp --dport 4505 -j CHECKSUM --checksum-fill
sudo iptables -t mangle -A POSTROUTING -o lxcbr0 -p udp --sport 4505 -j CHECKSUM --checksum-fill
sudo iptables -t mangle -A POSTROUTING -o lxcbr0 -p udp --dport 4506 -j CHECKSUM --checksum-fill
sudo iptables -t mangle -A POSTROUTING -o lxcbr0 -p udp --sport 4506 -j CHECKSUM --checksum-fill

Getting started with Salt

I’ve finally started to use the salt stack in anger now and I thought I’d make some quick notes on what I thought I was slow to pick up on.

Specifying that a box needs to have x services installed is all done in the configuration. I understood that there are two aspects to salt, one is running things via the command line, the second is configuration management. I thought I’d be able to do something like ‘salt provision box-a service-b’. Instead you generally put the machine names and what they should have in top.sls in the /srv/salt directory along with all the other config.

Pillars appears to be intended to provide the site specific data generally. In general the salt .sls files are the config that can be used everywhere, sometimes templated, and the pillar data can be used in those templates to insert data specific to a site/machine. This means you should be able to use the same basic salt setup on multiple salt-masters, and simply change the data on them to generate machines with different users and setups.

Salt appears to be very opaque to start with. It took me too long to realise that like most things, it logs. In fact it logs very well so you can really turn the log level up to a very high level if you are troubleshooting. It can be interesting to turn it up if you want to see what it’s doing in real time, otherwise you just see problems logged.

It’s well worth taking a look at the state documentation as you’re looking at the tutorials. I found the examples in the tutorial relatively hard to pick apart until I was able to see the references for the various states.

http://docs.saltstack.com/ref/states/all

The file state is particularly worth taking a look at. A lot of deployment is pushing files about and modifying them.

http://docs.saltstack.com/ref/states/all/salt.states.file.html#module-salt.states.file

Make sure you run the latest versions from salt’s official repositories if possible. The ones packaged with your OS are likely to be old. And don’t run inconsistent versions between your masters and minions. That way leads to fail.

If something obvious doesn’t appear to work, try it with a different package. I found my simple couchdb installation didn’t work quite right out of the box. I tried exactly the same thing with memcached and it did. This turned out to be a bug.

I’m still early on my journey into salt, but so far it appears to be very useful.

Salt service running

If you’re using salt to ensure you’re running your services and you constantly see your service being started it might be because the service doesn’t support the status command.

----------
    State: - service
    Name:      openerp
    Function:  running
        Result:    True
        Comment:   Started Service openerp
        Changes:   openerp: True

If you tweak the salt minion to log trace messages you’ll see that salt calls service openerp status.

/etc/salt/minion:
log_level: trace

# service salt-minion restart

/var/log/salt/minion
2013-10-02 19:26:32,602 [salt.loaded.int.module.cmdmod][INFO    ] Executing command 'service openerp status' in directory '/root'
2013-10-02 19:26:32,611 [salt.loaded.int.module.cmdmod][INFO    ] Executing command 'service openerp start' in directory '/root'

If we try that on the command line we’ll see that the openerp service doesn’t support the status option.

root@openerp-aq:~# service openerp status
Usage: openerp-server {start|stop|restart|force-reload}

That means salt gets an error and assumes the service isn’t running so it starts it.

The service state has an alternative way to check for the service running by looking at ps and grepping for the process. This is done by specifying the sig key.

openerp:
  service:
    - running
    - sig: openerp
    - require:
      - pkg: openerp

Now we finally see what we want, salt leaving the service running.

----------
    State: - service
    Name:      openerp
    Function:  running
        Result:    True
        Comment:   The service openerp is already running
        Changes:   

In the log we now see,

2013-10-02 19:31:15,371 [salt.loaded.int.module.cmdmod][INFO    ] Executing command "ps -efH | grep 'openerp' | grep -v grep | awk '{print $2}'" in directory '/root'