Daniel's Notes

sshuttle and systemd-resolved DNS resolution inside a NetworkManager VPN

Recently we had a new starter on our team (yay!) and I was helping him get set up with VPN access and lab access. Now our setup is somewhat intricate: you use an OpenConnect VPN to get to the company, and then you need an extra step to get to the isolated lab network. There are certain DNS names that only resolve within the company, and then there are DNS names that resolve only within the lab.

My connection strategy looks like this: I have Ubuntu on my laptop, I connect to the VPN with a terrifying openconnect commandline, and then I do sshuttle --dns -r lab-gateway.corporation.com LAB_IPS/SUBNET. This all works, I can access corporate DNS, lab DNS, corporate IPs and lab IPs.

Our intrepid new starter set up Fedora on his laptop, connected to the corporate VPN with NetworkManager, and then attempted to get to the lab network with sshuttle. However, he could not resolve lab DNS through the browser. He could resolve DNS through tools like dig or nslookup.

A bit of monkeying around with tcpdump and watching the output of sshuttle -v and we realised that DNS requests from the browser simply weren’t hitting sshuttle‘s DNS redirection.

Eventually it clicked: while sshuttle intercepts and proxies requests that go to servers in resolv.conf, systemd provides a different interface for making DNS queries. We could reproduce the issue by asking systemd-resolve or resolvectl to resolve a name inside the lab network: they failed to resolve the name, and no traffic hit sshuttle. If you do strace resolvectl query google.com, you can see that it does the query over dbus, rather than by sending a query to something listed in resolv.conf.

The dbus query goes to systemd-resolved, and systemd-resolved queries its own list of nameservers. To resolve this, we need to tell sshuttle to also intercept and proxy DNS requests that would go to DNS servers that systemd-resolved knows about for the VPN. Assuming your VPN is managed by NetworkManager and the interface is called vpn0, you can do something like this:

sshuttle  -r lab-gateway.corporation.com LAB_IPS/SUBNET -v --dns --ns-hosts $(resolvectl dns vpn0 | awk '{print $4","$5}')

With that, sshuttle captures and redirects DNS queries sent to both:

nameservers listed in resolv.conf – so it will capture the 127.0.0.53 stub resolver used by regular programs; and
queries that systemd-resolved sends directly to the nameservers from the VPN – so it will capture requests that come over dbus.

This seems to resolve my coworker’s issue!

ffmpeg incantations for screen recording

I’ve been experimenting with recording my screen while coding. The idea is to go back to the recordings later and see what went well and where things could have gone faster/better. I’m not a sports person, but I’m led to believe that sports teams watch replays of their games to see where they could improve for next time – same sort of idea here.

Anyway, screen recording turned out to be a bit fraught. I tried an open-source screen recorder but it only recorded the top left corner of my screen. So now I use the following ffmpeg invocation when using my laptop screen:

ffmpeg -f x11grab -s 2560x1440 -r 10 -hwaccel auto -i :0.0 -vcodec libx264 -preset ultrafast -tune stillimage  sr.mkv

This seems to be able to keep up, CPU-wise (-preset ultrafast was required for that) while not generating truly outrageous file sizes. I don’t think -hwaccel auto does anything on my system but I’m not 100% sure.

If I’m using my external monitor, I need to record a bigger screen, and I downscale it a bit:

ffmpeg -f x11grab -s 3840x2160 -r 10 -hwaccel auto -i :0.0 -vcodec libx264 -preset ultrafast -tune stillimage -s 1920x1080 sr.mkv

I haven’t reviewed the screen captures yet, but being conscious of the recording was helpful in and of itself: for example, I realised I needed desktop notifications (like notify-send) but from remote machines, and hopefully I’ll blogue about that soon.

Drive UUIDs in qemu guests

VMs created via libvirt/virt-manager and friends have disk drive UUIDs visible to Linux.

VMs created directly with qemu when attaching the disk drives via something like -hdd do not seem to have them. This is a real issue if your OS tries to mount disks via UUID!

This qemu invocation does have UUIDs visible to Linux:

sudo ./qemu/build/ppc64-softmmu/qemu-system-ppc64 \
   -m 2G -M pseries-2.12,accel=kvm -nographic -vga none -smp 4 \
   -blockdev filename=/scratch/libvirt/sle15sp2.qcow2,node-name=storage,driver=file \
   -blockdev driver=qcow2,file=storage,node-name=disk \
   -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=disk,id=virtio-disk0,bootindex=1

Sadly I can’t tell you why this works!

PowerVM: add an install ISO as a virtual CD

Usually I work with KVM. Sometimes I don’t. Then I have to figure out how to install a distro on an LPAR. The hardest part of this process is getting the install media set up in such a way that the HMC can see it and attach it to my partition. Here’s a set of steps and references for future me, and maybe for you too.

Most all of this is the same as this Tips4AIX post, but I only found that part way through and it’s not wildly easy to find with your Search Engine Of Choice.

Log on the VIOS partition. User is probably padmin. Password probably follows one of the common patterns for not-really-secret passwords.
If necessary, clear out some old images to make space: lsrep, then if the image is attached to a vopt device you can either rmvdev -vtd voptN or unloadopt -vtd voptN. Then rmvopt -name <name>. (See Step 10 in these AIX instructions.)
scp your image to the VIOS partition.
Following step 6 of these PowerVC instructions for HMC/VIOS: mkvopt -name NAME -file ISO -ro to load it in to storage. (Thanks to the Tips4AIX post for the -ro tip.)
From here I think you can probably do the rest on the web UI, but given you’re already on the VIOS, you can also: (again following the PowerVC instructions)
1. Determine your partition number, either via the web UI or with lssyscfg -r lpar on the HMC
2. Run lsmap -all to determine the vhost that corresponds to your LPAR.
3. If you need to create a new virtual optical drive to attach your ISO, you can do that with mkvdev (not mkdev as the PowerVC instructions say!): mkvdev -fbo -vadapter vhostN. This gives you a vtoptN.
4. Load the CD into the virtual drive: loadopt -disk NAME -vtd voptN.
Boot your partition and use PFW to boot into your virtual disk.

Update: Also consider: unloadopt -vtd voptN

Annotated source for grub’s memory management code: region merging

grub2 is a very common bootloader for Linux. Recently I was asked to explain what a particular part of the grub memory management code did. I thought I’d post it here in case it benefited anyone else. Eventually I’ll tidy it up into some sort of upstream post.

Grub deals with memory in ‘regions’, contiguous areas of memory that we have claimed from firmware. This code is part of the function that adds a new region to grub’s internal data structures – it specifically deals with trying to merge the new region we’re adding into an existing region. We want to be able to do this because it may help us satisfy larger allocations than 2 separate regions, as an allocation block may not cross regions.

This code only detects merging with regions after the region being added – that is, you add your regions from high address down. If you want to merge regions and you add from low address up, see my patch.

This is from grub_mm_init_region in grub-core/kern/mm.c. You can find it in context here. Some data structures are documented in this patch which you may find helpful.

Apologies for the lack of syntax highlighting, I’m not sure how to bend WordPress dot com to my will here without shelling out for the Business plan.

// for every existing region we know about:
//  p = ptr ptr to this region
//  q = ptr to region
for (p = &grub_mm_base, q = *p; q; p = &(q->next), q = *p)
  // if the address of the memory we're adding now (addr) + size 
  // of the memory we're adding (size) + the bytes we couldn't use
  // at the start of the region we're considering (q->pre_size)
  // is the address of q: that is, if the memory looks like this:
  //
  // addr                          q
  //   |----size-----|-q->pre_size-|<q region>| 
  if ((grub_uint8_t *) addr + size + q->pre_size == (grub_uint8_t *) q)
    {
      // r is a new region. Its address is the first
      // GRUB_MM_ALIGN-aligned address above addr
      r = (grub_mm_region_t) ALIGN_UP ((grub_addr_t) addr, GRUB_MM_ALIGN);
      // copy the region data from the existing region
      // we're examining to the new region
      *r = *q; // (this is a struct assignment)
      // consider the size of the region we're adding to
      // be part of the unused pre-region area 
      r->pre_size += size;
      // if we have enough extra space in the pre-region
      // area for an mm block, then
      if (r->pre_size >> GRUB_MM_ALIGN_LOG2)
	 {
	  // create a new mm block immediately following
	  // the new header
	  h = (grub_mm_header_t) (r + 1);
	  // the size of the block is based on the size of
	  // the pre-region area (the original pre-region
	  // size + size being added) shifted down to have
	  // units of mm blocks rather than bytes
	  h->size = (r->pre_size >> GRUB_MM_ALIGN_LOG2);
	  // treat the new allocation as allocated
	  h->magic = GRUB_MM_ALLOC_MAGIC;
	  // the region size will grow by the size of this
	  // block (shift back to bytes)
	  r->size += h->size << GRUB_MM_ALIGN_LOG2;
	  // correct the new pre-size 
	  r->pre_size &= (GRUB_MM_ALIGN - 1);
	  // replace the old region (q) in the ring with
	  // the new region (r)
	  *p = r;
	  // "free" the block. This will put it into the
	  // free lists properly this works because the
	  // allocated blocks don't keep any metadata
	  // about the state of the ring they're inserted
	  // into the free ring by grub_free.
	  grub_free (h + 1);
	}
      // replace the old region with the new region
      // (somewhat duplicative!)
      *p = r;
      // don't proceed to add this as a standalone region
      return;
    }

Building SLES kernels

I’m trying a new thing were I post short things – almost micro-blog length – about little tricks, tools and hacks that I use to get through the day.

I find myself having to build SLES kernels to test Things.

I’ve been building with make binrpm-pkg based on a kernel source from SUSE’s kernel GitHub repo and a config file from the config directory in their kernel-source repo. Here are the various hacks I’ve needed to get things working:

Firstly, set modprobe’s allow_unsupported_modules to 1:

dja@dyn438:~> tail -n1 /etc/modprobe.d/10-unsupported-modules.conf
allow_unsupported_modules 1

(Don’t forget to regenerate your initrd!)

Secondly, sign kernel modules. This is only necessary if you want to enforce signatures through some mechanism like lockdown or IMA. This is set in menuconfig under Enable loadable module support -> Automatically sign all modules.

Building RPMs more quickly by not compressing them

I’m trying a new thing were I post short things – almost micro-blog length – about little tricks, tools and hacks that I use to get through the day.

I’m building a bunch of kernel RPMs. The default compression takes ages and ages, much longer than building the kernel itself. Disk space isn’t a limiting factor at the moment, so lets just disable compression entirely:

dja@dja-guest:~$ cat ~/.rpmmacros 
#       Compression type and level for source/binary package payloads.
#               "w9.gzdio"      gzip level 9 (default).
#               "w9.bzdio"      bzip2 level 9.
#               "w6.xzdio"      xz level 6, xz's default.
#               "w7T16.xzdio"   xz level 7 using 16 thread (xz only)
#               "w6.lzdio"      lzma-alone level 6, lzma's default
#               "w3.zstdio"     zstd level 3, zstd's default
#               "w.ufdio"       uncompressed
#
%_source_payload       w.ufdio
%_binary_payload       w.ufdio

On kernel patch review

I’m trying to do more patch review these days, and I’ve started trying doing the review in a group setting or streaming it in hopes that more people feel empowered to do patch review in future.

I’m also trying to figure out and attempt to document what I personally do in a patch review. I’m not claiming to be a particular expert here, but I’m trying to be a bit more systematic than “Yeah, I guess that seems good to me”. This is still very much a work in progress – I hope to revise and extend it over time.

What do maintainers want?

I have 2 references from maintainers about what is and isn’t helpful, which inform this:

From talking to Michael Ellerman, the powerpc maintainer, he prefers reviews that aren’t just one line “Reviewed-by: Your Name <email@address>” but include some content about what has been reviewed.
Dave Chinner of xfs posted about reviews and I’ve cribbed from his thoughts pretty heavily here.

A brief outline of what I try to look for

Can I understand the cover letter?
- (Not usually something I cover explicitly but) Is this a good idea?
Can I understand the commit message?
Does the code do what the commit message says it does?
Does it pass automatic reviews?
Is the change consistent with the surrounding code?
- Patches should do 1 thing at a time: don’t clean up and change at the same time.
- Churn must add commensurate value – git blame is a useful tool and we don’t want to make it harder to go through the history without good reason.
If code is being reorganised without intending to change the behaviour, does behaviour in fact stay the same?
- If I follow the various conditions checked, do we go along the same paths/do the same things?
Are errors caught and resources freed in error paths?
Are the ‘little details’ correct?
- are loops are correctly terminated?
- are the right variable types used?
- are things signed or unsigned as appropriate?
- will any arithmetic operations overflow?

Do I notice any violations of the coding style not picked up by checkpatch?

Final thoughts

From Dave Chinner’s email:

IOWs, you don’t need to know anything about the subsystem to perform such a useful review, and a lot of the time you won’t find a problem. But it’s still a very useful review to perform, and in doing so you’ve validated, to the best of your ability, that the change is sound. Put simply:

“I’ve checked <all these things> and it looks good to me.

Reviewed-by: Joe Bloggs <joe@blogg.com>”

This is a very useful, valid review, regardless of whether you find anything. It’s also a method of review that you can use when you have limited time – rather than trying to check everything and spending hours on a pathset, pick one thing and get the entire review done in 15 minutes. Then do the same thing for the next patch set. You’ll be surprised how many things you notice that aren’t what you are looking for when you do this.
https://lore.kernel.org/linux-fsdevel/20200202214620.GA20628@dread.disaster.area/

Running an Ubuntu ppc64le virtual machine on x86

For an issue I’m debugging, I need a 64-bit little-endian powerpc machine. I don’t have a physical one handy, so I wanted to run a virtual one on an x86 system.

We can use qemu and libvirt for this. Unfortunately, it will be very slow – we have to use qemu’s TCG mode rather than hardware assisted virtualisation because the guest (POWER) and host (amd64) are different. Fortunately, libvirt can work with these VMs like normal, which I find very helpful as it allows me to use tools like virsh and virt-manager, rather than dealing with qemu’s inscrutable command line arguments.

So I grabbed an Ubuntu ppc64el install ISO (Xenial, Bionic) and set to work.

Here’s the virt-install incantation derived from much accumulated wisdom amongst my former colleagues at OzLabs:

virt-install --arch=ppc64le --name=GUESTVMNAME \
  --os-variant=linux --boot menu=on \
  --disk path=GUESTDISK.qcow2,size=20,format=qcow2,bus=virtio,cache=none,sparse=true \
  --memory=4096  --vcpu=2  --wait 0  --graphics none \
  --cdrom=/home/ubuntu/ubuntu-18.04.1-server-ppc64el.iso

If you run this on an x86 Xenial host, the installation will begin and you can interact with it by running virsh console (or use virt-manager).

Sadly, it does not work; the installer kernel panics. Removing ‘quiet’ from the boot commands in grub shows messages like the following:

modprobe[98]: unhandled signal 4 at 00007d96d2364fbc nip 00007d96d2364fbc lr 00007d96d2332ab8 code 1

Mikey Neuling pointed out that this was a SIGILL – illegal instruction – which reminded me that while Ubuntu only supports POWER8 for ppc64el, qemu will emulate a lot further back – and by default on Xenial I guess it uses an older ISA.

So I manually set the machine type to POWER8. I used virt-manager, which generated this snippet of libvirt xml under the domain tag:

  <cpu mode='custom' match='exact' check='none'>
    <model fallback='allow'>POWER8</model>
  </cpu>

This, however, won’t start – virsh start and virt-manager report:

qemu-system-ppc64le: No Transactional Memory support in TCG, try cap-htm=off

Suraj Jitindar Singh explained some of the backstory here, and suggested that I try specifying a particular “machine type” to qemu that would default to having hardware transactional memory off. Fortunately, his suggestion – pseries-2.12 – worked: the machine started, the installer is installing, and I’m much happier.

I used virsh edit to do this – the relevant snippet is:

<type arch='ppc64le' machine='pseries-2.12'>hvm</type>

This lives under os, under domain.

The story on Bionic

Fortunately Bionic does a better job of all this.

The virt-install command fails straight out of the box with the cap-htm error. Fortunately, it’s easy to specify the machine type for virt-install: just append –machine=pseries-2.12:

virt-install --arch=ppc64le --name=GUESTVMNAME --machine=pseries-2.12 \
  --os-variant=linux --boot menu=on \
  --disk path=GUESTDISK.qcow2,size=20,format=qcow2,bus=virtio,cache=none,sparse=true \
  --memory=4096  --vcpu=2  --wait 0  --graphics none \
  --cdrom=/home/ubuntu/ubuntu-18.04.1-server-ppc64el.iso

This works! I suppose the machine type must imply a POWER8 cpu, because I didn’t need to specify it.

Update

With a bionic guest, the installation worked on the serial console, but systemd failed to spawn a terminal on it for some reason. At least, it’s probably systemd’s fault, and it’s convenient to blame it! I haven’t actually confirmed this: I added a display in virt-manager and installed an openssh server so I can now access it in 2 different ways.

See also

uvtool: I normally use this for my VMs, but it has some issues starting ppc64 guests. It flat-out fails on Xenial (it starts it as an x86 guest!) and on Bionic it fails too – it seems to assume that it will be running under KVM. I haven’t dug into this any further.
The very helpful instructions from Michael Ellerman on the linuxppc/linux wiki: Booting with Qemu. If I didn’t need a pretty expansive user-space I would totally have just used this.

Netplan FAQs

Looking at search traffic, it seems the following netplan issues get people frequently.

No tabs in YAML

Problem:

netplan prints an error message like this:

Invalid YAML at //etc/netplan/10-bad.yaml line 6 column 0: found character that cannot start any token

Solution:

You have a tab in your YAML file. Remove the tab and try again.

There will eventually be better error messages for this – PR#12 or PR#18, but they haven’t landed as of June 2018.

mapping values are not allowed in this context

Problem:

netplan prints an error message like this:

Invalid YAML at //etc/netplan/10-bad.yaml line 3 column 12: mapping values are not allowed in this context

Cause:

Netplan parses YAML with libyaml, which has some rather obtuse error messages.

As far as I can tell, this is caused when you have a YAML file where:

you have begun a mapping. As of Bionic, netplan recognises networks, match, nameservers, routes, routing-policy, access-points, parameters and SSIDs as mappings (see the netplan man page)
you have an item in the mapping that isn’t a key-value pair, it’s just a bare value.

This is easy to do if you forget a colon somewhere. For example, here there should be a colon after the ens3:

network:
      version: 2
      ethernets:
        ens3
            dhcp4: true

(If you have another YAML file that demonstrates this, please let me know on Twitter (@daxtens) or by posting a comment!)

Solution:

Check your YAML file carefully for syntax errors. Check all mapping stanzas, checking that each item has a colon in the correct place.

Cannot set an MTU!

Netplan (well really the combination of udev and systemd-networkd) has some weird quirks when setting MTUs. As far as I can tell, if you are trying to set an MTU and it isn’t working, make sure the match stanzas are matching on MAC address. This seems to be the most reliable; otherwise networkd throws a fit about devices that are renamed in udev. See LP: #1724895 and my previous post.

Other issues?

We can probably help!

Let me know in the comments;
Ask me (@daxtens) on on Twitter;
Ask on Ask Ubuntu;
Try #netplan on freenode; or,
Open an issue on Launchpad.