Setting up a new CPAN Mini Mirror

It’s time for me to set up a new dev laptop and I wanted to set up a CPAN Mirror quickly. Using a mirror has been something I’ve done for a long time now, it’s great for coding without internet. Unless you’re doing lots of module installs this is likely to use more bandwidth than downloading on an ad hoc basis, but the tradeoff is that I can install almost any module I realise I need regardless of whether I have connectivity or not. The only limitations will be down to external dependencies on things like operating system packages/tools.

I have previously setup mirrors using lxc and salt to provision the machine, but this time I decided to convert that set up to Docker. I also simplified the setup to just the mirror as I didn’t need to inject modules into the CPAN server any more. Well, in truth I never really did on my laptops, I simply did that because it was useful for work at the time. I created the salt configuration to make it easier to re-provision new servers for a work setup that allowed for private mirror that also had work modules allowing for a full CPAN type deployment process of both public and private code.

To do this I’ve used docker-compose, that’s almost always the best option for a laptop setup. Even with a single machine it generally makes life simpler as you can encode all the configuration in the file so that you have a few simple consistent commands to build/setup/run your containers. I’ve also set it to the highest version number currently available partly just to see what’s available, and partly because I want to make use of some of the newer features of volumes. It doesn’t appear that I can express everything I want to perfectly in the docker-compose file, but I suspect that they aren’t really targeting my situation. While docker is partly really popular because it works really well on developer laptops, cached file system layers and lightweight machines work really well in a constrained environment, fundamentally docker is aiming for servers and so they’re trying to deal with issues of sharing resources across multiple machines rather than working on a single machine.

What I really wanted to be able to do was specify the exact details of the volume I wanted shared between the containers, from the location down to the user id’s for them. If that’s currently fully possible I haven’t seen a way to do it.

This setup is not designed for the general internet or with security in mind (not that a simple mirror really should really have much in the way of threats). I don’t even expose the ports, just print out the url for use when running cpanm. I also just update the server manually rather than setup a cron job for it as I don’t really want to use that much bandwidth on this. I don’t use the mirror that often, but when I do it’s really valuable, even if it’s not all the latest and greatest versions of the modules.

Having all the modules locally can also be great when you want to do some analysis of what existing modules are doing. It’s reasonably easy to write scripts to say find all the XS modules and then extract their C code to see which call a particular function.

The configuration is on github here, The modules downloaded are kept in a volume outside the containers so updating/removing etc. should be easy. In theory it should even be possible to wrap the set up around an existing mirror if you already have files.

An Operations guide to Catalyst configuration in Docker

We use the Perl Catalyst web framework at $work.  A lot.  It’s got most of the stuff you want for web sites and services, and it’s pretty solid and it’s lovely and stable.

As with most systems it has a well established method of configuration and it allows you to use configuration files.  Theses are handy for all sorts of reasons, but in a docker environment most of the file system generally wants to be essentially static.  If you need to change config files on every different deployment of your container you’re probably going to need to do something ugly.  Docker is much happier with allowing you to push in settings via environment variables instead.  This fits much more neatly into the 12 factor app methodology that using docker itself fits neatly into.

Of course Catalyst has a solution to that that doesn’t require wholesale change, you can just load the Catalyst::Plugin::ConfigLoader::Environment plugin to your application.  That allows you to override parts of your configuration.

Perl modules generally have names like this, My::App, often (but not always) with capitalized names, and :: between words.  The corresponding catalyst configuration file for that application would be named my_app.conf.  It is simply a lowercase version of the app name, with :: replaced with _.

The environment variables you override the config with need to be an all uppercase version of the app name, again with :: replaced by ::.  So it would start MY_APP.  Then you put another _ after that, and then you specify the configuration section you want to override.

Let’s take a look at an example config,

using_frontend_proxy 1 

   connect_info dbi:Pg:host=db;dbname=patch
    connect_info someuser
    connect_info averysecurepassword
      quote_char "\""
      name_sep .
      pg_enable_utf8 1

Note that catalyst config files can generally be in lots of different formats.  This is kind of an apache like config style, but it can also be yaml and other formats.  The configuration loaded Cataylst apps commonly use is very flexible.

If we want to change the using_frontend_proxy setting we can set the environment variable MY_APP_using_frontend_proxy to 0.

To set more complex data structures we can set a json value.  This will get merged into the config, so if something exists in the hash/dictionary but isn’t overwritten then it will generally be left alone.

The configuration file above roughly translates to a dictionary looking like this in code (expressed in json),

    "Model::Cabbages": {
        "connect_info": [
                "name_sep": ".",
                "pg_enable_utf8": "1",
                "quote_char": "\""
    "using_frontend_proxy": "1"

Note how the repeated connect_info were translated into an array, and then a dictionary.  This may seem strange, but the connect_info block is a very common structure in the perl world as most if not all will be passed straight onto DBI, the workhorse doing most database work.

So to change the password in the database configuration we do,


Since that’s an array of configuration settings we end up having to specify the whole lot, otherwise we lose some of the configuration.  Also note that we turned the :: into __.  That is generally a good thing when working on the command line as putting :: into env variables via bash is tricky (if possible).  

The one oddity is if you have a configuration key with more than 1 set of double colons.  E.g. Model::Payment::Client.  Then you just need to live with the :: otherwise your override will be interpreted in an odd fashion.

Perl and docker-compose configurations can set environment variables with :: easily though so this generally isn’t a big deal.  env and set can confuse however as one will show environment variables with ::, and the other won’t.


Take the configuration filename, drop the extension, upper case it.

Add an underscore.

Take the first section or key, and append that to the environment variable.

That’s the variable that will target that config section.

Now set it’s value to either a string (if it’s a simple string value), or a json structure that matches the configurations effective structure.

Now you discover whether someones done something confusing and eschewed convention. Hopefully not.

There is something missing here, and it’s how to test changes, and generate the correct json easily. I have the tooling to make it reasonably simple, I just need to open source it.

PAUSE permissions code outline

This is a quick brain dump from my brief coding session on PAUSE working on the permissions. This won’t make sense without reading the code yourself. Unless you’re interested in modifying PAUSE or reviewing my changes I would ignore this blog post. I’m writing this for my future self as much as anyone. The behaviour I modified in my pull request is in italics.

The code for setting the permissions when a new distribution is indexed is largely in PAUSE::package. It’s triggered by ./cron/mldistwatch and PAUSE::mldistwatch, but the code that actually sets the permissions is in PAUSE::package.

perm_check checks that the current user has permission to be uploading this distro and bails if not. If this is a new package (and they do have permission) then it also adds the user to the perms table as they are effectively first come.

give_regdowner_perms adds people to the perms table where necessary. This was supporting the module permissions in amongst other things. It is now also where we copy the permissions from the main module.

The checkin method as well as adding the record to the packages table also adds to the primeur table via the checkin_into_primeur function. If the x_authority flag is used then that user is added, otherwise the main module is looked up, or failing that the uploading user.

Note that the first come (primeur) users appear in both the perms and primeur tables.

There are special checks for Perl distributions in the permissions code that will change the behaviour of the permissions. I am purposely not mentioning them as I never took the time to understand them.

A quick note on testing. As well as the tests which work of the dummy distributions in the corpus directory there is a test utility. To get a look at the permissions database in practice use the build-fresh-cpan script. This will build a one off cpan environment that you can examine. Just pass it a package to index and then you can check the permissions database.

$ perl one-off-utils/build-fresh-cpan corpus/mld/009/authors/O/OO/OOOPPP/Jenkins-Hack-0.14.tar.gz
Initialized empty Git repository in /tmp/h4L3GU6Awi/git/.git/
$ ls
cpan  db  git  Maildir  pause.log  run
$ cd db
$ ls
authen.sqlite  mod.sqlite
$ sqlite3 mod.sqlite 
SQLite version 2014-10-29 13:59:56
Enter ".help" for usage hints.
sqlite> select * from primeur;
sqlite> select * from perms;

Docker logging and buffering

When you start using Docker one of the things it’s quite possibly you’ve hit is output buffering. Docker encourages you to log to stdout/err from your program, and then use docker to feed your regular logging solution.

This generally exhibits itself as you seeing no logging, then after the program has been running for a while you come back to the logs and find they are all there. It doesn’t get as much press as caching but it can be just as head scratching.

With Perl for example the buffering kicks in when it’s not connected to a terminal. For that this somewhat idiomatic snippet can be useful,

select( ( select(\*STDERR), $|=1 )[0] );
select( ( select(\*STDOUT), $|=1 )[0] );

As ever with Perl there is more than one way to do it of course…

Note that you could experience this sort of thing when using other logging setups. If you pipe to logger to output to rsyslog you could experience the same issues.

Debugging Web API traffic

Note that this blog post mostly assumes you’re operating on Linux. If you’re using Windows just use Fiddler. It probably does everything you need. Actually, having just looked at their page it looks like it may well work on lots of places other than Windows too, so it might be a good option.

When developing programs that consume API’s that make use of HTTP at some level it’s often useful to check what is actually going over the wire. Note that this is talking about unencrypted traffic at this level. If you’re talking to a server over HTTPS you will need to MITM your connection or get the keys for the session using the SSLKEYLOGFILE environment variable.

The simplest way to capture traffic is generally to use tcpdump. Wireshark is often a good tool for looking at network traces, but for lots of HTTP requests it tends to feel clunky. This is where I turn to a python utility named pcap2har. This converts a packet capture to a HAR file. A HAR file is essentially a json file containing the HTTP requests. It has an array of the requests with each request noting the headers and content of the parts of the request/response. The HAR file format is documented here.

You will actually find that Google Chrome allows you to export a set of requests from it’s Network tab of the Developer toolbar as a HAR file too.

The pcap2har utility isn’t packaged in a particularly pythonic way, and it doesn’t actually extract the request body so I created a minor tweak to it on a branch. This branch does extract the request body which is often used in API calls. You’ll need to pip install dpkt, the rest of the dependencies are bundled in the repo. Then you run it like this,

git clone --branch request_body
cd pcap2har
sudo pip install dpkt
tcpdump port 8069 -w packets.dump
pcap2har packets.dump traffic.har

Having said that HAR is a lot easier to consume, there are viewers, but I’ve not found one that I particularly liked. I tend to either pretty print the json and look at it in a text editor, or then use grep or code to extract the information I want.

For OpenERP/Odoo API calls I created a quick script to explode the contents of the API calls out to make it easier to read. It explodes the xml/json contents within the requests/responses out to json at the same level as the rest of the HAR data, rather than having encoded data encoded within json.

Perl QA Hackathon report (#perlqah2016)

Thank you to all the people who sponsored the Perl QAH hackathon and all those that provided their time. It was a very productive environment for a lot of projects.

I worked on 2 primary things relating to PAUSE permissions. PAUSE (indexing) itself and a new module for testing that permissions are consistent named Test::PAUSE::ConsistentPermissions.

The permissions of distributions on CPAN are generally something you don’t think about until you become a co-maintainer of a distribution and you see red on or Then you discover that there is a system by which each module in your distribution is effectively owned by somebody, and that some others might also have been granted permission to upload a file.

Let’s take Test::DBIx::Class as an example. I’m a co-maintainer, and JJNAPIORK is the owner. There is also another Co-Maintainer.

Using the script from Test::PAUSE::ConsistentPermissions I can look at the permissions currently assigned.

$ pause-check-distro-perms Test-DBIx-Class Test::DBIx::Class
Distribution: Test-DBIx-Class
Module: Test::DBIx::Class

When I upload a new release of Test::DBIx::Class to PAUSE any new modules added will be given to me. So I will be owner and no-one will be Comaint. I then need to grant PHAYLON comaint (which I can do as owner), and then pass ownership back to JJNAPIORK.

There is an alternative mechanism involving a bit of metadata named ‘x_authority’. With that we could ensure that JJNAPIORK retains ownership of all the modules within the distribution. The downside of that is that while I would also gain comaint on those new modules as the uploader, PHAYLON would not. Since I wasn’t the owner either, I wouldn’t be able to assign him Comaint, and I would have to ask JJNAPIORK to do that instead.

I believe there were historically alternative methods for managing this, but PAUSE has been through some rationalisation and simplification of some features and those features don’t exist anymore.

I came to this hackathon wanting to work on this problem having encountered Karen Etheridge (ETHER)’s prototype Dist::Zilla::Plugin::AuthorityFromModule which suggested a potential solution.

We had a meeting with the interested parties at the hackathon about how we could deal with this scenario better. Ricardo Signes (RJBS) suggested that we could make use of the fact that there is a designated ‘main module’ for permissions. We could  use that for the defaults. This is one step better than the previously suggested solution as it makes use of some of the previous rationalisation of PAUSE permissions and won’t require authors to provide extra metadata. This should mean that permissions are much more likely to just work out of the box.

With that potential solution suggested RJBS gave me some assistance to get working on the PAUSE indexer. I started with adding tests, then wrote the code.

The change made so far is very minimal, only affecting the indexing of new packages. No changes have been made to the user interface, or the complex permissions you can currently end up with. The pull request is here – Note that it also benefits from Peter Sergeant (Perl Careers)’s work to hook the PAUSE tests into Travis giving it a green tick indicating the tests passed.

The other thing I worked on was Test::PAUSE::ConsistentPermissions to allow us to check that a distribution has consistent permissions. I created a script for checking a distro on PAUSE (not too dissimilar to Neil Bowers App::PAUSE::CheckPerms module) and a test function for using a distro’s release tests. This is a bit like Test::PAUSE::Permissions, but rather than check whether you have permission to upload the distribution, it checks whether the permissions are consistent. These two properties don’t necessarily coincide.

During the event I was also able to create a couple of minor pull requests.

Here’s the obligatory thank you to the full list of sponsors. Thank you all.

The sponsors for the Perl QA Hackathon 2016,

Hand Coding SQL with psycopg2 for Odoo

I thought it would be a good idea to write down some guidelines on writing hand coded SQL for Odoo/OpenERP. One of the general themes should be ‘avoid SQL injection’, and Odoo themselves provide some documentation explaining that. They also make the point before that you should try to stick with the ORM as much as possible too. Rolling your own SQL bypasses all the access controls on records, so if you are writing some SQL, always consider whether the user is supposed to have access to the data extracted or manipulated by the SQL.

Assuming you’re still set on writing SQL it’s worth noting that Odoo uses psycopg2, a mature and respectable library for accessing Postgres from Python. Dealing with parameters in SQL clauses is simple as it allows you to pass them through as parameters, just as you’d hope.

cr.execute('select * from res_partner where name = %s', (partner_name,))

Unfortunately it’s too simplistic to say never use string concatenation. In practice you often want to stitch together a complex query based on some data. You might want to vary the search criteria based on user input.

Here are some concrete scenarios to demonstrate more complex situations. Read the psycopg2 documentation for how to pass in values, and how it converts the various types first.

In clause

This is the most common place people give up with parameterisation. It’s clearly documented in Psycopg2, but since people often don’t read the docs, they don’t realise %s works and so they end up doing something involving join and their own format strings. The syntax is really simple,

cr.execute('select * from res_partner where name in %s', (['Partner 1', 'Partner 2'],))

Optional clauses

The trick with building up optional parts is to maintain lists of parameters alongside lists of the bits of SQL you need. The exact mechanism for appending the strings isn’t terribly important, and in fact you might find it useful to write yourself a nice class to make all that string construction simple and reliable. The key thing is to never mix untrusted data into the SQL you are generating.

clauses = []
params = []
if partner_name:
    clauses.append('name = %s')
if website:
    clauses.append('website = %s')
where_clause = ""
if len(clauses) > 0:
    where_clause = "where " + ' and '.join(clauses)
sql = "select id, name from res_partner " + where_clause
cr.execute(sql, params)

Refactoring out chunks of SQL generation

If you find yourself requiring reuse with snippets of SQL or subqueries it’s quite common to refactor that out into a method. This is another common place for mistakes. Return SQL and a list of parameters whenever you do this, rather than just a block of SQL.

def stock_level(product_ids):
    sql = """
    select location_dest_id as location_id, product_uom, sum(product_qty) as qty
    from stock_move 
    where product_id IN %s
    and state = 'done'
    group by  location_dest_id, product_uom
    union all 
    select location_id, product_uom, 0 - sum(product_qty) as qty
    from stock_move 
    where product_id IN %s
    and state = 'done'
    group by  location_id, product_uom
    return (sql, [product_ids, product_ids])

snippet, params = stock_level(product_ids)
sql = """
select location_id, product_uom, sum(qty) as stock_level
from (
) as stock
group by location_id, product_uom
having sum(qty) > 0 ;
""" % snippet
cr.execute(sql, params)

Using the same parameter twice

The previous example uses the same parameter twice in the query, and passes it in twice. This can be avoided by using a named arguments like %(name)s. This can also be useful when you have a very large query requiring lots of parameters as it allows you to pass in a dictionary and not worry about the ordering of the arguments you pass it.

def stock_level(product_ids):
    sql = """
    select location_dest_id as location_id, product_uom, sum(product_qty) as qty
    from stock_move 
    where product_id IN %(product_id)s
    and state = 'done'
    group by  location_dest_id, product_uom
    union all 
    select location_id, product_uom, 0 - sum(product_qty) as qty
    from stock_move 
    where product_id IN %(product_id)s
    and state = 'done'
    group by  location_id, product_uom
    return (sql, {"product_id": product_ids})

snippet, params = stock_level(product_ids)
sql = """
select location_id, product_uom, sum(qty) as stock_level
from (
) as stock
group by location_id, product_uom
having sum(qty) > 0 ;
""" % snippet
cr.execute(sql, params)

Fixing badly refactored code

If you do find yourself with a function that is returning SQL only, and you can’t change the function prototype because external code is depending on it, psycopg2 provides the mogrify function that can help. It does require the cursor to be in scope.

def stock_level(cr, uid, product_ids):
    sql = """
    select location_dest_id as location_id, product_uom, sum(product_qty) as qty
    from stock_move 
    where product_id IN %s
    and state = 'done'
    group by  location_dest_id, product_uom
    union all 
    select location_id, product_uom, 0 - sum(product_qty) as qty
    from stock_move 
    where product_id IN %s
    and state = 'done'
    group by  location_id, product_uom
    return cr.mogrify(sql, [product_ids, product_ids])

snippet = stock_level(cr, uid, product_ids)
sql = """
select location_id, product_uom, sum(qty) as stock_level
from (
) as stock
group by location_id, product_uom
having sum(qty) > 0 ;
""" % snippet

Allowing the caller to specify fields/tables

This is something where the libraries generally don’t help. You need to sanitize this code yourself. In general try to avoid fully untrusted output. It’s better to have a whitelist of what you expect to allow, and allow them to select from those known good tables/fields. Also remember that you can query the database for what the structure of the database is if you don’t know it when writing the code. If you really can’t check either, sanitize the input heavily. Try to allow the absolute minimum characters within identifiers and also consider using an extra qualifier so that you can identify programmatically created fields/tables and prevent name clashes with core database functions.

In summary

Keep your string manipulation safe, never append untrusted input (note that data you retrieved from the database should generally be treated as untrusted). Assuming you are doing it right, any query without parameters should look start to look suspicious. There are of course simple queries that take no parameters, but they should be easy to verify as safe.

Note that the code is not from a live system and may not actually work (it was typed up for this post and not tested). It’s designed to illustrate the principles rather than be blindly copy pasted.

DBIx::Class join pruning

DBIx::Class has a join optimiser to prune out any unnecessary joins from the queries it generates.  Occasionally this can cause code that looks good to fail unexpectedly.  The most obvious time this normally happens when using literal SQL.

The most common example of this is when doing an aggregate operation with a group by.  Here I’m trying to sum and count some values on a joined table.  This looks like pretty standard DBIC code, but it fails.  I don’t use the linked table in the group by, just in the columns where it’s my own raw SQL.

sub total
    my $self = shift;
    my @columns = $self->result_source->columns;
    my $me = $self->current_source_alias;
    return $self->search(undef, {
        join => ['items'],
        '+columns' => [
            { total => \'sum(amount) as total' },
            { lines => \'count(amount) as lines' },
        group_by => [map {"$me.$_"} @columns],

        total => { '>' => 10 }
        '+columns' => ['total', 'lines']


DBI Exception: DBD::Pg::st execute failed: ERROR:  column "amount" does not exist
LINE 1: ... (SELECT "me"."id", "me"."name", "me"."type", sum(amount) as...
                                                             ^ [for Statement "SELECT "me"."id", "me"."name", "me"."type", "me"."total", "me"."lines" FROM (SELECT "me"."id", "me"."name", "me"."type", sum(amount) as total, count(amount) as lines FROM "orders" "me" GROUP BY "me"."id", "me"."name", "me"."type") "me" WHERE ( "total" > ? )" with ParamValues: 1='10']

Looking closely at the SQL generated the items alias isn’t joined in.  The optimiser has pruned that join because it couldn’t see any reference to the column.  It turns out that DBIC does really try to do the right thing, it’s just we didn’t give it enough info to give it a hint.

Adding the ‘items.’ qualifier to the column names in the SQL allows DBIC to realise that that join is necessary.

   return $self->search(undef, {
        join => ['items'],
        '+columns' => [
            # NOTE: fully qualified name
            # even though they are not ambiguous.
            { total => \'sum(items.amount) as total' },
            { lines => \'count(items.amount) as lines' },
        group_by => [map {"$me.$_"} @columns],

With that table alias, not strictly necessary for the SQL since those column names are unambiguous, DBIC realises that it is needs the join.

SELECT "me"."id", "me"."name", "me"."type", "me"."total", "me"."lines"
  FROM (
    SELECT "me"."id", "me"."name", "me"."type", sum( items.amount ) AS total, count( items.amount ) AS lines
      FROM "orders" "me"
      LEFT JOIN "order_items" "items"
        ON "items"."order_id" = "me"."id"
    GROUP BY "me"."id", "me"."name", "me"."type"
   ) "me"
WHERE "total" > '10'

Querying debian/ubuntu packages

The debian package management system is something I’ve been using for years, and it’s served me well for all those years. For the first decade I rarely needed to use more than the common apt-get install or dpkg -i package.deb. Listing the installed packages with dpkg -l has been useful too, but apart from apt-cache search those were all the commands I needed most of the time.

Over the past 5 years I’ve been using slightly more complex queries to figure out information about packages I’ve installed. Working at a company where we are deploying and supporting debian servers on a regular basis has meant it’s been useful to poke about the .deb infrastructure a litle more. In truth most of the time I just use 2 commands, but they seem to get me through most of what I want to do.

If I don’t know which package a program comes from I can use dpkg-query with the -S flag.

$ dpkg-query -S /sbin/ss
iproute2: /bin/ss

To figure out what is provided by a package I can use the -L flag.

$ dpkg-query -L iproute2 | grep etc

That’s also useful for checking a package is still fully installed,

for f in `dpkg-query -L python-lxml`; do if [ ! -e $f ]; then echo Missing $f; fi; done

Of course there are other commands that I use periodically, but these are the ones I’ve been using frequently enough to start remembering by heart.

Vim word breaks and perl

I use vim for coding my Perl and in the most recent versions of Ubuntu I’ve not been keen on one of the changes to the syntax defaults.  The word definition has start to include : which means that class names like Blah::Thing count as one word which isn’t how I’m used to working.  Luckily vim is very configurable and the vim irc channel is also really helpful so I was able to find where the configuration was and how to override it quickly.

Finding out that it’s related to the iskeyword configuration was pretty simple, but changing that didn’t have any effect.  It turns out it is set in a Perl syntax specific configuration so overriding it is a little more tricky.

Figuring out where a configuration setting was set is done using :verb set ?

:verb set isk?
Last set from /usr/share/vim/vim74/ftplugin/perl.vim

Knowing where that was set I could then override it with an ‘after’ configuration file.

" ~/.vim/after/ftplugin/perl.vim
set iskeyword-=:

Then when I check in vim,

:verb set isk?
Last set from ~/.vim/after/ftplugin/perl.vim

For more of a discussion of the language specific configuration and the after files see this vim wiki entry –