XPath and docx documents

I’ve just been doing a quick script to rip a docx document to pod format for something internal and it’s made me extend my XPath skills a bit more.  The XPath Axes come in handy when you want something more subtle than // but you aren’t entirely sure how deep a child element is.  Then you can do something like “descendant::a:blip” which will find the a:blip element anywhere down the tree from your current element.

I’m putting the module I’ve just written on github for now.  It’s definitely not good enough for CPAN but it might be useful for reference so I figure it’s worth sharing.  It does a very basic rip of the xml from a docx file (since it’s basically a zip with xml and other resources) and then there is a script to turn that into simple pod.

https://github.com/colinnewell/WordDocxScraper

Limit and Order by

Remember to make sure your ordering is precise when doing a limit or whatever your SQL dialect wants you to do. 

Since I normally only screw this up at the start of a project this was the second time I’ve done it but it’s an odd one.  You’d kind of think that the fundamental nature of the limit clause would mean you’d need a consistent order for it to be useful.  If you do a query that asks for the first 10 records and then another that’s identical but asks for the next 10 you probably assume that’s what you’d get.  You have to remember this is SQL that the results returned are un-ordered unless specified.  If your order by is not specific enough you can get a random set of results each time.  Typically the same exact same query, even un-ordered returns the same results, but if you’re changing the limit clause it’s not an identical query and so the query plan may be subtly different and so you can’t be sure which records you’ll get.  In fact your order by needs to specify the ordering predictably for every single row to get consistent results.  An order by on a field that’s not distinct will still allow some degree of unpredictability and probably defeat the point of your limit on your query.

Heisenbugs

Argh, if you get an exception like this from the Iterator module and you tend to run with Carp::Always it’s probably a heisenbug.  Try turning Carp::Always off and see what happens.  I’m not sure whether this is a bug in a library or a bug in my code but I’ve just wasted a couple of hours because I accidentally left Carp::Always on.

Exception::Class::Base::throw('Iterator::X::User_Code_Error', 'message', ' at lib/site_perl...', 'eval_error', 'Iterator::X::User_Code_Error=HASH(0xa0d3f08)') called atlib/site_perl/5.10.1/Iterator.pm line 236

I should note that both modules are very useful.  It’s just a freak occurrence of the two together in my code that’s caused me a brief issue.  I figured I better document it so hopefully I won’t spend so long the next time I do it.  Actually this code is all running in a catalyst app which might be another factor that’s affecting things.

SSL host checking and LWP::UserAgent

I needed to turn on host validation and coincidentally the new major release of LWP::UserAgent does that by default now!  The one problem I had was that there wasn’t a root certificate I needed included in it’s standard bundle (well the Mozilla one in the Mozilla::CA dist).  I already suspected that would be the case since I’d noticed Firefox didn’t like the sites certificate so I just had to figure out how to authenticate this site.

Looking at the certificate I could see that it was provided by ‘GlobalSign’ so I had a look on their website for the root certificates I’d need.  They provided it in what they call their ‘Domain root validation bundle’.

These seem to be normally in the format of base64 text encapsulated by ‘—–BEGIN CERTIFICATE—–’ and headers ‘—–END CERTIFICATE—–’.  This seems to be the format of the .pem files LWP::UserAgent wants to use.  There can actually be multiple certificates in a single file as there is in the case of the bundle from GlobalSign.

If you save them to bundle.pem this little snippet will just check you can download a url from a site correctly.

require LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->ssl_opts( SSL_ca_file => './bundle.pem' );
my $response = $ua->get('https://somesite/');
print "URL check ", $response->is_success ? 'succeeded' : 'failed', "\n";

If you’re using your own certificate you should be able to use your own *cert.pem file in place of the bundle.pem.  In the case of metabase.cpantesters.org for example I was able to download the cert by exporting it via Chrome and point to it in the same manner. 

That solved my problem although as I was investigating this some more I must confess this doesn’t look like the whole story in some ways.  A look at the GlobalSign site suggests they are supported by Mozilla (and Firefox) which confuses things a little.  It appears there are different types of certificate they issue and the ones I’m having to deal with aren’t included in that deal?

One additional note, if your LWP::UserAgent is deeply embedded in something else and you can’t get to it to set the config there are environment variables you can use instead.  Just check the documentation for LWP::UserAgent.