Hand Coding SQL with psycopg2 for Odoo

I thought it would be a good idea to write down some guidelines on writing hand coded SQL for Odoo/OpenERP. One of the general themes should be ‘avoid SQL injection’, and Odoo themselves provide some documentation explaining that. They also make the point before that you should try to stick with the ORM as much as possible too. Rolling your own SQL bypasses all the access controls on records, so if you are writing some SQL, always consider whether the user is supposed to have access to the data extracted or manipulated by the SQL.

Assuming you’re still set on writing SQL it’s worth noting that Odoo uses psycopg2, a mature and respectable library for accessing Postgres from Python. Dealing with parameters in SQL clauses is simple as it allows you to pass them through as parameters, just as you’d hope.

cr.execute('select * from res_partner where name = %s', (partner_name,))

Unfortunately it’s too simplistic to say never use string concatenation. In practice you often want to stitch together a complex query based on some data. You might want to vary the search criteria based on user input.

Here are some concrete scenarios to demonstrate more complex situations. Read the psycopg2 documentation for how to pass in values, and how it converts the various types first.

In clause

This is the most common place people give up with parameterisation. It’s clearly documented in Psycopg2, but since people often don’t read the docs, they don’t realise %s works and so they end up doing something involving join and their own format strings. The syntax is really simple,

cr.execute('select * from res_partner where name in %s', (['Partner 1', 'Partner 2'],))

Optional clauses

The trick with building up optional parts is to maintain lists of parameters alongside lists of the bits of SQL you need. The exact mechanism for appending the strings isn’t terribly important, and in fact you might find it useful to write yourself a nice class to make all that string construction simple and reliable. The key thing is to never mix untrusted data into the SQL you are generating.

clauses = []
params = []
if partner_name:
    clauses.append('name = %s')
    params.append(partner_name)
if website:
    clauses.append('website = %s')
    params.append(website)
where_clause = ""
if len(clauses) > 0:
    where_clause = "where " + ' and '.join(clauses)
   
sql = "select id, name from res_partner " + where_clause
cr.execute(sql, params)

Refactoring out chunks of SQL generation

If you find yourself requiring reuse with snippets of SQL or subqueries it’s quite common to refactor that out into a method. This is another common place for mistakes. Return SQL and a list of parameters whenever you do this, rather than just a block of SQL.

def stock_level(product_ids):
    sql = """
    select location_dest_id as location_id, product_uom, sum(product_qty) as qty
    from stock_move 
    where product_id IN %s
    and state = 'done'
    group by  location_dest_id, product_uom
    union all 
    select location_id, product_uom, 0 - sum(product_qty) as qty
    from stock_move 
    where product_id IN %s
    and state = 'done'
    group by  location_id, product_uom
    """
    return (sql, [product_ids, product_ids])

snippet, params = stock_level(product_ids)
sql = """
select location_id, product_uom, sum(qty) as stock_level
from (
    %s
) as stock
group by location_id, product_uom
having sum(qty) > 0 ;
""" % snippet
cr.execute(sql, params)

Using the same parameter twice

The previous example uses the same parameter twice in the query, and passes it in twice. This can be avoided by using a named arguments like %(name)s. This can also be useful when you have a very large query requiring lots of parameters as it allows you to pass in a dictionary and not worry about the ordering of the arguments you pass it.

def stock_level(product_ids):
    sql = """
    select location_dest_id as location_id, product_uom, sum(product_qty) as qty
    from stock_move 
    where product_id IN %(product_id)s
    and state = 'done'
    group by  location_dest_id, product_uom
    union all 
    select location_id, product_uom, 0 - sum(product_qty) as qty
    from stock_move 
    where product_id IN %(product_id)s
    and state = 'done'
    group by  location_id, product_uom
    """
    return (sql, {"product_id": product_ids})

snippet, params = stock_level(product_ids)
sql = """
select location_id, product_uom, sum(qty) as stock_level
from (
    %s
) as stock
group by location_id, product_uom
having sum(qty) > 0 ;
""" % snippet
cr.execute(sql, params)

Fixing badly refactored code

If you do find yourself with a function that is returning SQL only, and you can’t change the function prototype because external code is depending on it, psycopg2 provides the mogrify function that can help. It does require the cursor to be in scope.

def stock_level(cr, uid, product_ids):
    sql = """
    select location_dest_id as location_id, product_uom, sum(product_qty) as qty
    from stock_move 
    where product_id IN %s
    and state = 'done'
    group by  location_dest_id, product_uom
    union all 
    select location_id, product_uom, 0 - sum(product_qty) as qty
    from stock_move 
    where product_id IN %s
    and state = 'done'
    group by  location_id, product_uom
    """
    return cr.mogrify(sql, [product_ids, product_ids])

snippet = stock_level(cr, uid, product_ids)
sql = """
select location_id, product_uom, sum(qty) as stock_level
from (
    %s
) as stock
group by location_id, product_uom
having sum(qty) > 0 ;
""" % snippet
cr.execute(sql)

Allowing the caller to specify fields/tables

This is something where the libraries generally don’t help. You need to sanitize this code yourself. In general try to avoid fully untrusted output. It’s better to have a whitelist of what you expect to allow, and allow them to select from those known good tables/fields. Also remember that you can query the database for what the structure of the database is if you don’t know it when writing the code. If you really can’t check either, sanitize the input heavily. Try to allow the absolute minimum characters within identifiers and also consider using an extra qualifier so that you can identify programmatically created fields/tables and prevent name clashes with core database functions.

In summary

Keep your string manipulation safe, never append untrusted input (note that data you retrieved from the database should generally be treated as untrusted). Assuming you are doing it right, any query without parameters should look start to look suspicious. There are of course simple queries that take no parameters, but they should be easy to verify as safe.

Note that the code is not from a live system and may not actually work (it was typed up for this post and not tested). It’s designed to illustrate the principles rather than be blindly copy pasted.

Limit and Order by

Remember to make sure your ordering is precise when doing a limit or whatever your SQL dialect wants you to do. 

Since I normally only screw this up at the start of a project this was the second time I’ve done it but it’s an odd one.  You’d kind of think that the fundamental nature of the limit clause would mean you’d need a consistent order for it to be useful.  If you do a query that asks for the first 10 records and then another that’s identical but asks for the next 10 you probably assume that’s what you’d get.  You have to remember this is SQL that the results returned are un-ordered unless specified.  If your order by is not specific enough you can get a random set of results each time.  Typically the same exact same query, even un-ordered returns the same results, but if you’re changing the limit clause it’s not an identical query and so the query plan may be subtly different and so you can’t be sure which records you’ll get.  In fact your order by needs to specify the ordering predictably for every single row to get consistent results.  An order by on a field that’s not distinct will still allow some degree of unpredictability and probably defeat the point of your limit on your query.

DBF, Dates and Advantage speed

I’ve optimized a query to work better with dates on Advantage Database Server using DBF’s and I figured I ought to make a note of what I did so that I don’t forget the next time I need to.  Of course I didn’t realise it was the dates that was causing the problems when I started, but that certainly appears to have been the case.

With DBF’s the date fields are stored as text so the trick when pulling the dates into another table is to convert them to text so that there’s no conversion take place.  My original query to take a subset of a table was like this,

select top 30 rowid as row, [date], [time] into #records from whaudit order by [date], [time]

This took 1 minute with a fairly large dataset, even with an index of DToS(Date) + Time.  By changing that to,

select top 30 rowid as row, convert([date], SQL_CHAR) as dt, [time] into #records from whaudit order by [date], [time]

It goes down to 3 seconds to execute on the same table.  The use of the temporary table is to implement a paged query so that I can take a subset of records, and the rowid in ADS SQL is particularly useful for selecting the exact records.

The one weird thing is that if you alias the converted field [date] the query goes back to executing in a minute.  I can’t quite figure that out, whether it’s because it’s because it’s the same column name as in the order by or because it’s a keyword but it gets a different execution plan and is really slow.  This might not happen to all tables, but it certainly happens to ours with it’s specific index setup.

select top 30 rowid as row, convert([date], SQL_CHAR) as [date], [time] into #records from whaudit order by [date], [time]