CodeQL

In this post, we explore how to use CodeQL and apply it to a mature software project (V8). I try to emphasize the usage of the command line tool instead of the VSCode extension (which the official documentation favors).

Installation ¶

First, we need to install CodeQL. Although this turned out to be quite easy, I found the concrete steps to be somewhat lost in the vast documentation. Since I prefer to work in containerized environments, I prepared a Dockerfile for installing CodeQL with V8.

Installing CodeQL is done by downloading and extracing a tar.gz file. Make sure to download the codeql-bundle and not the codeql-cli-binaries like I accidentally did. Add the path to the extracted directory to the PATH environment variable to enable CodeQL in your terminal. You can test whether the setup was successful by running codeql resolve qlpacks. If this results in a list of installed languages packages (and not just legacy-upgrades), you installed the correct version.

Overall, the installation looks like this:

$ curl -L0 \
    https://github.com/github/codeql-action/releases/download/codeql-bundle-v2.16.1/codeql-bundle-linux64.tar.gz \
    --output codeql-bundle-linux64.tar.gz
$ tar -xzf codeql-bundle-linux64.tar.gz
$ export PATH=$PATH:$(pwd)/codeql # or add it to .bashrc / .zshrc for permanence
$ codeql resolve qlpacks
codeql/cpp-all is found in 3 same-priority locations, so attempts to resolve it will fail:
/tools/codeql/qlpacks/codeql/cpp-examples/0.0.0/.codeql/libraries/codeql/cpp-all/0.12.4
/tools/codeql/qlpacks/codeql/cpp-all/0.12.4
/tools/codeql/qlpacks/codeql/cpp-queries/0.9.3/.codeql/libraries/codeql/cpp-all/0.12.4
...

Setting up a project ¶

Before being able to run CodeQL queries, we need to create a database for our target project and initialize a CodeQL project. For V8, the Dockerfile creates a database during the build phase:

/app/v8 $ codeql database create v8.db --language=cpp --command="tools/dev/gm.py x64.debug"

Even if you want to just test out the CodeQL language (without analyzing any source code) you need a database to run queries (AFAIK). For this, I found an empty C file to be sufficient:

$ echo "int main(int c, char** v) {}" > main.c
$ codeql database create test.db --language=c --command="make main"

Note, that if your project is more sophisticated and you want to rebuild your database, you need to provide the --overwrite flag and a “clean” build command as stated in a warning during the build:

For [compiled] languages, the --command must specify a “clean” build which compiles all the source code files without reusing existing build artefacts.

Finally, we need to initialize the project folder where our CodeQL query files will be stored and add the cpp dependency to be able to anaylze cpp source code.

$ codeql pack init example-package
$ codeql pack add codeql/cpp-all

After the setup, you should have a test.db folder containing information about our (empty) main file and a project folder example-package containing a qlpack.yml file.

Query Language ¶

CodeQL is a declarative language based on Datalog (itself based on Prolog). Thus, instead of describing how the query should be executed, we describe what the query should return. To get started with CodeQL, I recommend looking at the CodeQL tutorials, the QL language specification and the existing queries in the qlpacks, e.g., for cpp these are located in codeql/qlpacks/codeql/cpp-examples and codeql/qlpacks/codeql/cpp-queries.

The basis of a CodeQL query is the select clause that consists of the select keyword and optionally the from and where keyword. Although the later two are optional, we want to discuss the from keyword first, since it forms the basis of our query and declares the variables that are used in our select clause. For example, the snippet from Stmt stmt would “collect” every statement that is present in the target code. Next, the select keyword defines the columns of the query's results (similar to its SQL counterpart). To select all statements and their location in the source code, we could thus write select stmt, stmt.getLocation(). The last keyword, where, defines conditions that elements have to satisfy to be included in the result set. For example, we could only select statements that represent for loops with where stmt instanceof ForStmt (instead of using the where keyword here, we could have also used from ForStmt stmt to limit the query to elements of type ForStmt). In more complex queries, these can be defined using predicates.

If we want to analyze C++ code in our query, we additionally need to import the cpp package to enable C++ support in CodeQL. Overall, this results in the following query

import cpp

from Stmt stmt
where stmt instanceof ForStmt
select stmt, stmt.getLocation()

which can be executed with the query run command of the CodeQL CLI:

$ codeql query run --database test.db example-package/first-query.ql

In general, I found it helpful to think of each aspect of CodeQL (except for the select statement which defines how the final output is organized) as a filtering step: the from statement defines the initial set of objects by their type, the where statement filters this set by applying predicates. Each predicate filters the set of objects via properties (e.g. the length of the string). Even casting in CodeQL is just a constraint on the type of the object. Whether an object belongs to a class just comes down to whether it fits the object's properties. For example, the integer '2' belongs to both of the (fictional) classes OneOrTwo and TwoOrThree.

Example: Index Loop ¶

Let's have a look at a more complex example: In this section, we create a CodeQL class that describes simple for loops that use an index variable to iterate from a starting value to an end value. We want to find loops of the form for (int i = 0; i < n; i++) which iterates the variable i from 0 to n. Here, we do not care how the index variable or the boundary variable is named. We explore the code step by step; the full CodeQL code can be found here.

First, we define our new class called IndexLoopStmt that extends ForStmt. If the function isIndexLoop in the constructor evaluates to true, the respective statement ForStmt is (also) an IndexLoopStmt. By defining a separate class we can later simply use from IndexLoopStmt s to get all index-based for loops.

class IndexLoopStmt extends ForStmt {
  IndexLoopStmt() {
    isIndexLoop(this)
  }

  /* [...] */
}

Each for loop consists of three components: the initializer, the condition and the update. We want the initializer to only consists of a single assignment to an index variable (e.g., int i = 0), the condition to be a comparision between the index variable and an upper bound (e.g., i < n), and the update to be an increment of the index variable (e.g., i++).

The first thing to note (that I obviously noticed last during my first attempt to write the code) is that we need to make sure that all three components use the same index variable. To get the name of the variable, we need a series of calls that I found by a mixture of browsing the code and trial-and-error. To get the index variable, we first call getInitialization on our ForStmt to get the initialization statement. Next comes an interesting feature of CodeQL: By casting to a DeclStmt, we only inspect elements that correspond to this type. In contrast to languages like C, the cast is not a “dangerous” operation if applied to the wrong element. Instead, it acts more like a filter; thus, every element that is not a DeclStmt is no longer addressed by this statement. We can then call getADeclaration on the object and cast it to the type Variable to get our index variable. Note we use a Variable instead of an AssignExpr here, because the code documentation explicitly mentions this.

Now, we can retrieve the index variable from the ForStmt and can compare it with the variable in the loop condition and loop update; we do this to ensure that the index variable in the initialization is the same as the variables used in the condition and the update. On a side note, we could use the same structure to catch the common error of a false update statement in two nested loops (i.e., two for loops with the index variables i and j where the nested loop uses i++ instead of j++ in the update). We create three auxiliary functions to define the conditions for the initialization, condition, and update statement that have to be met to be considered an index-based loop. We combine these pieces with the exists keyword. This defines a set of variables and evaluates to true if the formulars evaluate to true for one set of variables. We use this construct to retrieve the index variable Variable v:

predicate isIndexLoop(ForStmt loop) {
  exists(Variable v |
    v = loop.getInitialization().(DeclStmt).getADeclaration().(Variable) |
    isIndexInit(v) and isIndexCond(loop, v) and isIndexInc(loop, v)
  )
}

Originally, I passed the ForStmt to each auxiliary function which made isIndexLoop a one-liner. However, because I had to retrieve the index variable in both isIndexCond and isIndexInc separately, I decided to move this functionality up into the isIndexLoop function and create an auxiliary variable v.

We then define three functions that verify the properties of each component of our for loop:

isIndexInit: the initialization should be a single statement that assigns a single value (Literal) to the index variable.
isIndexCond: the condition should compare this variable against an upper bound value using a less than comparison (LTExpr). Here, we use a simple string comparison between the left operand of the lower than comparison and the index variable.
isIndexInc: the update should increase the index variable using an increment operation (i.e., prefix or postfix increment). Here, we again compare the variable from the IncrementOperation with our index variable using string comparison.

In the latter two functions we again use casts to limit our predicate to elements of that type. I am not sure whether there exists a better way to compare the variables for equality (string comparison seems to be somewhat hacky), but ultimately we want both variables to have the same identifier. I again used explored the CodeQL source code and used trial-and-error to find the correct functions.

predicate isIndexInit(Variable v) {
  v.getInitializer().getExpr() instanceof Literal
}

predicate isIndexCond(ForStmt loop, Variable v) {
  loop.getCondition().(LTExpr).getLeftOperand().toString() = v.toString()
}

predicate isIndexInc(ForStmt loop, Variable v) {
  loop.getUpdate().(IncrementOperation).getOperand().toString() = v.toString()
}

For convenience, we can extend our IndexLoopStmt class to include a function to retrieve the index variable and the upper bound of the comparison operation. Unfortunately, this results in a small amount of duplicate code, but because of recursion, we can not combine the class method and the predicate (AFAIK).

class IndexLoopStmt extends ForStmt {
  IndexLoopStmt() {
    isIndexLoop(this)
  }

  Variable index() {
    result = this.getInitialization().(DeclStmt).getADeclaration().(Variable)
  }

  Expr bound() {
    result = this.getCondition().(LTExpr).getRightOperand()
  }
}

Finally, we can combine everything in a neat select clause that retrieves the index variable and the upper bound of index-based for loops. We can also add a where clause to filter for “simple” bounds (e.g., literals).

from IndexLoopStmt loop
where loop.bound().getNumChild() = 0
select loop, loop.index(), loop.bound()

Installation ¶

Setting up a project ¶

Query Language ¶

Example: Index Loop ¶

See also ¶