CodeQL
In this post, we explore how to use CodeQL and apply it to a mature software project (V8). I try to emphasize the usage of the command line tool instead of the VSCode extension (which the official documentation favors).
Installation ¶
First, we need to install CodeQL. Although this turned out to be quite easy, I found the concrete steps to be somewhat lost in the vast documentation. Since I prefer to work in containerized environments, I prepared a Dockerfile for installing CodeQL with V8.
Installing CodeQL is done by downloading and extracing a tar.gz file. Make sure
to download the codeql-bundle
and not the
codeql-cli-binaries
like I accidentally did.
Add the path to the extracted directory to the PATH
environment variable to enable CodeQL in your terminal. You can test whether
the setup was successful by running codeql resolve qlpacks
.
If this results in a list of installed languages packages (and not just
legacy-upgrades
), you installed the correct version.
Overall, the installation looks like this:
$ curl -L0 \
https://github.com/github/codeql-action/releases/download/codeql-bundle-v2.16.1/codeql-bundle-linux64.tar.gz \
--output codeql-bundle-linux64.tar.gz
$ tar -xzf codeql-bundle-linux64.tar.gz
$ export PATH=$PATH:$(pwd)/codeql # or add it to .bashrc / .zshrc for permanence
$ codeql resolve qlpacks
codeql/cpp-all is found in 3 same-priority locations, so attempts to resolve it will fail:
/tools/codeql/qlpacks/codeql/cpp-examples/0.0.0/.codeql/libraries/codeql/cpp-all/0.12.4
/tools/codeql/qlpacks/codeql/cpp-all/0.12.4
/tools/codeql/qlpacks/codeql/cpp-queries/0.9.3/.codeql/libraries/codeql/cpp-all/0.12.4
...
Setting up a project ¶
Before being able to run CodeQL queries, we need to create a database for our target project and initialize a CodeQL project. For V8, the Dockerfile creates a database during the build phase:
/app/v8 $ codeql database create v8.db --language=cpp --command="tools/dev/gm.py x64.debug"
Even if you want to just test out the CodeQL language (without analyzing any source code) you need a database to run queries (AFAIK). For this, I found an empty C file to be sufficient:
$ echo "int main(int c, char** v) {}" > main.c
$ codeql database create test.db --language=c --command="make main"
Note, that if your project is more sophisticated and you want to rebuild
your database, you need to provide the --overwrite
flag and a “clean” build command as stated in a warning during the build:
For [compiled] languages, the --command must specify a “clean” build which compiles all the source code files without reusing existing build artefacts.
Finally, we need to initialize the project folder where our CodeQL query files will be stored and add the cpp dependency to be able to anaylze cpp source code.
$ codeql pack init example-package
$ codeql pack add codeql/cpp-all
After the setup, you should have a test.db
folder containing information about our (empty) main file and a project
folder example-package
containing a
qlpack.yml
file.
Query Language ¶
CodeQL is a declarative language based on
Datalog
(itself based on Prolog). Thus, instead of describing how the query
should be executed, we describe what the query should return. To get started with
CodeQL, I recommend looking at the
CodeQL tutorials,
the QL language specification
and the existing queries in the qlpacks, e.g., for cpp these are located in
codeql/qlpacks/codeql/cpp-examples
and
codeql/qlpacks/codeql/cpp-queries
.
The basis of a CodeQL query is the
select clause
that consists of the select
keyword and
optionally the from
and
where
keyword. Although the later two are
optional, we want to discuss the from
keyword first, since it forms the basis of our query and declares
the variables that are used in our select clause. For example, the
snippet from Stmt stmt
would
“collect” every
statement
that is present in the target code. Next, the
select
keyword defines the columns
of the query's results (similar to its SQL counterpart). To select
all statements and their location in the source code, we could thus write
select stmt, stmt.getLocation()
.
The last keyword, where
, defines
conditions that elements have to satisfy to be included in the
result set. For example, we could only select statements that
represent for loops with
where stmt instanceof ForStmt
(instead of using the where
keyword here,
we could have also used from ForStmt stmt
to limit the query to elements of type ForStmt
).
In more complex queries, these can be defined using
predicates.
If we want to analyze C++ code in our query, we additionally need to
import
the cpp package to enable C++
support in CodeQL. Overall, this results in the following query
import cpp
from Stmt stmt
where stmt instanceof ForStmt
select stmt, stmt.getLocation()
which can be executed with the query run
command
of the CodeQL CLI:
$ codeql query run --database test.db example-package/first-query.ql
In general, I found it helpful to think of each aspect of CodeQL
(except for the select
statement
which defines how the final output is organized) as a
filtering step: the from
statement defines
the initial set of objects by their type, the
where
statement filters this set by applying
predicates. Each predicate filters the set of objects via properties
(e.g. the length of the string). Even casting in CodeQL is just a
constraint on the type of the object. Whether an object belongs to
a class just comes down to whether it fits the object's properties.
For example, the integer '2' belongs to both of the (fictional)
classes OneOrTwo
and
TwoOrThree
.
Example: Index Loop ¶
Let's have a look at a more complex example:
In this section, we create a CodeQL class that describes simple
for loops that use an index variable to iterate from a starting value
to an end value. We want to find loops of the form
for (int i = 0; i < n; i++)
which iterates the variable i
from
0
to n
. Here,
we do not care how the index variable or the boundary variable is named.
We explore the code step by step; the full CodeQL code can be found
here.
First, we define our new
class
called IndexLoopStmt
that extends ForStmt
. If the function
isIndexLoop
in the constructor evaluates
to true, the respective statement
ForStmt
is (also) an
IndexLoopStmt
. By defining a separate class
we can later simply use from IndexLoopStmt s
to get all index-based for loops.
class IndexLoopStmt extends ForStmt {
IndexLoopStmt() {
isIndexLoop(this)
}
/* [...] */
}
Each for loop consists of three components: the initializer, the
condition and the update. We want the initializer to only consists of
a single assignment to an index variable (e.g.,
int i = 0
), the condition to be a
comparision between the index variable and an upper bound (e.g.,
i < n
), and the
update to be an increment of the index variable (e.g.,
i++
).
The first thing to note (that I obviously noticed last during my
first attempt to write the code) is that we need to make sure that
all three components use the same index variable. To get the name of
the variable, we need a series of calls that I found by a mixture of
browsing the code
and trial-and-error. To get the index variable, we first call
getInitialization
on our
ForStmt
to get the initialization
statement. Next comes an interesting feature of CodeQL: By
casting
to a DeclStmt
, we only inspect elements
that correspond to this type. In contrast to languages like C, the
cast is not a “dangerous” operation if applied to the wrong element.
Instead, it acts more like a filter; thus, every element that is not
a DeclStmt
is no longer addressed by this
statement. We can then call getADeclaration
on the object and cast it to the type Variable
to get our index variable. Note we use a Variable
instead of an AssignExpr
here, because
the
code documentation explicitly mentions this.
Now, we can retrieve the index variable from the
ForStmt
and can compare it with the
variable in the loop condition and loop update; we do this to ensure
that the index variable in the initialization is the same as the
variables used in the condition and the update. On a side note, we
could use the same structure to catch the common error of a false update
statement in two nested loops (i.e., two for loops with the index variables
i
and j
where
the nested loop uses i++
instead of
j++
in the update). We create three
auxiliary functions to define the conditions for the initialization,
condition, and update statement that have to be met to be considered
an index-based loop. We combine these pieces with the exists
keyword. This defines a set of variables and evaluates to true if
the formulars evaluate to true for one set of variables. We use this
construct to retrieve the index variable Variable v
:
predicate isIndexLoop(ForStmt loop) {
exists(Variable v |
v = loop.getInitialization().(DeclStmt).getADeclaration().(Variable) |
isIndexInit(v) and isIndexCond(loop, v) and isIndexInc(loop, v)
)
}
Originally, I passed the ForStmt
to each
auxiliary function which made isIndexLoop
a
one-liner. However, because I had to retrieve the index variable in
both isIndexCond
and
isIndexInc
separately, I decided to move
this functionality up into the isIndexLoop
function and create an auxiliary variable v
.
We then define three functions that verify the properties of each component of our for loop:
-
isIndexInit
: the initialization should be a single statement that assigns a single value (Literal
) to the index variable. -
isIndexCond
: the condition should compare this variable against an upper bound value using a less than comparison (LTExpr
). Here, we use a simple string comparison between the left operand of the lower than comparison and the index variable. -
isIndexInc
: the update should increase the index variable using an increment operation (i.e., prefix or postfix increment). Here, we again compare the variable from theIncrementOperation
with our index variable using string comparison.
predicate isIndexInit(Variable v) {
v.getInitializer().getExpr() instanceof Literal
}
predicate isIndexCond(ForStmt loop, Variable v) {
loop.getCondition().(LTExpr).getLeftOperand().toString() = v.toString()
}
predicate isIndexInc(ForStmt loop, Variable v) {
loop.getUpdate().(IncrementOperation).getOperand().toString() = v.toString()
}
For convenience, we can extend our IndexLoopStmt
class to include a function to retrieve the index variable and the
upper bound of the comparison operation. Unfortunately, this results
in a small amount of duplicate code, but because of recursion, we can
not combine the class method and the predicate (AFAIK).
class IndexLoopStmt extends ForStmt {
IndexLoopStmt() {
isIndexLoop(this)
}
Variable index() {
result = this.getInitialization().(DeclStmt).getADeclaration().(Variable)
}
Expr bound() {
result = this.getCondition().(LTExpr).getRightOperand()
}
}
Finally, we can combine everything in a neat select clause that retrieves
the index variable and the upper bound of index-based for loops. We
can also add a where
clause to filter for
“simple” bounds (e.g., literals).
from IndexLoopStmt loop
where loop.bound().getNumChild() = 0
select loop, loop.index(), loop.bound()