Finding string formatting vulnerabilities in C/C++
.. rst-class:: setup
For this example you need to set up CodeQL for Visual Studio Code and download the CodeQL database for dotnet/coreclr from GitHub.
.. rst-class:: agenda
- Non-constant format string
- Data flow
- Modules and libraries
- Local data flow
- Local taint tracking
Let’s write a query to identify instances of CWE-134 Use of externally controlled format string.
printf(userControlledString, arg1);Goal: Find uses of printf (or similar) where the format string can be controlled by an attacker.
Note
Formatting functions allow the programmer to construct a string output using a format string and an optional set of arguments. The format string is specified using a simple template language, where the output string is constructed by processing the format string to find format specifiers, and inserting values provided as arguments. For example:
printf("Name: %s, Age: %d", "Freddie", 2);would produce the output "Name: Freddie, Age: 2”. So far, so good. However, problems arise if there is a mismatch between the number of formatting specifiers, and the number of arguments. For example:
printf("Name: %s, Age: %d", "Freddie");In this case, we have one more format specifier than we have arguments. In a managed language such as Java or C#, this simply leads to a runtime exception. However, in C/C++, the formatting functions are typically implemented by reading values from the stack without any validation of the number of arguments. This means a mismatch in the number of format specifiers and format arguments can lead to information disclosure.
Of course, in practice this happens rarely with constant formatting strings. Instead, it’s most problematic when the formatting string can be specified by the user, allowing an attacker to provide a formatting string with the wrong number of format specifiers. Furthermore, if an attacker can control the format string, they may be able to provide the %n format specifier, which causes printf to write the number characters in the generated output string to a specified location.
See https://en.wikipedia.org/wiki/Uncontrolled_format_string for more background.
Write a query that flags printf calls where the format argument is not a StringLiteral.
Hint: Import semmle.code.cpp.commons.Printf and use class FormattingFunction and getFormatParameterIndex().
.. rst-class:: build
.. literalinclude:: ../query-examples/cpp/data-flow-cpp-1.ql :language: ql
Note
This first query is about finding places where the format specifier is not a constant string. In the CodeQL libraries for C/C++, constant strings are modeled as StringLiteral nodes, so we are looking for calls to format functions where the format specifier argument is not a string literal.
The C/C++ standard libraries include many different formatting functions that may be vulnerable to this particular attack–including printf, snprintf, and others. Furthermore, each of these different formatting functions may include the format string in a different position in the argument list. Instead of laboriously listing all these different variants, we can make use of the standard CodeQL class FormattingFunction, which provides an interface that models common formatting functions in C/C++.
Results are unsatisfactory:
- Query flags cases where the format string is a symbolic constant.
- Query flags cases where the format string is itself a format argument.
- Query doesn't recognize wrapper functions around
printf-like functions.
We need something better.
Note
For example, consider the results which appear in /src/ToolBox/SOS/Strike/util.h, between lines 965 and 970:
const char *format = align == AlignLeft ? "%-*.*s" : "%*.*s";
if (IsDMLEnabled())
DMLOut(format, width, precision, mValue);
else
ExtOut(format, width, precision, mValue);Here, DMLOut and ExtOut are macros that expand to formatting calls. The format specifier is not constant, in the sense that the format argument is not a string literal. However, it is clearly one of two possible constants, both with the same number of format specifiers.
What we need is a way to determine whether the format argument is ever set to something that is not constant.
Define a subclass of DataFlow::Node representing “source” nodes, that is, nodes without a (local) data flow predecessor.
Hint: use not exists().
.. rst-class:: build
class SourceNode extends DataFlow::Node {
SourceNode() {
not DataFlow::localFlowStep(_, this)
}
}Note
Note the scoping of the don’t-care variable “_” in this example: the body of the characteristic predicate is equivalent to:
not exists(DataFlow::Node pred | DataFlow::localFlowStep(pred, this))which is not the same as:
exists(DataFlow::Node pred | not DataFlow::localFlowStep(pred, this)).Refine the query to find calls to printf-like functions where the format argument derives from a local source that is not a constant string.
.. rst-class:: build
import cpp
import semmle.code.cpp.dataflow.DataFlow
import semmle.code.cpp.commons.Printf
class SourceNode extends DataFlow::Node { ... }
from FormattingFunction f, Call c, SourceNode src, DataFlow::Node arg
where c.getTarget() = f and
arg.asExpr() = c.getArgument(f.getFormatParameterIndex()) and
DataFlow::localFlow(src, arg) and
not src.asExpr() instanceof StringLiteral
select arg, "Non-constant format string."Audit the results and apply any refinements you deem necessary.
Suggestions:
Replace
DataFlow::localFlowStepwith a custom predicate that includes steps through global variable definitions.Hint: Use class
GlobalVariableand its member predicatesgetAnAssignedValue()andgetAnAccess().Exclude calls in wrapper functions that just forward their format argument to another
printf-like function; instead, flag calls to those functions.
- Results are still underwhelming.
- Dealing with parameter passing becomes cumbersome.
- Instead, let’s turn the problem around and find user-controlled data that flows into a
printfformat argument, potentially through calls. - This needs :doc:`global data flow <global-data-flow-cpp>`.