Datalog for Datalegends: How quickly can we do program analysis?

29 Aug, 2024

One way hackers attack programs is through a method called control flow hijacking. By exploiting design errors, attackers are able redirect program execution to malicious code that they have created. One way of stopping these attacks is by enforcing a control flow graph (CFG). Unfortunately, constructing a CFG comes with its own difficulties. Mainly determining the target of indirect calls. If we want to determine these targets ahead of time, we can perform a points-to analysis. We’ve elected to do our analysis in the logical programming language Datalog. This article goes over the basic syntax of Datalog as well as what I believe to be the best engine for program analysis.

Why Datalog?

Datalog is a relatively niche programming language, I was completely unaware of it when I started this project. In light of that, here is a brief introduction into what makes up a Datalog program: Datalog is a logical programming language where programs are made up of a combination of “facts” and “rules”. Here’s an example program demonstrating how to find connected nodes in a graph.

edge(1, 2).
edge(2, 3).
path(x, y) :- edge(x, y).
path(x, y) :- path(x, z), edge(z, y).

Lines 1-2 describe facts. They indicate that there is an edge between 1-2 and 2-3. Lines 3-4 show a non-recursive rule. It means that there is a path between two nodes if there is an edge between two nodes. If we were to run the program using only this rule we would get result of path(1, 2) and path(2, 3) because of our two facts. Lines 5-7 add onto the path rule with recursion. In our example, nodes 1 and 3 do not share an edge. However, they each have an edge with node 2. Our recursive path rule would see this and determine that there is a path between 1 and 3.

You can also set up Datalog programs to use .csv files as input/output. In the example below, edge facts will be in and input file and path will be directed to an output file.

.decl edge(x:number, y:number)
.input edge
.decl path(x:number, y:number)
.output path

path(x, y) :- edge(x, y).
path(x, y) :- edge(x, z), path(z, y).

Why would you want to do points-to analysis in Datalog? Points-to analysis can be classified as a problem with high density. In large programs there is typically a high number of potential objects that pointers can reference. You end up with a complex, interconnected web of connections. Modeling this in a traditional language can become complex and Datalog’s syntax makes it easier to model the relationships between objects. Furthermore, Datalog has multiple execution engines designed with program analysis in mind. This means that Datalog can often outperform popular languages for program analysis problems.

Engines

Below I’ll go over some of the popular Datalog implementations optimized for program analysis before giving my opinion as to which implementation should be used.

Recstep

RecStep’s primary optimization is a specialized data structure for handling problems of high density. They present the bit matrix, which represents a rule R(A, B) with options 1-n for each element in the rule in a matrix structure where each intersection can simply be flipped from 0 to 1 if a rule is satisfied. An example is below:

This simplicity makes the data structure very efficient in handling dense problems. Also, the structure can be parallelized very easily. All of which makes RecStep perfect for handling program analysis problems. However, this is still a research prototype. There are dependency issues, and the implementation has only been tested with integers.

Formulog

The Formulog engine takes advantage of the fact that many Datalog programs can be run using SMT solvers after translating the syntax. Formulog can also be compiled which further speeds up the execution. Unfortunately, due to the structure of program analysis problems, they cannot make use of SMT solver optimizations.

Souffle

Souffle, like RecStep, makes use of a specialized data structure to optimize for program analysis. This structure, called a brie, is able to very efficiently handle dense sets of data due to its binary encoding and fast searching algorithms. Souffle, like Formulog can be compiled to speed up runtime. Unlike Formulog and RecStep, Souffle is heavily supported by industry and government grants and sees consistent updates. Because of these reasons, we decided to use compiled Souffle Datalog as our language of choice.

Conclusion

This post is just an introduction into the heavy theory behind what exactly makes Datalog so applicable to program analysis and what kinds of optimizations are done in order to speed up execution. If you want to learn more about any of the engines, I have linked papers below. I also want to say that my opinion in no way invalidates any of the work done on the RecStep and Formulog engines, which actually outperform Souffle for certain kinds of problems. Finally, I want to thank Professor Tan and Monika Santra of Penn State’s SOS group, for introducing me to Datalog and allowing me to implement these findings on their control flow graph generator, BPA.

Sources

RecStep
Making Formulog Fast
Brie: A Specialized Trie for Datalog
Control Flow Generator