GSoC 2020: SLEIGH Disassembler Backend
Introduction
Hello all, I’m Jiaxiang Zhou from China. I was lucky to be selected as a participant of Radare2 project this year. My main work was to integrate SLEIGH as a disassembly backend into Radare2. r2ghidra-dec was my main working repository, aiming to delivering Ghidra’s decompiler to Radare2. It could be renamed as r2ghidra
since it would become not only a decompiler but a complete bridge between Radare2 and Ghidra after this project.
Special thanks should be given to my mentors, Florian Märkl, Giovanni, and Anton. Your patience and guidance are all well appreciated. I couldn’t have completed this project without your support.
Here’s the slides made for r2con2020.
RAsm plugin
SLEIGH disassembler has been deeply embedded into C++ codebase of decompiler. So the solution is clear:
Isolate SLEIGH from C++ codebase of Ghidra’s decompiler
To get full access of Sleigh
and low level spec file and interfaces, I implemented a class(SleighAsm
) just like lite version of Architecture
. This class will export P-codes and registers’ info parsed from spec file. It enable us to disassemble all valid instructions on demand:
RAnal plugin
SLEIGH will give out P-codes as IR to describe what instruction does. So I had to analysis on P-codes to extract control flow, type info. When it came to ESIL, things got more tricky because P-code’s model and ESIL are different. What’s more, P-codes support float number operation, which ESIL doesn’t.
Port SleighInstructionPrototype
Ghidra’s C++ codebase concentrate on decompiler, so it focus on function-level analysis. There’s classes like Funcdata
to analysis intra-function flow. But instruction-level analysis tool only exists in JAVA codebase. So I had to port SleighInstructionPrototype
from JAVA to C++. This enables control flow info extraction on Constructors
(lower than P-code). This port work was tough to minimize the changes on original Ghidra codebase. And I eventually managed to port whole SleighInstructionPrototype
with only two private fields exported!
1 | diff --git a/Ghidra/Features/Decompiler/src/decompile/cpp/context.hh b/Ghidra/Features/Decompiler/src/decompile/cpp/context.hh |
And this is control flow extracted only from P-code results:
P-code to ESIL translation
Ghidra’s emulate system based on P-code is quite different from ESIL. First, it’s not stack-based, and it can randomly access to any middle variables.
Here’s my design:
Add
PICK
to take values from stack to stack topI will leave middle variables intentionally on stack and retrieve them when this middle variable is needed.
Add
GET
to retrieve register’s value to stackSometimes register’s name is just one char, ESIL will confuse if it’s compared with a number. Most importantly, I need
GET
to retrieve float number to stack without changing ESIL’s original codes in Radare2.Override
=
to store value from stack back to registerThis is to pair with
GET
for float number handling. You will notice that all element except register(used as destination) are immediate value.Override
[4]
and[8]
to read float number from memoryWhen a float number is written into a register/memory location, I will record its register name/memory address to track until it’s overwritten by anything except float number.
Override
=[4]
and=[8]
to store float number to memoryAs above.
Add serials of float operation.
When real translation is running, plugin will employ a stack to emulate middle variables left on stack. This will help calculate offset of middle variables(unique varnodedata) used as argument of current P-code.
Here’s example:
Pattern match on P-code to analysis type of instruction
SLEIGH only provide P-codes. But P-codes doesn’t tell what the instruction is. And the multi-arch support of SLEIGH make thing even more complex.
I made an overview on all R_ANAL_OP_TYPE_*
and summary their patterns. Hope to do the pattern match based on P-codes and know what type the instruction is. Sounds crazy, but I not only typed instructions successfully, I also managed to recover arguments of associated instructions!
Here’s example:
Summary
RAsm and RAnal plugins are both workable. Ready to provide information recovered from Ghidra’s SLEIGH disassembler.
Related commits:
https://github.com/thestr4ng3r/ghidra/commit/50acb42da120072ce1c02a15004bc7e74a682599
https://github.com/thestr4ng3r/ghidra/commit/7bde3b54b43230601363f89b0214ab4bdba8bf6f
https://github.com/radareorg/r2ghidra-dec/commit/59473e5c53d8724cfee489968e0990a31396ef90
https://github.com/radareorg/r2ghidra/commit/e083b183c9954c899fd4f945547683689c9d00a4