how to detect the instruction boundary while designing a disassembler?

Question

If I want to write a disassembler program, how can I determine the boundaries of instructions? By "boundaries," I mean the rule of separating a sequence of bytes into individual instructions.

For example, given the codes FF 25 10 00 FF FF 38 11 FF 0A (I made this up), how can we know whether it consists of two instructions FF 25 10 00 FF and FF 38 11 FF 0A, or only one FF 25 10 00 FF FF?

The most direct implementation method that comes to my mind is building an opcode table. When encountering a byte of data, we look it up in the opcode table to find all possible matching instructions (e.g., using a switch case statement). For example, FF can match CALL or JMP instructions. Then we further determine based on the next byte, just like the procedure I take when decoding machine codes myself.

Are there any other smarter ways to achieve this? If so, please let me know.

On x86, you won't get around decoding the instructions to find their lengths. — fuz, Apr 18 '23 at 11:46
`FF 25` is the opcode+modrm for `jmp [rip+rel32]`, so it consumes another 4 bytes of machine code as the rest of the instruction. There are some design features that make length decoding (given a starting point) cheaper, e.g. you don't need to account for all the REX bits to decode length from opcode+modrm, since R12 and R13 have some of the same restrictions as RSP and RBP. ([rbp not allowed as SIB base?](https://stackoverflow.com/q/52522544)). — Peter Cordes, Apr 18 '23 at 11:54
But x86 machine code is a byte stream that's **not self-synchronizing**. There's no way to look at a sequence of bytes and know which one starts a new instruction, other than decoding forwards from a known start point. (And following relative branches, for disassemblers like IDA that attempt to be somewhat resistant to obfuscations, not just designed for simple compiler output like `objdump` is.) [Is it possible to decode x86-64 instructions in reverse?](https://stackoverflow.com/q/52415761) — Peter Cordes, Apr 18 '23 at 11:57
[Looking at x86 machine code, how do I determine the starting location of the next instruction?](https://stackoverflow.com/q/64370582) / [What happens if \`objdump -d --start-address\` starts printing from the middle of an x86 instruction?](https://stackoverflow.com/q/75262864) — Peter Cordes, Apr 18 '23 at 11:58
I've noticed the whole "not self-synchronizing" thing @PeterCordes mentioned when looking at a game boy debugger, and every time I scrolled up the disassembly, the top instruction would change because it was getting new data and the program refreshed the disassembly with a new starting point. That being said, I feel like there would be a way to figure out the odds of a given interpretation being correct based on context clues. Sometimes you can see something that just makes no sense within the context of the other instructions. — puppydrum64, Jun 30 '23 at 12:30
@puppydrum64: Yeah, you could probably train a model (perhaps using AI techniques) to predict whether an instruction sequence was nonsense or not. In code that isn't intentionally obfuscated, it should work pretty well. — Peter Cordes, Jun 30 '23 at 12:38

how to detect the instruction boundary while designing a disassembler?

0 Answers0