2

How can I determine what an array of bytes will translate into in machine code?

I understand that if I see 0f at the start it's a 2 byte instruction, but I see other prefixes and in some disassembly in my x64 debugger I see weird interactions like 48 83 C4 38 and I can see on the opcode reference that 48 says the operand is 64 bytes.

But 83 says it can be 7 different instructions depending on a field called "register/opcode field" ..what?

Can someone please explain the logic behind how the processor uses these bytes to determine:

  1. What instruction is ran
  2. On what register(s) and/or address(es) the instruction uses (if any)
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    mostly a duplicate of [How to read the Intel Opcode notation](https://stackoverflow.com/q/15017659), but the answers there don't pick this apart in much detail. – Peter Cordes Dec 25 '18 at 18:18
  • 1
    [How to interpret the opcode manually?](https://stackoverflow.com/q/6019843) covers decoding ModRM. [what does opcode FF350E204000 do?](https://stackoverflow.com/q/15209993) is another duplicate of this, using `FF /6` `push r/m64` as the example. [x86 OpCode Instruction Decoding](https://stackoverflow.com/q/26607462) is yet another duplicate. I found all these googling on `site:stackoverflow.com x86-64 register/opcode field` – Peter Cordes Dec 25 '18 at 18:24
  • 2
    The internals of some processors have the equivalent of multiple "lookup tables" to handle opcode, r/m field, rex prefix, ... . – rcgldr Dec 25 '18 at 19:10
  • 1
    @peter cordes After some learning I have come to an understanding that my question is actually 100% a duplicate of [How to read the Intel Opcode notation](https://stackoverflow.com/q/15017659), I don't really respect stackoverflow enough to look into what I should do from this point though, but since I see you're on frequently and will probably see this I ask for your advice on what I'm supposed to do (as in, do I mark this question as a duplicate and close it or what?) – bcvdgfdag fewafdsaf Dec 27 '18 at 14:57
  • @bcvdgfdagfewafdsaf: I'll move my answer over there and close this as a duplicate. Thanks for letting me know that you agree it's a duplicate, rather than broadening this question to encompass more of what old_timer answered. – Peter Cordes Dec 27 '18 at 17:36

3 Answers3

3

0x48 is a REX prefix, with the W field set to 1, implying 64-bit operand size. (not 64-byte).

Many opcodes for immediate versions of instructions, including 83, use the 3-bit /r field in the ModR/M byte as 3 extra opcode bits. Intel's vol.2 manual documents this, and the opcode table in an appendix includes it, I think.

This is why most original-8086 immediate instructions, like and r/m, imm still only allow 2 operands, unlike shrd eax, edx, 4 or imul edx, [rdi], 12345 where both ModRM fields are used to encode operands, as well as the immediate operand implied by the opcode. SHRD/SHLD and were added with 386, and imul-immediate was added with 186. It's maybe unfortunate that copy-and-AND (and eax, edx, 0xf) isn't encodeable, but at least x86 can use LEA for copy-and-add/sub.


Each instruction's own docs, e.g. add (html extract of the vol2 manual), shows encodings like
REX.W + 83 /0 ib for ADD r/m64, imm8, which is what you have.

diagram of the ModRM bit fields from wiki.osdev.org

  7                           0
+---+---+---+---+---+---+---+---+
|  mod  |    reg    |     rm    |
+---+---+---+---+---+---+---+---+

0xc4 = 0b11000100, so the reg field = 0. Thus our opcode is 83 /0, in Intel's notation.

The rest of the ModRM fields are:

So the instruction is add rsp, 0x38

ndisasm -b64 agrees:

$ cat > foo.asm
db 0x48, 0x83, 0xC4, 0x38
$ nasm foo.asm     # create a flat binary with those bytes, not an object file
$ ndisasm -b64 foo
00000000  4883C438          add rsp,byte +0x38
ecm
  • 2,583
  • 4
  • 21
  • 29
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • You really went the extra mile! +1 – kabanus Dec 25 '18 at 19:57
  • Reposted with minor expansion of the intro section on [How to read the Intel Opcode notation](https://stackoverflow.com/a/53976236), and closed this question as a duplicate. I guess I'll leave this answer here instead of deleting, because it still answers this question. – Peter Cordes Dec 30 '18 at 08:38
2

I see letters on a page, the letter a, this could be many different words, the letter after it is an n. This could be an, and, answer, any number of words, so I continue.

x86 and other machine code from that era worked this way, in particular the instruction sets that it was directly derived from.

First off and most important if you just take all the bytes of a program and jump into the middle this will not make any sense, it is very very easy to get off on the wrong foot "the quick brown fox" "thequickbrownfox" "ickbrow" what is that? The processor starts and continues based on the rules of the instruction set, the processor is fairly stupid it follows the rules as defined or at least documented in the processor manuals. So long as the programmer and tools have created a properly constructed program it will not get lost, if it does it is the fault of the programmer/tools not the processor. The processor will start to decode the opcode byte as the opcode byte. That byte could be a whole instruction or just a fraction based on the specific byte. If a fraction then the first byte plus the byte that follows it may determine the whole instruction or be a fraction.

CISC in particular the opcodes themselves and in part the next bytes may or may not contain bits that mean something relevant. In a RISC like mips or arm or others 0000 in a specific please means register 0, 0001 means register 1 and so on. But in some if not many CISC instructions there isn't a bit that distinguishes register x from register y, register a from register b. The whole of the opcode had to be looked up in a table to know what it meant.

x86 is a variable length instruction set, some instructions are one byte, no other operands, others need more bytes then maybe an immediate after that. Want to move the immediate value 0x12345678 into register EAX, without looking at any documentation going to say that is either a 5 or 6 byte instruction either an opcode that says load immediate into ax, or a byte that says load immediate and another byte that says this is ax then four bytes of the immediate.

mov eax,0x12345678
mov ebx,0x12345678
mov ecx,0x12345678
mov edx,0x12345678

Disassembly of section .text:

00000000 <.text>:
   0:   b8 78 56 34 12          mov    eax,0x12345678
   5:   bb 78 56 34 12          mov    ebx,0x12345678
   a:   b9 78 56 34 12          mov    ecx,0x12345678
   f:   ba 78 56 34 12          mov    edx,0x12345678

turns out to be 5 bytes. While possible that the bits of those bytes might decode directly into one of the four registers it is unlikely as that is not how those instruction sets were designed.

You may be overcomplicating this, and sadly the Intel and other x86 docs are not as good as some other vendors. But it's really just a flow chart, fairly easy to decode the first byte tells you if you are looking for another byte or not by its definition, the next byte indicates if you need to look further and so on. You do not decode x86 like you decode mips or arm or others that are designed differently. All of them have a decode that says look at these bits and determine the instruction or determine if I need more bits, but x86 does it one way, mips does it another, arm does it another. There are pros and cons to each.

CISC like x86 though is more of a flow chart, the first byte tells you to go to page X that page either has the whole answer or it says get the next byte and based on that go to page Y in appendix X.

Some houses have one occupant, the address/location takes you to one person. Some have more than one and once you get to the house based on the address, then you need further information to determine which person or pet is of interest to you. The first piece of information, the street address conforms to a standard, but the information to isolate the person/pet within that house conforms to a standard for that house. The first byte of an instruction is the opcode. But based on the opcode if there are additional bytes then what those bytes are are opcode specific as we saw above. b8 78 56 34 12 for 0xB8 the second byte is part of the immediate value. There are many you can look up where the second byte is further decoding of the instruction

mov eax,eax
mov eax,ebx
mov eax,ecx
mov eax,edx


   0:   89 c0                   mov eax,eax
   2:   89 d8                   mov eax,ebx
   4:   89 c8                   mov eax,ecx
   6:   89 d0                   mov eax,edx

for the 0x89 opcode then the second byte is not data in these cases but further define the instruction.

It is true that the second bytes decoding is not unique to only that opcode, many instructions will share the same decoding of those bits to for example determine ah,al,ax,eax,bh,bl,bx...etc. And that is documented in the intel documentation as well as countless other books and websites.

The true documentation is the source code to the chip itself, since we rarely have access to that we get documentation, which isn't usually written by the author of the logic, and then maybe polished off by a technical writer, at each step some info may be lost or left confusing. Some vendors are better than others, some versions of their documentation are better than others.

x86 is pretty much the last instruction set you want to learn, having one is not a valid reason, for every x86 you have, just inside that box there are many non-x86 processors, plus for every x86 you own you own quite a few, dozens, of non-x86 devices. And if education and learning is the goal, you want to start with a simulator anyway, greatly improves your chances of success, and crashes don't hurt nearly as much. There are much better instruction sets to start with like msp430 and pdp11 which was clearly what influenced it. arm, thumb, later getting into mips and its nuances, then of the 8 bits I wouldn't start with x86 I would go with something else 6502 or others.

Then maybe if curious 8088/8086 using an emulator and the old docs on the internet way back machine, then lastly x86 as in 80386, 80486, and x86-64. Diving into x86-64 first has got to be all about pain, truly for folks into self abuse. If you still feel you have to do this the less painful path of this painful path is to start with 8088/8086 using old manuals and dosbox or bochs or a number of other emulators. Once you get the foundation then what they added in the step to 32 bit then 64 bit may make more sense and you don't have to be confused by the massive amount of protection added over time, you can start clean and pure.

Disassembly of variable length instruction sets is a huge problem to solve, and nobody has solved it because they can't completely. Not possible. I used to learn all new instruction sets by starting with a disassembler. These days I would probably do a simulator instead. The only way to have half a chance of success is to start at the valid entry point(s). And decode in execution order, not linearly through the binary. That will only expose some of the code. The remaining if any is data based and you can try to emulate, but that won't be perfect either. For one thing the data at disassembly time may change run time. You could even emulate the program and run it for days/weeks to discover various data values in various locations that a specific instruction is looking at and still not truly know all the possibilities. So some disassemblers simply get it wrong but show it to you as if it were right and others correctly, simply say I don't know what this is...

today the vast majority of the binaries are compiled, so the data paths are mostly sane and complete. But go get some ROMs from the stand up video game days, asteroids for example. you will see something that looks like this pseudocode:

a = 0
if(a == 0) goto somewhere
b = 7

we can easily see that the conditional branch is actually an unconditional, disassembly we would need to treat the instruction after the conditional branch as a possible execution path. But then what you find in that ROM is that the instruction that follows is actual data then an instruction. a 1 represents the opcode byte a 2 and 3 represent additional bytes for that instruction, more pseudocode

1 a = 0;
2
1 if(a == 0) goto somewhere
2
3
1 b = 7.
2
3
1
2
3

But when we continue to decode all the supposedly valid execution paths we find that

1 b = 7.
2 
3  <--- is a branch destination
1
2
3

That is an opcode byte not the latter bytes in an instruction, so now there is a conflict a good disassembler will tell you this. Then the human has to go examine these paths determine which one was valid the a=0.... path or the b = 7. assuming a = 0 and the conditional branch that follows was part of a valid disassembly then it would appear that that is really an unconditional branch and there are a couple of data bytes or fill or whatever then later on some code follows. This could have been intentional as was more common of the day to intentionally throw off a disassembler, or it could have been the result of hand hacking the binary rather than re-building a whole project and burning the ROMs. (Read up on I think it was defender, hacking the binary in the hotel room the night before the trade show then next day).

Those bytes might have been other instructions that were hand modified to bypass a bug. The 6502 is a good starting place and a number of those game ROMs if you want to write a disassembler there are not as many instructions as say a z80 or 8088/8086 that by using second bytes multiplied the original potential of 256 instructions into a longer list. Early PIC or msp430 would be far easier as a first disassembler as they only have a dozen or two instructions. Msp430 has a debugged/supported gnu backend (the llvm one is not debugged nor supported, so avoid it) so you have easy to get at tools if learning instruction sets is of interest.

When you have a fixed instruction length like mips when the 16 bit one is not used or arm when the 16 bit thumb is not used. (AND the instruction set says the instructions have to be aligned (not risc-v)) You can linearly disassemble through memory, some of the "instructions" you find make no sense or are undefined, but you just grind through, the human later will see those as data not instructions but the ones that are instructions will make sense. Unfortunately mips and arm have secondary instruction sets that decode completely differently and have different rules, so you cannot simply disassemble an arm binary either, for something compiler generated today you need to do it in execution order as well, you are far more likely to get most of the instructions decoded, but there will be some jump tables that dead end your efforts leaving chunks of code not properly disassembled.

So while wordy, the short answer is only trust the disassembler as far as you can throw it. And the instructions are pretty easy to decode if you go in execution order from a known to be valid entry point and look at the documentation for the processor.

halfer
  • 19,824
  • 17
  • 99
  • 186
old_timer
  • 69,149
  • 8
  • 89
  • 168
  • 1
    Did you start writing this diatribe before I edited the question to make it more specific, and emphasize that it was asking specifically about the "opcode /r" field? The old title was garbage, but it still looked like a fairly specific question. (And I don't see an answer to it in this rambling rant about x86 in general.) Nothing in the question mentioned the problem of finding instruction boundaries via static analysis, or any of the other issues caused by a variable-length instruction set. This just looks like a non-answer rant. – Peter Cordes Dec 25 '18 at 22:59
  • 1
    *While possible that the bits of those bytes might decode directly into one of the four registers it is unlikely as that is not how those instruction sets were designed.* Unlike 8080 and earlier ISAs with only a couple registers and different versions of the same instruction with a different implicit register, when x86 has a single-byte opcode that includes an explicit register, it's the low 3 bits of the opcode that encode the register. e.g. [`mov r32, imm32` is `b8 + rd`](https://www.felixcloutier.com/x86/mov), where rd is the destination register code. – Peter Cordes Dec 25 '18 at 23:03
  • You get an upvote from me just for the effort of writing this wall of text. – fuz Dec 26 '18 at 00:11
  • 1
    @Peter Cordes so far this is the best answer and answers what I was originally questioning, I haven't finished reading it because of how long it is but it is specifying exactly what I'm looking for, how bytecode is parsed. I didn't know exactly how to ask the question when I originally wrote it, unfortunately it's a problem I have with a lot of questions on stackoverflow (not knowing how to write my question properly, that is.) But my original question might be better phrased as "how is bytecode parsed into opcodes by correct disassemblers" – bcvdgfdag fewafdsaf Dec 26 '18 at 05:11
  • @bcvdgfdagfewafdsaf: you commented on \@kabanus's answer that it didn't give you any new information, so I assumed you already understood the basics of x86-64 instruction format shown in that diagram and were just asking about using the `/r` field as extra opcode bits, if you'd already gotten to the point of looking up opcode bytes. That's the only thing your question focused on. So yeah, if this answer was helpful then great, and yeah the question you wrote didn't clearly express what you wanted to know. (It would've been a dup of other questions that explain instruction-length decoding.) – Peter Cordes Dec 26 '18 at 05:34
1

This depends on the specific architecture, not just x86-64, but the actual chip provider. You can check for example intel's guide for architecture software developers.

This has a whole chapter dedicated just to the syntax of the commands in bytecode, and then another on each command available. Here is figure 2.1, to give you an idea:

Intel architecture format

taken from the above manual. This would change if you use ARM for example.

This is something people can take years to study to be able to "fluently read" byte code, so just skimming this can only give you a rough idea of the syntax or a good resource for locating a specific thing.

kabanus
  • 24,623
  • 6
  • 41
  • 74
  • This is not useful, it hasn't given me any new information or actually answered my questions. – bcvdgfdag fewafdsaf Dec 25 '18 at 16:21
  • 1
    @bcvdgfdagfewafdsaf It does! See the elaboration of the ModR/M byte where bits 5–3 are labelled “Reg/Opcode,” i.e. this field can either be a register operand or another 3 bits of opcode. – fuz Dec 26 '18 at 00:12