Implement virtualization by using emulation

7 minutes

Now that we understand the conditions for virtualizing ISAs and the two main classes of CPU virtualization, full virtualization and paravirtualization, we move to discussing emulation as a technique used to implement full virtualization and process VMs. Recall that emulation is the process of allowing the interfaces and functionalities of one system (the source) to be implemented on a system with different interfaces and functionalities (the target). Emulation is the only CPU virtualization mechanism available when the guest and host ISAs are different. If the guest and host ISAs are identical, direct native execution can (possibly) be applied.

Emulation is carried out via either interpretation or binary translation. With interpretation, source instructions are converted, one instruction at a time, to relevant target instructions. Interpretation is relatively slow because of the one-by-one emulation of instructions and the lack of any optimization technique (e.g., precluding the interpretation of an already encountered and interpreted instruction). Binary translation optimizes interpretation via converting blocks of source instructions to target instructions and caching generated blocks for repeated use. Typically, blocks of instructions are more amenable to optimizations than single instructions are. Compared to interpretation, binary translation is much faster because of the application of block caching as well as code optimizations over blocks. In the following video, we discuss three major interpretation schemes: decode and dispatch, indirect threaded, and direct threaded.

As explained in the video, a basic interpreter should read through the source code instruction by instruction, analyze each instruction, and call relevant routines to generate the target code. An interpreter called decode and dispatch applies basic interpretation but results in a number of branch (or jump) instructions, both direct and indirect, which leads to poor execution times. As an optimization for decode and dispatch, an interpreter called indirect threaded attempts to release some of the decode-and-dispatch branches via appending (or threading) a portion of the dispatch code to the end of each interpreter routine. Last, a more advanced interpreter, called direct threaded, improves indirect threaded by attempting to interpret a repeated operation only once. Although the direct-threaded interpreter improves the indirect-threaded and decode-and-dispatch branches, it still suffers from major drawbacks, such as vast memory image and limited portability. In the next video, we discuss binary translation, which essentially targets the limitations of direct-threaded interpretation.

As presented in this video, binary translation tries to amortize the fetch and analysis costs caused by the direct-threaded interpreter through: (1) translating a block of source instructions (rather than a single instruction) to a block of target instructions and (2) caching the translated code in an attempt to save interpreting source instructions more than once. The following table compares binary translation, decode-and-dispatch, indirect-threaded, and direct-threaded emulation techniques in terms of four metrics: memory requirements, startup performance, steady-state performance, and code portability. For instance, the decode-and-dispatch interpreter row reads as follows: first, with decode and dispatch, memory requirements remain low because of having only one interpreter routine for each instruction type in the target ISA. Furthermore, decode and dispatch avoids threading the dispatch code at the end of each routine. Second, startup performance is fast because neither predecoding nor translation of the source binary is applied. Third, steady-state performance (i.e., the performance after starting up the interpreter) is slow because of the high number of branches and the interpretation of every instruction for each appearance. Finally, code portability is good because predecoding with addresses of interpreter routines is not applied by decode-and-dispatch interpreters (in contrast to direct-threaded interpreters).

	Memory requirements	Startup performance	Steady-state performance	Code portability
Decode-and-dispatch interpreter	Low	Fast	Slow	Good
Indirect-threaded interpreter	Low	Fast	Slow	Good
Direct-threaded interpreter	High	Slow	Medium	Medium
Binary translation	High	Very slow	Fast	Poor

Check your knowledge

When the guest and host ISAs are different, the only CPU virtualization mechanism that can be used is:

Direct native execution

Interpretation

Binary translation

Emulation

The process of decoding an instruction only once, saving the extracted information in an intermediate form, and reusing the instruction each time it is encountered is called:

Threaded interpretation

Indirect-threaded interpretation

Direct-threaded interpretation

Predecoding

A major limitation of indirect-threaded interpretation is the requirement of looking up a table on each dispatch. To get rid of such a level of indirection, it was suggested in 1973 to replace the instruction opcodes contained in the intermediate form with the actual addresses of the interpreter routines. What is this method called?

Direct-threaded interpretation

Predecoding

Indirect-threaded interpretation

Decode-and-dispatch interpretation

With predecoding, all the source instructions of the same types are extracted with the same interpretation routine. Performance can be significantly enhanced by mapping each individual source binary instruction to its own customized target code. This process of converting the source binary program into a target binary program is called:

Decode-and-dispatch interpreter

Direct-threaded interpreter

Binary translation

Predecoding

Indirect-threaded interpreter

Consider the following snippet of an interpreter code that steps through the source program, decodes each instruction (that is, the extract() function), and calls a corresponding routine depending on the observed opcode. (All routines are omitted, except the first one, for the brevity of the presentation.)

While (!halt && !interrupt)
{
    inst = code[PC];
    opcode = extract(inst, 31, 6);

    switch(opcode)
        {
            case LoadWordAndZero: LoadWordAndZeroRoutine(inst);
            case ALU: ALUrouting(inst);
            case Branch: BranchRoutine(inst);
            ...
        }
}

LoadWordAndZeroRoutine(inst)
{
    RT = extract(inst, 25, 5);
    RA = extract(inst, 20, 5);
    Displacement = extract(inst, 15, 16);
    
    if (RA == 0)
        Source = 0;
    else
        Source = reys[RA];

    address = source + displacement;
    reys[RT] = (data[address] << 32) >> 32;
    PC = PC + 4;
}

Which of the following statements best describes the above interpreter?

This interpreter is a threaded one because it is structured around a central loop, a decoder, and a dispatcher.

This interpreter is a direct-and-dispatch one because it is structured around a central loop, a decoder, and a dispatcher.

In addition to the test for a halt or an interrupt at the top of the loop, this interpreter exhibits one branch.

In addition to the test for a halt or an interrupt at the top of the loop, this interpreter exhibits three branches.

In addition to the test for a halt or an interrupt at the top cite loop, this interpreter exhibits four branches.

This interpreter is a direct-and-dispatch one because it is structured around a central loop, a decoder, and a dispatcher. In addition to the test for a halt or an interrupt at the top of the loop, this interpreter exhibits four branches.

Consider the following snippet of an interpreter code.

inst = code[PC];
opcode = extract(inst, 31, 6);

switch(opcode)
    {
        case LoadWordAndZero: LoadWordAndZeroRoutine(inst);
        case ALU: ALUrouting(inst);
        case Branch: BranchRoutine(inst);
    }

LoadWordAndZeroRoutine(inst)
{
    RT = extract(inst, 25, 5);
    RA = extract(inst, 20, 5);
    Displacement = extract(inst, 15, 16);
    
    if (RA == 0)
        Source = 0;
    else
        Source = reys[RA];
    address = source + displacement;
    reys[RT] = (data[address] << 32) >> 32;
    PC = PC + 4;
    
    if (halt || interrupt)
        goto emit;
    
    inst = code[PC];
    opcode = extract(inst, 31, 6);
    extended_opcode = extract(inst, 10, 10);
    routine = displacement[opcode, extended_opcode];
    
    goto *routine
}
...

Which of the following statements best describes the above interpreter?

This interpreter is a decode-and-dispatch interpreter.

This interpreter differs from a decode-and-dispatch interpreter as it appends a portion of the dispatch code to the end of each interpreter routine.

As compared to a decode-and-dispatch interpreter, this interpreter removes tree branches after calling the first interpreter routine. Thus, this interpreter performs faster than a decode-and-dispatch interpreter.

As compared to a decode-and-dispatch interpreter, this interpreter removes all branches after calling the first interpreter routine. Thus, this interpreter performs faster than a decode-and-dispatch interpreter.

This interpreter differs from a decode-and-dispatch interpreter as it appends a portion of the dispatch code to the end of each interpreter routine. As compared to a decode-and-dispatch interpreter, this interpreter removes tree branches after calling the first interpreter routine. Thus, this interpreter performs faster than a decode-and-dispatch interpreter.

You must answer all questions before checking your work.

Continue

Check your knowledge

Feedback