under construction Object Code Translation Home Page


Welcome to my page on object code translation (OCT). Object code translation means the automated conversion of binary or assembly code to some other representation. It is sometimes also called binary translation (BT). I am working on a Ph.D. on the subject and I created this page with the goal to collect all the related material and links I am aware of. However, I didn't yet compile all the stuff I have into this page, so it is still an area under construction (and maybe, it will always be; after all, it's a Ph. D....). I use *** and $$$ to indicate that the materail must be reviewed or extended

On this page, you find a lot of information and material concerning the subject. The page gives you both an overview and an introduction to OCT. But even more important, the page contains many, many links to related pages, literature, tools, products, and people working on OCT. The page is an ideal starting point if you want to dig deep into the subject. It contains (will contain) links and references to all the related material I am aware of. However, OCT is a large and fast changing field, and I greatly appreciate any hint to additional or new material. If you have done any work that should be referenced on this page, or if you aware of such work, please let me know. I am also very much interested in any comments concerning this page or the way I understand OCT. If you think that something should be changed or if you find an error, some inaccuracy or an outdated link, please email me. I respond to any mail as soon as possible.

Hopefully that this page may be of great help to you.

Markus Pilz,
Dept. of Computer Science,
University of Zurich

email: pilz@ifi.unizh.ch
phone: +41-1 / 635 67 12
fax : +41-1 / 635 68 09


Table of Contents

Introduction
Terminology
Resources
Interpreters, Tools and Products
Projects
Languages, Notations and Formalisms
Object File Formats
Peoples
Papers and Books
Places and Names
Acknowledgments
Abbreviations


Introduction

What is Object Code Translation?
Object code translation means the conversion of some binary or assembly code to some other representation. Object code translation is a special form of compilation where the source language is a machine language and not a high level language. It is convenient to distinguish two phases in the translation process: in the first, some internal representation is build by disassembling the object code or by parsing the assembly program, respectively. In the second, this representation is optimized and new code generated. Output can be binary code, assembly code, or even some high level language representation of the inputted program. The first phase is called disassembly phase and the second code generation phase.

Object code translation has many, many applications. By generating code for a different architecture, binary code becomes fast executable on that architecture. In the same way, one gains portability for assembly and binary code. And given the disassembly phase, object code translation allows to change a program even if source code is not available. Instruction set simulation, virtual machine implementation, just in time compilation, software migration, cross assembly, CASE tools for assembly code, executable editing, program tracing and code instrumentation are examples of application areas where object code translation is a key technology.

Object code translation is a rather old technique and various examples of its successful application exist. But beside the numerous examples, there is neither a general understanding of the underlying principles nor a widely accepted terminology. Furthermore, disassembling binary or assembly code raises problems that are carefully avoided in high level language programs: not all the code may be available at compilation time (e.g. self modifying code) and not the entire flow of control may be reconstructed statically (e.g. indirect jumps). There are various translation schemes that solve these problems. They all handle some code at runtime, but they differ on how much is translated statically. If there is any translation done statically, checks are insert into the code that trigger a fall back on interpretation or dynamic translation whenever some assertions are violated or whenever some code not yet translated is detected.

It turns out that it only depends on the source machines instruction-set, how much code can be detected and translated prior to runtime. We are developing a system level design language that allows to describe all aspects of some instruction set architecture relevant to object code translation. The primary goal of the description is that one should become able to infer certain properties of the instruction-set and to deduce which translation scheme to use best. As a secondary goal, the description can be used for the parameterization or the generation of parts of an object code translator like e.g. a disassembler, an assembler or a code generator.


Terminology

There is a lot of confusion about the meaning of the terms used in object code translation. Some terms (e.g. interpretation) originate from other fields but are used in OCT with a different or more precise meaning. Other terms (e.g. just-in-time compiler) are widely used but their meaning is vague. Still other terms (e.g. host) have various meanings depending on the context they are used in or the community that uses them.

The goal of these section is to clearly define the meaning I give to these terms. I will be very careful throughout all of this pages and throughout my thesis, to use them only with the meaning defined here after. The main goal of a definition given here is thus to distinguish a term from a similar or related one. It is not my intention to explain the term in all its nuances, but I might put in some links to more precise definitions and related material...

Binary File

see also:

Binary Program

Binary Translation, Binary Translator

synonym:

Compilation, Compiler

see also:

Decompilation, Decompiler
The term decompilation indicates the inversion of the compilation process. It stands for the detranslation of some binary code back to source code represented in a high level language, whereas the term disassembly is used for the reconstruction of assembly language source code.

Detranslation, Detranslator
Detranslation refers to the general task of reconstructing a source code from its binary code (absolute binary or reloccatable bianry,executabel, object file or library). The term subsumes both decompilation and disassembly.

The compilation process is not reversible in gerneral, however. Douring compilation, a lot of information is lost (varible names, prozedure names, comments, position information etc.). $$$quot The task of detranslating a program does not produce an unique result and, in general, is impossible. There could be infinitly many different source programs that compile or assembe to yield the same binary code. The diefferences between such equivalent programms can go beyond the use of different labels and variable names. With programs, there may be several different constructs (e.g. an instruction, an integer and a floating point number) that would all translate to the yield the same bit pattern in memory. $$$quot

Disassembly, Disassembler
Disassemly is the inverse process to assembly. It means the detranslation of some binary code back to source code represented in a symbolic assembly language, wheras decompilation refers to the regeneration of the source code in a high level language.

Dynamic Compilation

synonym:

see also:

Emulation, Emulator

see also:

Executable

Function
A function is a subroutine that returns a value.

see also:

Host Machine

synonym:

Interpretation, Interpreter

Just-in Time Compilation

synonym:

Machine Code

Mobile Code
Mobile Code

Object File

Object Code Translation, Object Code Translator

synonym:

Procedure
A procedure is a subroutine that does not return a value.

see also:

Simulation, Simulator

see also:

Static Compilation

see also:

Subroutine
Subroutine is a generic term to denote both a function and a procedure.

Source Machine

see also:

Target Machine

synonym:

see also:


Resources


Interpreters, Tools and Products

Contents

BFD: the Binary File Descriptor Library
The Binary File Descriptor Library is a package which allows applications to use the same routines to operate on object files whatever the object file format. A new object file format can be supported simply by creating a new BFD back end and adding it to the library.

($$$gnu, Cygnus)

dcc :
The dcc decompiler was C. Cifuentes Ph.D project. It decompiles .exe files from the (i386, DOS) platform to C programs. The final C program contains assembler code for any subroutines that are not possible to be decompiled at a higher level than assembler.

The analysis performed by dcc is based on traditional compiler optimization techniques and graph theory. The former is capable of eliminating registers and intermediate instructions to reconstruct high-level statements; the later is capable of determining the control structures in each subroutine.

DCG
Dynamic Code Generator. DCG has been superseded by vcode.

D. Engler, T. Proebsting

Fabius
Fabius is a Standard ML compiler written by M. Leone that automatically defers certain aspects of optimization and code generation to run time.

NJMC: New Jersey Machine-Code Toolkit
The New Jersey Machine-Code Toolkit helps programmers write applications that process machine code - assemblers, disassemblers, code generators, tracers, profilers, and debuggers. The toolkit lets programmers encode and decode machine instructions symbolically. Encoding and decoding are automated based on compact specifications.

The toolkit was written and is maintained by M. Fernández and N. Ramsey.

STonX
STonX is an Atari ST Emulator intended primarily for use with Unix and the X Window System. Ports to MS-DOS and Microsoft Windows 95 have been released by other people or are being worked on. STonX is distributed under the GNU License, meaning that source code is available as well - and, of course, it's free software.

There is also an in-official STonX home page maintained by T. Smolar. He also maintains a STonX FAQ.

M. Yannikos, M. Griffiths

vcode
vcode is a portable, extensible, fast dynamic code generation system. An important feature of vcode is that it generates machine code in-place without the use of intermediate data structures. Eliminating the need to construct and consume an intermediate representation at runtime makes vcode both efficient and extensible. vcode dynamically generates code at an approximate cost of six to ten instructions per generated instruction.

D. Engler

Toba
Toba translates Java class files into C source code. This allows the construction of directly executable programs that avoid the overhead of interpretation. If the Java class file is seen as the object file of the JavaVM, then Toba is an object code translator.

Sumatra Project, T. Proebsting


Projects

Contents

DIAMONDS
The goal of the DIAMONDS project is to design a new 8/16-bit microcontroller on a RISC basis, assuring software compatibility with an existing CISC family. Our part of the project is to explore a way to guarantee this compatibility. We study the various migration pathes clearly favouring translation over interpretive approaches.

Markus Pilz

FermaT
M. Ward

Sumatra
The Sumatra project's goal is to create a research infrastructure for experimenting with mobile code. Like many people in these days, they adopt Java byte code as basis for transmitting code and they build Toba, a bytecode-to-C binary translator.

T. Proebsting

TIBBIT: Timing Insensitive Binary To Binary Translation
The TIBBIT project focuses on performing automated binary-code translation and migration of applications across widely different architectures, and automatically compensating for implicit time-based dependencies between the application and the speed of the underlying hardware. The TIBBIT project is primarily concerned with the automated migration of embedded real-time applications.

Bryce Cogswell, Zary Segall


Languages, Notations and Formalisms

'C language
`C is an extension of ANSI C that provides language support for dynamic code generation.

D. Engler

WSL: Wide Spectrum Language
M. Ward

Object File Formats

Contents

68k COFF
see COFF

a.out: assembler and link editor output format
On UNIX boxes, a.out is the default output format of the system assembler as(1) and the link editor ld(1). The link editor makes a.out executable files.

A file in a.out format consists of: a header, the program text, program data, text and data relocation information, a symbol table, and a string table (in that order). In the header, the sizes of each section are given in bytes. The last three sections may be absent if the program was loaded with the -s option of ld or if the symbols and relocation have been removed by strip(1).

b.out
$$$(gnu unix?)

COFF: Common Object File Format
COFF is a portable format for binary applications on UNIX System V. Accordingly, you can find a description of the COFF file format in the manual of a System V Unix box.

Click for the SunOS 4.1 man page COFF(5).

COFF is used as object file format by

  • Motorola MCUasm Assembly Language toolset
  • Sun 386i systems runing a SunOS 4.0.x releas or earlier

ELF: Executable and Linking Format
  • $$$If you're on a Sun Solaris machine as I am: just type in man elf and enjoy.

  • ELF Spec as included in TIS

IEEE-695
IEEE-695 is used on a variety of native and cross-development plattforms
  • Motorola 68000 (Microtec Research)
  • Motorola 68HC08 (BSO/Tasking)
  • Hitachi processors
  • Zilog processors

Hewlett-Packard's IEEE-695 Developer's Packag includes the written IEEE-695 specification and a couple of tools. You can get if by ftp.

[Gray97a]

Microsoft Symbol and Type Information (Windows)
$$$TIS

Oasys (Oasys operating system?)
$$$

OMF: Relocatable Object Module Format (Windows)
$$$TIS

PE: Portable Executable Format (Windows)
$$$TIS

S-records
$$$

TIS: Tool Interface Standards (Windows)
$$$Formats Specifications for Windows

Peoples

It's obvious that this list will always be incomplete. But if you think you should be mentioned on it, drop me a mail, I am glad to add you to the list!

Contents

Cifuentes, Cristina
homepage

dcc, M. van Emmerik

Cogswell, Bryce
homepage

TIBBIT, Zary Segall

Eggers, Susan
homepage

Emmerik, Mike van
dcc, C. Cifuentes

Engler, Dawson R.
homepage

vcode, 'C language, DCG, T. Proebsting

Fernández, Mary
homepage

NJMC, N. Ramsey

Griffiths, Martin D.
STonX,

Hughes, Kevin
Kevin Hughes is at ICL. He works/worked with A. Rawsthorne.

Lee, Peter
homepage

M. Leone, Fabius

Leone, Mark
Mark Leone's work centers around the use of compile-time specialization to reduce the cost of run-time code generation. As a proof of concept, he has implemented a compiler for a subset of Standard ML, called Fabius, that automatically creates programs that generate native code at run time with extremely low overhead (approximately six cycles per generated instruction).

P. Lee

Massalin, Henry

Keppel, David
David Keppel is Pardo!

Pilz, Markus
Markus Pilz, thats's me!

I have worked on the DIAMONDS project and I am currently working on gaggia, a Java bytecode optimizer and native code compiler. Java bytecode compilation can be seen as a special case of binary translation.

Proebsting, Todd A.
Todd Proebsting Has done a lot of interesting work, but much of it is more related to compiler construction (lcc, need I say more) and retargetable code generation then to binary translation. However, DCG is a dynamic code generator, and Toba is a binary translator.

D. Engler

Ramsey, Norman
homepage

NJMC, M. Fernández

Rawstorne, Alasdair
Alasdair Rawsthorne was at ICL, is aware of the work of K. Hughes and has some student projects open in binary translation.

Segall, Zary
TIBBIT, Bryce Cogswell

Smolar, Toni
homepage

STonX

Wall, David
David Wall was at Digital's Western Research Lab (WRL) and is now at MIPS Technologies, Inc. He worked on binary translation and analyzing object files and has written quiet some papers.

Ward, Martin
Martin Ward works on the theory and application of Program Transformations. He has developed a Wide Spectrum Language (WSL) which allows to describe both source and target program. Then it's possible to show by repeated transformations, that the target program corresponds to the source program. He uses this approach in the FermaT project at SML to reverse engineer IBM 370 Assembler code to C or COBOL code.

Yannikos, Marinos
Marinos "nino" Yannikos" has written STonX and has a page with links to Software Emulation and Binary Translation

Papers and Books

UNDER CONSTRUCTION!!!
[FMWe84]
Christopher W. Fraser, Eugene W. Myers and Alan L. Wendt. Analyzing and Compressing Assembly Code. In Proceedings of the ACM SIGPLAN 1984 Symposium on Compiler Construction, SIGPLAN Notices, 19(6), June 1984, 117-121
[Gray97a]
Rand Gray, Deepak Muchandani. Object File Formats. Dr. Dobbs journal, 22(5), May 1997, 47-52, 75, 76

Places and Names

ICL
homepage

WRL: Digital's Western Research Lab
homepage

MIPS Technologies, Inc.
homepage

SML, Ltd.
Software Migrations Ltd.,
Mountjoy Research Center,
Stockton Rd,
Durham DH1 3SW, UK
Tel: 44+(0)191 386 0420, Fax: 44+(0)191 383 1243

Acknowledgments

Kevin Hughes sent me a long list of dynamic code generation projects and people working on that. Many thanx.

Abbreviations

BT   - Binary Translation 
HDL  - Hardware Description Language 
HLL  - High-Level Language 
HM   - Host Machine 
HW   - HardWare 
IR   - Intermediate program Representation 
ISA  - Instruction-Set Architecture 
ISP  - Instruction-Set Processor description language 
ISS  - Instruction-Set Simulator 
JIT  - Just In Time compilation 
OCT  - Object Code Translation 
OS   - Operating System 
PC   - Program Counter 
RISC - Reduced Instruction Set Computer 
RTL  - Register Transfer Level 
SM   - Source Machine 
SW   - SoftWare 

Back to my home page Back to group Back to institute Back to university
pilz@ifi.unizh.ch / September 15, 1998