This software can be used to analyze, display and disassemble Enterprise-64 and 128 EXOS structured files / ROM images / programs on PC (or on the web, see later).
This document can be found (probably more updated version, you should read, if you are not there now ...): epbas.lgb.hu/readme.html
Project page: epbas.lgb.hu/
(The name "EPBAS" is from the original functionality: to list IS-BASIC programs, and nothing more. Currenty the functionality of the project is much wider than that.)
Currently, IS-FORTH (type-1), IS-BASIC programs (type-4), editor documents (type-8, ie: WP files), ML user programs (type-5) and raw binary images are supported (together with EXOS_ROM images). With ML/raw mode, disassembly can be requested as well.
The loaded file is parsed for multiple headers, and as EXOS, the final type-10 header will cause the converter to stop. The converter also takes care to create valid UTF-8 output, which also means that you may need to specify the EP character set used. The output can be plain TEXT or HTML, with/without "info mode" and "debug hex mode".
For more information about the development please check out the section "CHANGELOG" at the end of this file.
Additionally, the converter can be used to create "nice" HTML page on the known character sets, Z80 opcodes, etc. It helps to fix bugs :) Also the internal DBs used to describe various ops can be dumped.
Please note, that this file is a very simply re-formatted variant into HTML from the README file (text only) can be found in the downloadable version as well.
This program can be used/distributed/modified according to the GNU/GPL v3 or later. Visit page www.gnu.org/licenses/gpl.html for more information. In nutshell (described in a very amateurish and not so correct way) you can say that you can use/redistribute this program without restrictions other than changing the copyright / license. Also you can modify it or you can even create other projects using this work if your work is also covered by this license with providing the source freely available.
This program is written by ©2012,2013 Gábor Lénárt ("LGB"), mail address is (lgb at-sign lgb dot hu).
Any - constructive - feedback/help is welcome.
Special thanks to the Hungarian Enterprise Forever forum, especially EP-Jedi Master Zozo. :-)
If you don't want to install this software (or its requirements, see below), you can even try it out online: epbas.lgb.hu/tryit/
If you choose to install it: epbas.lgb.hu/epbas.zip
Note: with downloading the software, you will get the "off-line" version, it's a command line controlled stuff. The on-line mode uses a little wrapper which is _not_ the part of the downloadable version but basically it merely calls the very same software anyway.
The converter is written in Python. On an average Linux (and maybe UNIX) system, it should be installed, so you can directly run the script (epbas.py) as an "executable" (you may need to give executable permission to the file though).
Please note that there are multiple .py files belongs to this program, you need all of them!
On Windows, I haven't got too much idea, as I never have had Windows. For sure, Python runs on Windows, but it's up to you to figure out how to do it (www.python.org). As far as I know, it's possible on Windows to "assign" *.py extension to the python.exe interpreter somehow, in that case you can "launch" the .py file "directly".
Note about Python: Python2 and Python3 are quite different "beasts". :) Though I test my converter with both of the major versions, I mainly use Python2. Also, it's important to note, that older versions within the v2 branch may don't know about newer constructs I use (like bytearray). So in nutshell: I would recommend to use Python 2.7 versions, it should work. Earlier or later versions can cause problems.
Without any parameter, you'll get a summary on the syntax. Note, that some not-yet-finished switches are also shown which won't work.
There are some parameters which can be used only WITHTOUT any other file names/switches with special modes. They are:
Besides these, the normal usage of the converter requires zero or more switches and an input file name (with or without path). Note: the LAST parameter of the command line is ALWAYS treated as the input file name/path! Before the file name, these switches can be used:
-info
unless
if you're only interested in the clean listing and nothing
more.-db
mode (see above) to get some idea about known EP
character sets. You can also use the -chset
special mode
to create detailed character set map in the form of a HTML
file.-hex
and/or -info
mode. In HTML
mode some kind of syntax-highlighting is used as well.
HTML mode is recommended if it's acceptable for you, because
of the help given by syntax highlighting and also the
ability to follow code with the link/anchor mode. Of course
a HTML page needs a web browser, so if you need a text file
to be further processed, you may not want to use it anyway.-dasm
is also given (START address
is only used then). Can be used to eg analyze ROM image.
You can even give more hex numbers (again with separated by
comma) to give more hints for the disassembler for code
analyzation starting points (this does not make any sense unless
you use -dasm
switch too). Read section BINARY INPUT MODE and
DISASSEMBLER to learn more on this topic.-html
mode. By default, HTML mode
uses anchors/links to allow to follow the program/data
structure where it's available (for more information see
HTML OUTPUT section of this document). With this switch you
can disable it.-bin
switch) it will dumps it via "standard" hex dump
format (even without -hex
given!). If you specify this option
as well, converter tries to disassemble instead. It's a quite
complex topic, so read the "DISASSEMBLER" section about the
details. Please note that this is considered as a HIGHTLY
EXPERIMENTAL feature, it MAY EVEN NOT WORK AT ALL. It can be
also used with the -bin
switch.
DOES NOT WORK YET: you can specify a file after -dasm
(ie: -dasm=source.asm
) which should be the source from the
previous run (for the same file!) but with user modifications.
This will be processed to give "hints" and also to include
user comments.-html
mode (not in text!). With this option, you can require to save
the file instead of just displying it. If you don't specify
a parameter for -savepic
(without the =filename
part), the file tried to be displayed interactivly in a window, but
this is Python PIL stuff, and may work only on Linux, I have no idea.Most switches can be used together without problem, however there are some exceptions. Program will warn you if this is the case so you don't need to worry about this issue.
If a format contains a viewable image, it can be displayed, but
only in html mode (see below, section "HTML OUTPUT"). However it's
possible to save the output as a file, with the -savepic=filename
switch, see above. In text mode (without -html
mode),
only a text is displayed that image cannot be rendered.
HTML output mode can be requested with the -html
switch.
One advantage of HTML mode is using syntax highlighting. Another one is using anchor/link scheme, that is: you can click on branches (GOTO numbers etc) to follow the execution of the program. It can help to understand complex program.
This link/anchor mode can be disabled with the -nolinks
switch.
With link/anchor not disabled, HTML anchors are generated prefixed with L0_ and similar prefixes. The number is the EXOS header "number" (eg: first header is 0, second is 1). It's needed to handle the situation to have more "items" in one file, so links won't be conflicting between modules.
This is the default mode, unless binary mode (-bin
switch, see later) is specified.
In this mode the input file is treated as an EXOS file, unless EXOS_ROM string is found, in this case it's treated as an EXOS ROM as a whole. If the input file is not EXOS_ROM and it seems not have a valid EXOS header either, the result is an error.
With the -bin
switch, you can specify binary input. In this case, no
EXOS header is examined, and the input file is treated as a block of
raw ML program. The output of the converter in this case will be
similar as with EXOS type=5 files: hex dump on the file, or disassembled
source, if -dasm
is specified as well.
Note, that in theory you can use -bin
switch on an EXOS type=5 file too.
This is possible, as type=5 EXOS file is simply the EXOS header than
the program stream, and no ending header (no type=10 header at the
end). Thus you can disassemble a type=5 EXOS file in binary input
mode as well, with the following switches:
-bin F0,100 -dasm
The 0xF0 as load addresses in needed because EXOS header is there and the "real" program should be put on 0x100 (and EXOS header is 16 bytes long: 0xF0 + 16 = 0x100).
One of the advantage of this trick is to be able to used the binary
code point sync hinting mode, which is not available in the default
EXOS input mode. Binary input mode can be also useful to make hex
dump of an unknown-formatted EP file (no -dasm
is needed then of
course, and probably -bin 0,0
is a good idea, even if you know that
"starting address" won't be used too much with hex dump!). One
disadvantage to use -bin
with files having EXOS headers, that
disassembler won't parse EXOS header, and won't place the length
information there by labels. However it's not a big price, and you
can modify the source to do that so, if you really want to re-assemble
the source then.
Note, that disassembler (see the DISASSEMBLER section) does not know about internal memory layout changings, so it's better to try to disassembly smaller parts at once, like with page 0 of EXOS. As it's mapped from C000 after startup (which is not handled by the disassembler) it's better to specify for the converter that address already. To really try that, you should have a file containing only page 0 of EXOS. With using UNIX (eg Linux) system, it's quite easy:
dd if=name_of_the_original_image of=exos-page0.rom bs=16384 count=1
Of course you should modify the name after if= :) You'll get your page 0 with the name after of= with this command.
Note, that you can give more hex numbers with -bin
, but two are
compulsory, as we know: first is the load address, second is the start
address (program entry point). Without -dasm
the start address is quite
meaningless but anyway you must specify it, even if you want only
a "nice" hex dump of the file. If you specify more hex numbers, you
can hint the code analyzator of the disassembler as code entry points.
To learn about more on this, read the section DISASSEMBLER.
In general: using binary input mode, you can improve the quality of the result from the disassembler if you specify more addressses, especially if you see the problem as some parts of the code is not recognized as code, but dumped as data instead.
The original target of the project, thus the name of the "EPBAS". IS-BASIC programs are dumped however currently there is no perfect match with "real" printing (on the EP) as spaces used are different. It's on my to-do list.
"Multiple" BASIC programs type is not supported as I don't know what the hell they are, and also I haven't got any example to work with :(
IS-FORTH mode is currently experimental. Also, this was not tested too much yet.
If you examine an IS-FORTH program with syntax highlithing more closely, you will notice that almost everything is "red". That colour is used to show words which are part of the VLIST. It's not a mistake that eg number of 3 is red, but eg 99 is blue (numeric constant). It's because 3 is defined as FORTH word ... The standard words defined were extracted from IS-FORTH directly. Defined (or redefined) words will be also red, they are tracked. Do not be surprised as in forth almost everything is just a defined word.
Also, encoding strings seems to be "odd" at first, ie string needs a space at the beginning which is not part of the string. This is also not a mistake, it's because quotion mark is a FORTH word, and you need a separator so FORTH can recognize.
Absolute system extensions type (type code 6) is handled, however
the current support is very same as with "new application program"
(type 5) only the load/start addresss is different (0xC00A instead
of 0x0100). The very same rules apply, ie hex dump without -dasm
,
and disassembly mode when used with that switch.
The disassembler tries to be intelingent, it's an iterating, two pass disassembler written by me. In the first pass, code flow is tried to follow, with doing iteration at every point where program flow can result in multiple choices for the next PC value, ie conditional jumps or RETs. The result in the first pass is stored only as "code hints" which created a map in the in-core memory image about memory locations containing opcodes. All other memory locations are treated as data then. In the second pass the actual output is generated by walking though the code hint points array. In case of hit, the actual opcode will be disassembled. Other locations are presented with data declarations. Another feature is the data hint array. On each opcode which would read/write memory locations, data hint is filled for that address. On data mode dump, more data is tried to dump in one line if there is no more data hint hit inside it. If data is detected to be STD ASCII, ASCII mode dump is done.
The current hack is the ability to try to disassemble sections which cannot be reached by constant jumps/calls. This is called "fallback mode". It's not an ideal solution, as these sections can be data rather than code. To try to minimalize these cases, a linear part of the code is assigned as data as soon as a data label found referenced to that area. Of course it's also not perfect.
If HTML mode (-html
) is requested, anchors/links are used as with
BASIC, so eg JPs can be followed by clicking on the addresses as well.
Please note that there is a major problem with ASCII mode data dump:
The purpose of the whole converter project is having a clean, UTF-8
representation of EP encoding. However the purpose of the disassembler
is to create source which can be assembled by JSASM, which would not
tolerate UTF-8 sequences too well (as it does not know what kind of
EP chars should be generated then). For this, the converter analyzes
the text conversion tables by inspecting the used EP charset
(-cset=...
): only bytes are treated as "character data" where UNICODE
position is the same as the EP-ASCII code for the given character.
Also, the disassembler emits "ASCII" data parts if at least 5 characters
of continous data is found, where every byte is in the interval of
valid ASCII codes having the very same ASCII code and unicode position.
Since disassembler is picky to treat something as code, if it can
reach via following the code, it can fail to disassemble non-reachable
parts, or code paths can be reached only by register-jump, jumping
table etc, which can't be discovered by simple static code analyzation.
To help the disassembler you can give manual "code hinting points"
(BUT only if you use the -bin
switch!). To learn about this topic
more, please check out the reference of the switches and the
BINARY INPUT MODE section.
Technical notes about the disassembler:
-info
switch.Still about the Z80 disassembler. The "standard" Z80 assembly syntax is somewhat broken sometimes. For example check out this:
JP (HL)
For real there is NOT instruction like this! (HL)
would mean normally
that CPU should read the byte (or maybe word in this case ...) at
address in HL
, and use that address to jump. However this is not true
at all, as this op is simply jumps to address in HL
, so PC := HL
.
So for real, the legal op should be written as:
JP HL
And yes, SJasm even support this, surprise :) Maybe it was a mistake
that JP (HL)
format is used so widely ...
Another fact which shows that JP HL
is the correct form is the DD/FD
prefix. As we know, if an instruction uses (HL)
, DD/FD prefix causes
to use (IX+d)
or (IY+d)
instead. If an instruction does not use
(HL)
but uses HL
, H
or L
, it's converted to IX/IY
, IXH/IYH
or
IXL, IYL
. If you find a Z80 opcode matrix and search for "DD E9" (E9
is opcode of JP HL
) you will see this:
JP IX
It does mean that the original (unprefixed) E9 opcode must have been
JP HL
, and _not_ JP (HL)
, as with this case the prefixed version
should be JP (IX+d)
and not JP IX
.
I had to mention this topic, as it was reported as a bug. It is not,
the bug is people using JP (HL)
format. As my disassembler is not
generated table based, but parsing opcodes on byte level, the logical
way - JP HL
- is used. If you tried to modify this, the prefixed version
would turn out to be JP (IX+d)
which would be incorrect then.
Another similar anomaly is having the "ALU" group of opcodes "A" as
the first parameter for some opcodes, but not for the other. From
view point of the logic, it's totally meaningless, my disassembler
won't generate "A" ever, so no ADD A,B
but only ADD B
. Yes, SJasm
supports this.
There is some oddify how Z80 assembly syntax tries to explain some situations. Consider this:
LD somereg, label LD somereg, (label)
It's clear what is the difference, but it's easy to mess this up by mistake. Some Z80 assemblers supports syntax like:
LD somereg, #data
To signal with '#' about the immediate data (anot not memory reference). However I'm using the "standard" way.
SJasm allowes syntax about [...] instead of (...) is not supported. The purpose (I guess) of SJasm's [...] is the way to avoid situations where (...) is used mathematically and not meant to sign the memory reference ... ?
I should start to thing about a better name for the project because the original intent of the converter was a simple "list EP BASIC programs as text", but now it handled multiple header types, it has some kind of intelligent disassembler, it does character set conversions, helps to create HTML references of charsets and internal program flow for both of BASIC and ML parts. Now, even IS-FORTH and WP are supported.
More and more "smart" disassembler functionality to implement, with user controlled hinting (label names, "phase blocks", data/code selection, comments, etc) in a way that a newly generated disasm list contain all the user submited changes while allowing to interface with external GUI and/or web frontend easily for the user to do this. I may left this work after the version number of 1.0.
Also minor change can be the quite regular case when a program
copies itself to another location with a simple LDIR
opcode. As we can know (hopefully) the register values before LDIR
we can "fake" the result of the operation as well!
For 1.0, I'd like to clean the code up at many places (even recontructing the whole program), fixing bugs, and introduce clean python2/python3 compatibility without current "hacks". Minor feature imporvements (as with IS-FORTH support) can come meanwhile though. The new and clean code base will be able to make it possible to introduce bigger changes than, like even more advanced disassembler features.
Short term goals:
Longer term goals:
In my opinion, there is no "complete" documentation on files handled by EXOS. Information can be gathered from various places though, or experimenting with files & checking them in a hex editor. I try to summarize information I know. This is _FAR_ from being complete, so if you have any suggestion/help, please tell me. Zozo already was a great source :)
Enteprise's OS (EXOS) uses a well-structured and nice scheme to manage even multiple modules inside a single file. EXOS files consists of one or more modules, each has got a 16 byte long header. The first byte of the header must be zero, otherwise it's not an EXOS file. The next byte signals the type. The remaining bytes are usually zero, unless they are used by the specific type, then it depends on the specific type. After the header (except for end-of-file) data follows, it's up the specific format to tell how much bytes (also the type defines how to interpret the data bytes, of course).
Type codes:
End-of-file header is special, it signals the end of the file, and no more data after the header. One thing I can't really understand: some of the types above seems to be "terminating" and no other headers/modules follows even without end-of-file type, some of them needs the end-of-file. It seems it depends on the behaviour of the type: some module types causes to pass the control to the handler, so there is no point to put more headers in the file, not even the end-of-file. For example this is the case with type-5.
It's important to note, that these files can contain ASCII data. In this case, the interpretation of the bytes should be in the character set map used by Enterprise, and there are even multiple - different - character sets. That's why my converter needs this information and it maps EP chars into UTF-8 sequence based on the selected charater set information. Character set tables used by my program can bee seen here: epbas.lgb.hu/result-chset.html It seems (outside of Hungary) two main tables are used: the UK and the BRD (german).
IS-FORTH programs are simple: basically they are all text. An IS-FORTH program consists one or more "buffers". The byte in the EXOS header after the type code describes the number of buffers. A buffer is always 2+1024 bytes long. The first two bytes actually form a word, which tell the buffer number "sequence". Please note, that there is no need for strict ordering, anything of the buffer numbers. The next 1024 bytes forms the buffer itself. The unused bytes at the end of the buffer are filled up with space characters (ASCII 0x20). The used area divided into lines separated by standard CRLF sequences. It seems IS-FORTH programs lack the end-of-file header. Of course character conversion applies! Decoding IS-FORTH is really easy compared to eg IS-BASIC, as the program "stream" within the buffer itself is only text. However doing syntax highlighting etc is harder, as there is structural information in case of IS-BASIC programs while it's not the case with IS-FORTH. I try to "tokenize" the buffer content using separator character like space, and filling vlist array as well, for links in HTML mode for word definition. Also comments are more-or-less recognized together with the built-in list of words.
WP documents (technically they are called "saved editor buffer documents" or such, so maybe not only WP can emit these kind of information). I don't know too much on this format other than it consists of character lines, having 3 bytes of information (as far as I can tell, pointers: editor buffer in memory are linked lists) at the beginning (so I skip them) and a trailing byte. What I do is to print a line till byte value is equal or greater than 32 (space) after the 3 bytes at the beginning, then I skip a single byte again, and finally continue with the next line. I don't say it's the correct solution :) Of course, character conversion applies!
User ML programs are "machine language" stream of binaries. They're always loaded at offset 0x100 into the memory. There is not so much structure of this file, however the EXOS header contains the length of the program (with low then high bytes of the word) after the type byte. If you want to disassemble the programs or even doing a hex dump, you must be careful with the bytes used to display/represent strings, as they are subject of character converion. However unlike interpreted and "structured" files (like IS-BASIC) you can't be sure which bytes are data and which is not, of course ...
IS-BASIC programs are "complicated" because they use tokenization,
custom number representation, etc (but note: it's said IS-BASIC can
load - and also save - programs as a pure text files too). An IS-BASIC
program consist of lines. A line begins with a single byte telling the
length of the line. If it is zero, it signals the end of the program,
and you should stop parsing there.
The next two bytes are representation of the line number (standard low then
high byte). End of the line is signaled by a zero byte then (though
it's redundant in my opinion as length of the line shows the line
length anyway). The line itself consist of "marker" (note, it's only
my name for these entities) bytes and possible other information after
the specific markers. If marker byte is below 0x20 then
the "special sign" table is used to display some character. If the
byte is below 0x60 (but not below 0x20 - of course) then the marker's
lower 5 bits signals the length of a string which is decoded as a name
(in EP charset) after the marker. Byte 0x80 shows a BASIC string, the next byte is
the length, then that number of bytes follows. The string must be
decoded (in EP charset) surrounded by quotion marks. Marker byte 0x60
shows a tokenized BASIC keyword. The next byte is an index within the
token table. Some of the BASIC keywords ("untok_left") are special in
a way, that rest of the line should be printed as-is (with EP charset
conversion though, of course) no marker byte interpretion, etc. Marker bytes
0xA2 and 0xC2 means a two byte integer constants followed by the marker
byte as low then high bytes. There are two different markers as 0xA2
is used as BASIC line reference (GOTO, etc). 0xC6 signals a float
number. Unlike other BASIC dialects IS-BASIC does not use "standard"
floating point math but a (packed) BCD encoded scheme. This reduces the space
of number space somewhat (and said to be slower), however compared to the base-2 math, it's
accurate from the view point of base-10 math, humans are used to work
with (eg 0.1 cannot be stored in base-2 precisely which can cause funny
surprises even with simple FOR-NEXT loops). To really understand the BCD encoded floats, it's better to watch the source code,
it's harder to explain than reading the code. The special sign and
the basic token table can be seen in the source too, or can be viewed
by specifying the -db
command line switch, or visit this page (which
is the output of call with -db
):
epbas.lgb.hu/result-dist-db.txt
As far as I can tell, other values of the marker byte are invalid, at
least I throw an error for other values.
-bin
switch!)
with/without -dasm
switch. Currently, only 16K images
are supported.... it was too early to write changelog at the beginning ...