Since you're taking feedback and this hex editor is geared more towards reverse engineering, it may be a good idea to offer this suggestion. There's a need among game modders for a hex editor that supports non standard text encodings, but hardly any viable tools that satisfy this.
Text Encoding?
You know, how byte 0x41 maps to the text character "A". However, this assumes this data structure abides by the ASCII standard.
All hex editors support this in some way.
There's a lot of different ways to interpret bytes after 0x7F, depending on the encodings (often called code pages). Some of these encodings would even use 2 bytes per character (and then 3 when we added emoji, etc) because 256 characters just ain't enough. One of these encodings is UTF-8. For the sake of the argument, I'll call these standard character encodings.
Many hex editors support these or a smaller portion of them out of the box.
Then, there's the completely arbitrary character encodings, seen in Teletext transmissions, some early printer formats, video games, and eastern asian computers. They don't abide by ASCII or any existing established encoding, because the designers prioritized something else, like using a partial font (with rarely used symbols thrown out) and fitting it into a limited video memory, then starting the count from whatever is the first character as 00. This practice still lives on to this day. I'll call these the user-defined character encodings, and they'll be the main subject of my feature request.
Existing approaches to user-defined character encodings
Character Sets (010 Editor)
010 Editor has a bunch of standard character encodings, both single byte (which it calls "simple") and double byte ("complex", which are hardcoded). Exclusively for single byte encodings, it allows editing, exporting and importing the characters sets as csv files.
The csv of a 010 Editor Character Set is a comma separated 16*16 array that represents all possible 256 characters from 0x00 to 0xFF. In each cell, the UTF-8 code for the character meant to interpret that character value is written in the 0xABCD format. For example, one of its customizable character sets that assign the euro symbol ( U+20AC ) to 0x80 will have 0x20AC
in the first cell of the ninth line.
A cell can have -1
as its value to tell 010 Editor to leave it blank and/or not overwrite the last used character there.
0x0000,0x0001,0x0002,0x0003,0x0004,0x0005,0x0006,0x0007,0x0008,0x0009,0x000A,0x000B,0x000C,0x000D,0x000E,0x000F, 0x0010,0x0011,0x0012,0x0013,0x0014,0x0015,0x0016,0x0017,0x0018,0x0019,0x001A,0x001B,0x001C,0x001D,0x001E,0x001F, 0x0020,0x0021,0x0022,0x0023,0x0024,0x0025,0x0026,0x0027,0x0028,0x0029,0x002A,0x002B,0x002C,0x002D,0x002E,0x002F, 0x0030,0x0031,0x0032,0x0033,0x0034,0x0035,0x0036,0x0037,0x0038,0x0039,0x003A,0x003B,0x003C,0x003D,0x003E,0x003F, 0x0040,0x0041,0x0042,0x0043,0x0044,0x0045,0x0046,0x0047,0x0048,0x0049,0x004A,0x004B,0x004C,0x004D,0x004E,0x004F, 0x0050,0x0051,0x0052,0x0053,0x0054,0x0055,0x0056,0x0057,0x0058,0x0059,0x005A,0x005B,0x005C,0x005D,0x005E,0x005F, 0x0060,0x0061,0x0062,0x0063,0x0064,0x0065,0x0066,0x0067,0x0068,0x0069,0x006A,0x006B,0x006C,0x006D,0x006E,0x006F, 0x0070,0x0071,0x0072,0x0073,0x0074,0x0075,0x0076,0x0077,0x0078,0x0079,0x007A,0x007B,0x007C,0x007D,0x007E,0x007F, 0x20AC,0xFFFD,0x201A,0x0192,0x201E,0x2026,0x2020,0x2021,0x02C6,0x2030,0x0160,0x2039,0x0152,0xFFFD,0x017D,0xFFFD, 0xFFFD,0x2018,0x2019,0x201C,0x201D,0x2022,0x2013,0x2014,0x02DC,0x2122,0x0161,0x203A,0x0153,0xFFFD,0x017E,0x0178, 0x00A0,0x00A1,0x00A2,0x00A3,0x00A4,0x00A5,0x00A6,0x00A7,0x00A8,0x00A9,0x00AA,0x00AB,0x00AC,0x00AD,0x00AE,0x00AF, 0x00B0,0x00B1,0x00B2,0x00B3,0x00B4,0x00B5,0x00B6,0x00B7,0x00B8,0x00B9,0x00BA,0x00BB,0x00BC,0x00BD,0x00BE,0x00BF, 0x00C0,0x00C1,0x00C2,0x00C3,0x00C4,0x00C5,0x00C6,0x00C7,0x00C8,0x00C9,0x00CA,0x00CB,0x00CC,0x00CD,0x00CE,0x00CF, 0x00D0,0x00D1,0x00D2,0x00D3,0x00D4,0x00D5,0x00D6,0x00D7,0x00D8,0x00D9,0x00DA,0x00DB,0x00DC,0x00DD,0x00DE,0x00DF, 0x00E0,0x00E1,0x00E2,0x00E3,0x00E4,0x00E5,0x00E6,0x00E7,0x00E8,0x00E9,0x00EA,0x00EB,0x00EC,0x00ED,0x00EE,0x00EF, 0x00F0,0x00F1,0x00F2,0x00F3,0x00F4,0x00F5,0x00F6,0x00F7,0x00F8,0x00F9,0x00FA,0x00FB,0x00FC,0x00FD,0x00FE,0x00FF,
It's obviously limited because of the lack of support for encodings that use more than 1 byte per character.
Thingy Table Files (Thingy)
These were used by romhacking communities to deal with retro game modding.
They consist of text files that contain entries separated by newlines for each character.
Each entry is the hex value (not case sensitive) for the byte(s) value in big endian, then the equal sign, then the actual character(s), and finally a newline. Any entry that doesn't follow this is ignored. Representing the character for the "equal" sign can be done with 03==
Some custom versions of this used a tabulation instead of the equal sign as a separator, but it wasn't used as much.
Table files have the extension .tbl and are supported by some "special" romhacking-focused hex editors such as Tinke, WindHex and Thingy. The problem is that most of these are obsolete, closed source and not under active development anymore.
Occasionally, games will resort to some compression techniques for the text like byte pair encoding, often called within romhacking circles "dual tile encoding (DTE)" or "multi tile encoding (MTE)" (imprecise terminology, but DTE is exclusively used when two characters are encoded with one byte, and for the lowest layer if it's recursive)
Or things like so-called control codes, markdown tags in binary forms that tell the game where to stop parsing text ("end" control code) or add newlines, waits for user prompts, timers, sound effects, instructions for different text colors, fonts and sizes...
This can also be represented with tables.
# this is a comment
# byte pair encoding (mte)
8500=Hello
8501=You got
8502=ight
# byte pair encoding (dte)
A0=ed
A1=ng
A2=..
# control codes
FE=\n
FF=\e
FC00=[color:default]
FC01=[color:red]
FC02=[color:yellow]
FA=[wait_user]
One main flaw in the implementations seen in previous hex editors with Thingy TBL support was that, in the process of filling the 16*16 array with character values, any unfilled characters would be replaced with a placeholder character which can't be customized, such as a period or an interrogation mark or whatever.
Some notes about previous hex editors with support for this:
- Considering the Thingy TBL file format didn't include conditions how it needs to be formatted, and it's a text file, users picked whatever code pages to save them. Some hex editors are hardcoded to only accept specific encodings (like ASCII, or Shift-JIS) but a couple have the more welcome option to pick which encoding to parse the table as, similar to LibreOffice's handling of csv files.
- Some hex editors with support for standard character encodings will include the user defined character encoding as a toggle rather than by default ("Use TBL")
- In regards to byte pair encoding and user keyboard input in the text field, many of these hex editors just type the "uncompressed" word rather than using the DTE/MTE bytes, unless a "DTE mode" is toggled on that behaves similarly to a Japanese IME (writing text outside the editor then pressing return, that's when it's inserted (overwrite as always, of course) into the text field with the DTE/MTE in mind.) or accepts copy pasting.
abcde Table Files
A refinement of the Thingy Table File format that standardizes it more and adds some more esoteric features. It's meant more for text dumping utilities (that extract and reinsert the text into the game binary following a table file in this newer format, and a configuration file for the text data location and pointer formats), but there are some additions worth mentioning. Note that this abcde comes with a readme that goes into this in much more detail than I ever could.
- It's mostly retro-compatible with Thingy Table Files.
- HAS TO BE UTF-8. Of course, the Byte Order Mark (which is optional) will be ignored when parsing the abcde TBL file.
- Comments now have to be prefixed with #
- The characters
<>
are reserved and can no longer be included, because of the newly added support for multi-table files. Maybe this can be solved by adding escape characters \<
or turning off multi-table support?
Its new concepts:
A standardized way to mark the end control code character meant to stop the parsing of the current string.
# FF is the end control code
/FF=[END]
Support for binary values.
Useful when dealing with games that use less than 8 bits to transcribe characters, either of fixed length (like Battle of Olympus on NES which used 5 bits for all characters. abcde included examples based on it) or variable length (Huffman compression is one of the more known worthwhile applications of this, and one such implementation would display texts encoded using it in plain text)
It requires an anchor point offset to start interpreting text from, and an explicit end control code in binary.
Might be challenging to implement in a hex editor, though I saw some unpublished attempts by some romhackers to program hex editors which display the hex field using BINARY rather than hexadecimal. Maybe the text output could be added to the pattern data viewer instead?
# binary values are prefixed by %
# huffman coding wikipedia page example used here
%111=
%010=A
%000=E
%1101=F
%1010=H
%1000=I
Support for multiple tables.
Some games, instead of coding different values for both upper-case and lower-case characters, had them share the same value, and added a "table switch control code" to change between them. This was used for Latin characters (A/a) or Japanese kana (あ/ア) quite often, but was also used to pick between 2 sets of punctuation and symbols. To this date, there's no hex editor to date which can handle this use case. Here's an example of a hypothetical game implementation:
Internal Data:
h(§2)ello! (§1)m(§2)y name is (§1)erdrick t(§2)he(\n)
d(§2)ragonmaster.(\e)
Output:
Hello! My name is ERDRICK The
Dragonmaster.
Some key observations here:
- The game has two sets, first one is for all caps, second one is for lowercase characters. In abcde tables, these sets are defined just like with normal tables, but as 2 tables in succession preceded each by a line with
@FirstSet
and @SecondSet
- The game has a default set, the first. It's used in the very beginning, or after some characters/control codes like the line break here. In abcde tables, this will be the first set present in the table file, so FirstSet.
- These sets can be explicitly called with control codes. (§2) forces lower case characters. These are the table switch control codes. In abcde tables, they are defined with a line preceded by
!
and including a "table ID" (which set to use) and a "match type" (when to go back to the default set)
In our example, assuming (§2) uses the byte value 02 in hexadecimal, and likewise (§1) is 01, and the line break is 0A, the abcde table will be something like:
# switch to SecondSet and never switch back unless it hits §1, or unless it hits a line break
!02=,<@SecondSet>:$01,<@SecondSet>:$0A
or, alternatively:
# switch to FirstSet and never switch back (condition 0)
!01=,<@FirstSet>,0
# switch to SecondSet and never switch back unless it hits a line break
!02=,<@SecondSet>:$0A
Some games have kanji characters they call from an additional font. For example 8200 would call the first character (ID: 00), and would display 今 so this could be represented in two different ways:
- using the features of a normal thingy table, so
8200=今
OR
- add a new set to the table, called @kanji1 (of course taking care to name the main set, @main or something)
- include under it
00=今
- include in the @main table the following
!82=,<@kanji1>:1
The value 1 means (make exactly 1 match in the kanji1 table, then fall back. Why this? Some games like Secret of Mana's Japanese version will have similar syntax but read a fixed amount of characters from 1 to 5 depending on the control code, so it will be from 1 to 5 matches then...)
This multiple table support is also very useful for variables.
There's all kinds of control code that tell the game to wait X frames, display character portrait X out of a hundred or so, play sound effect X, and so on. With the older Thingy table format, you'd have to write 256 different entries just to catch all possible values for EACH control code.
FA00=[Sound_00]
FA01=[Sound_01]
FA02=[Sound_02]
# ...
FAFF=[Sound_255]
Whereas you could use the syntax this way:
00=A
01=B
02=C
!FA=<[Sound]>,1
# Output for FA00 will be: [Sound]($00), NOT [Sound]A
Search feature for user-defined character encodings
This is actually implemented in WindHex32 as "Relative Search". It's a way to quickly analyze a binary file and try and detect text data that might be using a custom encoding, as long as:
- Its characters in this unknown character set are ordered with no gaps
- Search query only includes the characters [A-Za-a-z0-9] or wildcard characters that don't assume the character's nature (whitespace, letter, number, capital, etc)
- Search query is 3 characters or more
The standalone utility monkeymoore adds the ability to include user input for the character set (a list of the characters as they appear in the font, and likely the encoding), this allows for searching text in languages other than English, such as European languages and Japanese, as long as the characters in the search query are included in order in the set and no character appears twice.
The results of a relative search operation is a list of possible search results and their custom encodings. The user can then confirm which one is the more correct, and save it as a user-defined character encoding to view the rest of the file with (although it will be lacking, especially punctuation and whitespace until the user manually adds them back to the character set definition aka table or character set)
Closing words
As a feature, it's very in demand, and I'm surprised how many attempts were made that went nowhere, and seeing how much reinventing the wheel there was compelled me to write this wall of text.
Even the bare minimum Thingy TBL support with no DTE/MTE support is very rare, even among hex editors that pride themselves on being as comprehensive as possible when it comes to supporting encodings, so it's not like there's a high barrier to entry to instantly become the best hex editor at game modding (already up there with the pattern viewer).
Going the extra mile for the latter details will be very interesting for anyone interested in Huffman.
I hope it sparks your interest and I'd be very grateful if it came to happen. Thanks a lot for reading all of this.
P.S. Imgui doesn't seem to support non English languages all that well, so you might be interested in Unifont as a fallback font to cover about everything else. I'll look for other attractive alternatives for pixel fonts.