HRMc optimizations: jump compression

tl;dr: this post explains a small optimization, that I call "jump compression", that is implemented in the HRM compiler.

the issue

When I started using the HRMc compiler, I noticed that there were several useless jumps and labels all over the place, sometimes leading to huge chains of jumps that could just be removed in the generated code.

Let’s start with this simple program that puts in the output tape every non-zero item from the input tape.

# example.hrm
start:
# read from inbox
emp = inbox
# put in the output queue if emp is non-zero
if nz then
    outbox
endif
# start again
jmp start

The program is then compiled (without optimizations!) to this “assembly” snippet. As you can see, the compiler here generates some ad-hoc labels to implement the if operation with jumps.

The little ASCII diagram of the jumps lets us understand two things about the flow of the program

  1. the jumps both reach the same instruction (the last jump), because it’s tagged with two labels (_hrm_1, _hrm_endif_1)
  2. the last jump goes back to the first instruction, tagged start
# hrmc example.hrm --no-unreachable --no-jump-compression
start:        inbox      <----(2)------|
              jez _hrm_1 ----------|   |
              outbox               |   |
              jmp _hrm_endif_1 --| |   |
_hrm_1,                          | |   |
_hrm_endif_1: jmp start  <--(1)--|-|   |
                     |-----------------|

The problem with this code here is that the game CPU would spend nearly a third of its time following jumps. Running it with hrm-interpreter, with random inputs we’d notice that jumps are more than half of all instructions executed from the CPU. Isn’t it a waste of time? What if we could avoid the “jump chain” and go directly to the right instruction?

the algorithm

Let’s define a “jump chain”: if a jump A, either conditional or unconditional, refers to a label that tags another jump instruction B, then we have a jump chain. An example of a jump chain is (1) in the previous example: it’s safe to replace jmp _hrm_endif_1 with jmp start, as they’re clearly equivalent in that context.

The idea is then to shrink/compress the jump chain: every time we find a jump, either conditional or unconditional, we find the first non-jump operation directly reachable from that node in the AST, then take the label associated with that instruction and replace the original jump’s label with it.

Let’s then look at a commented version of the algorithm, as implemented in conversion.py of hrm-compiler

# compress_jumps: Instruction[] -> Instruction[]
def compress_jumps(ast):
    # labels_in_ast : Instruction[] -> Map<label: string, position: number>
    # labels_in_ast maps every label to the position of the instruction tagged by the label
    labels_positions = labels_in_ast(ast)

    compressed_ast = []

    for index, ast_item in enumerate(ast):
        if type(ast_item) in [p.JumpOp, p.JumpCondOp]:
            jump_going_nowhere = False
            try:
                # get the first position executable by `_label`
                _label = ast_item.label_name
                next_pos = labels_positions[_label]
            except KeyError:
                # `_label` does not tag any instruction.
                # save the instruction: removing jumps, either conditional or
                # unconditional, modified the semantics of the source program
                compressed_ast.append(ast_item)
                # go to the next instruction
                continue

            # let's visit the jump chain:
            # a jump chain stops when
            # (1) the current instruction is not an unconditional jump,
            # (2) the instruction was already visited (countermeasure to loops)
            # (3) the jump instruction refers to a label that does not tag any instruction.
            # the algorithm updates `_label` until the jump chain stops.
            visited = [False for i in ast]
            while type(ast[next_pos]) == p.JumpOp and \
                not visited[next_pos] and \
                not jump_going_nowhere:
                visited[next_pos] = True
                _label = ast[next_pos].label_name
                try:
                    next_pos = labels_positions[_label]
                except KeyError:
                    jump_going_nowhere = True

            # no matter why the previous loop stopped,
            # _label is guaranteed to contain the label that is located
            # nearest to the first non-`jmp` instruction.
            # let's then create add the new jump instruction to compressed_ast
            if type(ast_item) == p.JumpOp:
                compressed_ast.append(p.JumpOp(_label))
            else:
                compressed_ast.append(p.JumpCondOp(_label, ast_item.condition))
        else:
            compressed_ast.append(ast_item)

    return compressed_ast

As hinted by the previous ASCII diagram, the algorithm exploits the graph-like features of the instruction array (named AST here, but it's more similar to a compiler-specific Intermediate Representation, or IR): the algorithm does a depth-first visit of the graph defined by the IR: the visit stops when one of three conditions become true:

  1. the currently visited instruction is not a conditional jump
  2. the instruction was already visited
  3. it is not possible to visit the next instruction

When the visit stops, _label is the name of the label that is nearest to the non-jmp instruction that the CPU is going to execute at the end of the jump chain.

There are two possible cases:

  1. we reached the non-jmp instruction: _label is the name of the label tagging that instruction
  2. we couldn't reach the non-jmp instruction: the algorithm stopped at the jmp instruction, and we can get the name of the label from it

the result

If we enable this optimization in the compiler, it will generate the following code:

# hrmc example.hrm --no-unreachable # jump-compression is enabled
start:        inbox     <------|
              jez start  ---->-|
              outbox           |
              jmp start  ---->-|
_hrm_1,                        |
_hrm_endif_1: jmp start  ---->-|

Now we see that every jump instruction now refers to the right label: it means we are not going to waste time jumping around!

Conclusions

In this article, I tried to explain a simple optimization I've implemented in HRMc, a simple compiler for the Human Resource Machine game's assembly language. Thank you for spending your time reading these notes.

Did you enjoy this explanation? Does this optimization have a real/academic name? Feel free to tell me on Twitter ! If you liked this article, please consider offering me a Ko-fi !

HRM compiler (or: Winter wants to build a compiler)

HRM Compiler on github

I really like programming languages as tools to express thoughts and intention. It also means I'd love to write a compiler, but didn't want to create it just for the sake of it.

The need for a compiler arose when a friend of mine suggested to try Human Resource Machine, a simple game when you have to program a... human employee to get things from an inbox tape, do something useful with them and put the result on an outbox tape. Human Resource Machine (HRM) is fun, but I don't like the fact you have to use a visual tool to write some sort of simplified assembly language. I don't really like assembly, and I feel lost when the execution flow starts to jump here and there. I needed something to let me write the algorithm I have in mind and then convert it to code. "So" - I thought - "why don't I create a stupid simple language and compile to this stupid version of assembly?"... and I started creating this little compiler. The result -still a work in progress!- is a simple "compiler" that I used to solve some of the problems/exercises in the game (checkout the examples folder!).

How does it work?

The source language is a superset of the target language, so the translation is easy. The translation from the source to the target language is done in multiple passes, and each pass removes a part of the superset language or rewrites part of the AST to make the next pass' work simpler. At the end of the process, the AST that the "code generation" routines work on is only made of target language's instructions and statements, making the "code generation" phase very simple and straightforward.`

Let me explain what I mean by looking at this simple example: put in the OUTBOX only the items in the INBOX that are not ZERO, and keep a counter of non-zero items.

# let's define an alias for a memory cell, to make it easier
# to understand what the code does. The cell is already
# set to zero by the game.
alias 1 counter
pickone:
# retrieve an item from the INBOX
emp = inbox
# if `emp` is not ZERO, then put the value stored
# inside `emp` into the OUTBOX
if nz then
    incr counter
    outbox
endif
jmp pickone

The compiler would parse it, generate the IR and perform two conversions. The first one, in our case, changes all the if nz (if nonzero value in emp) to if ez (if emp contains a zero value), to normalize the IR for the second pass.

alias 1 counter
pickone:
emp = inbox
# transformation here!
if ez then
    # nothing to do here!
else
    incr counter
    outbox
endif
jmp pickone

The second conversion, well, converts the if to target-language structures: conditional and unconditional jumps. The previous pass makes the work of this one easier, making the code simpler, as it has to handle just this case.

alias 1 counter
pickone:
emp = inbox
# if -> jump here! simplify code generation!
jez _hrm_if_1
incr counter
outbox
jmp _hrm_endif_1
_hrm_if_1:
_hrm_endif_1:
# the `if` is fully transformed, and ends here.
jmp pickone

After that, the code generation phase converts our IR to the target language. Again, the code generation' implementation is simpler because we translated higher level IR into equivalent IR that is nearer to the target language and thus easier to translate.

As it is a tool to help me think about the problem I'm trying to solve, this language is far from being a real, complete language with all the usual things you may be used to find in other languages. If you look at the code in the repository, you may discover I didn't implement a while construct: it's not a problem, you can build it with if and jmp, or conditional jumps. Instead, I preferred to create some helpers to let me think about a problem without worrying about "oh I don't have jnz, just jez, so I have to write this thing differently", I can just throw a if nz and let the compiler handle that.

Things to work on

The compiler can generate better code, to fix some problems due to the new instructions and operations added to the base language. After this post, I wrote a couple of optimizations, such as jump compression (to remove useless unconditional jumps) and unreachable code removal (to remove instructions that will be never executed and make the generated code smaller), but I'm sure there are other optimizations I can implement.

Another thing to consider is that the "emulated processor" the gamer is programming in Human Resource Machine is very limited: it only has one register, and its instruction set is very small. Those limitations make it difficult to create a richer language, but I think it'll be fun to overcome these challenges.

Do you like this project? Whisper (or shout!) your opinions via Twitter, or consider offering me a coffee on Ko-fi !

Experimenting with Rust and CloudFoundry buildpacks

In this post I'll write some notes about CloudFoundry buildpacks, which files are needed to make them work, and in general how I used an Heroku buildpack to deploy a simple Rust application to GE Predix (that uses CF).

Note: for those who did not heard of CloudFoundry, it is an open source platform that you can deploy on your own servers, AWS or other IaaS services to run your services. Using it lets you migrate applications from Predix to Bluemix or other PaaSes easily.

So, let's start with the problem I wanted to solve: I needed a simple, hackable backend to create some random data for a single-page Javascript application I'm building. Wouldn't it be the perfect time to test how Rust (and its web stack) performs in webdev?

Unfortunately, Rust is not supported "out of the box" in CloudFoundry. CF has some so-called system buildpacks for Java, Golang, Javascript single-page applications and other languages, but there is not one of them for Rust: it means we need to look at some custom buildpacks.

Before we go further... what is a buildpack?

A buildpack is a collection of scripts that prepares a virtual machine (dyno in Heroku slang) to build and run your application in the cloud. In this article we're going to see how to build one of these.

Ok, let's see...

The documentation about custom buildpack explains how a buildpack works, and which files are necessary to successfully build and run a project in the cloud (foundry).

Custom buildpacks need three (executable) files:

  • bin/detect

    This script determines whethever or not to apply the buildpack to an application. Basically it checks if you are using the wrong buildpack for your project, and stops you before you try to use the Java buildpack for a Rust application. Buildpacks can also be "framework-specific", so it makes sense to check if you're trying to run a Gradle buildpack to build an application meant to be built by Maven.

    It is run with only one parameter: the path to the build directory, where the source code and the files loaded during cf push are loaded. In a Rust application, it will contain your source code.

    CloudFoundry expects it to return 0 if everything is ok and the buildpack can continue, 1 otherwise.

  • bin/compile

    This script is in charge of configuring the environment and generating a runnable binary from sources. A Rust buildpack would download rustup, install it and use cargo build to build your application.

    This script is run with two parameters: the build directory and the cache directory. The last one will contain the assets created during the build process. A Rust buildpack we will speak about later used to store the compiled libraries there, in order to "cache" dependencies and make deploys faster.

  • bin/release

    This script must generate a .yml file that describes how the application should be executed. An example of such a .yml follows:

    default_process_types:
        web: ./target/release/YOUR_APPLICATION_NAME # in Rust, this suffices.
    

    This script only takes one parameter, the build directory. Please note: some developers use this script as a simple template, in order to automatically generate the correct .yml file during the execution of bin/compile.

An example of a buildpack that has these three scripts is the Rust buildpack built by Aaron Santavicca. If we look at it, we can see that:

  • bin/detect checks if our project has a Cargo.toml file. If it doesn't, it cannot be built with cargo.
  • bin/compile installs rustup, builds your application with cargo and generates the real bin/release script.
  • bin/release is a template to be modified when running bin/compile.

...and that's it. CloudFoundry runs these scripts, reads the .yml file and runs everything as requested. Yay!

Hey, I'm used to Heroku, do I need to change my trusted buildpack?

No, you don't need it.

CloudFoundry buildpacks are compatible with Heroku ones: the good guy HoverBear provides a working buildpack for Rust that relies on multirust and caches build artifacts (mostly dependencies), so you don't have to lose five minutes to build the world every time you launch cf push (at least, it used to do that before RustFest, to brutally fix an unexpected problem with removed dependencies and broken slug size limits).

However, making them run is a bit trickier. When I was evaluating Rust buildpacks, I looked at the HoverBear one, and... bin/release script wasn't there. What kind of magic was going here? How did it even work?

After some testing (and reading the docs), I found out that there are... two ways to make buildpacks missing bin/release work.

The first one is adding the command attribute to the manifest.yml file you wrote for your application. A command: ./target/release/YOUR_APPLICATION_NAME should suffice.

The second one is to add a Procfile to the project. A Procfile is a file that tells CloudFoundry (or Heroku) which command has to be executed to start your application. A simple Procfile for a Rust app may just contain the line

web: ./target/release/YOUR_APPLICATION_NAME

Conclusion

My team and I are developing web apps on Predix (a CloudFoundry instance managed by GE), and I use Rust to build some quick-and-dirty test backends to generate random responses for our frontend. To run Rust on CF, I needed to understand how to build a new application on it, and I spent some time to fully understand how things work. I hope these notes can be useful for you too!

__attribute__((packed)) on Windows is ignored (with MinGW)!

tl;dr by default, MinGW on win32 behaves as if it ignores __attribute__((packed)), you need to add -mno-ms-bitfields to make it work as expected.

Some days ago, I was tasked with a port of a simple networked application to Windows 7. There is a small issue, though: to serialize the messages, the code filled a packed structure, that is then memcpy-ed into a buffer and then thrown into the TCP stack. Don't blame me, I trusted a bad advice, I was young and stupid!

We may get away with it (it works, somehow!). The big problem here is that everything falls apart if the structure layout is not like we expect it to be! In the best case, the application segfaults because it tried to access some memory we should not access, or some internal checks fire because the values are so wrong they simply are out of range.

Let's say I'm passing this packed structure around...

typedef struct _mypackedstruct {
  uint32_t magic_number;
  uint8_t  version;
  uint32_t interesting_property;
  uint64_t creation_time_sec;
  uint64_t creation_time_usec;
  uint32_t first_info_no;
  uint32_t second_info_no;
} __attribute__((packed)) MyPackedStructure;

... and the code receives and parses the serialized structure, as in the following pseudo-C++ example

// a very simplified sample of the code that highlighted the problem
MyPackedStructure destination;
char* buffer = new char[sizeof(destination)];

// read some data from a previously created socket
// and push them into a buffer
socket.recv_message(buffer);

// push the unparsed data into destination
// to perform an automatic "deserialization".
// Don't try this at home. Please, use serialization libraries.
memcpy(&destination, buffer, sizeof(destination));

// perform validity check on the structure's content.
assert(is_packet_ok(destination));

On various Linux distros, everything works fine. Once I built the project on Windows, the is_packet_ok assert fired. Why?

After looking at the code, trying to understand the problem, I had the idea of checking the size of the structure. On Windows (using MinGW), sizeof(MyPackedStructure) returns 40, while on Linux (using three different compilers: gcc 4.6.3, gcc 4.9.2, clang/llvm 3.8) the same expression returns 33. Seven bits sure do a big difference! I want you to focus your attention on that 40. It's easy to recognize that 40 mod 8 = 0: the structure was aligned to 8 bytes. Why was the packing attribute ignored?

A simple search on the Wired showed me that this strange behavior is known since April 2012, but the ticket is still marked as new. In the comments, you'll find a workaround: set -mno-ms-bitfields.

What is this -mno-ms-bitfiels option? The 6.36.5 i386 Variable Attributes section of the manual says

"If packed is used on a structure, or if bit-fields are used, it may be that the Microsoft ABI lays out the structure differently than the way GCC normally does. Particularly when moving packed data between functions compiled with GCC and the native Microsoft compiler (either via function call or as data in a file), it may be necessary to access either format."

[snip]

"2. Every data object has an alignment requirement. The alignment requirement for all data except structures, unions, and arrays is either the size of the object or the current packing size (specified with either the aligned attribute or the pack pragma), whichever is less. For structures, unions, and arrays, the alignment requirement is the largest alignment requirement of its members. Every object is allocated an offset so that: offset % alignment_requirement == 0"

What does it mean? It means that, by default, MinGW uses the Microsoft algorithm to calculate packed structures requirements, and it doesn't work as we'd like.

The expected behavior (as described in the same link!) is:

packed
The packed attribute specifies that a variable or structure field should have the smallest possible alignment—one byte for a variable, and one bit for a field, unless you specify a larger value with the aligned attribute.

To explain it, I will write a simple program that creates a packed structure and initializes it with known values. The program will be later inspected with GDB in order to check the actual memory layout of the structure.

// file: main.cpp
// MyPackedStructure is the same structure we defined at the start of the blog post
int main(){
  // print the size of the structure
  std::cout << "sizeof(MyPackedStructure) = " << sizeof(MyPackedStructure) << std::endl;

  // prepare a structure variable...
  MyPackedStructure obj;
  obj.magic_number         = 0xa5a61ff5;
  obj.version              = 2;
  obj.interesting_property = 0x33;
  obj.creation_time_sec    = 1462224873;
  obj.creation_time_usec   = 4141457;
  obj.first_info_no        = 5;
  obj.second_info_no       = 2;

  // ... and put a breakpoint here to inspect its memory!
  return 0;
}

I built and ran the code on ArchLinux (x86, GCC 5.3.0, GDB 7.11) and...

[~/projects/experimental/sizeof_packed_struct]> g++ -o test main.cpp -g # let's build it with the latest GCC on ArchLinux, x86
[winter@timeofeve] [/dev/pts/3] [master]
[~/projects/experimental/sizeof_packed_struct]> gdb ./test
Reading symbols from ./test...done.
(gdb) break main.cpp:29
Breakpoint 1 at 0x804875c: file main.cpp, line 29.
(gdb) r
Starting program: /home/winter/projects/experimental/sizeof_packed_struct/test
sizeof(MyPackedStructure) = 33

Breakpoint 1, main () at main.cpp:29
29        return 0;
(gdb) x /33x &obj  # let's dump 33 bytes of memory from the address of `obj`
0xbffff5af:     0xf5     0x1f    0xa6    0xa5    0x02   0x33    0x00    0x00
0xbffff5b7:     0x00     0xe9    0xc7    0x27    0x57   0x00    0x00    0x00
0xbffff5bf:     0x00     0x91    0x31    0x3f    0x00   0x00    0x00    0x00
0xbffff5c7:     0x00     0x05    0x00    0x00    0x00   0x02    0x00    0x00
0xbffff5cf:     0x00

... the memory is laid out as we'd expect it: we can see the magic number in the first 4 bytes, version occupies only the fifth byte. interesting_property is unaligned: the field is 4 bytes long, but the first 3 bytes are in the first row (0xbffff5af), and the last one lies in the second row (0xbffff5b7). Please note that MyPackedStructure, on ArchLinux is 33 bytes long.

If we run it on Windows 7 (x86_64, mingw 4.9.3, gdb 7.6.1, PowerShell x86), we get...

PS C:\Users\WinterHarrison\Desktop> g++ main.cpp -o test -g
PS C:\Users\WinterHarrison\Desktop> gdb .\test
Reading symbols from C:\Users\WinterHarrison\Desktop\test.exe...done.
(gdb) break main.cpp:29
Breakpoint 1 at 0x401498: file main.cpp, line 29.
(gdb) r
Starting program: C:\Users\WinterHarrison\Desktop/.\test.exe
[New Thread 164.0x5c8]
sizeof(MyPackedStructure) = 40

Breakpoint 1, _fu0___ZSt4cout () at main.cpp:29
29        return 0;
(gdb) x /40x &obj
0x28ff08:        0xf5    0x1f    0xa6    0xa5    0x02   (0xff    0x28    0x00)
0x28ff10:        0x33    0x00    0x00    0x00   (0x53    0x40    0x0e    0x64)
0x28ff18:        0xe9    0xc7    0x27    0x57    0x00    0x00    0x00    0x00
0x28ff20:        0x91    0x31    0x3f    0x00    0x00    0x00    0x00    0x00
0x28ff28:        0x05    0x00    0x00    0x00    0x02    0x00    0x00    0x00

... that MyPackedStructure is now 40 bytes long. The fact that my Arch is 32bit doesn't mean anything, as I'm using fixed-size integers.

Please also note that there are no unaligned fields: interesting_property now starts at the first byte of the second row (0x28ff10).

This fact is the real cause of the problem I described at the start of this post: memcpy writes important (and correctly packed!) data inside the padding bytes, that are then not considered nor accessible. When you try to read from the structure, you will get some junk with no meaning at all. Just a simple example: after the memcpy in the pseudocode at the top of the article, we'd expect obj.interesting_property to be 0x33, but it's 0 on Windows. 0x33 would be where we now can see 0xff (first row, sixth byte), but that particular address is not accessible by the client code. Well, I may access it doing some pointer arithmetics, but isn't the point of using a packed structure to avoid pointer arithmetics at all?

Let's now apply the suggested workaround: tell GCC "whatever, I don't care about this MS bitfields algorithm, please do as you do on Linux" using the -mno-ms-bitfields compiler flag.

PS C:\Users\WinterHarrison\Desktop> g++ main.cpp -o test -g -mno-ms-bitfields
PS C:\Users\WinterHarrison\Desktop> gdb .\test
Reading symbols from C:\Users\WinterHarrison\Desktop\test.exe...done.
(gdb) break main.cpp:29
Breakpoint 1 at 0x401498: file main.cpp, line 29.
(gdb) r
Starting program: C:\Users\WinterHarrison\Desktop/.\test.exe
[New Thread 832.0xaf8]
sizeof(MyPackedStructure) = 33

Breakpoint 1, _fu0___ZSt4cout () at main.cpp:29
29        return 0;
(gdb) x /33x &obj
0x28ff0f:       0xf5    0x1f    0xa6    0xa5    0x02    0x33    0x00    0x00
0x28ff17:       0x00    0xe9    0xc7    0x27    0x57    0x00    0x00    0x00
0x28ff1f:       0x00    0x91    0x31    0x3f    0x00    0x00    0x00    0x00
0x28ff27:       0x00    0x05    0x00    0x00    0x00    0x02    0x00    0x00
0x28ff2f:       0x00

Both the size and the memory dump look like the Linux size and dump. This is exactly what we wanted! Everything works now! Mission accomplished, let's call it a day!


So, what did I learn from this experience?

  • Use -mno-ms-bitfields if you want to use packed structures on Windows (using MinGW)
  • Serialization libraries (cereal, messagepack, protocolbuffers, ...) exist for a reason: to avoid writing similar blog posts and to ensure that messages are serialized and deserialized in the same way on every platform
  • Don't think that your code will run the same everywhere: always have a test suite (and don't make it a pain to build and run on a new target platform!)
  • Read your compiler's documentation, especially if you're using language extensions

That's all for today. Let me know if you found those notes useful - if you like them, please consider offering me a Ko-fi to let me keep writing!