As part two (see previous attempt) of my ongoing series in ‘computational necromancy,’ I’ve spent the last year and a half or so constructing my own 1/10-scale, binary-compatible, cycle-accurate Cray-1. This project falls purely into the “because I can!” category – I was poking around the internet one day looking for a Cray emulator and came up dry, so I decided to do something about it. Luckily, the Cray-1 hardware reference manual turned out to be useful enough that implementing most of this was pretty straightforward. The Cray-1 is one of those iconic machines that just makes you say “Now that‘s a super computer!” Sure, your iPhone is 10X faster, and it’s completely useless to own one, but admit it . . you really want one, don’t you?
The Cray-1A Architecture
Now, let’s get down to specs – What is this bad boy running? The original machine ran at a blistering 80 MHz, and could use from 256-4096 kilowords (32 megabytes!) of memory. It has 12 independent, fully-pipelined execution units, and with the help of clever programming, can peak at 3 floating-point operations per cycle. Here’s a diagram of the overall architecture:
It’s a fairly RISC-y design, with 8 64-bit scalar (S) registers , 8 64-bit/64-word vector (V) registers, and 8 24-bit address (A) registers. Rather than a traditional cache, it uses a ‘software-managed’ cache with an additional 64 64-bit words (T registers) and 64 24-bit words (B registers). There are instructions to transfer data between memory and registers, and then register-to-register ‘compute’ instructions.
One of the coolest aspects of this machine is that everything is fully pipelined. This machine was designed to be fast, so if you’re careful, you can actually get one (or more) instruction every cycle. This has some interesting implications – there’s no ‘divide’ instruction, for instance, because it can take a variable amount of time to finish. To perform a divide, you need to first compute the ‘reciprocal approximation’ (something we *can* do in exactly 13 cycles, it turns out) of the denominator value, and then perform a separate multiply of that result with the numerator.
The vector instructions are particularly cool. A vector Add operation might take only 5 cycles to start producing results (remember, each vector can hold 64 values, so it takes 5 + 64 cycles to finish adding). Why wait for it to finish though? We can take the result output from the adder, and “chain” it straight into another vector unit (say a multiplier). And *that* only takes another 10 cycles or so, so we can chain that result into yet another unit (say, reciprocal approximation). Now, rather than waiting for the first operation to finish, we’re computing up to 3 floating point calculations per cycle. Clever programmers could sustain about 2 floating point operations per cycle, or 160 million instructions per second.
The actual design was implemented in a Xilinx Spartan-3E 1600 development board. This is basically the biggest FPGA you can buy that doesn’t cost thousands of dollars for a devkit. The Cray occupies about 75% of the logic resources, and all of the block RAM.
This gives us a spiffy Cray-1A running at about 33 MHz, with about 4 kilowords of RAM. The only features currently missing are:
-Exchange Packages (this is how the Cray does ‘context-switching’ – it was intended as a batch-processing machine)
-I/O Channels (I just memory-mapped the UART I added to it).
If I ever find some software for this thing (or just get bored), I’ll probably go ahead and add the missing features. For now, though, everything else works sufficiently well to execute small test programs and such.
When I started building this, I thought “Oh, I’ll just swing by the ol’ Internet and find some groovy 70’s-era software to run on it.” It turns out I was wrong. One of the sad things about pre-internet machines (especially ones that were primarily purchased by 3-letter Government agencies) is that practically no software exists for them.
***** If Anyone has any Cray-1 software, please contact me!! If you work at one of the National Labs, please take a look!****
After searching the internet exhaustively, I contacted the Computer History Musuem and they didn’t have any either. They also informed me that apparently SGI destroyed Cray’s old software archives before spinning them off again in the late 90’s. I filed a couple of FOIA requests with scary government agencies that also came up dry. I wound up e-mailing back and forth with a bunch of former Cray employees and also came up *mostly* dry. My current best hope is a guy I was able to track down that happened to own an 80 MB ‘disk pack’ from a Cray-1 Maintenance Control Unit (the Cray-1 was so complicated, it required a dedicated mini-computer just to boot it!), although it still remains to be seen if I’ll actually get a chance to try to recover it.
Without a real software stack (compilers, operating systems, etc.), the machine isn’t terribly useful (not that it would be all that useful if I did have software for it). All of the opcodes and registers for the Cray-1 are described in Base-8 (octal), so I did at least write a little script to translate octal machine code into the hexadecimal format that Xilinx’ tools require. All of my programming so far has just been in straight octal machine-code, just assembling it in my head. I have started work on re-writing the CAL Assembler, but that may take awhile, as it employs some tricky parsing that I’m having to teach myself.
Makin’ it look pretty
What’s the point of owning a Cray-1 if it doesn’t *look* like a Cray-1?? Unfortunately, the square-shaped FPGA board isn’t conducive to actually making it the traditional “C” shape, but I think it turned out pretty cool anyway. My friend Pat was nice enough to let me use his CNC milling machine to cut out the base pieces (and help with assembly). It’s a combination of MDF, balsa wood and pine. There was also a healthy dose of blood, sweat and tears (and gorilla glue) involved.
Some random photos from the build process:
The pieces before painting
Finally, Computer Engineer Barbie has an appropriate place to sit down!
This is awesome! How can I build my own?
This is very much a work-in-progress, but if you’d like to join in the fun, feel free! All you need is a copy of the RTL (almost all Verilog-2001) and a Spartan-3 1600 or equivalent FPGA board. The code is likely riddled with bugs and questionable implementation choices at this point, so on the off-chance anyone actually downloads this, feel free to lend a hand and send me any bug fixes you might make!
I finally had some more time to work on this. The updated code includes faster implementations of the multiplier units (it runs up to 50 MHz on my Spartan-3E!), as well as support for context-switching (“Exchange Packages”). There is still no support for I/O channels, but the 8-bit memory-mapped UART was replaced with full 64-bit UART.
As well as improved hardware, this release also includes a lot of progress on the software front. It includes a more-or-less complete implementation of the CAL assembler, re-written in Python, as well as a utility for generating Xilinx-friendly memory initialization files. Writing in CAL is *way* easier than writing in octal machine code. Code is also included for a simple BASIC-like language I’ve started playing around with (basically useless, as it doesn’t output valid Cray-1 assembly yet, but possibly interesting to look at).
Finally, an archaic DOS-compatible Cray X-MP simulator surfaced! It’s a single-processor simulator (the X-MP models where essentially 2-4 processor Cray-1s), so it is essentially just a Cray-1 sim, but it does work pretty well if you just want something to play with.
Get it here! Cray1_r2.zip
The project is now hosted on Google code, and is morphing into a single-CPU Cray X-MP with a working copy of COS 1.17: The Cray-1X Project