Intro to CPU design [All parts]


Active member
Intro to CPU design [All parts] #1
Alright, my signature is getting too long, so I will make a thread here that contains all of the links to the CPU design series.

Please don't comment here, unless you have a suggestion for going more in-depth about something, comment on the thread to which it pertains.

CPU Design Part 1
CPU Design Part 2
CPU Design Part 3
CPU Design Part 4
CPU Design Part 5
CPU Design Part 6

This will be updated as I write more articles.

All parts (in one post):
Okay, so why don't we make a series of tutorials on CPU design. I will make them complete with drawings, all of the maps, and part numbers. By the end of this series you will have a fully programmable CPU to play around with and hopefully a better understanding of how they work.
For the purpose of simplicity, I will be using a RISC-like architecture. I won't go in-depth on the pipeline used here, just know that there is a very crude implementation of one going on.

Binary and numerical systems
A binary (or base-2) numerical system is a system in which only two values are used:
HIGH - binary 1
LOW - binary 0
This is what we will use at the base level of our CPU. In CPU design we use electron states to determine logic values. We will only use 2 (of the many) electron states, since they are easy to produce. Currently no CPU on the market uses any more than two electron states, quantum computing is still in its early stages.

Counting in binary
It is actually quite simple. Since we will be using a 4-bit CPU, we will include all 4 bits in our counting. 4 different bits means that we will have 16 different values (v(n) = 2^n), so let's count to 16:
0000 = 0
0001 = 1 ;now we do something funky, we have to do a carry, resulting in this:
0010 = 2 ;1 + 1 = 10, remember that, it is important!
0011 = 3
0100 = 4; 11 + 1 = 100, because 1 + 1 = 10, then we have 1 + 1 again.
0101 = 5
0110 = 6
0111 = 7
1000 = 8
1001 = 9
1010 = 10
1011 = 11
1100 = 12
1101 = 13
1110 = 14
1111 = 15
now, why didn't we make it to 16? Simple. We have 16 possible values, that INCLUDES the 0. The maximum count for each set of bits is m(n) = v(n) - 1
therefore m(4) = v(4) - 1
m(4) = (2^4) - 1
m(4) = 16 - 1 = 15

Now, I assume you have counting in binary down, we need to work on addition and subtraction.
lets do 7+6 in binary. It helps if you have that table of binary to decimal memorized, we will go into conversion in a minute.
7 = 0111
6 = 0110
ALWAYS go from right to left. remember the rules:
0+0 = 0
0+1 = 1
1+0 = 1
1+1 = 10
xxx1 (0+1 = 1)

0100 (carry the 1)
xx01 (1 + 1 = 10)

1100 (carry the 1 again)
x101 (1 + 1 = 10, 0 + 1 = 1)

1101 (0 + 0 = 0, 1 + 0 = 1)

Final answer: 1101
Now, let's check that, 1101 in binary is 13. 6+7 in decimal (base 10) is 13. SUCCESS!

Subtraction follows the same rules. I really don't need to teach you that one do I?
The only difference in subtraction is:

0000 - 0001 = 1111
When we have to borrow, and we run out of bits to borrow from, we just assume the next bit to the left (bit 5 in this case) is a 1.

numerical systems: hexadecimal
Hexadecimal is basically just a notation system for our binary, it makes it easier to write it all out. Hexadecimal (or hex) is a base-16 system, so for every 4 bits of binary we will have exactly one digit. Just memorize this table for conversion:
binary hex decimal
0000 = 0 = 0
0001 = 1 = 1
0010 = 2 = 2
0011 = 3 = 3
0100 = 4 = 4
0101 = 5 = 5
0110 = 6 = 6
0111 = 7 = 7
1000 = 8 = 8
1001 = 9 = 9
1010 = a = 10
1011 = b = 11
1100 = c = 12
1101 = d = 13
1110 = e = 14
1111 = f = 15

So, our 0111 + 0110 example, in hex can be written 7 + 6.
For compadibility, we will no longer be using decimal in ANY part of this tutorial. Every number from here out will be either written in binary or hex.

Adding or subtracting hex numbers is best done by converting to binary and using the method I described above, and then using the table to convert it back to hex. If you know the real method, use it.

Alright, so we know now how to deal with binary numbers and hexadecimal numbers. We know that 1+1=10. This tutorial is where I get to confuse the crap out of you.

Boolean algebra works differently than binary number systems. Boolean algebra works bit by bit, and uses logic operations instead of our normal add, subtract, multiply, etc.

Let me simply name the logic operations we will cover in this tutorial. There are MANY more operations, but we will stick to the basics.
  • OR
  • AND
  • NOT
  • NOR
  • NAND
  • XOR
  • XNOR

So, let's begin now:

Or is a logic operation that takes two arguments (A and B) and returns logic HIGH (binary 1) if either A OR B are at logic HIGH.
For example, here is a table:
Seems pretty simple. Cool.

AND is a logic operation much like OR, but will only return HIGH if both A AND B are at logic high. Here is the table:
Oh look, another easy one.

The NOT operation is the easiest one so far (yes, there is one easier than the NOT gate). All it does, is take 1 argument (A) and return the opposite of it.

NOR is NOT OR, literally it is the same as the following operation:
X = A OR B
Y also = A NOR B
Here is the table
X = A OR B
It really isn't too difficult to understand, just take the output from an OR operation, and take the opposite.

NAND is the same concept of NOR, except it is just NOT AND.
Once again, just an AND, and then the opposite (NOT) of that

XOR, or exclusive-or, is one of the tricky ones. It is simply: A OR B, but NOT A AND B. Basically, it will be logic HIGH if A and B are different, and logic LOW if they are the same.
For those of you who are catching on quickly, it actually means
Here is the table:
It might be a little complex, but I know if you read over it a couple times you will understand.

XNOR, or NOT exclusive or, is just a not placed onto an exclusive or, causing it to read logic HIGH when the values are the same, and logic LOW when different:

There you have it, all of the gates we will be using in our CPU. Now, we have to cover notation:
Back to our list.

Now, some of these don't have notation symbols, so let's revise our list, and i will write down the symbols:
OR = +
AND = * (or two variables next to each other, like AB is the same as A*B)
NOT = ' (or a line drawn above the variable name)
XOR = a plus with a circle drawn around it. It can also be written AB'+A'B

Here is an image I found that shows the drawings for each of these gates. The only one missing is the XNOR, which is just the XOR with the little circle after it like the NOR gate has.

[Image: fig-arch-gates.png]

Alright, we are going to take a break from the math and work on our CPU system itself.

Some of these terms may not mean much to you, but part 4 will explain ALL of them. I have all of the drawings for part 4 already done, but I don't want to deal with any more math at 2 in the morning.

So, let's start with our low level aspects. Our CPU is a 4-bit, so we will need to be able to store these numbers somehow. We will use a databank made up of 4-bit registers. Since we can use only 16 different numbers, we will use 16 of those registers, totaling 8 bytes of total memory.
We will also use another one of those 8 byte databanks for our instruction storage, so we will have 8 bytes of user memory, and 8 bytes of program memory.

Next up, we need to decide how many general purpose registers we are going to use, and name them. I don't think we need any more than three, since we aren't doing anything too complex here. Let's use the following
A, B, and X for registers.

A and B are going to be our accumulator registers, and X will be our index register.

Now, we will need one more VERY important register: the instruction pointer (IP). The IP will keep track of what instruction we are executing. It will be updated every time a pipeline stage is completed.

So, we have A, B, X, and IP for registers. Yes, this is a VERY limited design, but it is intended to teach you how to make them, not make a perfect CPU you can use for gaming.

Alright, on to instruction set:
We will need a few instructions:
0 - LDA - load register A (with CONST)
1 - LDB - load register B (with CONST)
2 - MOVA - move register A (to REG)
3 - MOVX - move X (to REG)
4 - XORA - xor A (WITH REG)
5 - XORB - xor B (WITH REG)
6 - XORX - xor X (WITH REG)
7 - ADDA - A = A + (REG)
8 - STOA - store A to memory (AT ADDR)
9 - COM - 2's compliment (REG)
a - INV - inverse (REG)
b - LDMA - load register A (from ADDR)
There you have it, 11 different instructions. Each instruction takes EXACTLY one clock cycle. Each instruction takes up 1 byte (two nibbles).

Now, this can be quite confusing, so how about we make a simple program to do 1+1, and store that in memory at location 0x4:

but, we won't be able to program like that, we will need to do it in machine code, here is the same program (in hex)
Broken down:
01 ;LDA 1
21 ;LDB 1
71 ;ADDA B
84 ;STOA 4

There is something I didn't cover here: how do we encode register names?
0 - A
1 - B
2 - X
3 - IP

Now, a sample program to do the following: 6 - 3 and store it in memory 15

and in machine code:

06 ;LDA 6
13 ;LDB 3
91 ;COM B
71 ;ADDA B
8f ;STOA f



It takes up 5 bytes of storage, which is 63% of our available memory, so programs will need to be short (you can use 8 lines of code total).

Okay, so I still don't really want to do a crapload of math and hope you catch on, so we will spend this tutorial designing and learning the parts for our CPU.

Here is what we will need (this will be explained)
  • Memory block
  • CPU Registers
  • Instruction block
  • ALU
  • MMU
  • Instruction fetcher
  • Instruction decoder
  • Clock generator

Okay, so let's describe what all of those do and why we need them:

Memory block:
On a basic CPU, this isn't needed, but for the sake of learning something new and passing it on, I will include the ability to read and write from memory. Consider this RAM.

CPU registers:
These are the A, B, X, and IP storage units. They are variables (if you know programming or basic algebra). Since everything we do here will be using wires and some transistor gates, we will need to implement a way to store data. We will use a D latch in blocks. My drawings have all of these schematics already drawn out. They will be explained when they are brought up.

Instruction block:
This is known as a L1 cache on modern CPUs, normally the instructions are read from RAM into the cache and executed from there. Since we are limited by space and I wanted to maximize how much memory we can use, we will have 8 bytes of memory, AND 8 bytes of instruction storage. This totals 16 bytes of "memory", plus the four nibble-wide registers, meaning there is 18 bytes of storage on our CPU total.

The ALU (Arithmatic and Logic unit) is a sub-cpu that will handle all of our mathematical operations. We will also have a separate MMU that will handle all of our memory accesses.

The MMU (Memory Management Unit) will be the unit that handles our memory access. It will be separate from the ALU but it will work in tandem. This idea may be scrapped later if it becomes too complicated for this tutorial. Most CPUs have many of these, and it contributes to the modular design of the CPU.

Instruction fetcher:
The instruction fetcher will read from the instruction block and decide which byte to pull and send to the instruction decoder. This will also be responsible for updating the IP.

Instruction decoder:
The decoder will decipher the CPU level instruction and determine if an ALU operation, a MMU operation, or a register operation needs to be done, and convert the instruction into the instruction set of either the ALU or MMU.

Clock generator:
Okay, this is where it will get tricky, we will be using a human-input clock. This means we will wire up a button, each downpress will perform the following pipe stages:
and each release will perform:

Note, the pipe design may change as we continue this project. It might be required that the pipe look more like this:

This way, you will be able to watch the states of the CPU as they are executed stage by stage.

We will also be wiring up 144 LED's so that you can see the memory block, instruction block, A, B, X, and IP
This part is optional, but it is nice to see what is going on.

If you get really good at multiplexers, you can get away with using 7 switches, and 4 LEDs.
While we are at it, what is a multiplexer?

A multiplexer is a function that takes a number of inputs and returns one output.

f(a, b, c, d, s, t) -> q

follows the table below:

There will be a curcuit drawing when we get more into them.

Okay, so we have some basic knowledge here, and we have a rough cut of what we want to implement.

Let's start basic. You know what the operations OR, AND, and NOT are, as well as XOR. So the first thing we will do is draw up our own XOR operation (using the formula xor=AB'+A'B). We will be using a dedicated XOR chip when we build this to save money, but it does the exact same thing.

Once we have that, we will build our first circuit, an adder. It will simply take 2 1-bit numbers, and add them together. This is more specifically known as a half-adder (which we will cover later in this article).

[Image: WLhMzo0.jpg]

Okay, Figure 1 is our exclusive-or (XOR) circuit. I have labeled each step along the way. Notice that A' and B' happen at the same time, just like A'B and AB' happen at the same time. Since we are dealing with electricity, we can do a nearly infinite number of things at the same time. Also notice the lines drawn. Those are wires (if you did not already know). Also notice how some of the wires have waves in them? This is where a wire crosses over another wire but the two are NOT connected. I won't always draw the little bumps in this tutorial, so if you see two lines cross each other, assume they are NOT connected. They are only connected where there is a dot at the intersection point.

Now, FIG-2: this is where it gets a little complicated. There are actually TWO formulas for half-addition:
Now, what are Q and C?
Q: the result bit of our addition. This is our final result
C: carry. In the case that we add two 1-bit number, but yield a 2-bit result, C is that extra bit.

Alright, it looks like you are getting it, let's do some more with addition:

Full adder vs Half adder
A half adder takes two 1-bit numbers and adds them together. As simple as that. But what if we wanted to add say two 2-bit numbers together? We would just link up adders together. Well, with a half-adder we can't do that, there isn't a way to handle the second bit.
This is where a full-adder comes in.
A full adder takes three 1-bit numbers and adds them together. You can make a half-adder from a full adder by wiring the third bit directly to ground (logic LOW, binary 0).
Let's take a look at Figure 3:

[Image: PhbYFGX.jpg]

Now, the formulas for a full-adder and a half-adder are different:
Q=(A XOR B) XOR C[in]
C[out]=((A XOR B) AND C[in]) OR (A AND B)
What is this C[in] and C[out]?

Well, that is what makes a full-adder different from a half-adder. A half-adder only has an output carry bit, while a full-adder has both an in and an out.

Take a look at Figure-4. This is the truth table for our 1-bit full adder. The bits are actually backwards in the table, it should be written like so:
C[out] Q
Just read it like A+B+C[in]=C[out] Q
This might be complicated, but if you look really closely in Figure-3, you will see the half adder, and if you spend some time reviewing your logic, you should be able to understand what is going on with the rest of the circuit.

On to Figure-5:
I don't want to have to draw out Figure-3 every time I use it, so we will make this little square, and call it FADD. every time I use a 1-bit full-adder, I will draw a square and inside it write FADD. We did the same thing with the half-adder and named it ADD.

Now, 1-bit addition is pretty useless for a 4-bit CPU. So let me show you how to use Figure-2 and Figure-3 (more specifically ADD and FADD) to make an adder circuit that adds two 4-bit numbers together.

[Image: ZLfsnPt.jpg]

Now, here we take two numbers, A and B, add them together, and output S and V. You will notice subscripts on A, B, and S. The subscripts explain which bit of the number we are talking about, in this case from 0-3.
What is this V bit in there?
Well, if you recall, we can add two 1-bit numbers and produce a 4-bit result. The same thing happens here. if you add 1111 (hex f) and 0001 (hex 1) the output should be 10000 (hex 10), but we only have a 4-bit CPU, so our output will actually be 0000 (hex 0). Now, f+1 is not 0, so we have an overflow bit. That bit (V) will be set ONLY if an overflow like that has happened, and we should know that the value is probably wrong.

Notice how the adder works by passing the carry from each adder into the third input of the next? This is how we expand our 1-bit adders to work with more and more bits. Remember this skill, we will be using it a lot later.

Now, I took the time to draw out the full circuit design for our 4-bit adder, which is Figure 7:

[Image: d38jPXG.jpg]

This is what you will ultimately be wiring up. The design is a little bit condensed, and won't look the same as FADD or ADD, I did that so I could fit it all on one sheet. Just follow the lines and you should understand it. Notice here, there are no more bumps, but there are dots where wires are connected.

Figure 8:
This is a bit weird to you now, you will understand it more later, but it is simply a usage chart. It is an attempt to figure out exactly how much it will cost to make a single 4-bit adder. We have 7 XOR operations, 7 AND operations, and 3 OR operations. XOR, AND, and OR chips all have 4 gates per chip, meaning we can use say 4 OR operations, and only use a single chip. So we use 2 XOR chips, 2 AND chips, and 1 OR chip. But we have the problem that we still have 1 extra XOR, AND, and OR operation left on one of each chip. This won't be a problem right now, the waste is low. You can always optimize by moving other parts of the CPU around to fill up the waste and lower cost if you are feeling advanced, but this will get handled in a later part to this tutorial.

Figure 9:
Once again, I don't want to draw that diagram up every time I use it, so we named it 4ADD.

Now we have a way to add numbers, but how do we store them? We will ultimately use a register, but in order to understand registers you must first understand registers. I did not go to the basic level on this, so it might be a little confusing, but we will use flip-flops for our register bits. A flip-flop is just a way to store data, by sending the signal back to the circuit.

A S-R flip flop (actually called an SR Latch) has two input bits: set (S) and reset ®. and two outputs: Q and Q'. The formulas are:

Yes, this means it's really
Q = S XOR (R XOR (S XOR (R XOR ...
This is just about the only time you can have recursive electricity (I won't explain why that works, it just does, know that for now).

Well, we need to do read and write operations to these, so we need to keep track of time. That will be done using a clock signal. But if we want to write a bit when the clock is logic HIGH, and we just make an enable circuit for that (google those if you really care that much), we will end up writing that bit so quickly and often we may actually flip the bit and it won't write correctly. So we don't want to write when the clock is HIGH, we want to write when the clock goes from LOW to HIGH. That only happens once.

We will use a form of a schmidt trigger for that. The formula is:
Yes, it is confusing. It actually works because things don't happen instantly, so by the time that the NOT is processed, A may have already changed, and therefore the values will be different. I don't want to go into depth on how that works, that may be a later topic if it needs to be explained.

Alright, now I can show you the figures:

[Image: IMyONvY.jpg]

Take a good look at figure-10, just for amusement.

The real magic is Figure-11, which we also make one of those boxes for and named it DFF. The first part is our enable circuit, basically a schmidt trigger and a data input, and the second part is the SR latch. Both of these combined make what is known as a D-flip-flop. These will make up our 4-bit registers.

We take in 4 bits (D[0]-D[3]), a clock (CLK) signal, and put those into 4 different DFF's. Notice that CLK is shared between the 4 DFF's. Why do we do this? We want a synchronized write cycle. It will write all 4 bits at THE SAME TIME. Obviously there is also Q[0]-Q[3] coming out. These are ALWAYS outputting the value from the DFF. There is no output for Q' on the DFF because we simply won't use it, we will have an instruction later for that.
I named the 4-bit register 4REG for use later.

See, you are getting the hang of this, I know it is a lot of stuff to throw at you all at once, but if you need to take as long as required until you understand each figure.

I also don't want to write so many lines when we do our memory block, so I made a block called D4R, which is identical to the 4REG, except it takes a dark line and a light line. The dark line actually means there are 4 wires there. It outputs a single dark line, which means 4 lines in this case. Here is the diagram: firgure-14

[Image: bjEfRho.jpg]

The block D4R is Figure-15.

Nothing new in this figure, I simply created our 8-byte memory block. Remember there will be two of these. There isn't any point making one of those squares for it just yet, so I won't. Use this as a reference. For fun, the first person who draws out the ENTIRE circuit diagram for the memory block might win something.

EDIT: If you do the extra credit, you are ONLY allowed to use AND, OR, and NOT gates.

I used DS[0]-DS[f], meaning Data Set 0 through Data Set f. These symbolize the dark lines on the D4R. I also used QS[0]-QS[f] meaning Q Set 0 through Q Set f.


[Image: 7iGblJX.jpg]

Now, just for that one guy who asked "what about subtraction", here are the figures (with annotations) for subtracting.

Figure-17[O] (optional):

[Image: uR4O7vj.jpg]

This explains the process. Parts 1-3 create the process of doing so.

The full implementation is Figure-18[O]:

[Image: ChJt5DH.jpg]

Make sure you read the note at the bottom. The bit H must be wired directly to logic HIGH (+5v)

Okay, this is where things get a little bit more advanced. If we want to read and write, we are going to want to keep our cost down and the number of wires we have to run low. How can we read from 64 different bits but only use 4 of them? Simple! We use a multiplexer.

What is a multiplexer? @JzJad posted this below, and I think it describes it pretty well:
[Image: multiplexer.jpg]

Multiplexers take all of the possible inputs, with additional select inputs, and output only the desired outputs.

Alright, let's make a basic 4-bit multiplexer. The formulas get long.
A, B, C, D
S0, S1 (in the order S0S1)

Q (selected bit)

Table: (for the table, G=S0, H=S1)

Seems simple enough.
G=S0, H=S1

It looks pretty complicated, bit it is not. We can do a simple sum of products for this:

G'H' = 00
G'H = 01
GH' = 10
GH = 11

And then, just assign which numbers we want to each select by adding in the inputs:

G'H'A = 00
G'HB = 01
GH'C = 10
GHD = 11

And there you have it. Now simply or them all together and we have

This is close enough to simplest form. Now, let's figure out how much waste we have. Remember, we only use gates that take two inputs, and we have 4 that need 3 and 1 that needs 4. So we need to rewrite it:

Okay, so lets make the tally:
OR  :3
And remember, 4 gates per chip, so the chip usage is:
OR  :1
And the waste is 1 OR gate. That is actually pretty good. We used 4 chips for this multiplexer.

Now, we will be using multiplexers that take a 4-bit select, and take in 64 other bits. How do we do that? Simple. We take n=64/4=16. That is the number of bits each multiplexer has to take in, and we need 4 of those (one for each bit).

a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p
A, B, C, D

Now, let's make a sum of products:
Q =
(A'B'C'D') +
(A'B'C'D) +
(A'B'CD') +
(A'B'CD) +
(A'BC'D') +
(A'BC'D) +
(A'BCD') +
(A'BCD) +
(AB'C'D') +
(AB'C'D) +
(AB'CD') +
(AB'CD) +
(ABC'D') +
(ABC'D) +
(ABCD') +

And then, Add in our inputs:

Q =
(aA'B'C'D') +
(bA'B'C'D) +
(cA'B'CD') +
(dA'B'CD) +
(eA'BC'D') +
(fA'BC'D) +
(gA'BCD') +
(hA'BCD) +
(iAB'C'D') +
(jAB'C'D) +
(kAB'CD') +
(lAB'CD) +
(mABC'D') +
(nABC'D) +
(oABCD') +

Now, that is in NO way simplified... So, we can take out some of the work by first splitting it into two groups: Aset and Aunset. Aunset is from a-h inclusive, and Aset is from i-p inclusive.

We can then factor out an A' from Aset and an A from Aunset:
Q =
  (aB'C'D') +
  (bB'C'D) +
  (cB'CD') +
  (dB'CD) +
  (eBC'D') +
  (fBC'D) +
  (gBCD') +
  (iB'C'D') +
  (jB'C'D) +
  (kB'CD') +
  (lB'CD) +
  (mBC'D') +
  (nBC'D) +
  (oBCD') +
Look, getting better already. Let's go ahead and re-order Aset and Aunset so that we group them both into Bset and Bunset the same way:

Q =
    (aC'D') +
    (bC'D) +
    (cCD') +
    (eC'D') +
    (fC'D) +
    (gCD') +
    (iC'D') +
    (jC'D) +
    (kCD') +
    (mC'D') +
    (nC'D) +
    (oCD') +

And continue, all the way down the line:

Q =
      (aD') +
    ) + C(
      (cD') +
  ) + B(
      (eD') +
    ) + C(
      (gD') +
) + A(
      (iD') +
    ) + C(
      (kD') +
  ) + B(
      (mD') +
    ) + C(
      (oD') +

Okay, but now it is looking pretty messy. Just keep swimming.. Simplifying the equation will save our costs later. At this point though, it cannot be simplified any further. We can now do some equation compression:

Q =
A'(B'(C'((aD') + (bD)) + C((cD') + (dD))) + B(C'((eD') + (fD)) + C((gD') + (hD))))
A(B'(C'((iD') + (jD)) + C((kD') + (lD) )) + B(C'((mD') + (nD)) + C((oD') + (pD))))

Yeah, doesn't that look pretty! But, there you have it, the equations for the multiplexer that we will actually be using. We will name that block 16MUX. And we will use 4 16MUX's in each memory block, and then we will use the 4-bit multiplexer from the beginning of this thread as 4MUX. We might not ever use that, but who knows.

Demultiplexer: theory
Demultiplexers will take the outputs and determine the select lines. I won't go into any more depth on those, we won't be using them as far as I know in this tutorial.

Part 7 will be here when complete