CS4 — ASCII and Unicode
- I can explain why a standard character encoding is needed to store text in a computer
- I can describe ASCII as a 7-bit code representing 128 characters, and identify its limitations
- I can describe Unicode as a standard that supports all world languages and scripts
- I can calculate storage requirements for text using ASCII and Unicode
- I can state that 7-bit ASCII represents 128 characters and 8-bit extended ASCII represents 256
- I can convert between a character's decimal code point and its binary representation
- I can explain two distinct advantages of Unicode over ASCII
- I can calculate that 16-bit Unicode uses 2× more storage than ASCII, and 32-bit uses 4× more
0. followed by which digit?01000001? (Hint: this answer will matter in today's lesson.)Key vocabulary
Notes
Why computers need a character encoding
Computers store everything as binary — including text. But a sequence of bits like 01000001 is meaningless on its own. A computer only knows it represents the letter 'A' because both the sender and receiver have agreed on the same character encoding: a shared lookup table that maps bit patterns to characters.
Without a standard encoding, text created on one computer would be unreadable on another. The first widely adopted standard was ASCII, developed in the early 1960s for American English computing. As computing spread globally, its limitations became apparent.
ASCII: the original standard
ASCII (American Standard Code for Information Interchange) was finalised in 1963. It uses 7 bits to encode each character, giving 2⁷ = 128 unique code points (numbered 0 to 127). These cover:
- The 26 uppercase English letters (A–Z, code points 65–90)
- The 26 lowercase English letters (a–z, code points 97–122)
- The digits 0–9 (code points 48–57)
- Common punctuation and symbols (!, ", #, $, %, …)
- 33 non-printing control characters (line feed, carriage return, etc.)
In practice, ASCII characters are stored in 1 byte (8 bits) — the extra bit is either set to 0 or used for error detection (parity).
Extended ASCII
As computers spread across Europe, the spare 8th bit was used to add another 128 characters (code points 128–255), creating extended ASCII with 256 characters total. These extra slots were filled with accented letters (é, ü, ñ), currency symbols (£, ©), and line-drawing characters.
The problem was that different manufacturers and countries chose different characters for these extra 128 slots. There was no single extended ASCII standard — a document created on a PC in France might display garbled text when opened on an American Mac. This incompatibility was extended ASCII's fatal flaw.
Limitations of ASCII
Even extended ASCII only covers 256 characters. This is completely inadequate for global computing:
- Arabic has 28 primary letters, each with up to 4 contextual forms — over 100 distinct shapes
- Chinese (Mandarin) uses over 50,000 characters; everyday literacy requires knowing around 3,500
- Japanese uses three separate writing systems simultaneously (hiragana, katakana, kanji)
- Emoji — now essential in global communication — require dedicated code points entirely outside ASCII's range
ASCII was never designed for any of this. A new approach was needed.
Same character, different storage
Unicode: a universal solution
Unicode was introduced in 1991 as a single, universal character encoding standard. Its design goals were simple but ambitious: every character in every human writing system should have a unique, permanent code point. The current Unicode standard (version 15) assigns code points to over 149,000 characters across 161 scripts, including historic and constructed languages.
For the SQA Higher course, you need to know two encoding widths:
- 16-bit Unicode: uses 2 bytes per character → can directly address 2¹⁶ = 65,536 code points
- 32-bit Unicode: uses 4 bytes per character → can directly address over 4 billion code points, more than enough for all current and future needs
The scale difference is dramatic. ASCII's 128 characters occupy a tiny fraction of Unicode's space. The visual proportions above are not to scale — Unicode's range is approximately 8,700 times larger than ASCII's.
Unicode and backwards compatibility
Unicode's designers made a crucial decision: the first 128 Unicode code points (0–127) are identical to ASCII. The letter 'A' is code point 65 in both ASCII and in Unicode. This backwards compatibility meant that:
- Existing ASCII files could be read correctly by Unicode software without any conversion
- Software could adopt Unicode gradually without breaking existing workflows
- Billions of ASCII documents already in existence did not need to be re-encoded
This was not an accident — it was a deliberate design choice that made Unicode's adoption feasible. Without it, switching the entire world's computing infrastructure would have required converting or discarding every existing text file.
Disadvantages of Unicode
Unicode's main disadvantage is increased storage. Because each character requires more bits, files that only contain ASCII text become larger when stored in Unicode format:
| Encoding | Bits per character | Bytes per character | Storage vs ASCII |
|---|---|---|---|
| ASCII | 7 (stored in 8) | 1 | Baseline |
| 16-bit Unicode | 16 | 2 | 2× more storage |
| 32-bit Unicode | 32 | 4 | 4× more storage |
For a document containing only English text, switching from ASCII to 16-bit Unicode doubles the storage requirement — with no practical benefit, since all English characters already fit within ASCII's 128 code points. For 32-bit Unicode, the cost is four times the storage. This overhead was a significant concern when Unicode was introduced and storage was expensive; it remains relevant in memory-constrained systems today.
Worked examples
128 > 65 → 0 | 64 ≤ 65 → 1, remainder 1
32 > 1 → 0 | 16 > 1 → 0 | 8 > 1 → 0 | 4 > 1 → 0 | 2 > 1 → 0 | 1 = 1 → 1
Result:
0100000101000001 is how the letter 'A' is stored in memory.00000000 01000001 (padded to 2 bytes).- Vague storage answers: Writing "Unicode uses more memory than ASCII" scores zero. Always quantify — 16-bit Unicode uses twice as much storage (2×); 32-bit uses four times as much (4×).
- Confusing bits and bytes: ASCII stores each character in 1 byte (8 bits), but the actual code uses only 7 of those bits. 16-bit Unicode uses 2 bytes (16 bits) per character. These are not the same thing — be precise.
- Saying ASCII "can't represent any non-English characters": Extended ASCII (8-bit) does cover many Western European accented characters (é, ü, ñ). The correct claim is that standard 7-bit ASCII cannot, and neither standard covers non-Latin scripts like Arabic or Chinese.
- Forgetting backwards compatibility in exam answers: When asked to explain Unicode's advantages, backwards compatibility with ASCII is a distinct, examinable point — don't omit it.
"Explain why Unicode is used instead of ASCII" is a frequent 2-mark question. You must give two distinct points to score both marks. Safe, reliable answers are:
- ASCII only has 128 code points, covering mainly the English alphabet — it cannot represent characters from other languages such as Arabic or Chinese.
- Unicode can represent characters from every writing system in the world, using a single consistent standard with over 149,000 code points.
A third useful point (for 3-mark variations): Unicode is backwards compatible with ASCII — the first 128 code points are identical, so existing ASCII files do not need to be converted.
Task Set A
Task Set B
Higher Computing Science → Computer Systems → CS4
Timing (120 min double):
5 min — warm up (CS3 recap + binary bridge), circulate
5 min — key vocabulary together
10 min — why encoding matters (demo: open a file in wrong encoding, show garbled text if possible)
5 min — ASCII: structure, code points, worked example 1 together
10 min — Unicode: scale diagram, context-clash diagram, backwards compatibility
5 min — storage calculation method (worked example 2 together)
5 min — now you try, then cold call
5 min — common mistakes and exam tip
25 min — tasks
5 min — cold call review on A7/A8 (most exam-relevant)
Watch for: pupils writing "uses more memory" without the multiplier — drill the ×2 and ×4 language. Also watch for confusion between 7-bit code points and 1-byte storage (ASCII characters are stored in 1 byte even though the code itself only uses 7 bits).
Demo idea: Open a terminal and type python3 -c "print(chr(65))" — instantly shows 'A'. Then print(ord('A')) to show 65. Then print(ord('😀')) to show 128512 — dramatically outside ASCII range. Pupils find this concretely convincing.
B4 (surrogate pairs) is genuinely beyond the SQA specification but stretches confident pupils well. It's worth mentioning to the group even if they don't attempt it — "this is what happens when a 16-bit standard tries to cover 1.1 million characters" is a memorable idea.