Computer Systems · Data Representation

CS4 — ASCII and Unicode

📅 Mon 15 Jun 2026 · P1+P2 (double)
~120 minutes
Learning intentions
Success criteria
Warm up — recap from CS3
Answer from memory · check when done
Warm-up 1
In a normalised positive floating-point number, the mantissa always starts with 0. followed by which digit?
Warm-up 2
How many different characters can be represented using 7 bits?
Warm-up 3
What is the denary (decimal) value of the binary number 01000001? (Hint: this answer will matter in today's lesson.)

Key vocabulary

Character encoding
A standard that assigns a unique number (code point) to each character so computers can store and exchange text consistently.
ASCII
American Standard Code for Information Interchange. A 7-bit encoding covering 128 characters: English letters, digits, and basic punctuation.
Extended ASCII
An 8-bit extension of ASCII covering 256 characters. The extra 128 slots were used for accented letters and symbols, but different manufacturers chose different characters — leading to incompatibility.
Unicode
A universal character encoding standard that assigns unique code points to every character in every human writing system — over 149,000 characters and counting.
Code point
The unique number assigned to each character in a character set. For example, the code point for 'A' is 65 in both ASCII and Unicode.
Bit depth (per character)
The number of bits used to store each character. ASCII uses 7 bits (stored in 1 byte). Unicode uses 16 or 32 bits (2 or 4 bytes) per character.

Notes

Why computers need a character encoding

Computers store everything as binary — including text. But a sequence of bits like 01000001 is meaningless on its own. A computer only knows it represents the letter 'A' because both the sender and receiver have agreed on the same character encoding: a shared lookup table that maps bit patterns to characters.

Without a standard encoding, text created on one computer would be unreadable on another. The first widely adopted standard was ASCII, developed in the early 1960s for American English computing. As computing spread globally, its limitations became apparent.

ASCII: the original standard

ASCII (American Standard Code for Information Interchange) was finalised in 1963. It uses 7 bits to encode each character, giving 2⁷ = 128 unique code points (numbered 0 to 127). These cover:

In practice, ASCII characters are stored in 1 byte (8 bits) — the extra bit is either set to 0 or used for error detection (parity).

Extended ASCII

As computers spread across Europe, the spare 8th bit was used to add another 128 characters (code points 128–255), creating extended ASCII with 256 characters total. These extra slots were filled with accented letters (é, ü, ñ), currency symbols (£, ©), and line-drawing characters.

The problem was that different manufacturers and countries chose different characters for these extra 128 slots. There was no single extended ASCII standard — a document created on a PC in France might display garbled text when opened on an American Mac. This incompatibility was extended ASCII's fatal flaw.

Limitations of ASCII

Even extended ASCII only covers 256 characters. This is completely inadequate for global computing:

ASCII was never designed for any of this. A new approach was needed.

Character 'A' — code point 65 Same character, different storage
ASCII (7-bit / 1 byte)
01000001
8 bits · 1 byte
16-bit Unicode (2 bytes)
00000000 01000001
16 bits · 2 bytes

Unicode: a universal solution

Unicode was introduced in 1991 as a single, universal character encoding standard. Its design goals were simple but ambitious: every character in every human writing system should have a unique, permanent code point. The current Unicode standard (version 15) assigns code points to over 149,000 characters across 161 scripts, including historic and constructed languages.

For the SQA Higher course, you need to know two encoding widths:

Character set scale comparison
0
start
127
ASCII
255
Ext. ASCII
1,114,111
Unicode
7-bit ASCII (128 chars)
Extended ASCII (256 chars)
Unicode (over 1.1 million code points)

The scale difference is dramatic. ASCII's 128 characters occupy a tiny fraction of Unicode's space. The visual proportions above are not to scale — Unicode's range is approximately 8,700 times larger than ASCII's.

Unicode and backwards compatibility

Unicode's designers made a crucial decision: the first 128 Unicode code points (0–127) are identical to ASCII. The letter 'A' is code point 65 in both ASCII and in Unicode. This backwards compatibility meant that:

This was not an accident — it was a deliberate design choice that made Unicode's adoption feasible. Without it, switching the entire world's computing infrastructure would have required converting or discarding every existing text file.

Disadvantages of Unicode

Unicode's main disadvantage is increased storage. Because each character requires more bits, files that only contain ASCII text become larger when stored in Unicode format:

EncodingBits per characterBytes per characterStorage vs ASCII
ASCII7 (stored in 8)1Baseline
16-bit Unicode1622× more storage
32-bit Unicode3244× more storage

For a document containing only English text, switching from ASCII to 16-bit Unicode doubles the storage requirement — with no practical benefit, since all English characters already fit within ASCII's 128 code points. For 32-bit Unicode, the cost is four times the storage. This overhead was a significant concern when Unicode was introduced and storage was expensive; it remains relevant in memory-constrained systems today.

Worked examples

Example 1 — Binary value of the letter 'A'
1
Look up (or recall) the ASCII code point: 'A' = 65 in decimal. This is the universal value for uppercase A in both ASCII and Unicode.
2
Convert 65 to binary using place values:
128 > 65 → 0  |  64 ≤ 65 → 1, remainder 1
32 > 1 → 0  |  16 > 1 → 0  |  8 > 1 → 0  |  4 > 1 → 0  |  2 > 1 → 0  |  1 = 1 → 1
Result: 01000001
3
Verify: 64 + 1 = 65 ✓. The 8-bit binary 01000001 is how the letter 'A' is stored in memory.
'A' = decimal 65 = binary 01000001. In 16-bit Unicode, this same value is stored as 00000000 01000001 (padded to 2 bytes).
Example 2 — Storage for "Hello" in ASCII vs Unicode
1
Count the characters: H, e, l, l, o = 5 characters. Each character requires storage according to the encoding used.
2
ASCII: 1 byte per character → 5 × 1 = 5 bytes (= 40 bits)
3
16-bit Unicode: 2 bytes per character → 5 × 2 = 10 bytes (= 80 bits)
4
32-bit Unicode: 4 bytes per character → 5 × 4 = 20 bytes (= 160 bits)
16-bit Unicode uses the storage of ASCII; 32-bit Unicode uses . For simple English words like "Hello", this extra storage provides no practical benefit — all five letters are within ASCII's range.
Example 3 — Why Unicode was needed: Arabic and emoji
1
ASCII has 128 code points (0–127). The Arabic letter ب (ba) — the second letter of the Arabic alphabet — has no code point in ASCII at all. There is simply no slot available.
2
In Unicode, ب is assigned code point U+0628 (decimal 1576). Arabic letters occupy code points 1536–1791 — far beyond ASCII's range of 0–127.
3
The emoji 😀 (grinning face) is Unicode code point U+1F600 (decimal 128,512). This requires 32-bit Unicode — it is impossible to represent in ASCII or 16-bit Unicode without special surrogate pair encoding.
Unicode is the only encoding standard that can represent all of these characters in a single, consistent system. Any application that handles international text — web browsers, messaging apps, word processors — must use Unicode.
Now you try
What is the 8-bit binary representation of the letter 'B'? Show your working clearly.
⚠️ Common mistakes — examiner feedback
📝 Exam tip

"Explain why Unicode is used instead of ASCII" is a frequent 2-mark question. You must give two distinct points to score both marks. Safe, reliable answers are:

A third useful point (for 3-mark variations): Unicode is backwards compatible with ASCII — the first 128 code points are identical, so existing ASCII files do not need to be converted.

Task Set A

Task Set A — Higher core
Work through all questions. Written questions are self-assessed.
A1
What is the ASCII decimal code point for the uppercase letter 'A'?
A2
What is the 7-bit binary representation of the letter 'Z' (decimal code point 90)?
A3
Which statement about the original 7-bit ASCII standard is correct?
A4
How many bytes are needed to store the string "Hi" using ASCII?
A5
How many bytes are needed to store the same string "Hi" using 16-bit Unicode?
A6
A document contains 500 English characters. How many more bytes does 16-bit Unicode use compared to ASCII?
A7 — past paper style (2 marks)
Explain why ASCII cannot be used to store Arabic text.
A8 — past paper style (2 marks)
State one advantage and one disadvantage of Unicode compared to ASCII.
A9
A text file storing 100 English characters is converted from ASCII to 32-bit Unicode. How many times larger does the file become?
A10 — past paper style (2 marks)
Explain why Unicode is backwards compatible with ASCII and why this was important when Unicode was introduced.
✅ Higher checkpoint — A7, A8, and A10 are the most exam-relevant. Confident explanations with two distinct points on each = ready for the SQA exam.

Task Set B

Task Set B — Extension · Beyond the specification
No auto-check — self-assess using the model answers.
B1
The word "café" contains the accented character 'é'. Calculate the total bytes needed to store it in 16-bit Unicode, and explain why standard 7-bit ASCII cannot store this string at all.
B2
UTF-8 uses 1 byte for ASCII characters and 2–4 bytes for others. Explain one advantage and one disadvantage of UTF-8 compared to fixed-width 16-bit Unicode for a document that is mostly English text.
B3
A school database stores 1,200 pupil names, each averaging 12 characters, currently in ASCII. Calculate the storage in kilobytes for both ASCII and 16-bit Unicode (1 KB = 1,024 bytes). State the difference, and explain whether switching to Unicode is justified.
B4
16-bit Unicode can directly represent 65,536 code points, but the full Unicode standard contains over 149,000 characters. Explain the limitation this creates and describe how it is addressed in practice.
📁 File this in OneNote under:
Higher Computing Science → Computer Systems → CS4
📌 Teacher notes — not for pupils (Shift+T to toggle)

Timing (120 min double):
5 min — warm up (CS3 recap + binary bridge), circulate
5 min — key vocabulary together
10 min — why encoding matters (demo: open a file in wrong encoding, show garbled text if possible)
5 min — ASCII: structure, code points, worked example 1 together
10 min — Unicode: scale diagram, context-clash diagram, backwards compatibility
5 min — storage calculation method (worked example 2 together)
5 min — now you try, then cold call
5 min — common mistakes and exam tip
25 min — tasks
5 min — cold call review on A7/A8 (most exam-relevant)

Watch for: pupils writing "uses more memory" without the multiplier — drill the ×2 and ×4 language. Also watch for confusion between 7-bit code points and 1-byte storage (ASCII characters are stored in 1 byte even though the code itself only uses 7 bits).

Demo idea: Open a terminal and type python3 -c "print(chr(65))" — instantly shows 'A'. Then print(ord('A')) to show 65. Then print(ord('😀')) to show 128512 — dramatically outside ASCII range. Pupils find this concretely convincing.

B4 (surrogate pairs) is genuinely beyond the SQA specification but stretches confident pupils well. It's worth mentioning to the group even if they don't attempt it — "this is what happens when a 16-bit standard tries to cover 1.1 million characters" is a memorable idea.