Computer Systems · Data Representation

CS4 — ASCII and Unicode

📅 Mon 15 Jun 2026 · P1+P2 (double)

⏱ ~120 minutes

Learning intentions

I can explain why a standard character encoding is needed to store text in a computer
I can describe ASCII as a 7-bit code representing 128 characters, and identify its limitations
I can describe Unicode as a standard that supports all world languages and scripts
I can calculate storage requirements for text using ASCII and Unicode

Success criteria

I can state that 7-bit ASCII represents 128 characters and 8-bit extended ASCII represents 256
I can convert between a character's decimal code point and its binary representation
I can explain two distinct advantages of Unicode over ASCII
I can calculate that 16-bit Unicode uses 2× more storage than ASCII, and 32-bit uses 4× more

Warm up — recap from CS3

Answer from memory · check when done

Warm-up 1

In a normalised positive floating-point number, the mantissa always starts with 0. followed by which digit?

Incorrect — a mantissa starting 0.0 is not normalised: it wastes a bit of precision. Normalised form always has 0.1 for positive numbers.

Correct — a normalised positive mantissa is of the form 0.1xxx. This maximises precision by ensuring all mantissa bits are being used meaningfully.

Either 0 or 1

Incorrect — only 0.1xxx is normalised for a positive number. 0.0xxx can always be shifted left to give more precision.

It depends on the exponent

Incorrect — the normalisation rule is independent of the exponent. The mantissa must always start 0.1 for a positive normalised value.

Warm-up 2

How many different characters can be represented using 7 bits?

Incorrect — 64 = 2⁶. You'd need 6 bits to represent 64 values, not 7.

128

Correct — 2⁷ = 128. With 7 bits you can form 128 unique combinations (0 to 127). This is exactly the size of the original ASCII character set.

256

Incorrect — 256 = 2⁸. You'd need 8 bits for 256 values — that is extended ASCII's character count, not 7-bit ASCII.

512

Incorrect — 512 = 2⁹. This would require 9 bits.

Warm-up 3

What is the denary (decimal) value of the binary number 01000001? (Hint: this answer will matter in today's lesson.)

Key vocabulary

Character encoding

A standard that assigns a unique number (code point) to each character so computers can store and exchange text consistently.

ASCII

American Standard Code for Information Interchange. A 7-bit encoding covering 128 characters: English letters, digits, and basic punctuation.

Extended ASCII

An 8-bit extension of ASCII covering 256 characters. The extra 128 slots were used for accented letters and symbols, but different manufacturers chose different characters — leading to incompatibility.

Unicode

A universal character encoding standard that assigns unique code points to every character in every human writing system — over 149,000 characters and counting.

Code point

The unique number assigned to each character in a character set. For example, the code point for 'A' is 65 in both ASCII and Unicode.

Bit depth (per character)

The number of bits used to store each character. ASCII uses 7 bits (stored in 1 byte). Unicode uses 16 or 32 bits (2 or 4 bytes) per character.

Notes

Why computers need a character encoding

Computers store everything as binary — including text. But a sequence of bits like 01000001 is meaningless on its own. A computer only knows it represents the letter 'A' because both the sender and receiver have agreed on the same character encoding: a shared lookup table that maps bit patterns to characters.

Without a standard encoding, text created on one computer would be unreadable on another. The first widely adopted standard was ASCII, developed in the early 1960s for American English computing. As computing spread globally, its limitations became apparent.

ASCII: the original standard

ASCII (American Standard Code for Information Interchange) was finalised in 1963. It uses 7 bits to encode each character, giving 2⁷ = 128 unique code points (numbered 0 to 127). These cover:

The 26 uppercase English letters (A–Z, code points 65–90)
The 26 lowercase English letters (a–z, code points 97–122)
The digits 0–9 (code points 48–57)
Common punctuation and symbols (!, ", #, $, %, …)
33 non-printing control characters (line feed, carriage return, etc.)

In practice, ASCII characters are stored in 1 byte (8 bits) — the extra bit is either set to 0 or used for error detection (parity).

Extended ASCII

As computers spread across Europe, the spare 8th bit was used to add another 128 characters (code points 128–255), creating extended ASCII with 256 characters total. These extra slots were filled with accented letters (é, ü, ñ), currency symbols (£, ©), and line-drawing characters.

The problem was that different manufacturers and countries chose different characters for these extra 128 slots. There was no single extended ASCII standard — a document created on a PC in France might display garbled text when opened on an American Mac. This incompatibility was extended ASCII's fatal flaw.

Limitations of ASCII

Even extended ASCII only covers 256 characters. This is completely inadequate for global computing:

Arabic has 28 primary letters, each with up to 4 contextual forms — over 100 distinct shapes
Chinese (Mandarin) uses over 50,000 characters; everyday literacy requires knowing around 3,500
Japanese uses three separate writing systems simultaneously (hiragana, katakana, kanji)
Emoji — now essential in global communication — require dedicated code points entirely outside ASCII's range

ASCII was never designed for any of this. A new approach was needed.

Character 'A' — code point 65 Same character, different storage

ASCII (7-bit / 1 byte)

01000001

8 bits · 1 byte

16-bit Unicode (2 bytes)

00000000 01000001

16 bits · 2 bytes

Unicode: a universal solution

Unicode was introduced in 1991 as a single, universal character encoding standard. Its design goals were simple but ambitious: every character in every human writing system should have a unique, permanent code point. The current Unicode standard (version 15) assigns code points to over 149,000 characters across 161 scripts, including historic and constructed languages.

For the SQA Higher course, you need to know two encoding widths:

16-bit Unicode: uses 2 bytes per character → can directly address 2¹⁶ = 65,536 code points
32-bit Unicode: uses 4 bytes per character → can directly address over 4 billion code points, more than enough for all current and future needs

Character set scale comparison

start

127

ASCII

255

Ext. ASCII

1,114,111

Unicode

7-bit ASCII (128 chars)

Extended ASCII (256 chars)

Unicode (over 1.1 million code points)

The scale difference is dramatic. ASCII's 128 characters occupy a tiny fraction of Unicode's space. The visual proportions above are not to scale — Unicode's range is approximately 8,700 times larger than ASCII's.

Unicode and backwards compatibility

Unicode's designers made a crucial decision: the first 128 Unicode code points (0–127) are identical to ASCII. The letter 'A' is code point 65 in both ASCII and in Unicode. This backwards compatibility meant that:

Existing ASCII files could be read correctly by Unicode software without any conversion
Software could adopt Unicode gradually without breaking existing workflows
Billions of ASCII documents already in existence did not need to be re-encoded

This was not an accident — it was a deliberate design choice that made Unicode's adoption feasible. Without it, switching the entire world's computing infrastructure would have required converting or discarding every existing text file.

Disadvantages of Unicode

Unicode's main disadvantage is increased storage. Because each character requires more bits, files that only contain ASCII text become larger when stored in Unicode format:

Encoding	Bits per character	Bytes per character	Storage vs ASCII
ASCII	7 (stored in 8)	1	Baseline
16-bit Unicode	16	2	2× more storage
32-bit Unicode	32	4	4× more storage

For a document containing only English text, switching from ASCII to 16-bit Unicode doubles the storage requirement — with no practical benefit, since all English characters already fit within ASCII's 128 code points. For 32-bit Unicode, the cost is four times the storage. This overhead was a significant concern when Unicode was introduced and storage was expensive; it remains relevant in memory-constrained systems today.

Worked examples

Example 1 — Binary value of the letter 'A'

Look up (or recall) the ASCII code point: 'A' = 65 in decimal. This is the universal value for uppercase A in both ASCII and Unicode.

Convert 65 to binary using place values:
128 > 65 → 0 | 64 ≤ 65 → 1, remainder 1
32 > 1 → 0 | 16 > 1 → 0 | 8 > 1 → 0 | 4 > 1 → 0 | 2 > 1 → 0 | 1 = 1 → 1
Result: 01000001

Verify: 64 + 1 = 65 ✓. The 8-bit binary 01000001 is how the letter 'A' is stored in memory.

✓

'A' = decimal 65 = binary 01000001. In 16-bit Unicode, this same value is stored as 00000000 01000001 (padded to 2 bytes).

Example 2 — Storage for "Hello" in ASCII vs Unicode

Count the characters: H, e, l, l, o = 5 characters. Each character requires storage according to the encoding used.

ASCII: 1 byte per character → 5 × 1 = 5 bytes (= 40 bits)

16-bit Unicode: 2 bytes per character → 5 × 2 = 10 bytes (= 80 bits)

32-bit Unicode: 4 bytes per character → 5 × 4 = 20 bytes (= 160 bits)

✓

16-bit Unicode uses 2× the storage of ASCII; 32-bit Unicode uses 4×. For simple English words like "Hello", this extra storage provides no practical benefit — all five letters are within ASCII's range.

Example 3 — Why Unicode was needed: Arabic and emoji

ASCII has 128 code points (0–127). The Arabic letter ب (ba) — the second letter of the Arabic alphabet — has no code point in ASCII at all. There is simply no slot available.

In Unicode, ب is assigned code point U+0628 (decimal 1576). Arabic letters occupy code points 1536–1791 — far beyond ASCII's range of 0–127.

The emoji 😀 (grinning face) is Unicode code point U+1F600 (decimal 128,512). This requires 32-bit Unicode — it is impossible to represent in ASCII or 16-bit Unicode without special surrogate pair encoding.

✓

Unicode is the only encoding standard that can represent all of these characters in a single, consistent system. Any application that handles international text — web browsers, messaging apps, word processors — must use Unicode.

Now you try

What is the 8-bit binary representation of the letter 'B'? Show your working clearly.

⚠️ Common mistakes — examiner feedback

Vague storage answers: Writing "Unicode uses more memory than ASCII" scores zero. Always quantify — 16-bit Unicode uses twice as much storage (2×); 32-bit uses four times as much (4×).
Confusing bits and bytes: ASCII stores each character in 1 byte (8 bits), but the actual code uses only 7 of those bits. 16-bit Unicode uses 2 bytes (16 bits) per character. These are not the same thing — be precise.
Saying ASCII "can't represent any non-English characters": Extended ASCII (8-bit) does cover many Western European accented characters (é, ü, ñ). The correct claim is that standard 7-bit ASCII cannot, and neither standard covers non-Latin scripts like Arabic or Chinese.
Forgetting backwards compatibility in exam answers: When asked to explain Unicode's advantages, backwards compatibility with ASCII is a distinct, examinable point — don't omit it.

📝 Exam tip

"Explain why Unicode is used instead of ASCII" is a frequent 2-mark question. You must give two distinct points to score both marks. Safe, reliable answers are:

ASCII only has 128 code points, covering mainly the English alphabet — it cannot represent characters from other languages such as Arabic or Chinese.
Unicode can represent characters from every writing system in the world, using a single consistent standard with over 149,000 code points.

A third useful point (for 3-mark variations): Unicode is backwards compatible with ASCII — the first 128 code points are identical, so existing ASCII files do not need to be converted.

Task Set A

Task Set A — Higher core

Work through all questions. Written questions are self-assessed.

What is the ASCII decimal code point for the uppercase letter 'A'?

What is the 7-bit binary representation of the letter 'Z' (decimal code point 90)?

Which statement about the original 7-bit ASCII standard is correct?

ASCII uses 8 bits and can represent 256 different characters

Incorrect — this describes extended ASCII (8-bit). The original ASCII standard is 7-bit with 128 characters (2⁷ = 128).

ASCII uses 7 bits and can represent 128 different characters

Correct — 7 bits allows 2⁷ = 128 unique code points (0 to 127), covering uppercase and lowercase English letters, digits, and basic punctuation.

ASCII uses 16 bits and supports all world languages

Incorrect — this describes Unicode (16-bit encoding). ASCII is 7-bit and only supports English characters.

ASCII and Unicode use different code points for the letter 'A'

Incorrect — Unicode is backwards compatible with ASCII. The first 128 Unicode code points are identical to ASCII, so 'A' is code point 65 in both standards.

How many bytes are needed to store the string "Hi" using ASCII?

How many bytes are needed to store the same string "Hi" using 16-bit Unicode?

A document contains 500 English characters. How many more bytes does 16-bit Unicode use compared to ASCII?

A7 — past paper style (2 marks)

Explain why ASCII cannot be used to store Arabic text.

A8 — past paper style (2 marks)

State one advantage and one disadvantage of Unicode compared to ASCII.

A text file storing 100 English characters is converted from ASCII to 32-bit Unicode. How many times larger does the file become?

2 times larger

Incorrect — 2× is the increase for 16-bit Unicode (2 bytes vs 1 byte). 32-bit uses 4 bytes per character.

3 times larger

Incorrect — there is no standard encoding that uses 3 bytes per character. 32-bit Unicode uses 4 bytes.

4 times larger

Correct — ASCII uses 1 byte per character; 32-bit Unicode uses 4 bytes per character. 100 chars × 1 = 100 bytes → 100 chars × 4 = 400 bytes. 400 ÷ 100 = 4×.

8 times larger

Incorrect — 8× would mean 8 bytes per character. 32-bit Unicode uses 32 bits = 4 bytes per character.

A10 — past paper style (2 marks)

Explain why Unicode is backwards compatible with ASCII and why this was important when Unicode was introduced.

✅ Higher checkpoint — A7, A8, and A10 are the most exam-relevant. Confident explanations with two distinct points on each = ready for the SQA exam.

Task Set B

Task Set B — Extension · Beyond the specification

No auto-check — self-assess using the model answers.

The word "café" contains the accented character 'é'. Calculate the total bytes needed to store it in 16-bit Unicode, and explain why standard 7-bit ASCII cannot store this string at all.

UTF-8 uses 1 byte for ASCII characters and 2–4 bytes for others. Explain one advantage and one disadvantage of UTF-8 compared to fixed-width 16-bit Unicode for a document that is mostly English text.

A school database stores 1,200 pupil names, each averaging 12 characters, currently in ASCII. Calculate the storage in kilobytes for both ASCII and 16-bit Unicode (1 KB = 1,024 bytes). State the difference, and explain whether switching to Unicode is justified.

16-bit Unicode can directly represent 65,536 code points, but the full Unicode standard contains over 149,000 characters. Explain the limitation this creates and describe how it is addressed in practice.

📁 File this in OneNote under:
Higher Computing Science → Computer Systems → CS4

📌 Teacher notes — not for pupils (Shift+T to toggle)

Timing (120 min double):
5 min — warm up (CS3 recap + binary bridge), circulate
5 min — key vocabulary together
10 min — why encoding matters (demo: open a file in wrong encoding, show garbled text if possible)
5 min — ASCII: structure, code points, worked example 1 together
10 min — Unicode: scale diagram, context-clash diagram, backwards compatibility
5 min — storage calculation method (worked example 2 together)
5 min — now you try, then cold call
5 min — common mistakes and exam tip
25 min — tasks
5 min — cold call review on A7/A8 (most exam-relevant)

Watch for: pupils writing "uses more memory" without the multiplier — drill the ×2 and ×4 language. Also watch for confusion between 7-bit code points and 1-byte storage (ASCII characters are stored in 1 byte even though the code itself only uses 7 bits).

Demo idea: Open a terminal and type python3 -c "print(chr(65))" — instantly shows 'A'. Then print(ord('A')) to show 65. Then print(ord('😀')) to show 128512 — dramatically outside ASCII range. Pupils find this concretely convincing.

B4 (surrogate pairs) is genuinely beyond the SQA specification but stretches confident pupils well. It's worth mentioning to the group even if they don't attempt it — "this is what happens when a 16-bit standard tries to cover 1.1 million characters" is a memorable idea.