
Understanding Code Points and Code Pages
A code point is the unique number assigned to each character. For example, the code point 32 in ASCII represents the space character. Standard ASCII defines 128 code points, which was enough for early computing needs. To include national characters, the concept of code pages was introduced, whereby the upper 128 slots were repurposed to accommodate language-specific characters. However, this method had a significant drawback: the same code point could represent different characters depending on the code page being used. This ambiguity proved problematic for internationalization efforts.The inconsistency of code pages paved the way for a more robust solution in character encoding.
The Emergence of Unicode
The ultimate solution to these challenges was the Unicode standard, which assigns unique code points to over one million characters. Importantly, the first 128 Unicode code points are identical to the standard ASCII set, and the first 256 match a widely used Western European code page. This compatibility ensures that legacy systems and modern applications can work together seamlessly.
UTF-8: The Most Widely Adopted Unicode Encoding
UTF-8 (Unicode Transformation Format) is the most commonly used encoding for Unicode characters. It is a variable-width encoding system that uses one to four 8-bit bytes to represent each code point. This means:- Latin characters and basic ASCII characters typically use a single 8-bit byte.
- Many non-Latin characters require 16 bits.
- Certain ideographs may occupy up to 24 bits.

Python 3’s complete support for Unicode and UTF-8 simplifies working with multi-language text and is a significant advantage for developers.