Anonymous: Zeichenkodierung: ISO ... vs. Unicode

Beitrag lesen

Was war in dem von Dir beantworteten Posting fuer Dich laecherlich?

Deine im Brustton der Überzeugung geäusserten und offensichtlich falschen Thesen,

erstens ist dieser "Brustton der Ueberzeugung" nicht wirklich vorhanden

Deshalb widersprichst du Tim ja auch weiterhin ohne irgendwelche Argumente.

und zweitens weiss ich immer noch nicht, was fuer Dich fachlich nicht OK war.

character set

A group of unique symbols used for display and printing.

What is Unicode?

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

Folgerung: Unicode ist keine Kodierung, sondern ein so-called character set.

5.1 The Document Character Set

To promote interoperability, SGML requires that each application (including HTML) specify its document character set. A document character set consists of:
A Repertoire: A set of abstract characters,, such as the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.
Code positions: A set of integer references to characters in the repertoire.

Each SGML document (including each HTML document) is a sequence of characters from the repertoire. Computer systems identify each character by its code position; for example, in the ASCII character set, code positions 65, 66, and 67 refer to the characters 'A', 'B', and 'C', respectively.

The ASCII character set is not sufficient for a global information system such as the Web, so HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]. This standard defines a repertoire of thousands of characters used by communities all over the world.

The character set defined in [ISO10646] is character-by-character equivalent to Unicode ([UNICODE]). Both of these standards are updated from time to time with new characters, and the amendments should be consulted at the respective Web sites. In the current specification, "[ISO10646]" is used to refer to the document character set while "[UNICODE]" is reserved for references to the Unicode bidirectional text algorithm.

The document character set, however, does not suffice to allow user agents to correctly interpret HTML documents as they are typically exchanged -- encoded as a sequence of bytes in a file or during a network transmission. User agents must also know the specific character encoding that was used to transform the document character stream into a byte stream.

Wir haben also gelernt: HTML nutzt Unicode als Character set, unabhängig vom character encoding.

5.2 Character encodings

What this specification calls a character encoding is known by different names in other specifications (which may cause some confusion). However, the concept is largely the same across the Internet. Also, protocol headers, attributes, and parameters referring to character encodings share the same name -- "charset" -- and use the same values from the [IANA] registry (see [CHARSETS] for a complete list).

The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters. This conversion fits naturally with the scheme of Web activity: servers send HTML documents to user agents as a stream of bytes; user agents interpret them as a sequence of characters. The conversion method can range from simple one-to-one correspondence to complex switching schemes or algorithms.

A simple one-byte-per-character encoding technique is not sufficient for text strings over a character repertoire as large as [ISO10646]. There are several different encodings of parts of [ISO10646] in addition to encodings of the entire character set (such as UCS-4).

Wir haben also gelernt: wir müssen das character encoding mit Hilfe des charset-Parameters angeben.

5.2.1 Choosing an encoding

Authoring tools (e.g., text editors) may encode HTML documents in the character encoding of their choice, and the choice largely depends on the conventions used by the system software. These tools may employ any convenient encoding that covers most of the characters contained in the document, provided the encoding is correctly labeled. Occasional characters that fall outside this encoding may still be represented by character references. These always refer to the document character set, not the character encoding.

Servers and proxies may change a character encoding (called transcoding) on the fly to meet the requests of user agents (see section 14.2 of [RFC2616], the "Accept-Charset" HTTP request header). Servers and proxies do not have to serve a document in a character encoding that covers the entire document character set.

Commonly used character encodings on the Web include ISO-8859-1 (also referred to as "Latin-1"; usable for most Western European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a Japanese encoding), EUC-JP (another Japanese encoding), and UTF-8 (an encoding of ISO 10646 using a different number of bytes for different characters). Names for character encodings are case-insensitive, so that for example "SHIFT_JIS", "Shift_JIS", and "shift_jis" are equivalent.

This specification does not mandate which character encodings a user agent must support.

Conforming user agents must correctly map to ISO 10646 all characters in any character encodings that they recognize (or they must behave as if they did).

Wir haben also gelernt: wir dürfen einen beliebigen character set benutzen. Möchten wir ein Zeichen ausserhalb des _character sets_ nutzen, können wir nummerische Entities nutzen, die sich _immer_ auf den _character set_ des Dokumentes beziehen, dass ja, wie wir inzwischen wissen, Unicode ist. Es bezieht sich _nicht_ auf das character encoding. Deshalb ist es Christian möglich, im Forum beliebige Zeichen über deren Unicode-Nummer einzubinden.

Fazit: Du hast mit deiner Behauptung, Unicode sei ein character encoding, also eine Kodierung, falsch gelegen. Der Unicode-Standard enthält zwar einige Character encodings (Kodierungen), aber Unicode ist ein Character set, also keine Kodierung. Nachlesen kannst du es in den entsprechenden Standards.

Du hast ja ohnehin den Hang zu Sinnlospostings.

Danke. Gleichfalls.

Lustig sind meine Beitraege oft.   ;-)

Nicht so, wie du es gern hättest.