From 554fd8c5195424bdbcabf5de30fdc183aba391bd Mon Sep 17 00:00:00 2001 From: upstream source tree Date: Sun, 15 Mar 2015 20:14:05 -0400 Subject: obtained gcc-4.6.4.tar.bz2 from upstream website; verified gcc-4.6.4.tar.bz2.sig; imported gcc-4.6.4 source tree from verified upstream tarball. downloading a git-generated archive based on the 'upstream' tag should provide you with a source tree that is binary identical to the one extracted from the above tarball. if you have obtained the source via the command 'git clone', however, do note that line-endings of files in your working directory might differ from line-endings of the respective files in the upstream repository. --- libstdc++-v3/doc/html/manual/facets.html | 728 +++++++++++++++++++++++++++++++ 1 file changed, 728 insertions(+) create mode 100644 libstdc++-v3/doc/html/manual/facets.html (limited to 'libstdc++-v3/doc/html/manual/facets.html') diff --git a/libstdc++-v3/doc/html/manual/facets.html b/libstdc++-v3/doc/html/manual/facets.html new file mode 100644 index 000000000..cfe89bc0d --- /dev/null +++ b/libstdc++-v3/doc/html/manual/facets.html @@ -0,0 +1,728 @@ + + +Facets

+The standard class codecvt attempts to address conversions between +different character encoding schemes. In particular, the standard +attempts to detail conversions between the implementation-defined wide +characters (hereafter referred to as wchar_t) and the standard type +char that is so beloved in classic C (which can now be +referred to as narrow characters.) This document attempts to describe +how the GNU libstdc++ implementation deals with the conversion between +wide and narrow characters, and also presents a framework for dealing +with the huge number of other encodings that iconv can convert, +including Unicode and UTF8. Design issues and requirements are +addressed, and examples of correct usage for both the required +specializations for wide and narrow characters and the +implementation-provided extended functionality are given. +

+Around page 425 of the C++ Standard, this charming heading comes into view: +

+The text around the codecvt definition gives some clues: +

+Hmm. So, in some unspecified way, Unicode encodings and +translations between other character sets should be handled by this +class. +

+Ah ha! Another clue... +

+At this point, a couple points become clear: +

+One: The standard clearly implies that attempts to add non-required +(yet useful and widely used) conversions need to do so through the +third template parameter, stateT.

+Two: The required conversions, by specifying mbstate_t as the third +template parameter, imply an implementation strategy that is mostly +(or wholly) based on the underlying C library, and the functions +mcsrtombs and wcsrtombs in particular.

+ Probably the most frequently asked question about code conversion + is: "So dudes, what's the deal with Unicode strings?" + The dude part is optional, but apparently the usefulness of + Unicode strings is pretty widely appreciated. Sadly, this specific + encoding (And other useful encodings like UTF8, UCS4, ISO 8859-10, + etc etc etc) are not mentioned in the C++ standard. +

+ A couple of comments: +

+ The thought that all one needs to convert between two arbitrary + codesets is two types and some kind of state argument is + unfortunate. In particular, encodings may be stateless. The naming + of the third parameter as stateT is unfortunate, as what is really + needed is some kind of generalized type that accounts for the + issues that abstract encodings will need. The minimum information + that is required includes: +

+In addition, multi-threaded and multi-locale environments also impact +the design and requirements for code conversions. In particular, they +affect the required specialization codecvt<wchar_t, char, mbstate_t> +when implemented using standard "C" functions. +

+Three problems arise, one big, one of medium importance, and one small. +

+First, the small: mcsrtombs and wcsrtombs may not be multithread-safe +on all systems required by the GNU tools. For GNU/Linux and glibc, +this is not an issue. +

+Of medium concern, in the grand scope of things, is that the functions +used to implement this specialization work on null-terminated +strings. Buffers, especially file buffers, may not be null-terminated, +thus giving conversions that end prematurely or are otherwise +incorrect. Yikes! +

+The last, and fundamental problem, is the assumption of a global +locale for all the "C" functions referenced above. For something like +C++ iostreams (where codecvt is explicitly used) the notion of +multiple locales is fundamental. In practice, most users may not run +into this limitation. However, as a quality of implementation issue, +the GNU C++ library would like to offer a solution that allows +multiple locales and or simultaneous usage with computationally +correct results. In short, libstdc++ is trying to offer, as an +option, a high-quality implementation, damn the additional complexity! +

+For the required specialization codecvt<wchar_t, char, mbstate_t> , +conversions are made between the internal character set (always UCS4 +on GNU/Linux) and whatever the currently selected locale for the +LC_CTYPE category implements. +

+The two required specializations are implemented as follows: +

+ +codecvt<char, char, mbstate_t> + +

+This is a degenerate (i.e., does nothing) specialization. Implementing +this was a piece of cake. +

+ +codecvt<char, wchar_t, mbstate_t> + +

+This specialization, by specifying all the template parameters, pretty +much ties the hands of implementors. As such, the implementation is +straightforward, involving mcsrtombs for the conversions between char +to wchar_t and wcsrtombs for conversions between wchar_t and char. +

+Neither of these two required specializations deals with Unicode +characters. As such, libstdc++ implements a partial specialization +of the codecvt class with and iconv wrapper class, encoding_state as the +third template parameter. +

+This implementation should be standards conformant. First of all, the +standard explicitly points out that instantiations on the third +template parameter, stateT, are the proper way to implement +non-required conversions. Second of all, the standard says (in Chapter +17) that partial specializations of required classes are a-ok. Third +of all, the requirements for the stateT type elsewhere in the standard +(see 21.1.2 traits typedefs) only indicate that this type be copy +constructible. +

+As such, the type encoding_state is defined as a non-templatized, POD +type to be used as the third type of a codecvt instantiation. This +type is just a wrapper class for iconv, and provides an easy interface +to iconv functionality. +

+There are two constructors for encoding_state: +

+ +encoding_state() : __in_desc(0), __out_desc(0) + +

+This default constructor sets the internal encoding to some default +(currently UCS4) and the external encoding to whatever is returned by +nl_langinfo(CODESET). +

+ +encoding_state(const char* __int, const char* __ext) + +

+This constructor takes as parameters string literals that indicate the +desired internal and external encoding. There are no defaults for +either argument. +

+One of the issues with iconv is that the string literals identifying +conversions are not standardized. Because of this, the thought of +mandating and or enforcing some set of pre-determined valid +identifiers seems iffy: thus, a more practical (and non-migraine +inducing) strategy was implemented: end-users can specify any string +(subject to a pre-determined length qualifier, currently 32 bytes) for +encodings. It is up to the user to make sure that these strings are +valid on the target system. +

+ +void +_M_init() + +

+Strangely enough, this member function attempts to open conversion +descriptors for a given encoding_state object. If the conversion +descriptors are not valid, the conversion descriptors returned will +not be valid and the resulting calls to the codecvt conversion +functions will return error. +

+ +bool +_M_good() + +

+Provides a way to see if the given encoding_state object has been +properly initialized. If the string literals describing the desired +internal and external encoding are not valid, initialization will +fail, and this will return false. If the internal and external +encodings are valid, but iconv_open could not allocate conversion +descriptors, this will also return false. Otherwise, the object is +ready to convert and will return true. +

+ +encoding_state(const encoding_state&) + +

+As iconv allocates memory and sets up conversion descriptors, the copy +constructor can only copy the member data pertaining to the internal +and external code conversions, and not the conversion descriptors +themselves. +

+Definitions for all the required codecvt member functions are provided +for this specialization, and usage of codecvt<internal character type, +external character type, encoding_state> is consistent with other +codecvt usage. +

+The std::messages facet implements message retrieval functionality +equivalent to Java's java.text.MessageFormat .using either GNU gettext +or IEEE 1003.1-200 functions. +

+The std::messages facet is probably the most vaguely defined facet in +the standard library. It's assumed that this facility was built into +the standard library in order to convert string literals from one +locale to the other. For instance, converting the "C" locale's +const char* c = "please" to a German-localized "bitte" +during program execution. +

+This class has three public member functions, which directly +correspond to three protected virtual member functions. +

+The public member functions are: +

+catalog open(const string&, const locale&) const +

+string_type get(catalog, int, int, const string_type&) const +

+void close(catalog) const +

+While the virtual functions are: +

+catalog do_open(const string&, const locale&) const +

+string_type do_get(catalog, int, int, const string_type&) const +

+void do_close(catalog) const +

+A couple of notes on the standard. +

+First, why is messages_base::catalog specified as a typedef +to int? This makes sense for implementations that use +catopen, but not for others. Fortunately, it's not heavily +used and so only a minor irritant. +

+Second, by making the member functions const, it is +impossible to save state in them. Thus, storing away information used +in the 'open' member function for use in 'get' is impossible. This is +unfortunate. +

+The 'open' member function in particular seems to be oddly +designed. The signature seems quite peculiar. Why specify a const +string& argument, for instance, instead of just const +char*? Or, why specify a const locale& argument that is +to be used in the 'get' member function? How, exactly, is this locale +argument useful? What was the intent? It might make sense if a locale +argument was associated with a given default message string in the +'open' member function, for instance. Quite murky and unclear, on +reflection. +

+Lastly, it seems odd that messages, which explicitly require code +conversion, don't use the codecvt facet. Because the messages facet +has only one template parameter, it is assumed that ctype, and not +codecvt, is to be used to convert between character sets. +

+It is implicitly assumed that the locale for the default message +string in 'get' is in the "C" locale. Thus, all source code is assumed +to be written in English, so translations are always from "en_US" to +other, explicitly named locales. +

+ This is a relatively simple class, on the face of it. The standard + specifies very little in concrete terms, so generic + implementations that are conforming yet do very little are the + norm. Adding functionality that would be useful to programmers and + comparable to Java's java.text.MessageFormat takes a bit of work, + and is highly dependent on the capabilities of the underlying + operating system. +

+ Three different mechanisms have been provided, selectable via + configure flags: +

+A new, standards-conformant non-virtual member function signature was +added for 'open' so that a directory could be specified with a given +message catalog. This simplifies calling conventions for the gnu +model. +

-- cgit v1.2.3