Facets

+ +-1- The class codecvt<internT,externT,stateT> is for use when +converting from one codeset to another, such as from wide characters +to multibyte characters, between wide character encodings such as +Unicode and EUC. + +

+Hmm. So, in some unspecified way, Unicode encodings and +translations between other character sets should be handled by this +class. +

+ +-2- The stateT argument selects the pair of codesets being mapped between. + +

+Ah ha! Another clue... +

+ +-3- The instantiations required in the Table ?? +(lib.locale.category), namely codecvt<wchar_t,char,mbstate_t> and +codecvt<char,char,mbstate_t>, convert the implementation-defined +native character set. codecvt<char,char,mbstate_t> implements a +degenerate conversion; it does not convert at +all. codecvt<wchar_t,char,mbstate_t> converts between the native +character sets for tiny and wide characters. Instantiations on +mbstate_t perform conversion between encodings known to the library +implementor. Other encodings can be converted by specializing on a +user-defined stateT type. The stateT object can contain any state that +is useful to communicate to or from the specialized do_convert member. + +

+At this point, a couple points become clear: +

+One: The standard clearly implies that attempts to add non-required +(yet useful and widely used) conversions need to do so through the +third template parameter, stateT.

+Two: The required conversions, by specifying mbstate_t as the third +template parameter, imply an implementation strategy that is mostly +(or wholly) based on the underlying C library, and the functions +mcsrtombs and wcsrtombs in particular.

Design

+  typedef codecvt_base::result                  result;
+  typedef unsigned short                        unicode_t;
+  typedef unicode_t                             int_type;
+  typedef char                                  ext_type;
+  typedef encoding_state                          state_type;
+  typedef codecvt<int_type, ext_type, state_type> unicode_codecvt;
+
+  const ext_type*       e_lit = "black pearl jasmine tea";
+  int                   size = strlen(e_lit);
+  int_type              i_lit_base[24] =
+  { 25088, 27648, 24832, 25344, 27392, 8192, 28672, 25856, 24832, 29184,
+    27648, 8192, 27136, 24832, 29440, 27904, 26880, 28160, 25856, 8192, 29696,
+    25856, 24832, 2560
+  };
+  const int_type*       i_lit = i_lit_base;
+  const ext_type*       efrom_next;
+  const int_type*       ifrom_next;
+  ext_type*             e_arr = new ext_type[size + 1];
+  ext_type*             eto_next;
+  int_type*             i_arr = new int_type[size + 1];
+  int_type*             ito_next;
+
+  // construct a locale object with the specialized facet.
+  locale                loc(locale::classic(), new unicode_codecvt);
+  // sanity check the constructed locale has the specialized facet.
+  VERIFY( has_facet<unicode_codecvt>(loc) );
+  const unicode_codecvt& cvt = use_facet<unicode_codecvt>(loc);
+  // convert between const char* and unicode strings
+  unicode_codecvt::state_type state01("UNICODE", "ISO_8859-1");
+  initialize_state(state01);
+  result r1 = cvt.in(state01, e_lit, e_lit + size, efrom_next,
+		     i_arr, i_arr + size, ito_next);
+  VERIFY( r1 == codecvt_base::ok );
+  VERIFY( !int_traits::compare(i_arr, i_lit, size) );
+  VERIFY( efrom_next == e_lit + size );
+  VERIFY( ito_next == i_arr + size );
+

Future

+ a. things that are sketchy, or remain unimplemented: + do_encoding, max_length and length member functions + are only weakly implemented. I have no idea how to do + this correctly, and in a generic manner. Nathan? +
+ b. conversions involving std::string +
+ c. conversions involving std::filebuf and std::ostream +

Bibliography

+ . + A brief description of Normative Addendum 1 + . Clive Feather. Extended Character Sets.

+ . + The Unicode HOWTO + . Bruno Haible.

+ . + UTF-8 and Unicode FAQ for Unix/Linux + . Markus Khun.

messages

+The std::messages facet implements message retrieval functionality +equivalent to Java's java.text.MessageFormat .using either GNU gettext +or IEEE 1003.1-200 functions. +

Requirements

+The std::messages facet is probably the most vaguely defined facet in +the standard library. It's assumed that this facility was built into +the standard library in order to convert string literals from one +locale to the other. For instance, converting the "C" locale's +const char* c = "please" to a German-localized "bitte" +during program execution. +

+22.2.7.1 - Template class messages [lib.locale.messages] +

+This class has three public member functions, which directly +correspond to three protected virtual member functions. +

+The public member functions are: +

+catalog open(const string&, const locale&) const +

+string_type get(catalog, int, int, const string_type&) const +

+void close(catalog) const +

+While the virtual functions are: +

+catalog do_open(const string&, const locale&) const +

+ +-1- Returns: A value that may be passed to get() to retrieve a +message, from the message catalog identified by the string name +according to an implementation-defined mapping. The result can be used +until it is passed to close(). Returns a value less than 0 if no such +catalog can be opened. + +

+string_type do_get(catalog, int, int, const string_type&) const +

+ +-3- Requires: A catalog cat obtained from open() and not yet closed. +-4- Returns: A message identified by arguments set, msgid, and dfault, +according to an implementation-defined mapping. If no such message can +be found, returns dfault. + +

+void do_close(catalog) const +

+ +-5- Requires: A catalog cat obtained from open() and not yet closed. +-6- Effects: Releases unspecified resources associated with cat. +-7- Notes: The limit on such resources, if any, is implementation-defined. + +

Design

+A couple of notes on the standard. +

+First, why is messages_base::catalog specified as a typedef +to int? This makes sense for implementations that use +catopen, but not for others. Fortunately, it's not heavily +used and so only a minor irritant. +

+Second, by making the member functions const, it is +impossible to save state in them. Thus, storing away information used +in the 'open' member function for use in 'get' is impossible. This is +unfortunate. +

+The 'open' member function in particular seems to be oddly +designed. The signature seems quite peculiar. Why specify a const +string& argument, for instance, instead of just const +char*? Or, why specify a const locale& argument that is +to be used in the 'get' member function? How, exactly, is this locale +argument useful? What was the intent? It might make sense if a locale +argument was associated with a given default message string in the +'open' member function, for instance. Quite murky and unclear, on +reflection. +

+Lastly, it seems odd that messages, which explicitly require code +conversion, don't use the codecvt facet. Because the messages facet +has only one template parameter, it is assumed that ctype, and not +codecvt, is to be used to convert between character sets. +

+It is implicitly assumed that the locale for the default message +string in 'get' is in the "C" locale. Thus, all source code is assumed +to be written in English, so translations are always from "en_US" to +other, explicitly named locales. +

Implementation

Models

+ This is a relatively simple class, on the face of it. The standard + specifies very little in concrete terms, so generic + implementations that are conforming yet do very little are the + norm. Adding functionality that would be useful to programmers and + comparable to Java's java.text.MessageFormat takes a bit of work, + and is highly dependent on the capabilities of the underlying + operating system. +

+ Three different mechanisms have been provided, selectable via + configure flags: +

+ generic +
+ This model does very little, and is what is used by default. +
+ gnu +
+ The gnu model is complete and fully tested. It's based on the + GNU gettext package, which is part of glibc. It uses the + functions textdomain, bindtextdomain, gettext to + implement full functionality. Creating message catalogs is a + relatively straight-forward process and is lightly documented + below, and fully documented in gettext's distributed + documentation. +
+ ieee_1003.1-200x +
+ This is a complete, though untested, implementation based on + the IEEE standard. The functions catopen, catgets, + catclose are used to retrieve locale-specific messages + given the appropriate message catalogs that have been + constructed for their use. Note, the script + po2msg.sed that is part of the gettext distribution can + convert gettext catalogs into catalogs that + catopen can use. +

+A new, standards-conformant non-virtual member function signature was +added for 'open' so that a directory could be specified with a given +message catalog. This simplifies calling conventions for the gnu +model. +

The GNU Model

+ The messages facet, because it is retrieving and converting + between characters sets, depends on the ctype and perhaps the + codecvt facet in a given locale. In addition, underlying "C" + library locale support is necessary for more than just the + LC_MESSAGES mask: LC_CTYPE is also + necessary. To avoid any unpleasantness, all bits of the "C" mask + (i.e. LC_ALL) are set before retrieving messages. +

+ Making the message catalogs can be initially tricky, but become + quite simple with practice. For complete info, see the gettext + documentation. Here's an idea of what is required: +

+ Make a source file with the required string literals that need + to be translated. See intl/string_literals.cc for + an example. +
+ Make initial catalog (see "4 Making the PO Template File" from + the gettext docs).
+ xgettext --c++ --debug string_literals.cc -o libstdc++.pot +
Make language and country-specific locale catalogs.
+ cp libstdc++.pot fr_FR.po +
+ cp libstdc++.pot de_DE.po +
+ Edit localized catalogs in emacs so that strings are + translated. +
+ emacs fr_FR.po +
Make the binary mo files.
+ msgfmt fr_FR.po -o fr_FR.mo +
+ msgfmt de_DE.po -o de_DE.mo +
Copy the binary files into the correct directory structure.
+ cp fr_FR.mo (dir)/fr_FR/LC_MESSAGES/libstdc++.mo +
+ cp de_DE.mo (dir)/de_DE/LC_MESSAGES/libstdc++.mo +
Use the new message catalogs.
+ locale loc_de("de_DE"); +
+ + use_facet<messages<char> >(loc_de).open("libstdc++", locale(), dir); + +

Use

+ A simple example using the GNU model of message conversion. +

+#include <iostream>
+#include <locale>
+using namespace std;
+
+void test01()
+{
+  typedef messages<char>::catalog catalog;
+  const char* dir =
+  "/mnt/egcs/build/i686-pc-linux-gnu/libstdc++/po/share/locale";
+  const locale loc_de("de_DE");
+  const messages<char>& mssg_de = use_facet<messages<char> >(loc_de);
+
+  catalog cat_de = mssg_de.open("libstdc++", loc_de, dir);
+  string s01 = mssg_de.get(cat_de, 0, 0, "please");
+  string s02 = mssg_de.get(cat_de, 0, 0, "thank you");
+  cout << "please in german:" << s01 << '\n';
+  cout << "thank you in german:" << s02 << '\n';
+  mssg_de.close(cat_de);
+}
+

Future

+ Things that are sketchy, or remain unimplemented: +
- + _M_convert_from_char, _M_convert_to_char are in flux, + depending on how the library ends up doing character set + conversions. It might not be possible to do a real character + set based conversion, due to the fact that the template + parameter for messages is not enough to instantiate the + codecvt facet (1 supplied, need at least 2 but would prefer + 3). +
- + There are issues with gettext needing the global locale set + to extract a message. This dependence on the global locale + makes the current "gnu" model non MT-safe. Future versions + of glibc, i.e. glibc 2.3.x will fix this, and the C++ library + bits are already in place. +
+ Development versions of the GNU "C" library, glibc 2.3 will allow + a more efficient, MT implementation of std::messages, and will + allow the removal of the _M_name_messages data member. If this is + done, it will change the library ABI. The C++ parts to support + glibc 2.3 have already been coded, but are not in use: once this + version of the "C" library is released, the marked parts of the + messages implementation can be switched over to the new "C" + library functionality. +
+ At some point in the near future, std::numpunct will probably use + std::messages facilities to implement truename/falsename + correctly. This is currently not done, but entries in + libstdc++.pot have already been made for "true" and "false" string + literals, so all that remains is the std::numpunct coding and the + configure/make hassles to make the installed library search its + own catalog. Currently the libstdc++.mo catalog is only searched + for the testsuite cases involving messages members. +
The following member functions:
+ + catalog + open(const basic_string<char>& __s, const locale& __loc) const + +
+ + catalog + open(const basic_string<char>&, const locale&, const char*) const; + +
+ Don't actually return a "value less than 0 if no such catalog + can be opened" as required by the standard in the "gnu" + model. As of this writing, it is unknown how to query to see + if a specified message catalog exists using the gettext + package. +

Bibliography

+ . + API Specifications, Java Platform + . java.util.Properties, java.text.MessageFormat, +java.util.Locale, java.util.ResourceBundle + .

+ . + GNU gettext tools, version 0.10.38, Native Language Support + Library and Tools. + .