This documents investigates the construction of IMEs on
Windows, and as a side effect provides the boilerplate code for
doing so, as well as a useful IME for Unicode.
This document investigates the construction of Input Method
Editors (or IMEs) on recent Windows platforms. It attempts to
complete the documentation provided by Microsoft, and assumes a
minimal understanding of Windows programming.
The IME I have chosen to develop is targeted at people
working with Unicode: it supports the input of arbitrary Unicode
characters, by scalar value or by character name. When a
character is identified, the inserted text can be just the
character itself, or the character and its name.
I have tried to isolate the parts specific to this IME from
the parts found in pretty much all IMEs, so that developing a
new IME should be easier.
Remember that our input method supports the specification of
a character to insert either via its code point or via
its name. Furthermore, regardless of how the character was
selected, we optionally insert its name. Therefore, we need some
data structures to record names and code points.
In this section, we deal with these data structures, and the
computations they support. We will keep an eye on a
generalization of this machinery: while we are primarily
interested in the Unicode names of the characters (and their
various localizations), we may as well build something that
works with arbitrary lists of names.
First, let’s assemble the character names. We take
those from an XML representation of the Unicode Character
Database and process it with a stylesheet that extracts only the
name followed by a semicolon, followed by the scalar value,
followed by a display indication.
We reject the names starting with CJK UNIFIED IDEOGRAPH-
and HANGUL SYLLABLE; there are many such characters, and they
follow a regular organization; we will take care of them
specially. We also reject the control characters, since their
name (<control>) is not very useful and they are unlikely to
be entered (the U+ method allows us to enter them
anyway).
The display indication is used when we insert the name of
the character, following the Unicode convention. In all cases,
we insert a string of the form “U+wxyz <character>
<name>”, where character depends on the display
indication:
0: the character is inserted directly,
e.g. “U+20AC € EURO SIGN”
1: the character is a combining mark and U+25CC
◌ DOTTED CIRCLE is used as a base,
e.g. “U+0302 ◌̂ COMBINING CIRCUMFLEX
ACCENT”
2: the character is a white space character and is
displayed between U+2018 ‘ LEFT SINGLE QUOTATION MARK
and U+2019 ’ RIGHT SINGLE QUOTATION MARK,
e.g. “U+2003 ‘ ’ EM
SPACE”
3: the character is some kind of control character and
is not displayed at all, a simple space separates the code
point and the name, e.g. “U+200D ZERO WIDTH
JOINER”
Before we build our data structures, let‘s
experiment a little bit to understand our data.
The list of names for Unicode 3.2 contains 13,789 names
(after removing the controls, CJK ideographs, and Hangul
syllables), and is about 400,000 characters. There is a fair
amount of redundancy in the names (e.g. LATIN occurs 968
times), and indeed, compressing its serialization in UTF-8
with a common compression utility yields a file less than 100
Kbytes. Of course, this file does not support our operations
very well, but it gives us a good lower bound for our data
structures.
The completion operation, where we use the beginning of a
name to prune the list of candidate names, suggests a tree
organization, where a node represents the prefix of some
names, and its children are the possible continuation for that
prefix. Here is a fragment for the three names ACCOUNT OF,
ACUTE ACCENT and ACUTE ANGLE:
In this graph, the red circle indicates complete
names. Note that it is possible for a non-leaf node to
represent a character, as some names are prefixes of others;
for example, the node representing the isolated A in LATIN
CAPITAL LETTER A WITH ACUTE will have a child labelled space
and represent the character U+0041.
Here is a data structure to represent a tree node.
Node class, to represent a node in the name tree ==
static class Node {
public int codePoint;
public int gc;
public Map children;
public Node parent;
public String key;
public Node (Node parent, String key) {
this.codePoint = -1;
this.parent = parent;
this.children = new TreeMap ();
this.key = key;
this.gc = 0;
}
compiler.tree.methods: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
}
If the node does not represent a character, then the
codePoint field is -1. The children are
indexed by the character they represent, also called their
key, for convenient access. The node that
represents the root of the tree is a little bit special,
because its parent field is
null, as is its key. For all the other
nodes, these fields are non-null
Inserting a new string in the tree is rather
straightforward. s.substring (from, to) is
the string to insert, and codePoint is the
code point for that name:
Node method to insert a string in a tree ==
public void insert (String s, int from, int to, int codePoint, int gc) {
if (from < to) {
String key = s.substring (from, from + 1).intern ();
from = from + 1;
Node n = (Node) children.get (key);
if (n == null) {
n = new Node (this, key);
children.put (key, n); }
n.insert (s, from, to, codePoint, gc); }
else {
this.codePoint = codePoint;
this.gc = gc; }
}
Let's have a few methods to investgate the tree. First
let's count the number of nodes:
public int count () {
int c = 1; // ourselves
for (Iterator it = children.values ().iterator (); it.hasNext (); ) {
c += ((Node) it.next ()).count (); }
return c;
}
Just to make sure that our structure is correct, we can
recreate the data from the tree and compare with the
original:
Let's load our names list and report some numbers:
{ String name;
int charCount = 0;
int stringCount = 0;
System.out.println ("---- loading the names in the tree");
LineNumberReader rd = new LineNumberReader (new FileReader (args[0]));
while ((name = rd.readLine ()) != null) {
int semi = name.indexOf (';');
int semi2 = name.indexOf (';', semi + 1);
int codePoint = Integer.parseInt (name.substring (semi + 1, semi2), 16);
int gc = Integer.parseInt (name.substring (semi2 + 1));
root.insert (name, 0, semi, codePoint, gc);
stringCount++;
charCount += semi; }
System.out.println ("" + stringCount + " names");
System.out.println ("" + charCount + " characters");
int totalNodes = root.count ();
System.out.println ("" + totalNodes + " char nodes"); }
We discover that we have 15,015 names, using 399,396
characters, but only 88,778 nodes. This is not too surprising
since many names have a common prefix (e.g. LATIN), and that
we represent character in those prefixes only once.
Let's restore the names list, to make sure our data
structure is correct:
{ String name = args [0] + ".restore.1";
System.out.println ("---- restoring in " + name);
FileWriter x = new FileWriter (name);
root.restore (x);
x.close (); }
As expected, the restored name list matches with original
names list, modulo the order.
Going back to our picture of the tree, we observe that
many nodes are not leaves and have a single child. Again, this
should not be surprising: after LATIN CAPI, the next
characters are always TAL. The nodes I, T, and A have this
property. Let's count those nodes to see if we can exploit
that property:
public int countSimple () {
int c = 0;
if (codePoint == -1 && children.size () == 1) {
c++; }
for (Iterator it = children.values ().iterator (); it.hasNext (); ) {
c += ((Node) it.next ()).countSimple (); }
return c;
}
{ System.out.println ("---- counting the simple nodes");
int simpleNodes = root.countSimple ();
System.out.println ("" + simpleNodes + " simple nodes"); }
Indeed, 69,387 nodes or about 80% are simple! Representing
the simple nodes are separate nodes does not help us much, but
it consumes quite a bit of space. If we collapse the adjacent
simple nodes, we get the following graph:
Let's do it, and verify that our collapsing is correct:
public void collapse () {
if (codePoint == -1 && children.size () == 1) {
Node n = (Node) children.values ().iterator ().next ();
parent.children.remove (key);
n.key = (key + n.key).intern ();
n.parent = parent;
parent.children.put (n.key, n);
n.collapse (); }
else {
Object [] n = children.values ().toArray ();
for (int i = 0; i < n.length; i++) {
((Node) n[i]).collapse (); }}
}
The astute reader will have noticed that our
restore does not depend on the keys being
single characters, so we can use it here.
{ System.out.println ("---- collapsing the simple nodes");
root.collapse ();
int nodes = root.count ();
System.out.println ("" + nodes + " nodes");
String name = args[0] + ".restore.2";
System.out.println ("---- restoring in " + name);
FileWriter x = new FileWriter (name);
root.restore (x);
x.close (); }
We are left with 19,391 nodes, indeed quite a simplification.
Here is a further step to take a look at the nodes we now
have, but outputing all the keys:
public void collectKeys (Map s) {
if (key != null) {
int n;
if (s.get (key) != null) {
n = ((Integer) s.get (key)).intValue () + 1; }
else {
n = 1; }
s.put (key, new Integer (n)); }
for (Iterator it = children.values ().iterator (); it.hasNext ();) {
((Node) it.next ()).collectKeys (s); }
}
Map keys = new HashMap ();
{ String name = args [0] + ".keys";
System.out.println ("---- collecting the keys in " + name);
root.collectKeys (keys);
System.out.println ("" + keys.size () + " distinct keys");
FileWriter x = new FileWriter (name);
for (Iterator it = keys.keySet ().iterator(); it.hasNext (); ) {
String s = (String) (it.next ());
int n = ((Integer) (keys.get (s))).intValue ();
x.write (Integer.toString (n));
x.write (" ");
x.write (s);
x.write ("\n"); }
x.close (); }
We have 4,310 unique keys (for 19,391 nodes). This indicates
that many keys have multiple occurrences. Not surprisingly,
the most frequent keys are individual letters (920 nodes have a
single A key). We also find other frequent keys: "FINAL FORM"
(172 occurrences), "SOLATED FORM" (116 occurrences), "NITIAL
FORM" (116 occurrences) " WITH" (104 occurrences). The second
and third are interesting: most names that contain ISOLATED
FORM have a corresponding name with INITAL FORM instead, but
the initial I is common between those pairs, so it is its own
single node.
Clearly, storing 172 occurrences of "FINAL FORM" is
redundant. Let's see how many characters we have if we store
each unique key only once.
{ int keyStore = 0;
int uniqueKeyStore = 0;
for (Iterator it = keys.keySet ().iterator(); it.hasNext (); ) {
String key = (String) (it.next ());
int count = ((Integer) (keys.get (key))).intValue ();
keyStore += (key.length () + 1) * count;
uniqueKeyStore += (key.length () + 1); }
System.out.println (keyStore + " characters in keys");
System.out.println (uniqueKeyStore + " character in unique keys"); }
If we do not collapse keys, we have 108,167 characters. If
we collapse, we bring down this number to 43,767, i.e. we save
about 64K characters. Therefore it is useful to represent the
keys outside the tree itself, in a big array, and have the
nodes in the tree point to that data structure.
Clearly, we want our IME to be relatively unobstrusive; in
particular, we want to it to initialize fast, since it may be
loaded and unloaded on demand. The best approach is to craft a
data struture that can be loaded in one big blob of memory, and
can be used right away. This implies in particular that all
logical pointers are represented as offsets in that memory
blob.
Not surprisingly, the code we have accumulated so far to
investigate our data is useful to build this blob. It is not
entirely a coincidence that the name of the class for this
code is Compiler.
Here is a class to manipulate this blob of memory while we
build it.
public static class Store {
Members for Store: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
}
In the Unicode standard, the character names are all made
of printable ASCII characters only. However, the names can be
translated in other languages, and indeed the French version
of ISO 10646 has translated names. Thus, we really want to
store character names made of arbitrary characters, and we
therefore use one of the UTFs to represent to names.
In selecting a UTF, we have two conflicting
approaches. The first observes that the overwhelming majority
of characters in names will be ASCII characters, even in some
translations; this points at UTF-8 as the most compact
form. The second approach observes that the IME has to
interact with other parts of the application platform,
e.g. when it gets characters from the keyboard, when it
displays its status and when it delivers its output; this
points at UTF-16 as the most practical form. After
experimenting a little bit with a UTF-8 representation, we
decided that the cost of UTF-16 was justified by the
corresponding simplification in the code.
Another concern is whether the names are normalized, and
if so, how. At this point, it is difficult to make an
intelligent choice, since the only names list we have is made
of ASCII characters only, and therefore is always in NFC and
NFD. We will actually ignore this aspect until we gain more
experience with other names lists.
As we have noted earlier, we gain significantly if we can
share the keys among the nodes, as multiple nodes have the
same key. To achieve this, we store all the keys in a big
array, and the nodes will simply point in this array.
To avoid storing the size of keys, we separate the keys by
sentinels. Ideally, we would use a noncharacter for our
sentinel; however, because they occupy 3 bytes in UTF-8, and
because we can be sure that no control character is part of
the character names, we use U+0000 instead.
This approach also provides a further opportunity for
space optimization: if key A is a suffix of key B, we only
need to store B, and we can simply point in the middle of B to
represent A. However, this is probably a small optimization
and we do not implement it here.
Here are the pieces to represent and manipulate the key
store inside a store:
protected byte [] keyStore;
protected int keyStoreCount;
protected Map keyStoreMap;
public int SENTINEL_UNIT = 0;
public void initKeyStore () {
keyStoreCount = 0;
keyStore = new byte [1000];
appendUTF16Unit (SENTINEL_UNIT);
keyStoreMap = new HashMap ();
}
In keyStore, the UTF-16 sequences are
serialized in little endian order. Only the positions 0
.. keyStoreCount-1 are
used. keyStoreMap maps a string to its
offset in keyStore.
Finding the offset of a key, and adding it if not
present, is simple:
Retrieving is only complicated by the fact that Java
does not support unsigned bytes:
protected int getKeyByte (int offset) {
int x = keyStore [offset];
if (x < 0) {
x += 256; }
return x;
}
protected char getUTF16Unit (int offset) {
int b1 = getKeyByte (offset);
int b2 = getKeyByte (offset + 1) << 8;
return (char) (b1 | b2);
}
public String stringAt (int offset) {
int nChars = 0;
int o = offset;
while (getUTF16Unit (o) != SENTINEL_UNIT) {
o += 2;
nChars += 1; }
char[] chars = new char [nChars];
for (int i = 0; i < nChars; i++) {
chars [i] = getUTF16Unit (offset);
offset += 2; }
return new String (chars);
}
Let's report our size:
public int keyStoreSize () {
return keyStoreCount;
}
Here is the code that walks all the nodes in our tree and
builds the key store. The keyStoreOffset member remembers the
index (in bytes) in the key store of the key for the current
node:
One aspect we have not dealt with so far is to find the
name of character given its scalar value. Suppose we could
somehow locate the node that corresponds to a character (that
is, the last node to form the name of character), and that we
can walk up in the tree: we can then reassemble the name from
all the keys of those nodes.
To find the node that represents a character, we could
build an explicit map. But here is an alternative that
consumes no space, at the expense of a bit of computation. We
represent each node as a sequence of fixed size units, each
unit being either the pointer to the key, the scalar value or
a pointer to one of the children. Futhermore, each unit
records the type of data it contains, and given a pointer to a
unit in a block, it is easy to find the boundaries of that
block (much like it's easy to find the boundaries of a
character in a UTF code unit sequence). The tree as a whole is
represented by the concatenation of the nodes, and we impose
one more constraint: the nodes that hold characters must
appear in increasing scalar value order. The point of all this
is that we can locate the node for a character by doing a
binary search in tree, viewed as an array of units.
To walk up in the tree, the simplest is to store back
pointers from the child nodes to their parents.
We know that the key store is not too big, of the order of
40K characters, which will turn in roughly 40 Kbytes. So an
offset in the key store will fit comfortably in 24
bits. Similary, our node store is of limited size: 18K nodes,
each node having about three units; again, an offset in the
node store will fit comfortably in 24 bits. Finally, a scalar
value also fits in 24 bits. Thus, we can make our units 32
bits, and use 8 bits to record the type of unit and the
starting and ending units of a node. More precisely:
bit 30 is 1 for the first unit of a node, 0 otherwise.
bit 29 is 1 for the last unit of a node, 0 otherwise
bits 28-27 have the value
00 for a unit that stores an offset in the key
store
01 for a unit that stores the offset of the
parent node
10 for a unit that stores a scalar value
11 for a unit that stores the offset of a child
bits 23-0 store the offset or scalar value
the remaining bits are unused and must be zero
Note that all units are optional: for example, a node that
does not represent a character does not have a character
unit. Also, we do not store explicitly the number of child
nodes; we put all the child units at the end of the node and
the last one (if any) is therefore the last unit of the node,
and marked as such.
Here are the Store members to manipulate the node store:
public static final int UNIT_SIZE = 4;
protected static final byte FIRST_UNIT = 0x10;
protected static final byte LAST_UNIT = 0x20;
protected static final byte FUNCTION_MASK = 0x0F;
protected static final byte PARENT_OFFSET = 0x00;
protected static final byte KEY_OFFSET = 0x01;
protected static final byte USV_VALUE = 0x02;
protected static final byte CHILD_OFFSET = 0x03;
protected byte [] nodeStore;
public void initNodeStore (int size) {
nodeStore = new byte [size];
firstUnit = false;
currentNodeOffset = 0;
}
To enter a node in the node store, the client must start
with a call to startNode, followed by calls
to the append* methods, and finalize by a
call to endNode.
protected boolean firstUnit;
protected int currentNodeOffset;
public void startNode (int offset) {
firstUnit = true;
currentNodeOffset = offset;
}
public void appendParentOffset (int offset) {
appendUnit (PARENT_OFFSET, offset);
}
public void appendKeyOffset (int offset) {
appendUnit (KEY_OFFSET, offset);
}
public void appendUSV (int codePoint, int gc) {
int n = (gc << 21) | codePoint;
appendUnit (USV_VALUE, n);
}
public void appendChildOffset (int offset) {
appendUnit (CHILD_OFFSET, offset);
}
public void endNode () {
nodeStore [currentNodeOffset - 1] |= LAST_UNIT;
}
public void appendUnit (byte control, int value) {
if (firstUnit) {
control |= FIRST_UNIT;
firstUnit = false; }
nodeStore [currentNodeOffset++] = (byte) ((value & 0x000000FF));
nodeStore [currentNodeOffset++] = (byte) ((value & 0x0000FF00) >> 8);
nodeStore [currentNodeOffset++] = (byte) ((value & 0x00FF0000) >> 16);
nodeStore [currentNodeOffset++] = control;
}
Here are a few methods to access a node unit:
protected int getNodeByte (int offset) {
int x = nodeStore [offset];
if (x < 0) {
x += 256; }
return x;
}
public int getNodeUnitControl (int offset) {
return getNodeByte (offset + 3);
}
public int getNodeUnitValue (int offset) {
int b3 = getNodeByte (offset+0) << 0;
int b2 = getNodeByte (offset+1) << 8;
int b1 = getNodeByte (offset+2) << 16;
return (b1 | b2 | b3);
}
public int getNodeUnitUSV (int offset) {
return getNodeUnitValue (offset) & 0x1FFFFF;
}
public int getNodeUnitGc (int offset) {
return (getNodeUnitValue (offset) >> 21) & 0x7;
}
Finally, a method to get the size of the node store.
public int nodeStoreSize () {
return nodeStore.length;
}
To build the node store, we do the following:
we first build a map from scalar values to
nodes
we assign a relative position to each and every
node, by enumerating them starting the nodes in the map,
in scalar value order. At the same time, we compute the
size of each node and therefore find the absolute
position of the next node.
finally, we can build the node store.
protected int nodeStoreOffset;
protected void collectCharNodes (SortedMap m) {
if (codePoint != -1) {
m.put (new Integer (codePoint), this); }
for (Iterator it = children.values ().iterator (); it.hasNext ();) {
((Node) it.next ()).collectCharNodes (m); }
}
protected void populateNodeStore (Store store) {
SortedMap m = new TreeMap ();
collectCharNodes (m);
int offset = 0;
// put ourselves, the root, first
offset = position (offset);
// position the character nodes, in usv order
for (Iterator it = m.values ().iterator (); it.hasNext ();) {
offset = ((Node) it.next ()).position (offset); }
// position the remaining nodes, in any order
for (Iterator it = children.values ().iterator (); it.hasNext ();) {
offset = ((Node) it.next ()).positionAndChildren (offset); }
store.initNodeStore (offset);
fillAndChildren (store);
}
protected int position (int offset) {
this.nodeStoreOffset = offset;
if (key != null) {
offset += Store.UNIT_SIZE; }
if (codePoint > 0) {
offset += Store.UNIT_SIZE; }
if (parent != null) {
offset += Store.UNIT_SIZE; }
offset += children.size () * Store.UNIT_SIZE;
return offset;
}
protected int positionAndChildren (int offset) {
if (this.nodeStoreOffset == 0) {
offset = position (offset); }
for (Iterator it = children.values ().iterator (); it.hasNext ();) {
offset = ((Node) it.next ()).positionAndChildren (offset); }
return offset;
}
protected void fill (Store store) {
store.startNode (nodeStoreOffset);
if (key != null) {
store.appendKeyOffset (keyStoreOffset); }
if (codePoint > 0) {
store.appendUSV (codePoint, gc); }
if (parent != null) {
store.appendParentOffset (parent.nodeStoreOffset); }
for (Iterator it = children.values ().iterator (); it.hasNext (); ) {
store.appendChildOffset (((Node) it.next ()).nodeStoreOffset); }
store.endNode ();
}
protected void fillAndChildren (Store store) {
fill (store);
for (Iterator it = children.values ().iterator (); it.hasNext (); ) {
((Node) it.next ()).fillAndChildren (store); }
}
It turns out to be very convenient to compute some numbers
as we compile the store and saved them in it, so that we can avoid
computing them every time the IME is loaded. In general, those
numbers will serve to determine the size of buffers allocated
in the IME; this allows us to just pass those buffers around,
knowing that they are large enough for the data they will
receive, and therefore avoid all the complexity and cost of dual pass
APIs (call once to compute the needed size, allocate the
buffer, call a second time to actually fill the buffer).
The downside of this approach is that we need to maintain
synchronicity between the code in the compiler and code in the
IME. It is also possible that a different implementation of
the IME would not need the numbers we compute here (no big
deal), or need other numbers; we can add such numbers as
needed. In any case, the cost seems worth the benefits.
The first such number is the maximal length of a character
name, measured in UTF-16 code units.
public int maxNameLength () {
int max = 0;
for (Iterator it = children.values ().iterator (); it.hasNext (); ) {
int m = ((Node) it.next ()).maxNameLength ();
if (m > max) {
max = m; }}
if (key != null) {
max += key.length (); }
return max;
}
The second number deals with completions on partially
specified names. The number we are looking for is the size of
the buffer that will contain all the candidates. One approach
is to compute this length precisely; this would
amount to a dual pass API, having just moved the first pass in
the compiler. Instead, we are going to compute a close upper
bound, which allows us to have a simpler “first
pass”, and in particular gives us a better chance of
proving that it matches the second pass.
If there are few enough candidates that we elect to show
all of them in their entirety, then we know that each one will
be no more than the maximal length of a name, plus at most 10 ASCII
characters for the " U+xxxxxx" suffix and the terminating
U+0000. We simply multiply this by the maximum number of candidates
for this case.
If we elect to show the word completions instead, the same
upper bound will apply just as well. Either a candidate is a
complete name, or it is no bigger than a full name, and the
suffix is "...", which is smaller than what we already accounted
for.
Finally, if we elect to show the individual letters
completions, then the number of candidates corresponding to
children nodes is bound by the
maximum number of children in any node, and each such candidate will
occupy 5 units at most (2 for the individual letter, in case
somebody uses supplementary characters in a character name, 2
for the ellipsis, and 1 for termination). To that, we need to
add the node itself, which is 10 units at most.
public int maxChildren () {
int max = children.size ();
for (Iterator it = children.values ().iterator (); it.hasNext (); ) {
int m = ((Node) it.next ()).maxChildren ();
if (m > max) {
max = m; }}
return max; }
public int maxCandidatesLength () {
int max = 15 * (maxNameLength () + 10);
int letterCompletions = maxChildren () * 5 + 10;
if (letterCompletions > max) {
max = letterCompletions; }
return max;
}
Let's put those in the store:
public int maxNameLength;
public int maxCandidatesLength;
We are now ready to represent our complete store in a
single byte sequence (rather than the two separate sequences
we have used so far.
To prepare for future revision of the format, we use an
organization much like an sfnt. First comes a header, with
version information and a pointer to a table of
content. The table of contents contains offset and size of
the key store and node store. Then we have the node store
and key store as before.
public void serialize (OutputStream f) throws java.io.IOException {
// Header
writeInteger (f, 1); // major
writeInteger (f, 0); // minor
writeInteger (f, 20); // offset to table of content
writeInteger (f, maxNameLength);
writeInteger (f, maxCandidatesLength);
// TOC
writeInteger (f, 44);
writeInteger (f, nodeStoreSize ());
writeInteger (f, 44 + nodeStoreSize ());
writeInteger (f, keyStoreSize ());
buildKeyboardStore ();
writeInteger (f, 44 + nodeStoreSize () + keyStoreSize ());
writeInteger (f, keyboardStoreSize ());
// Node store
for (int i = 0; i < nodeStoreSize (); i++) {
f.write (nodeStore [i]); }
// Key store
for (int i = 0; i < keyStoreSize (); i++) {
f.write (keyStore [i]); }
// Keyboard store
for (int i = 0; i < keyboardStoreSize (); i++) {
f.write (keyboardStore [i]); }
}
protected void writeInteger (OutputStream f, int i)
throws java.io.IOException {
f.write ((i ) & 0xFF);
f.write ((i >> 8) & 0xFF);
f.write ((i >> 16) & 0xFF);
f.write ((i >> 24) & 0xFF);
}
Here is an alternate serialization, in a form that is
acceptable to the Windows RC compiler (this will allow us to put the
store as a resource in the DLL that implements the IME).
Let's start with a method that builds the name of a
character, given an offset to its node.
public String characterName (int offset, String suffix) {
int next = -1;
boolean done = false;
do {
switch (getNodeUnitControl (offset) & FUNCTION_MASK) {
case PARENT_OFFSET: {
next = getNodeUnitValue (offset);
break; }
case KEY_OFFSET: {
suffix = stringAt (getNodeUnitValue (offset)) + suffix;
break; }}
if ((getNodeUnitControl (offset) & LAST_UNIT) != 0) {
done = true; }
offset += UNIT_SIZE; }
while (! done);
if (next != -1) {
return characterName (next, suffix); }
else {
return suffix; }
}
As we described earlier, search for the name of character
is essentially a binary search on the node store:
public String findCharacterName (int usv) {
int first = 0;
int last = nodeStore.length - 1;
while (first <= last) {
int mid = ((first / 4 + last / 4) / 2) * 4;
while ( mid >= first
&& (getNodeUnitControl (mid) & FUNCTION_MASK) != USV_VALUE) {
mid -= UNIT_SIZE; }
if (mid < first) {
mid = ((first / 4 + last / 4) / 2) * 4;
while ( mid <= last
&& (getNodeUnitControl (mid) & FUNCTION_MASK) != USV_VALUE) {
mid += UNIT_SIZE; }
if (mid > last) {
return null; }}
int thisUsv = getNodeUnitUSV (mid);
if (thisUsv < usv) {
first = mid + UNIT_SIZE; }
else if (thisUsv > usv) {
last = mid - UNIT_SIZE; }
else {
while ((getNodeUnitControl (mid) & FIRST_UNIT) == 0) {
mid -= UNIT_SIZE; }
return characterName (mid, ""); }}
return null;
}
Let's see if this really works:
System.out.println ("name of U+20AC is '"
+ store.findCharacterName (0x20ac)
+ "', should be 'EURO SIGN'");
System.out.println ("name of U+effff is '"
+ store.findCharacterName (0xeffff)
+ "', should be 'null'");
System.out.println ("name of U+13FF is '"
+ store.findCharacterName (0x13ff)
+ "', should be 'null'");
The function formatUSV fills the
buffer pointed by a with the four to six
digit representation of usv and returns the
number of characters it produced:
PRIVATE int formatUSV (UTF16CodeUnit *a, int usv) {
int n;
if (usv <= 0xffff) {
n = 4; }
else if (usv < 0xfffff) {
n = 5; }
else {
n = 6; }
{ int i;
for (i = n; i > 0; i--) {
int nibble = (usv >> ((i-1)*4)) & 0xf;
if (nibble < 10) {
*(a++) = '0' + nibble; }
else {
*(a++) = 'A' + (nibble - 10); }}}
return n;
}
The function formatUSVuplus fills the
buffer pointed by a with the U+
representation of usv and returns the
number of characters it produced:
PRIVATE int formatUSVuplus (UTF16CodeUnit *a, int usv) {
int n = 0;
a [n++] = 'U';
a [n++] = '+';
n += formatUSV (a + n, usv);
return n;
}
Here are the declarations for these functions:
PRIVATE int formatUSV (UTF16CodeUnit *buffer, int usv);
PRIVATE int formatUSVuplus (UTF16CodeUnit *buffer, int usv);
Remember that each node unit is a four byte structure,
with the control part of the node unit in byte 3 and the value
part in bytes 0 through
2. getNodeUnitControl returns the control
part of the node unit at offset, and
getNodeUnitValue returns the value
part.
In each function, nodeOffset must point
to the first unit of a node. If there is no key offset (which
should happen for the root node only),
getNodeKeyOffset returns -1. If there is no
usv node unit, getNodeUSV returns
-1.
PRIVATE int getNodeKeyOffset (int nodeOffset) {
int control = getNodeUnitControl (nodeOffset);
while ( (control & FUNCTION_MASK) != KEY_OFFSET
&& (control & LAST_UNIT) == 0) {
nodeOffset += UNIT_SIZE;
control = getNodeUnitControl (nodeOffset); }
if ((control & FUNCTION_MASK) == KEY_OFFSET) {
return getNodeUnitValue (nodeOffset); }
else {
return -1; }
}
PRIVATE USV getNodeUSV (int nodeOffset) {
int control = getNodeUnitControl (nodeOffset);
while ( (control & FUNCTION_MASK) != USV_VALUE
&& (control & LAST_UNIT) == 0) {
nodeOffset += UNIT_SIZE;
control = getNodeUnitControl (nodeOffset); }
if ((control & FUNCTION_MASK) == USV_VALUE) {
return getNodeUnitUSV (nodeOffset); }
else {
return -1; }
}
PRIVATE int getNodeGc (int nodeOffset) {
int control = getNodeUnitControl (nodeOffset);
while ( (control & FUNCTION_MASK) != USV_VALUE
&& (control & LAST_UNIT) == 0) {
nodeOffset += UNIT_SIZE;
control = getNodeUnitControl (nodeOffset); }
if ((control & FUNCTION_MASK) == USV_VALUE) {
return getNodeUnitGc (nodeOffset); }
else {
return -1; }
}
PRIVATE BOOL nodeHasChildren (int nodeOffset) {
int control = getNodeUnitControl (nodeOffset);
while ( (control & FUNCTION_MASK) != CHILD_OFFSET
&& (control & LAST_UNIT) == 0) {
nodeOffset += UNIT_SIZE;
control = getNodeUnitControl (nodeOffset); }
return (control & FUNCTION_MASK) == CHILD_OFFSET;
}
Let's have declarations for those functions
PRIVATE UTF16CodeUnit getKeyChar (int keyOffset);
PRIVATE int getNodeUnitControl (int nodeOffset);
PRIVATE int getNodeUnitValue (int nodeOffset);
PRIVATE int getNodeUnitUSV (int nodeOffset);
PRIVATE int getNodeUnitGc (int nodeOffset);
PRIVATE int getNodeKeyOffset (int nodeOffset);
PRIVATE USV getNodeUSV (int nodeOffset);
PRIVATE USV getNodeGc (int nodeOffset);
PRIVATE BOOL nodeHasChildren (int nodeOffset);
This function assembles the name of a character given a
pointer to its node. buffer is filled with
the character name, terminated by a 0 code unit. The return
value points to that last code unit. We rely on the fact
that the node unit for the parent (if any) appears before the node unit
for the key.
PRIVATE UTF16CodeUnit *characterName (int offset, UTF16CodeUnit* buffer) {
int control;
int o;
o = offset;
do {
control = getNodeUnitControl (o);
if ((control & FUNCTION_MASK) == PARENT_OFFSET) {
buffer = characterName (getNodeUnitValue (o), buffer); }
o += UNIT_SIZE; }
while ((control & LAST_UNIT) == 0);
o = offset;
do {
control = getNodeUnitControl (o);
if ((control & FUNCTION_MASK) == KEY_OFFSET) {
int keyOffset = getNodeUnitValue (o);
while (getKeyChar (keyOffset) != 0) {
*(buffer++) = getKeyChar (keyOffset);
keyOffset += 2; }
buffer [0] = 0; }
o += UNIT_SIZE; }
while ((control & LAST_UNIT) == 0);
return buffer;
}
This function is given a scalar value, finds the node for
that scalar value, and assembles the name. It return true iff
there is a name for the scalar value.
PRIVATE BOOL findCharacterName (int usv, UTF16CodeUnit* buffer, int* gc) {
int first = 0;
int last = nodeStoreLength - 1;
int mid;
int thisUsv;
while (first <= last) {
mid = ((first / 4 + last / 4) / 2) * 4;
while ( mid >= first
&& (getNodeUnitControl (mid) & FUNCTION_MASK) != USV_VALUE) {
mid -= UNIT_SIZE; }
if (mid < first) {
mid = ((first / 4 + last / 4) / 2) * 4;
while ( mid <= last
&& (getNodeUnitControl (mid) & FUNCTION_MASK) != USV_VALUE) {
mid += UNIT_SIZE; }
if (mid > last) {
return FALSE; }}
thisUsv = getNodeUnitUSV (mid);
debugMessage (DBG_CORE, L" usv = 0x%x", thisUsv);
if (thisUsv < usv) {
first = mid + UNIT_SIZE; }
else if (thisUsv > usv) {
last = mid - UNIT_SIZE; }
else {
while ((getNodeUnitControl (mid) & FIRST_UNIT) == 0) {
mid -= UNIT_SIZE; }
characterName (mid, buffer);
*gc = getNodeGc (mid);
return TRUE; }}
return FALSE;
}
Let's start with a procedure that walks the node store
given a string. There are two basic outcomes. First, the
string is not the prefix of any character name; in that case,
the return value points to the first character of string that makes this
happen, and the values of nodeOffset and
keyOffset are undefined. The other outcome
is that the string is the prefix of one or more characters;
the return value is 0, nodeOffset and
keyOffset are the node it leads to, and the
rest of the key to match.
PRIVATE UTF16CodeUnit* followString (UTF16CodeUnit* string, int len,
int *nodeOffset, int *keyOffset) {
*nodeOffset = 0; // the root node
*keyOffset = 0;
return followString1 (string, len, nodeOffset, keyOffset);
}
PRIVATE UTF16CodeUnit* followString1 (UTF16CodeUnit* string, int len,
int *nodeOffset, int *keyOffset) {
if (len == 0) {
return 0; }
else if (getKeyChar (*keyOffset) == 0) {
int control;
int nodeOffsetChild = *nodeOffset;
// look for a correct child
do {
control = getNodeUnitControl (nodeOffsetChild);
if ((control & FUNCTION_MASK) == CHILD_OFFSET) {
int childNodeOffset = getNodeUnitValue (nodeOffsetChild);
int childKeyOffset = getNodeKeyOffset (childNodeOffset);
if (getKeyChar (childKeyOffset) == *string) {
*nodeOffset = childNodeOffset;
*keyOffset = childKeyOffset + 2;
return followString1 (string+1, len-1, nodeOffset, keyOffset); }}
nodeOffsetChild += UNIT_SIZE; }
while ((control & LAST_UNIT) == 0);
// no children match
return string; }
else if (getKeyChar (*keyOffset) != *string) {
return string; }
else {
(*keyOffset) += 2;
return followString1 (string+1, len-1, nodeOffset, keyOffset); }
}
PRIVATE UTF16CodeUnit* followString (UTF16CodeUnit* string, int len,
int *nodeOffset, int *keyOffset);
PRIVATE UTF16CodeUnit* followString1 (UTF16CodeUnit* string, int len,
int *nodeOffset, int *keyOffset);
After the user enters the prefix of one or more names,
he may ask for all the possible completions of that prefix. We
want to present those completions in an somewhat intelligent
way: there is no sense presenting the full list if it is too
long. Consequently, we have three different levels of
completion. For each level, we have a function that counts the
number of completions, and one function that actually builds
them.
The first level is to enumerate all the completions in
their entirety:
PRIVATE int countAllCompletions (int nodeOffset) {
int count = 0;
int control;
do {
control = getNodeUnitControl (nodeOffset);
if ((control & FUNCTION_MASK) == USV_VALUE) {
count++; }
if ((control & FUNCTION_MASK) == CHILD_OFFSET) {
int childNodeOffset = getNodeUnitValue (nodeOffset);
count += countAllCompletions (childNodeOffset); }
nodeOffset += UNIT_SIZE; }
while ((control & LAST_UNIT) == 0);
return count;
}
PRIVATE UTF16CodeUnit* buildAllCompletions (int nodeOffset,
UTF16CodeUnit* names) {
//TODO UTF16CodeUnit *prefix = (UTF16CodeUnit *) malloc (maxNameLength * 2);
UTF16CodeUnit prefix [100];
return buildAllCompletions1 (nodeOffset, names, prefix, 0);
}
PRIVATE UTF16CodeUnit* buildAllCompletions1 (int nodeOffset,
UTF16CodeUnit* names, UTF16CodeUnit* prefix, int prefixLength) {
int control;
do {
control = getNodeUnitControl (nodeOffset);
if ((control & FUNCTION_MASK) == USV_VALUE) {
{ int i;
for (i = 0; i < prefixLength; i++) {
*(names++) = prefix [i]; }}
*(names++) = ' ';
names += formatUSVuplus (names, getNodeUnitUSV (nodeOffset));
*(names++) = 0; }
if ((control & FUNCTION_MASK) == CHILD_OFFSET) {
int childNodeOffset = getNodeUnitValue (nodeOffset);
int o = prefixLength;
{ int childKeyOffset = getNodeKeyOffset (childNodeOffset);
while (getKeyChar (childKeyOffset) != 0) {
prefix [o++] = getKeyChar (childKeyOffset);
childKeyOffset += 2; }}
names = buildAllCompletions1 (childNodeOffset, names, prefix, o); }
nodeOffset += UNIT_SIZE; }
while ((control & LAST_UNIT) == 0);
return names;
}
If the complete list of all completions is too big, we
retreat to completions to a word boundary:
PRIVATE int countWordCompletions (int nodeOffset) {
int count = 0;
int control;
do {
control = getNodeUnitControl (nodeOffset);
if ((control & FUNCTION_MASK) == USV_VALUE) {
count++; }
if ((control & FUNCTION_MASK) == CHILD_OFFSET) {
int childNodeOffset = getNodeUnitValue (nodeOffset);
int childKeyOffset = getNodeKeyOffset (childNodeOffset);
if (wcschr ((WCHAR *)(keyStore + childKeyOffset), ' ') == 0) {
count += countWordCompletions (childNodeOffset); }
else {
count ++; }}
nodeOffset += UNIT_SIZE; }
while ((control & LAST_UNIT) == 0);
return count;
}
PRIVATE UTF16CodeUnit* buildWordCompletions (int nodeOffset,
UTF16CodeUnit* names) {
//TODO UTF16CodeUnit *prefix = (UTF16CodeUnit *) malloc (maxNameLength * 2);
UTF16CodeUnit prefix [100];
return buildWordCompletions1 (nodeOffset, names, prefix, 0);
}
PRIVATE UTF16CodeUnit* buildWordCompletions1 (int nodeOffset,
UTF16CodeUnit* names, UTF16CodeUnit* prefix, int prefixLength ) {
int count = 0;
int control;
int i;
do {
control = getNodeUnitControl (nodeOffset);
if ((control & FUNCTION_MASK) == USV_VALUE) {
for (i = 0; i < prefixLength; i++) {
*(names++) = prefix [i]; }
if (prefixLength > 0) {
*(names++) = ' '; }
names += formatUSVuplus (names, getNodeUnitUSV (nodeOffset));
*(names++) = 0; }
if ((control & FUNCTION_MASK) == CHILD_OFFSET) {
int childNodeOffset = getNodeUnitValue (nodeOffset);
int o = prefixLength;
int lastSpace = -1;
{ int childKeyOffset = getNodeKeyOffset (childNodeOffset);
while (getKeyChar (childKeyOffset) != 0) {
if (getKeyChar (childKeyOffset) == ' ') {
lastSpace = o; }
prefix [o++] = getKeyChar (childKeyOffset);
childKeyOffset += 2; }}
if (! nodeHasChildren (childNodeOffset)) {
names = buildWordCompletions1 (childNodeOffset, names, prefix, o); }
else {
if (lastSpace == -1) {
names = buildWordCompletions1 (childNodeOffset, names, prefix, o); }
else {
int i;
for (i = 0; i <= lastSpace; i++) {
*(names++) = prefix [i]; }
*(names++) = '.';
*(names++) = '.';
*(names++) = '.';
*(names++) = 0; }}}
nodeOffset += UNIT_SIZE; }
while ((control & LAST_UNIT) == 0);
return names;
}
Our final level is to retreat to completions to the next
character:
PRIVATE int countLetterCompletions (int nodeOffset) {
int count = 0;
int control;
do {
control = getNodeUnitControl (nodeOffset);
if ((control & FUNCTION_MASK) == USV_VALUE) {
count++; }
if ((control & FUNCTION_MASK) == CHILD_OFFSET) {
int childNodeOffset = getNodeUnitValue (nodeOffset);
count ++; }
nodeOffset += UNIT_SIZE; }
while ((control & LAST_UNIT) == 0);
return count;
}
PRIVATE UTF16CodeUnit* buildLetterCompletions (int nodeOffset,
UTF16CodeUnit* names) {
int control;
do {
control = getNodeUnitControl (nodeOffset);
if ((control & FUNCTION_MASK) == USV_VALUE) {
names += formatUSVuplus (names, getNodeUnitUSV (nodeOffset));
*(names++) = 0; }
if ((control & FUNCTION_MASK) == CHILD_OFFSET) {
int childNodeOffset = getNodeUnitValue (nodeOffset);
int childKeyOffset = getNodeKeyOffset (childNodeOffset);
*(names++) = getKeyChar (childKeyOffset);
*(names++) = '.';
*(names++) = '.';
*(names++) = '.';
*(names++) = 0; }
nodeOffset += UNIT_SIZE; }
while ((control & LAST_UNIT) == 0);
return names;
}
With that in place, here is the driver that figures out
which level we should use, and invokes it.
PRIVATE void buildCandidates (int nodeOffset, UTF16CodeUnit *names,
int *count, int* length) {
int c;
WCHAR *endOfCandidates;
names [0] = 0;
c = countAllCompletions (nodeOffset);
debugMessage (DBG_CORE, L" %d completions from there", c);
if (c < 15) {
endOfCandidates = buildAllCompletions (nodeOffset, names); }
else {
c = countWordCompletions (nodeOffset);
debugMessage (DBG_CORE, L" %d word completions from there", c);
if (c < 15) {
endOfCandidates = buildWordCompletions (nodeOffset, names); }
else {
c = countLetterCompletions (nodeOffset);
debugMessage (DBG_CORE, L" %d immediate completions from there", c);
endOfCandidates = buildLetterCompletions (nodeOffset, names); }}
*count = c;
*length = endOfCandidates - names;
{ WCHAR *n = names;
int i;
for (i = 0; i < c; i++) {
debugMessage (DBG_CORE, L" ..%s", n);
while (*n != 0) {
n++; }
n++; }}
}
Here are the declarations for all those functions:
PRIVATE void buildCandidates (int nodeOffset, UTF16CodeUnit *names,
int *count, int *length);
PRIVATE int countCompletions (int nodeOffset);
PRIVATE int countWordCompletions (int nodeOffset);
PRIVATE int countLetterCompletions (int nodeOffset);
PRIVATE UTF16CodeUnit *buildAllCompletions (int nodeOffset, UTF16CodeUnit *names);
PRIVATE UTF16CodeUnit *buildAllCompletions1 (int nodeOffset, UTF16CodeUnit *names,
UTF16CodeUnit *prefix, int prefixLength);
PRIVATE UTF16CodeUnit *buildWordCompletions (int nodeOffset, UTF16CodeUnit *names);
PRIVATE UTF16CodeUnit *buildWordCompletions1 (int nodeOffset, UTF16CodeUnit *names,
UTF16CodeUnit *prefix, int prefixLength);
PRIVATE UTF16CodeUnit *buildLetterCompletions (int nodeOffset, UTF16CodeUnit *names);
These two functions intepret the composition string as a
fully form character specification, either in the form of a
number or in the form of a character name, and return the
corresponding scaler value. If the composition string cannot be
interpreted properly, the return value is -1.
PRIVATE int getUSVFromCompositionAsUSV (UTF16CodeUnit* name, int len) {
int usv = 0;
int i;
for (i = 0; i < len; i++) {
UTF16CodeUnit ch = name [i];
if ('0' <= ch && ch <= '9') {
usv = usv * 16 + (ch - '0'); }
else if ('A' <= ch && ch <= 'F') {
usv = usv * 16 + (ch - 'A' + 10); }
else {
return -1; }}
if (usv > 0x10FFFF) {
return -1; }
else {
return usv; }
}
PRIVATE int getUSVFromCompositionAsName (UTF16CodeUnit *name, int len) {
int nodeOffset;
int keyOffset;
LPCTSTR x;
x = followString (name, len, &nodeOffset, &keyOffset);
if (x != 0) {
return -1; }
else if (getKeyChar (keyOffset) != 0) {
return -1; }
else {
return getNodeUSV (nodeOffset); }
}
PRIVATE int getUSVFromCompositionAsUSV (UTF16CodeUnit *name, int len);
PRIVATE int getUSVFromCompositionAsName (UTF16CodeUnit *name, int len);
Of course, we need some structure to represent our IME
state: which keystrokes have been typed, what feedback to give
to the user, etc.
typedef struct {
int prefixLength;
UTF16CodeUnit prefixString [3];
int compositionLength;
UTF16CodeUnit compositionString [1000];
int completionLength;
UTF16CodeUnit completionString [1000];
BOOL canStopHere;
int resultLength;
UTF16CodeUnit resultString [1000];
int candidatesLength;
UTF16CodeUnit candidates [6000];
int candidatesCount;
int curMap;
int curMod;
} ImeState;
prefixLength/String hold the prefix characters typed by
the user.
compositionLength/String hold what the user has typed
after the prefix.
completionLength/String are filled based on the what the
user has typed and represent the interpretation the
composition string.
If candidatesLength is 0, then candidates and
candidatesLength are undefined.
If prefixLength = 0, then all the other fields are
undefined.
handleKeyStroke is given all the keystrokes that have not
been filtered away by wantKeyStroke. It modifies the state the
of IME based on the input, as well as prepares what is given
as feedback to the user. The return value indicates what
should be done next and takes one of the following
values:
This function handles directly the initial states (0 or 1
prefix character already entered), as well as the the escape
key in any state. The remaining cases are dispatched to
specialized functions, based on the prefix.
When we are done with elaborating the input and have
a scalar value, we need to create the result string. If full
is true, this will be of the form 'U+xxx <ch> name', otherwise
it is of the form '<ch>'. If ncr is true, the character is
represented as a '&#x<usv>;', otherwise, we insert the
character itself.
I started this work under Windows 2000 and then moved on
to Windows XP. It is my understanding that the framework is
the same under Windows 98, although this particular IME may be
less interesting there, since applications all not all capable
of handling Unicode.
There is also the possibility that all this would work on
Windows 95 and Windows NT. There where very few additions
between 95/NT and 98/2000.
Let's first deal with the environment you need to develop
IMEs.
The first think you need is Visual C++. Version 5.0 is
fine, and is the one I used, standard installation.
Next, you need the Driver Development Kit (or DDK) for
Windows NT. It turns out that Microsoft is retiring it to
replace it with the XP DKK. I have stashed a copy in case you
need it. A standard installation is fine, and we assume that
you put it in C:\ntddk.
The documentation for building IMEs in
c:\ntddk\src\ime\docs. There are two Word documents. The first
is titled “Win32 Multilingual IME Overview for IME
Development,”, and the second is titled “Vin32
Multilingual IME Application Programming
Interface”. Both are labeled “Version 1.41,
04-01-1999”. These documents are not the most easy to
navigate. In the code below, when implementing an API, we will
point to the relevant place in the documentation by an
annotation of the form “I/32”: “I”
refers to the first document, and “32” is the
page number.
c:\ntddk\src\ime also contains three examples of
imes; I believe that cht is a real one, while jpn is a demo
only. cht seems to be the best written.
The documentation for using IMEs from applications is in
the platfrom SDK documentation, under Windows Base Services,
International Features, Input Method Editor. This is for
applications that want to behave differently when an IME is
used. However, all applications should work fine without any
modification.
For development, I have found two SDK tools useful: spy++
to look at the messages that are exchanged, and dbmon (which
can be found under Visual C++, Visual C++ Samples, SKD
Samples, SDK Tool Samples, SDK Windows NT Samples,
Dbmon).
Here are a few functions that are helpful while developing
the input method, for “debugging by
printf”. They all eventually call the Win32 function
OutputDebugString. To see the messages, run dbmon in a
shell.
All debugging output can be turned on or off, on a global
basis (that is, for all input contexts), by switching the
global variable debugMode:
PRIVATE BOOL debugMode = TRUE;
Here are a couple of macros to manipulate this
variable. They offer a convenient choke point to debug the UI
that normally controls debugging.
The next function is useful when a reporting a
GetLastError error: it outputs a printf-style formatted line,
and then another line with the message corresponding to
err.
PRIVATE void debugLastError (DWORD err, LPCTSTR format, ...) {
TCHAR msg [1024];
int ret;
va_list marker;
va_start (marker, err);
debugMessage (DBG_ERROR, format, marker);
va_end (marker);
if (getDebugMode () == FALSE) {
return; }
ret = FormatMessage (FORMAT_MESSAGE_FROM_SYSTEM,
NULL,
err,
MAKELANGID (LANG_NEUTRAL, SUBLANG_DEFAULT), // Default language
msg, sizeof (msg) / sizeof (TCHAR),
NULL);
msg [ret-2] = 0; // delete the newline
debugMessage (DBG_ERROR, L" LastError = %x, '%s'", err, msg);
}
The overall organization of a traditional IME under Windows
is this:
The code that implements an IME resides in a DLL which is
loaded and unloaded when needed by the Input Method Manager
(or IMM). This DLL exports a few procedures that allows the
IMM to control the global operation of the IME; for example,
the DLL exports ImeInquire, and that
function is called by the IMM to discover some characteristics
of the IME.
For (almost) each application window, the IMM ensures that
there is a corresponding IME UI window and a corresponding
INPUTCONTEXT structure. This window has no graphic appearance,
it's simply a tool to receive messages. The best is to think
of this window, together with the INPUTCONTEXT structure, as
an IME instance object; of the various messages on that window
as method calls; and of the call to DLL functions that take an
INPUTCONTEXT argument as other method calls.
When an IME is active for an application window, it gets
the first crack at keystrokes. First, the keystroke is
presented to the ImeProcessKey method (really, a DLL entry
point that takes an INPUTCONTEXT method), which determines if
the IME wants to handle that keystroke, or if it should be
passed directly to the application. If the IME instance
decides to handle the keystroke, it is then given to the
ToAsciiEx method, which can then deal with it.
As the IME elaborates keystrokes into a string to be given
as input to the application, it manipulates a
COMPOSITIONSTRING structure that is part of the
INPUTCONTEXT. This structure seems to be targeted as Far East
IMEs; however, there is no clear definition of what the
various fields are. For our purposes, we consider the
COMPOSITIONSTRING to be two strings: the raw keystrokes and
the finish string which is "pasted" into the
application.
As the IME modifies the COMPOSITIONSTRING, it generates
some messages: IME_START_COMPOSITION, IME_COMPOSITION,
IME_END_COMPOSITION. The first and last are just delimit the
refinement operation, while the middle one is sent every time
the COMPOSITIONSTRING is modified.
Clearly, the user needs some feedback on this elaboration
process, for example to see the keystrokes he types, and the
input into which they are transformed by the IME. This is
where one need to distinguish various classes of applications:
fully aware applications interpret the
IME*COMPOSITION messages and handle all the feedback to
the user. They are entirely in charge of the IME user
interface (for the elaboration process), and they can also
somewhat control the operation of the IME.
half aware applications know that an IME is
processing their input but they do not provide a UI for
the elaboration itself; at best they exert some control
over it.
unware applications are just that: they have no clue
that an IME is operating and the IME is must handle the UI
for elaboration.
The SDK contains a sample fully aware application and a sample
half aware application.
If the application decides to take control of the
composition UI, it just handles the IME*COMPOSITION
messages. Otherwise, those messages are routes to the IME UI
window, which can then decide how to implement the composition
UI.
In this section, we deals with the IME functions that do no
depend on the input context; we can think of that as the static
members of the IME objects.
An INPUTCONTEXT includes a piece of memory that is
private to the IME. This is useful for the IME to keep some
data about the input context around. What we need to say
right now is the size of that private memory, which we will
represent as an INPUTCONTEXT_PRIVATE structure.
The meaning of the conversion modes and sentence
capabilities is unclear.
Remember that we described an IME instance as made of a UI
window and the set of DLL procedures that operate in an
INPUTCONTEXT. In this section, we deal with those things.
In addition, the instance also has a UI_PRIVATE structure
(accessed via the IMMGWLP_PRIVATE key) attached to the UI
window, which let's us store the composition window and the
candidate window:
While composing the input to the application, the user
gets feedback via a COMPOSITIONSTRING structure and a
CANDIDATEINFO structure; more precisely, the input method is
in charge of filling those structure
Here is a method to fill those structure from the
internal state of the IME. It is rather mechanical and long:
the highlights are:
the composition string is made of the prefix, the
composition string proper, and completion string.
the whole thing is a single clause
every character has the same attribute:
ATTR_TARGET_CONVERTED if the input can terminate here
(complete name, valid code point, ...), ATTR_INPUT
otherwise.
This class is registered when the DLL is attached. The
only odd thing about this class is that it has the style CS_IME,
and that the per-instance private data is 2 LONGs; these aspects
are mandated by the IMM.
The function of the composition window is to display the
composition string as keystrokes are accumulated to build an
input string to the application. It is possible for the
application to provide its own services, in which cases, the
composition window provided by the IME is not used.
Let's register the class for this window when then IME DLL
is attached.
In our IME, we position the composition window near the
current point, slightly below it. To avoid continuous dance
of the window (especially in U++ mode), we actually move it
only if it is too far of its ideal position.
The function of the candidate window is to display the
candidates whenever the user asks for them. It is possible for the
application to provide its own services, in which cases, the
candidate window provided by the IME is not used.
Let's register the class for this window when then IME DLL
is attached.
In our IME, we position the softKbd window near the
current point, slightly below it. To avoid continuous dance
of the window (especially in U++ mode), we actually move it
only if it is too far of its ideal position.
unsigned char *nodeStore;
int nodeStoreLength;
unsigned char *keyStore;
int keyStoreLength;
unsigned char *keyboardsStore;
int keyboardsStoreLength;
int maxNameLength;
int maxCandidatesLength;
Our IME offers a few configuration parameters. Those are
available via the configuration panel (itself available via
right click on the system tray pen icon or the Regional
Options control panel (when the corresponding input language
is selected, "IME Settings" button), but we'd like to offer a
faster access. We do that via the left click menu on the
system tray pen icon.
First, we need a few constants to identify our menus:
For the IME to be accessible via the Regional Option
control panel, under the Input Locales tab, it must be
described in the registry, under the key
HLM\SYSTEM\CurrentControlSet\Control\Keyboard Layouts. This
key contains one subkey per keyboard layout and one subkey per
IME. (It seems that there should also be subkeys for other
input systems, such as text to speech systems). The name of
that subkey is a 32 bit hexadecimal number and it top nibble
is 0 for a keyboard layout and E for an input method. The
bottom four nibbles form the language under which the IME is
accessible, e.g. 040C for French. Finally, middle three
nibbles serve to distinguish the various IMEs for a given
language. Thus, to have our IME accessible under French, we
can use any of the subkeys E000040C, E001040C, etc. I have not
been able to find a way to register an IME for all
languages.
Inside the subkey, there must be three entries of type
REG_SZ:
'IME file' points to the IME dll; this path is
relative to WINNT/System32. In our case, we just use
uniime.ime.
'layout file' points to the keyboard layout file to use
with the IME, also relative to WINNT/System32. In our
case, we just use the US layout, kbdus.dll.
'layout text' is the name under with the IME will
appear in the Regional Options control panel.