Working on adding UNICODE support for Multi-Edit

Home Forums General General Chat Working on adding UNICODE support for Multi-Edit

Viewing 15 posts - 1 through 15 (of 30 total)
  • Author
    Posts
  • #2055
    deleyd
    Participant

    I’m working on adding some basic UNICODE support for Multi-Edit.

    As I read the UNICODE standard at [url:2xf27tm4]http://www.unicode.org[/url:2xf27tm4] I realized what a programmer’s nightmare it is, it’s so complex. It appears to be not one standard but five standards:[list:2xf27tm4][*:2xf27tm4]UTF-8[/*:m:2xf27tm4]
    [*:2xf27tm4]UTF-16LE (Little Endian)[/*:m:2xf27tm4]
    [*:2xf27tm4]UTF-16BE (Big Endian)[/*:m:2xf27tm4]
    [*:2xf27tm4]UTF-32LE (Little Endian)[/*:m:2xf27tm4]
    [*:2xf27tm4]UTF-32BE (Big Endian)[/*:m:2xf27tm4][/list:u:2xf27tm4]This is quite a lot to support. Fortunately it appears Microsoft Windows primarily concentrates on UTF-16LE:

    Unicode-enabled functions are often referred to as wide character functions, as described in Conventions for Function Prototypes. This designation is made because of the use of the UTF-16 encoding, which is the most common encoding of Unicode and the one used for native unicode encoding on windows operating systems. Each code value is 16 bits wide, in contrast to the older code page approach to character and string data, which uses 8-bit code values. The use of 16 bits allows the direct encoding of 65,536 characters. In fact, the universe of symbols used to transcribe human languages is even larger than that, and UTF-16 code points in the range U+D800 through U+DFFF are used to form surrogate pairs, which constitute 32-bit encodings of supplementary characters. See Surrogates and Supplementary Characters for further discussion. [/quote:2xf27tm4]
    (Ref: [url:2xf27tm4]http://msdn.microsoft.com/library/en-us/intl/unicode_9i79.asp[/url:2xf27tm4])

    So to start with I’m working on a front end that will handle just UTF-16LE ASCII files, that is, plain text ASCII files that just happen to be in this UNICODE format. We’ll convert the file to a plain text ASCII file, work with that, then on saving the file we’ll convert back to UNICODE (well that’s phase II i guess. Phase I is getting it to read a UNICODE file.)

    EDIT: Another problem is just identifying which files are UNICODE files, as the standard does not mandate a Byte Order Mark (BOM) at the beginning of the file.

    [code:2xf27tm4] Bytes Encoding Form
    00 00 FE FF UTF-32, big-endian
    FF FE 00 00 UTF-32, little-endian
    FE FF UTF-16, big-endian
    FF FE UTF-16, little-endian
    EF BB BF UTF-8
    [/code:2xf27tm4]
    I’ll alleviate this problem by initially only supporting UTF-16LE files which start with the little-endian Byte Order Mask FF FE.
    EDIT: I got confused by this, so here I make it clear for myself. For little-endian, which is what PC computers are, the BOM is:[code:2xf27tm4]FF (first byte) FE (second byte)

    Which if loaded into a longword of memory looks like:

    +——————-+
    | | | FE | FF | 00
    +——————-+

    Hence interpreted as an unsigned integer word value the value is 0xFEFF
    This integer value is UNICODE code point U+FEFF[/code:2xf27tm4]
    Unicode code point U+FEFF corresponds to the "character" zero width no-break space, but when encountered as the first two bytes of a file is to be interpreted as the BOM, which is to be considered not a part of the text and should be stripped off before displaying the text.

    (Ref: [url:2xf27tm4]http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf[/url:2xf27tm4] Section 15.9: Specials)

    EDIT: Also this will only be supported for Windows 2000, XP, and Vista (whenever I get one of those). i.e. no Windows 95,98,ME, and I hope nobody is still using Windows NT.

    EDIT: Oh, and another problem is in UTF-16 there is the possibility of 4 byte characters, known as ‘surrogate’ characters. That’s really a programmer’s nightmare, since we can’t just assume all characters are now 16 bits instead of 8 bits. Only UTF-32 guarantees all characters are equal width. (I don’t even know how to convert to UTF-32. I don’t know if a library function in Windows XP exists to do this. We could just say "no surrogate characters allowed".)

    And there’s the problem that UNICODE allows the same lexographic text to be represented in more than one way. For example,
    ã
    may be the single character LATIN SMALL LETTER A WITH TILDE (U+00E3)
    or it may be LATIN SMALL LETTER A followed by the COMBINING TILDE character (0061 0303). Fortunately there appears to be a normalized form for Unicode text. Unfortunately there appears to be four normalized forms of Unicode text.
    (Ref: [url:2xf27tm4]http://www.unicode.org/faq/normalization.html[/url:2xf27tm4]
    Ref: [url:2xf27tm4]http://www.unicode.org/reports/tr15/[/url:2xf27tm4]
    and no I have not read all this. I’m quite a beginner at Unicode.)

    It appears that Windows Vista will have some new library functions to help with Unicode. Plus [url:2xf27tm4]http://icu-project.org/[/url:2xf27tm4] may be a source for code.

    Then there’s the question "Do all languages have available fixed width fonts for their characters?" There may be some language out there that doesn’t.

    EDIT: And for those of you who think Notepad is pretty good at handling plain text files, try the following:

      [*:2xf27tm4]Open up Notepad (not Wordpad, not Word or any other word processor)[/*:m:2xf27tm4]
      [*:2xf27tm4]Type in this sentence exactly (without quotes): "this app can break"[/*:m:2xf27tm4]
      [*:2xf27tm4]Save the file to your hard drive.[/*:m:2xf27tm4]
      [*:2xf27tm4]Close Notepad[/*:m:2xf27tm4]
      [*:2xf27tm4]Open the saved file by double clicking it.[/*:m:2xf27tm4][/list:o:2xf27tm4]Instead of seeing your sentence, you should see a series of squares. For whatever reason, Notepad can’t figure out what to do with that series of characters and breaks.
      (Ref: [url:2xf27tm4]http://blogs.msdn.com/michkap/archive/2006/06/14/631016.aspx[/url:2xf27tm4]
      Also see the interesting follow up [url:2xf27tm4]http://blogs.msdn.com/michkap/archive/2006/07/11/662342.aspx[/url:2xf27tm4])
    #6941
    John Martzouco
    Participant

    Hey Dave,

    I think that you might of hit on a pretty good stop-gap with the decoding. I, for one, would be happy enough to convert the Unicode files to ANSI and would never bother to convert them back. In most cases, this is what I do anyways… I open the file in Notepad, save it to ANSI encoding and then use ME to do my work.

    I’m an anglophone working (almost) exclusively in English, so it would be safe for me to assume that any code exported from SQL Server 2005 will convert from Unicode to ANSI without any loss. Until my French colleagues start embedding French characters into their comments or string literals, the one way conversion will be fine for me.

    Saying that, I’d be tickled pink if my text editor would have built in functionality allowing me to make this one-way conversion! What are the chances that you could put together a DLL exposing the one-way functionality and post it here? It sounds to me like you’re pretty far along already.

    Congratulations on finding that first step! You were doing your thinking outside of the box and it looks to me like you’ve found a strategy that will be very helpful.

    With my best regards,
    John

    #6942
    John Martzouco
    Participant

    Concerning: this app can break

    Yup, you’re right, Notepad saves it in ANSI, but then opens it up as Unicode by default. So Notepad’s rules for testing a file’s encoding aren’t perfect yet.

    To see the correct content, all one has to do is force Notepad to open the file with ANSI encoding… File/Open/(don’t select a file)/(change encoding to ANSI)… Notepad isn’t broken, it just isn’t parsing it right.

    Did the source explain what it is about those four words that beat the application’s logic? I’ve never seen this happen with any other content.

    J

    #6947
    deleyd
    Participant

    Re: Notepad, I find you have to make sure there is no trailing CR/LF at the end of the line. And the second link [url:2qrsiwy7]http://blogs.msdn.com/michkap/archive/2006/07/11/662342.aspx[/url:2qrsiwy7] talk in detail why we have the same problem with the phrase

    Bush hid the facts[/quote:2qrsiwy7]but not with the phrase

    Bush hid the truth[/quote:2qrsiwy7]

    Re UNICODE front end for Multi-Edit,
    Looks like I’ll drop the requirement that the text all be ASCII but just in UNICODE format (i.e. byte values all 127 and below), and will instead translate to ANSI (i.e. byte values 0 – 255), and all I need to do is come up with a good byte value to use in place of characters which can not be translated to ANSI. The default is a question mark (?), but that’s no good. I need a non-character which still shows up as something in Multi-Edit, so user’s can search for that character (um, non-character) to find where translation failed. I’ll have to do some tests with that, as I know from experience that 0x7F, which is often used in Multi-Edit *.db files, sometimes shows up, but other times doesn’t, which means the cursor ends up out of sync with the editing point.

    Anyway the plan is to make a DLL, using my experience porting my EDX Spelling Checker to Multi-Edit, and then in the Multi-Edit source *.s files replace load_file with something like edx$load_file. The only problem is there are something like 110 instances of load_file in the *.s code. Probably quite a few of them don’t need to be changed as they just load internal Multi-Edit files. I’ll have to look at them and figure out which ones need to be changed.

    #6948
    DanHughes
    Participant

    David,

    If you succeed in getting your project working as you believe, I’ll be more than willing to add the hooks into the Multi-Edit kernel so that no macro source would have to be changed to use it.

    #6950
    John Martzouco
    Participant

    Gentlemen, this is great news!

    If there’s anything I can do to help, let me know. I’ve got a pretty good basis in C/C++ OOP and feel strongly enough about this to commit some time to it. Gratis of course, I want to see ME stay ahead of anything else on the market.

    Regards,
    John

    #6993
    deleyd
    Participant

    The Unicode front end for Multi-Edit when finished will convert UTF-16(LE), UTF-16(BE), and UTF-8 to your choice of ANSI, OEM, or UTF-8.

    The TOOLS->CUSTOMIZE->UNICODE dialog:

    When Multi-Edit opens a file, it will first examine the very beginning of the file to see if it starts with a unicode BOM byte sequence identifying it as a unicode file (FE FF = UTF-16(LE), FF FE = UTF-16(BE), EF BB BF = UTF-8). Then if you have the box checked in the customize dialog above to translate that type of unicode file, Multi-Edit will create a new translated file next to the original and load that new translated file.

    The icon used in the Unicode Conversion message above indicates how the conversion went:

    SPOCK HAND: Unicode was translated with no problems. You’re good to go.

    RED EXCLAMATION: Unicode was translated, but there is one or more of the following warning conditions:[list:2xz6myxx][*:2xz6myxx]A replacement character was used to represent a unicode character that could not be translated to the target code page.[/*:m:2xz6myxx][*:2xz6myxx]A UTF-16 file had an odd number of bytes (illegal ill-formed text).[/*:m:2xz6myxx][/list:u:2xz6myxx] STOP SIGN: There was a problem and the unicode was not translated.

    (Note: After converting the file Multi-Edit still needs to determine the line terminator used for the file. If Multi-Edit can’t determine if the line terminator is PC, Unix, or Mac, it may still load the file as a binary file. If this happens you can reload the unicode file and specify if the file uses PC, Unix, or Mac line terminators.)

    Explanation of the ‘Unicode Translation Flags’ in the Customize dialog
    (These are flags passed to the ‘WideCharToMultiByte’ system routine which does the translation. See [url:2xz6myxx]http://msdn.microsoft.com/library/en-us/intl/unicode_2bj9.asp[/url:2xz6myxx] These flags are ignored if translating to UTF-8.)

    DEFAULT CHAR – Character to substitute if a unicode character can not be translated to the target code page. (Specified as a decimal number.)

    WC_NO_BEST_FIT_CHARS – ‘WideCharToMultiByte’ may choose to use a close match to the unicode character if an exact match is not available. Checking this box prevents that, and forces ‘WideCharToMultiByte’ to substitute the default character instead.

    WC_COMPOSITECHECK – Convert composite characters, consisting of a base character and a nonspacing character, each with different character values. Translate these characters to precomposed characters, which have a single character value for a base-nonspacing character combination. For example, in the character è, the e is the base character and the accent grave mark is the nonspacing character. The following 3 apply if WC_COMPOSITECHECK is checked:[list:2xz6myxx][*:2xz6myxx]Wc_DiscardNS – Discard nonspacing characters during conversion.[/*:m:2xz6myxx][*:2xz6myxx]Wc_SepChars – Generate separate characters during conversion.[/*:m:2xz6myxx][*:2xz6myxx]Wc_DefaultChar – Replace exceptions with the default character specified above during conversion.[/*:m:2xz6myxx][/list:u:2xz6myxx]

    My next post will explain the horrors of supporting Unicode and why it’s not as easy as it sounds.

    #6994
    deleyd
    Participant

    UNICODE: THE PROGRAMMER’S NIGHTMARE

    There are two problems with the UTF-16(LE) Unicode format:

      [*:29xeq4ch]Some characters require TWO 16-bit code units to represent. The first 16-bit code unit is called a High Surrogate, and the second 16-bit code unit is called a Low Surrogate. Put them together and you can get the Unicode Scalar Value, which is a number between 0 and 0x10FFFF. The Unicode Scalar Value is what maps to a character.
      [/*:m:29xeq4ch]
      [*:29xeq4ch]I lied when I said the Unicode Scalar Value maps to a character. Unicode allows composite characters. For example, in the character è, the e is the base character and the accent grave mark is the non-spacing character. Unicode allows this to be represented by TWO Unicode Scalar Values: first the value for the letter e, followed by a second value for the accent grave mark (e.g. e + ` = è).

      Some exotic characters may have accent marks all over them. So you might have a base character (e.g. the letter e), followed by several non-spacing combining accent characters. Put it all together and you get the final composite "character", what they call a "Grapheme Cluster".[/*:m:29xeq4ch][/list:o:29xeq4ch]So you can kiss goodbye to indexing your way into a unicode string to get the n’th character.

      Plus it’s possible to have an illegal byte sequence (ill-formed text). You might have a lone high surrogate 16-bit code unit not followed by a low surrogate 16-bit code unit, or the opposite: a low surrogate 16-bit code unit not preceded by a high surrogate 16-bit code unit.

      Or you might just have an endless stream of non-spacing combining characters with no base character.

      And there is one more level of complication if you want to break a unicode stream into chunks. Once you’ve gone to all the trouble of finding two adjacent Unicode Scalar Values, either of which could be represented by a Surrogate Pair, you still have one more check to make, because you can’t break between just any two adjacent Unicode Scalar Values. You have to first determine the type of each value. There are 10 different types: {OTHER, CR, LF, CONTROL, EXTEND, L, V, T, LV, or LVT.} You determine the type of each Unicode Scalar Value (Grapheme) by deciphering file [url:29xeq4ch]http://unicode.org/Public/5.0.0/ucd/auxiliary/GraphemeBreakProperty.txt[/url:29xeq4ch]

      Then, once you’ve determined the type of each of your two adjacent Unicode Scalar Values (Graphemes), you then have to consult the Grapheme Break Chart to determine if it’s OK to break between these two Unicode Scalar Values. [url:29xeq4ch]http://unicode.org/Public/5.0.0/ucd/auxiliary/GraphemeBreakTest.html[/url:29xeq4ch]

      If the chart says you can’t break, then you have to scan the stream forwards or backwards and repeat the whole ordeal.

      THE CHEAP WAY OUT:
      (Note: The Multi-Edit unicode conversion utility does not use this cheap way out.)

      For UTF-16(LE) most characters do have a single 16-bit Unicode Scalar Value to represent them, so although you can get an è by combining an e followed by a non-spacing combining accent grave mark `, you can get it easier by just specifying the Unicode Scalar Value for è.

      This is what the ‘Wc_CompositeCheck’ flag is for. If during translation we encounter an e followed by a "non-spacing combining accent grave mark" ` character, it will convert this sequence into a single "e with accent grave" è character.

      Also, most common characters can be represented by a single 16-bit Wide Char. Surrogate Pairs are used for the Chinese, Japanese, and Korean supplement compatability ideographs, mathematical alphanumeric symbols, musical symbols, Aegean numbers, ancient Greek numbers, old Persian, Ugaritic, Deseret, Shavian, Osmanya, Cypriot Syllabary, Phoenician, Kharoshthi, Cuniform, and a few other things.

      So you can disallow combining marks, and disallow surrogate pairs, and you’re back to a basic 16-bit per character code stream. Unfortunately it’s no longer unicode.

      THE UTF-8 NIGHTMARE
      For bytes 0x00 – 0x7F, UTF-8 characters are exactly the ASCII characters, with one byte per character. For the rest of the unicode characters, UTF-8 requires a 2, 3, or 4 byte sequence to give you one character.

      And as before, there’s also the possibility of combining characters, such as having an e followed by a "non-spacing combining accent grave" ` character, to create the final Grapheme è.

      So you still can’t index your way into the unicode string to get the n’th character, and you still have the problem of finding a Grapheme Break if you want to break the text into chunks.

      UTF-32 DOESN’T GET YOU OUT
      UTF-32 uses one 32-bit code unit for each Unicode Scalar Value, bypassing the problem of UTF-16 surrogate pairs or UTF-8 2,3,4 byte sequences. However it doesn’t get rid of the e followed by a "non-spacing combing accent grave" ` character problem. You still can’t index your way into the unicode string to get the n’th character, and you still have the problem of finding a Grapheme Break if you want to break the text into chunks.

      So updating Multi-Edit to use Unicode isn’t as simple as it may seem.

    #6996
    sdw162006
    Participant

    Unicode sounds like something written by a committee that could not make up it’s mind, so they kept everyones ideas and patched it all together with duct tape.
    Then hid the scissors

    #7442
    deleyd
    Participant

    Using Multi-Edit to edit unicode files

    Here we have a simple unicode UTF-8 file viewed in Notepad:

    If we uncheck in Multi-Edit the box for unicode translation of UTF-8 files:

    and load the file into Multi-Edit, it looks like this:

    The first 3 characters which look like garbage are the UTF-8 header identifying this file as a UTF-8 unicode file. The final 3 characters represent the UTF-8 code for that Chinese character. (The number of characters for a special unicode symbol can vary from 2 to 4 in UTF-8 files.)

    Notice if most of the file is just plain English, you can edit this file as you please, and when you save it, it will still be a UTF-8 file with that Chinese character in it.

    If instead we check the box for unicode translation of UTF-8 files, and say Translate To ANSI:

    Then when we load the file, the file will be translated to a new file, with "~~#ANSI#" appended to the new file’s name, and the Chinese character is converted to a single character. The original UTF-8 file is left unchanged.

    In this case the Chinese character was replaced with a 0x08 character, but you could select a different character to use in TOOLS -> CUSTOMIZE -> UNICODE -> Default Char. For example, if you wanted all untranslatable characters converted to a question mark (?), set this value to 63. (See TOOLS -> ASCII TABLE)

    We can also convert a "standard" unicode file, which is UTF-16(LE), to a UTF-8 file, and save it as a UTF-8 file:

    Now when we load this UTF-16(LE) unicode file, it will be translated to a new UTF-8 file, with "~~#UTF-8#" appended to the new file’s name, and the Chinese character is converted to a UTF-8 character. The original UTF-16 file is left unchanged.

    When we save this file it will be a UTF-8 unicode file.
    Chinese Character (UTF-8).txt
    Chinese Character (UTF-16).txt

    #7778
    John Martzouco
    Participant

    I’d like to tale a minute to thank you again for adding this feature to the editor.

    Yesterday, I was preparing a bunch of .rdp files for Remote Desktop on XP and it turns out that Microsoft saves these in Unicode by default. It was a complete bonus to open them in the editor and have them immediately converted to ANSI so I could work with them without jumping through hoops.

    Keep up the great work,
    John

    #7910
    daantje
    Participant

    Although I admire the work done by deleyd, which adds at least some Unicode support to ME, I do think ME desperately needs to build in proper Unicode support into their editor.
    So that what I expected downloading ME2008 RC1. But unfortunately I was disappointed.
    One thing that is missing is the ability to edit a file in place in Unicode without having to translate it into another format. Also missing is the ability for the ME windows to properly show the right characters or glyphs.
    The other thing is that the ME file-support should support Unicode and use the BOM mark (if present) to determine the type, guess the type, or in case of conflicting possibilities, let the user decide.
    Unfortunately, I have to work on a mix of UTF8 and UTF16 encoded files. So it seems I need to start looking for another editor that does properly support Unicode, without bothering me with extra files and properly showing the all characters. I really regret this because I do like ME and its other features and I happily used it for over 10 or 12 years now. So I really hope the file version of ME2008 will natively support Unicode.

    #7967
    PhilHibbs
    Participant

    OK, here’s my suggestion on how to process Unicode files in ME.

    1. Convert everything to UTF32 on reading
    2. If you encounter a "grapheme cluster" (GC), then represent the whole cluster as a single fixed value, is there a reserved value range that can be used for this?
    3. Keep a list of all the GCs as they are encountered, and write them back as they were found when saving the file. This could be a simple ordered list, so you would need to know "this is the third GC in the file, so look at the third entry in the GC list", or it could be more clever than that, I don’t know what kind of libraries or structures you have available.
    4. If a GC is deleted or overtyped, then remove it from the GC list.

    This allows GCs to be read in and written back out, and it allows indexed character access. GC creation or editing could be done through some kind of dialog box.

    Files with many GCs could cause a problem with the list becoming overly long, and working out which GC on the list is the current character could be tricky or time consuming if done brute-force.

    Copying chunks of text containing GCs would require the GC list to be updated.

    This would also require ME to keep track of what format to write the file out as, but I guess any UTF32-conversion based solution would need that.

    Searching for a string that contains a GC could be handled by the search string having its own GC list. Could every buffer – be it a file, the clipboard, a search string, a replacement string – have a GC associated with it?

    *Update* Could the GC list be stored at a line level? Presumably ME stores line-level metadata, such as the "modified" flag, could a reference to a GC list be added to this?

    #8058
    John Martzouco
    Participant

    You know what really perplexes me about this?… Microsoft has shifted to Unicode but it doesn’t seem like they’ve offered any stop-gap in the OS to help people move forward gently.

    Am I wrong, or would it have been helpful if a Win32 API function existed to translate Unicode into ANSI? Sure, it won’t handle every character, but it would be as good as what David’s built.

    David has done it… how come MS didn’t?… or did they? Does anybody know if there’s an API call that could do the translation?

    Sorry to bring it up after you’ve done this great work… just wondering if MS hasn’t shipped (or secretly implanted) a backdoor for this in Windows. I can expect their dev teams have done this to alleviate their own pain… so does anyone have any friends at MS that might leak the API call?

    hmmmmm?

    #8059
    Michal Vodicka
    Participant

    David has done it… how come MS didn’t?… or did they? Does anybody know if there’s an API call that could do the translation?[/quote:1dhalum5]
    Sure, they did. It is there since the very first NT version. See "http://msdn.microsoft.com/en-us/library/ms776413(VS.85).aspx" and "http://msdn.microsoft.com/en-us/library/ms776420(VS.85).aspx" . There are also CRT functions which allow to convert strings: [url:1dhalum5]http://msdn.microsoft.com/en-us/library/6y9se58z.aspx[/url:1dhalum5].

    (sorry for non clickable links, forum script doesn’t handle URLs with parens inside)

Viewing 15 posts - 1 through 15 (of 30 total)
  • You must be logged in to reply to this topic.