Working on adding UNICODE support for Multi-Edit

Home Forums General General Chat Working on adding UNICODE support for Multi-Edit

This topic contains 29 replies, has 36,589 voices, and was last updated by  deleyd 10 years, 9 months ago.

Viewing 15 posts - 16 through 30 (of 30 total)
  • Author
    Posts
  • #8060

    John Martzouco
    Participant

    Thanks Michal! My far-fetched hunch was right.

    Did the Win32 APIs help you at all with your work, David?

    #8062

    deleyd
    Participant

    The Unicode front end just uses WideCharToMultiByte and MultiByteToWideChar.

    To translate anything to anything, first use MultiByteToWideChar to translate it to standard UTF-16LE, then use WideCharToMultiByte to translate it to any code page you please.

    I originally wanted to make the front end handle unlimited size files. But to do this you need to be able to break up a super large file into smaller pieces that these functions can handle. With ASCII there’s no problem—you can split the file anywhere between any two bytes. You can’t do that with Unicode text. Breaking a stream of Unicode text is very complicated. I ultimatly abandoned the idea and just had the front end make one call to MultiByteToWideChar and pass it the entire file all at once. It means there’s a file size limitation of 210MB.

    I know of no Unicode library giving you a tool to help you find where you can split a file. This is what we need, a unicode library of tools to help us handle unicode strings. I’ve been looking around everywhere and I just haven’t had any success yet. I’ve tried looking at the source code for open source text editors that handle unicode. I figure some open source code must exist somewhere, probably in Linux land. Just haven’t been able to locate anything suitable.

    (Just today I got a call back from Basis Technology (http://www.basistech.com). I had asked them about their "Rosette Core Library for Unicode", which sounded promising. They said it does support surrogate pairs and composite characters, meaning it’s true Unicode and not a cheap UCS-2 knockoff. But today they called and said it’s not what I’m looking for.)

    Here’s what you need to do to find a suitable place to break a unicode stream of text:

    And there is one more level of complication if you want to break a unicode stream into chunks. Once you’ve gone to all the trouble of finding two adjacent Unicode Scalar Values, either of which could be represented by a Surrogate Pair, you still have one more check to make, because you can’t break between just any two adjacent Unicode Scalar Values. You have to first determine the type of each value. There are 10 different types: {OTHER, CR, LF, CONTROL, EXTEND, L, V, T, LV, or LVT.} You determine the type of each Unicode Scalar Value (Grapheme) by deciphering file http://unicode.org/Public/5.0.0/ucd/aux … operty.txt

    Then, once you’ve determined the type of each of your two adjacent Unicode Scalar Values (Graphemes), you then have to consult the Grapheme Break Chart to determine if it’s OK to break between these two Unicode Scalar Values. http://unicode.org/Public/5.0.0/ucd/aux … kTest.html

    If the chart says you can’t break, then you have to scan the stream forwards or backwards and repeat the whole ordeal.[/quote:iveynzmu]

    #8065

    John Martzouco
    Participant

    ugh! absolutely ugh!

    It’s no wonder that nobody has taken the challenge to build a library. Are Graphemes variable-length?… it almost gives that impression from reading this. It’s hokey enough without that twist… but I can see the magnitude of trouble that variable-length would throw in there.

    Luckily (I’m guessing) most files are going to fit inside that 210 MB limit.

    Once again, congratulations on unraveling this… it saves my butt every three or four months.

    #8069

    John Martzouco
    Participant

    The attached is padded with 0x0 characters from the source (SQL Server Management Studio) but neither Notepad nor ME recognize it as a Unicode file.

    They both show is as 16 column wide binary data.

    Setting my TXT file type to [MSDOS Text] forces ME to show it in 3 long rows which is good… but why doesn’t the conversion from Unicode get triggered?

    Is there some combination of characters I could inject to the file during its construction that will signal ME and Notepad that it is Unicode?

    Thanks in advance (I know you’ll enjoy the challenge David)
    SQL_Server_Trace.txt

    #8070

    deleyd
    Participant

    Add these two bytes to the beginning of the file:[code:lptxbf0j]FE FF[/code:lptxbf0j][code:lptxbf0j]FE – 1st byte
    FF – 2nd byte[/code:lptxbf0j]

    This is what they call a Byte Order Mark (BOM). It helps identify the file as a UTF-16LE unicode file. Unfortunately, the standard says it’s not mandatory. Without it a program has to scan and guess what type of file it is. I did not add any guessing to the unicode front end because sometimes the guess is wrong. Try this:

      [*:lptxbf0j]Open up Notepad (not Wordpad, not Word or any other word processor)[/*:m:lptxbf0j]
      [*:lptxbf0j]Type in this sentence exactly (without quotes): "Bush hid the facts"[/*:m:lptxbf0j]
      [*:lptxbf0j]Save the file to your hard drive[/*:m:lptxbf0j]
      [*:lptxbf0j]Close Notepad[/*:m:lptxbf0j]
      [*:lptxbf0j]Open the saved file in Notepad[/*:m:lptxbf0j][/list:o:lptxbf0j]You’ll see a bunch of squares instead of the text. Notepad incorrectly guesses this is unicode text. (If it doesn’t work, make sure there is no trailing CR/LF at the end of the line in the text file, and that you are saving the document as ANSI encoding.)

      I recall John Martzouco is from Canada. Let me say it is truly frightening what is happening to our country. And even more alarming that most people aren’t noticing.

      The president has assumed unprecidented powers:[list:lptxbf0j][*:lptxbf0j]Eliminating habeas corpus and other safeguards of liberty[/*:m:lptxbf0j]
      [*:lptxbf0j]Eliminating checks and balances and supporting the powers of the "unitary executive"[/*:m:lptxbf0j]
      [*:lptxbf0j]Refusal to enforce selected provisions of the laws passed by Congress[/*:m:lptxbf0j]
      [*:lptxbf0j]Wiretapping without a warrant[/*:m:lptxbf0j]
      [*:lptxbf0j]and so on…[/*:m:lptxbf0j][/list:u:lptxbf0j]It all stems from the conserviative ideology that Obedience is Freedom.

      This is a very different concept of the word "freedom". About the opposite of what our founding fathers had in mind, and the opposite of what the rest of the world thinks is freedom.

      I’m reading the latest book by cognitive scientist & professor George Lakoff . This is probably the most important book of the century. He explains how Progressives are losing elections because they believe the 18th-century Enlightenment idea that people are rational.

      However, poor people in conservative states have been voting against their own self-interest. How is this possible? Simple; people aren’t rational.

      Lakoff talks about how the brain really works and what we need to do to change people’s minds.

      (I’ll probably start a new general chat post on this.)

    #8071

    John Martzouco
    Participant

    Our Prime Minister just extradited one of his own ministers to your government because of the sale of cannibis seeds. I haven’t read the details, but this is what I’ve picked up from those around me (I don’t read the papers).

    Our PM is an ally of your war-monging President and the slippery slope began here the day he was elected. Our country is not the proud nation that we know from our youth.

    We have this thing called minority-government here… a tri-partisan parliament. Because of the back-wash on these latest actions, the government has been forced to call a premature federal election 3/4 of the way into their four year term.

    At this time, there is not a single Canadian political party that has any dignity.

    #8072

    John Martzouco
    Participant

    It worked like a charm David!

    I’ll add the two characters to the macro I use to automatically open that file.

    Man, I absolutely have to tell you that you are the best reason to buy Multi-Edit! Thanks for the great work and the help as always!

    #8073

    deleyd
    Participant

    You can also add /FT=<FileType> to the call to uni$load_file. The options are:[list:2zvggv3u][*:2zvggv3u]UNI$ftNOTUNICODE[/*:m:2zvggv3u]
    [*:2zvggv3u]UNI$ftUTF8[/*:m:2zvggv3u]
    [*:2zvggv3u]UNI$ftUTF16LE[/*:m:2zvggv3u]
    [*:2zvggv3u]UNI$ftUTF16BE[/*:m:2zvggv3u][/list:u:2zvggv3u]Defined in Unicode.sh. 0 = not specified. UTF16LE is what Microsoft considers regular unicode. If not specified then we look for Unicode Byte Order Mark (BOM) at beginning of file.

    uni$load_file — in unicode.s
    the call to uni$load_file is in LdFiles — in MeSys.s

    #8074

    John Martzouco
    Participant

    Good thing this is in General Chat 8)

    #8081

    John Martzouco
    Participant

    Hey David,

    I call
    [code:3rd57nj9]RM("uni$load_file /FT=UNI$ftUTF16LE /NM=1 /FN=" + strLogFile); // Load the file[/code:3rd57nj9]

    but I still get 16 character columns. Am I using the parameters correctly? I’ll try a couple of the other designators. Thanks.

    #8082

    John Martzouco
    Participant

    I bet that it’s this:
    [code:3l29bnjp] // if file type is specified as Binary then skip all this and just load
    if ( Copy( Line_Terminator, 1, 1 ) == "\x00" ) {
    Goto load;
    }[/code:3l29bnjp]

    I’ll add a bypass parameter in the .s to force Unicode conversion even if it looks binary.

    #8086

    John Martzouco
    Participant

    David,

    You’re missing the Load_File_Name assignment when the branch is bypassed:
    [code:1cbo8w4v] else { // [JM Sep 02 2008]
    if ( success ) { // Success doesn’t mean we converted unicode, it just means we handle the return code here. If this bit is not set, then we display the message returned to us in errbuf.
    switch ( ExitStatus ) {

    case UNI__UNICODETRANSLATED : // Unicode was converted
    Load_File_Name = Out_File_Name; // Change so we load converted file
    break;
    }
    }
    }
    [/code:1cbo8w4v]

    #8112

    deleyd
    Participant

    I bet that it’s this:
    [code:29vz3z41] // if file type is specified as Binary then skip all this and just load
    if ( Copy( Line_Terminator, 1, 1 ) == "\x00" ) {
    Goto load;
    }[/code:29vz3z41]

    I’ll add a bypass parameter in the .s to force Unicode conversion even if it looks binary.[/quote:29vz3z41]
    I was looking at the code in Unicode.s, wondering how Line_Terminator got set when we haven’t yet loaded the file. Looks like in MeSys.s LDFiles it calls ExtSetup before calling uni$load_file, and that sets Line_Terminator according to the file extension, unless you explicitly specified a file type for LDFiles.

    Except usually the setup for a file extension specifies "Auto Detect". Not sure how that works. Gotta load the file before you can do an auto detect on it.

    Looks like in MeSys.s ExtSetup:[code:29vz3z41] else {
    Line_Terminator = ‘|13|10’;
    if ( Jx == 0 ) {
    FileType_Override = 255;
    }
    [/code:29vz3z41]it defaults the file type to CR/LF if it’s not specified. I wonder where Auto Detect happens? My guess is setting the undocumented FileType_Override to 255 tells XLoad_File to perform an autodetect.

    #8299

    John Martzouco
    Participant

    I’ve recently started working with Visual Studio 2008. It looks like every single file generated by this generation of the tool is in Unicode.

    On top of that, a lot of my current work is in French, so I really do need the Unicode support.

    Unlike the past, when I only needed to convert a couple of files, I’m in a situation where I need an editor that can handle Unicode characters full-time.

    Dan, what does the future look like for Unicode in ME? What is the timeline for a Unicode product?

    Thanks,
    John

    #8300

    deleyd
    Participant

    I found a Unicode package that may be what we need to add true unicode support to ME.

    The package is ICU4J version 4.0 released July 2, 2008. (There may be an update by now.)

    http://icu-project.org/

    See ICU User Guide:
    http://download.icu-project.org/files/i … rguide.zip

    (J is the Java version, which I think in the long run will be less troublesome than the corresponding C version.)

    ————————-

    I also had the idea of a quick interim fix that perhaps if a file only occasionally has some unicode characters, we could make a pop-up window that displays that line in in full glory, and allows some primitive editing on it.

    ————————-

    A true unicode overhaul requires a strategy decision of do we internally use UTF-8 or UTF-16. Microsoft Windows wants everything UTF-16. All Linux editors I’ve seen use UTF-8. I’m leaning towards UTF-8. UTF-16 made sense when unicode was just 16-bit characters, but now with surrogate pairs and combining characters the UTF-16 advantage is lost, and becomes just a way to use twice as much memory and disk space.

    (I wonder if Multi-Edit on Linux via CodeWeavers will support unicode? Probably not. I’m working on a Lnx Add-On that will get around all the problems with tabbed dialogs.)

    —David Deley
    http://members.cox.net/deleyd/

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic.