How should we implement the Unicode editing core?

Home Forums General General Chat How should we implement the Unicode editing core?

This topic contains 11 replies, has 10,918 voices, and was last updated by  Kriten 8 years, 1 month ago.

Viewing 12 posts - 1 through 12 (of 12 total)
  • Author
    Posts
  • #2517

    DanHughes
    Participant

    We are wanting to know how important it is to keep the feel of the current editing core or if we might switch to using another editing component at the core of the next version of Multi-Edit.

    #8368

    deleyd
    Participant

    Unicode is a real bear to work with. I found that out when I wrote the unicode translator front end. I originally wanted to make it translate a file of any size. The usual way, break up the input into manageable blocks and process each block. Breaking up the stream turned out to be waaaaay complicated.

    You Can’t Just Break the Stream Anywhere
    It’s a bear to find a place in the stream where you can break. You can’t just break at any arbitrary 16-bit word boundary. There are "Surrogate Pairs", two 16-bit words that go together, and you don’t want to break that in half. And that’s just the beginning.

    There are also Composite Characters, where you have a base character, followed by a bunch of punctuation characters which go together. e.g. you could have the letter A, followed by a composite tilde to put on top of the A. And that can be followed by an unlimited number of other additional fancy things to add to the letter A.

    Illegal degenerate streams
    Then there’s the possibility of having an illegal sequence of 16-bit words, and other degenerate possibilities, such as having nothing but an endless stream of composite additions with no base character to begin it all.

    Grapheme Break
    To break up the stream, assuming the stream itself isn’t an illegal degenerate sequence, you have to find a Grapheme Break. I started writing code to do that, and pushed and pushed to get something that worked, and eventually just threw it all away. That’s why the unicode front end just does it all at once, and if the input file is too big, too bad.

    Unicode Core Package
    Someone else must have made a code package that handles all these unicode problems. There are numerous editors in the free Linux world which support unicode to some degree.

    Last September I finally found such a package (which I think is free even for commercial use. You can check that):

    The package is ICU4J version 4.0 released July 2, 2008.

    http://icu-project.org/

    See ICU User Guide:
    http://download.icu-project.org/files/i … rguide.zip

    (J is the Java version, which I think in the long run will be less troublesome than the corresponding C version.)

    UTF-8 or UTF-16?
    All Linux programs use the UTF-8 format for unicode. Windows uses UTF-16. I’m leaning towards UTF-8 as having the better advantage. Saves memory space internally, which could be a big performance boost. Could greatly reduce page faults, greatly improving speed, even with the extra converting from UTF-8 to UTF-16 whenever you wanted to call a Windows system routine. Overall it could be much faster. 03/27/2009: Yes, there are no NULL bytes in UTF-8 (with the single exception of the NULL character itself. UTF-8 byte values 0 – 127 are still the standard ASCII characters.)] That’s probably going to be the biggest design decision to make. That major decision will affect the long term future of Multi-Edit.

    #8375

    AndyColson
    Participant

    I’m not sure how to vote here, but with one caution:

    if you do use a component like Scintilla, will we loose partial file loading? I love being able to open a 20 Meg file across the network and have it instantly come up.

    As long as I’m not loosing too much functionality, I don’t really care.

    -Andy

    #8381

    daantje
    Participant

    I agree with Andy, as long as the partial loading is not broken, that is fully up to you.

    Daniël

    #8385

    samej71
    Participant

    Notepad++ uses Scintilla, and I tested it out with a large file. It loads the entire thing at once. Granted, this could just be how N++ implemented things and not a consequence of using Scintilla, but I thought I’d mention it.

    Other than that, it seems like Scintilla is fast and able to support a number of features that would be used/useful within ME.

    #8407

    pschwalm
    Participant

    Hi,

    this is only a small remark to graphem breaking. I’m not sure if I can answer at the level you expect. Anyway …:

    I’m using python which has a module called "unicodedata" in the runtime library. In this module you can query the attributes of any unicode character. The functions "category" and "combining" might be of interest for you.

    I suppose you’re not using python, but it might be possible to use parts of the library. It’s written in C.

    Greetings from
    Peter Schwalm

    #8408

    igtorque
    Participant

    The first time I missed Unicode support from ME was a couple of years ago, when I translated a module of a PHP-based program for forums into Spanish. That was written in UTF-8 format, without BOM marker. I ended up using PSPad, which is fine, but takes too long to load, and lacks many functionality compared with ME.

    In other words, although I program mainly in Windows hosts, I need UTF-8 more than UTF-16.

    So I would be happy to have a ME with can edit UTF-8 files (with or without BOM) as soon as possible. I have voted for the first choice because I think from the outside that it would be the fastest path, but that is only a bad-informed guess…

    #8414

    curtm
    Participant

    Unfortunately, Unicode has become unavaoidable.

    Re-write the core.

    #8431

    John Martzouco
    Participant

    I don’t know if it’s Visual Studio 2008 or my team’s setup (we work in French) but every *.cs file in our project is Unicode now.

    I use ME for about 2% of my work now.

    6 months ago, I used it 98% of the time (different team).

    Go figure how I feel about this.

    We’ve been asking you for Unicode for about three years now; I hope that this time, you really *will* act on this. If this is smoke again, you’ll lose me on the basis that I’ve given up believing that you will ever follow up on your promises. You’ve promised before. Please be considerate of us.

    #8434

    dynalt
    Participant

    Since editing is the core of MultiEdit, I wouldn’t recommend using anyone else’s code. If editing were incidental, then perhaps. GUIs fall into that category.

    Unless there is a need to be multi-platform, the Delphi VCL is just fine and likely easier to deal with from Delphi than any third-party toolkit.

    Marco Cantu in "Delphi 2009 Handbook" (http://www.marcocantu.com/dh2009/) covers Unicode support in Delphi 2009 is depth, and claims that Delphi 2009 provides the total support that Unicode application demand. He has many guidelines for upgrading existing code to Unicode including code examples and performance measurements.

    From my reading, Unicode is never going to be fun, but Delphi 2009 has all the tools to ease the pain significantly.

    #8459

    deleyd
    Participant

    Found this. —D.D.

    The migration of existing Delphi-code is going to be a PITA for many many projects, since CodeGear took the strange – and IMO wrong – decision of defaulting the aptly named PChar to PWideChar starting with Delphi "Tiburon".

    Here is how PCHAR is declared in the Windows SDK inside WinNT.h:
    [code:4x3gvfm0]typedef CHAR *PCHAR, *LPCH, *PCH;[/code:4x3gvfm0]
    For those not literate enough in C, here is what this means in Delphi:
    [code:4x3gvfm0]type PCHAR = ^CHAR;
    type LPCH = ^CHAR;
    type PCH = ^CHAR;[/code:4x3gvfm0]
    Since the Delphi language (and Pascal) is case-insensitive, you now have the definition of Delphi’s PChar as everyone else in the Windows development community understands it.

    Do we have a problem here? We sure do. While in the C/C++ world – and remember that this also includes C++ Builder – TCHAR is the basis for any pointer types that can be used for zero-terminated character strings, PCHAR was clearly meant to be ANSI only. With the preprocessor symbols _UNICODE and UNICODE, any Windows C/C++ preprocessor that can be used with the Windows SDK will happily replace every occurrence of TCHAR with WCHAR; … which would be the counterpart to WideChar in Delphi and is also defined in Windows.pas . On the other hand if those preprocessor symbols are not defined, TCHAR will resolve to CHAR which resolves to char, the intrinsic type used for one-byte characters – read: ASCII/ANSI. So consequently, the widely known naming scheme has been broken by CodeGear in order to break legacy code only partially.

    A rant concerning Unicode and Delphi
    http://blog.delphi-jedi.net/2008/05/10/ … nd-delphi/

    #9086

    Kriten
    Participant

    I think we should Write a new editing core

Viewing 12 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic.