Regex match multiline blocks between two identifiable lines

Product Compare Forums Multi-Edit Support Regex match multiline blocks between two identifiable lines

Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
    Posts
  • #853
    OChristiaanse
    Participant

    I’m looking for a expression to match the bolded text. Note that the length of the blocks varies between the 2 upto 200 lines, so replacing multiple lines by repetably search and replace isn’t a option.

    I tried
    ^.@fouten tijdens.@$(^.@nivo 1.@$)@^.@deze foutmutatie.@$
    but the @ doesn’t work with enclosed expressions but only with characters.

    In fact this expression matches only the first block of lines, so I guess that the @ in )@^ is ignored.

    Should I use another regex plugin? In http://www.delorie.com/gnu/docs/regex/regex_toc.html
    there is a mention of a operator to repeat (0,1 or more times, the same as for characters) expressions in a match.

    Any suggestions?

    By the way I’m using ME 8

    35; Fouten tijdens verwerking.
    35; Fout Nivo 1,(4040204) in attribuut ;JAARPREMIEWERKGEVER;
    35; Informatie,( 4050136) Deze foutmutatie Afstand TPP op 01 Jan 2002.

    62; Fouten tijdens verwerking.
    62; Fout Nivo 1,(4040204) attribuut ;BEREIKBAAR_AOPA;
    62; Fout Nivo 1,(4040204) attribuut ;BEREIKBAAR_WZP;
    62; Fout Nivo 1,(4040204) attribuut ;OPGEBOUWD_WZP;
    62; Fout Nivo 1,(4040204) attribuut ;CODEREGLEMENT;
    62; Fout Nivo 1,(4040204) attribuut ;SALARISJAAR;
    62; Fout Nivo 1,( 4040282) Aanvang deelname TNBP
    62; Fout Nivo 1,( 4040284) Aanvang deelname TNBP
    62; Fout Nivo 1,(4040204) attribuut ;DATUMHUIDIGESTAND_MUTATIE;
    62; Fout Nivo 1,(4040204) attribuut ;DATUMHUIDIGE;
    62; Fout Nivo 1,(4040204) attribuut ;BEREIKBAAR_VP;
    62; Fout Nivo 1,(4040204) attribuut ;DIENSTJARENTOEKOMST;
    62; Fout Nivo 1,(4040204) attribuut ;CUMPREMIEVERZEKERDE;
    62; Fout Nivo 1,(4040204) attribuut ;PENSIOENGRONDSLAGBEREKENING;
    62; Informatie,( 4050136) Deze foutmutatie Afstand TPP op 01 Jan 2002.

    80; Fouten tijdens verwerking.
    80; Waarschuwing Nivo 1,( 4040205) Het AOPA op polis
    80; Waarschuwing Nivo 1,( 4040206) Het AOPB is nog niet

    219; Fouten tijdens verwerking.
    219; Fout Nivo 1,(4040204) in attribuut ;OPGEBOUWD_NBP;
    219; Fout Nivo 1,(4040204) in attribuut ;STATUS_NBP;
    219; Fout Nivo 1,(4040204) in attribuut ;OPGEBOUWD_WZP;
    219; Fout Nivo 1,(4040204) in attribuut ;STATUSPOLIS;
    219; Fout Nivo 1,(4040204) in attribuut ;BEREIKBAAR_NBP;
    219; Fout Nivo 1,(4040204) in attribuut ;BEREIKBAAR_WZP;
    219; Fout Nivo 1,(4040204) in attribuut ;STATUS_WZP;

    228; Fouten tijdens verwerking.
    228; Waarschuwing Nivo 1,( 4040295) Bij deze mutatie afstand TNBP

    232; Fouten tijdens verwerking.

    237; Fouten tijdens verwerking.

    #3267
    OChristiaanse
    Participant

    I received a hint to look in the archive and found there a sollution:

    ^.@fouten.@$(.@fout.@$.@)@.@informatie.@$

    But I don’t yet understand the error I made in my previous expression.

    #3278
    ReidSweatman
    Participant

    First, it’s unnecessary to repeat the “start-of-line anchor” (^), and depending on exactly how the regex engine parses that symbol, doing so may result in a non-functional regex. Start-of-line is already implied by the preceding “end-of-line anchor” ($).

    Likewise, Multi-Edit’s current regex engine occasionally has trouble with end anchors nested in repeats, although it’s not clear exactly what pattern causes the problem.

    The revised version you posted makes better use of domain-specific knowledge to achieve a shorter regex, noting that the sought text pattern always begins with fouten and ends with informatie. Since there’s an internal limit of roughly 240 characters on the length of a regular expression after any aliases are expanded, it’s important to try and keep the length as short as is commensurate with an accurate match.

    Realistically, the key changes you made were dropping the internal start anchors and adding the “minimal munch, any characters” after the end anchor within the group parentheses.

    Someone (don’t remember who, or I’d credit him) a couple of years back came up with an alias to match two symbols possibly separated by multiple lines. It’s defined as [code:11c9rszi](.@$.@)@.@[/code:11c9rszi]and commonly added to the alias list in Multi-Edit as <ml>. Then you use it like so: [code:11c9rszi]symbol1<ml>symbol2[/code:11c9rszi]

    I’m not sure exactly what you’re asking in re repeat operators; the standard minimal- and maximal-munch versions of “match any number, including zero” (the so-called “Kleene Star”) and “match any number, but at least one” are implemented. The operators for “match zero or one” and “match from m to n” are not (although see the next paragraph).

    As for using a regex plugin, don’t bother. The 9.1 upgrade will include full Perl-standard regular expression support using a completely different engine than the current one. Speaking personally, that’s one of my strongest motivations to upgrade, as I use regular expressions constantly. Up to this point, whenever I ran across a regex I wanted to write that was too complex for Multi-Edit’s search engine, I ran ActivePerl externally to get Perl regexes. With 9.1, it’s native. There are a lot of nice new features in 9.1, but this is the big one for me.

    #3287
    OChristiaanse
    Participant

    First, it’s unnecessary to repeat the “start-of-line anchor” (^), and depending on exactly how the regex engine parses that symbol, doing so may result in a non-functional regex. Start-of-line is already implied by the preceding “end-of-line anchor” ($). [/quote:114leik5]

    Thanks, I will remember that.
    I placed the first ^ in my search at front because otherwise a .@ would start the search, and that has a very bad performance.
    The file size I start with is 140 Mb, so performance is a must. (Normaly I finish with 5 Mb).

    Likewise, Multi-Edit’s current regex engine occasionally has trouble with end anchors nested in repeats, although it’s not clear exactly what pattern causes the problem. [/quote:114leik5]

    Maybe this wil help:
    [code:114leik5]fouten.@(.@$)@.@informatie [/code:114leik5]
    should (in my oppinion) do a matched pair search for the words ‘fouten’ or ‘informatie’ in it, but it bugs.
    Changing it into
    [code:114leik5]fouten.@(.@$.@)@.@informatie[/code:114leik5]
    does the trick. BUT WHY????

    … and commonly added to the alias list in Multi-Edit as <ml>. Then you use it like so: [/quote:114leik5]
    Thanks! Although in this situation I’m unable to use it, cause each line must be preceeded by the same Identifier, or alternatively, it shouldn’t be a empty line. And <ml> won’t do that.

    #3293
    ReidSweatman
    Participant

    Yes, the anchor at the start of the regex is necessary, and not a problem; it’s the ones following end anchors that don’t work.

    I examined your original patterns. I’ve run into this in somewhat different form myself on a couple of occasions. I’ve no doubt this behavior is coded into the regex engine. The upcoming v9.1 release will have an entirely different and more powerful regex engine, so this will cease to be an issue soon.

    As for the alias I suggested, if the particular pattern that has to occur at the start of intervening lines is quite common in your files, why not create a second alias just for that? Effectively, you’ve pretty much done that already with your existing regex; the alias given is more general-purpose, but could easily be slightly modified to be specific to your files, if common enough to be worthwhile.

    If you find such expressions growing in complexity to the point where the search engine breaks on them, you can always write a short macro. This has the advantage of taking something that would require an involved regex to match and breaking it into two or more searches done procedurally. Check the documentation for the Find_Text() macro, which is the one you’d use for regex searches within macros. It’s quite useful. In creating such a macro, you’d basically be doing the same thing you’d do if you did the search by hand. Something like, find the opening pattern, mark the location, look for the closing pattern, and if you find it, mark a line block, then write a loop to step through each intervening line, checking that each fits the required pattern. After you’ve written one of these kinds of things, you’ll find yourself writing them for quite a few odd tasks that won’t quite surrender to a straight regex.

Viewing 5 posts - 1 through 5 (of 5 total)
  • You must be logged in to reply to this topic.