Find Duplicates and Delete

Product Compare Forums Multi-Edit User Created Macros Find Duplicates and Delete

Viewing 14 posts - 1 through 14 (of 14 total)
  • Author
    Posts
  • #1598
    Carloche
    Participant

    This macro will allow you to quickly remove any duplicate entries found within a file. Files such as email lists, dictionary files, etc can be cleaned of all duplicate entries in seconds with a single run of the FindDup macro.
    FindDup.s

    #5796
    EEAnderson
    Participant

    Unless I am mistaken, this macro would require that the file be presorted so that all duplicate lines are grouped together.

    Of course, when sorted, the macro runs on the order of O(n) or somewhere like that. Unsorted and brute force method, it might run O(n!). :oops:

    Perhaps an initial sort at the top of the macro?

    Just a thought,
    EEA

    #5797
    EEAnderson
    Participant

    Interestingly, I found the following macro on the TWiki site for ME

    [url:14vq70ro]http://www.multieditsoftware.com/twiki/pub/Main/UserScriptLibrary/dups.s[/url:14vq70ro]

    It appears to have been written by Andy Colson.

    [code:14vq70ro]
    void removedups()
    {
    int start = c_line;

    while (true)
    {
    str s = get_line();
    down;
    while (Find_Text(s, 0, 0))
    {
    del_line;
    }
    start = start + 1;
    goto_line(start);
    if (at_eof) {
    break;
    }
    }
    }
    [/code:14vq70ro]

    #5798
    AndyColson
    Participant

    Ya know… Looking at that macro, I think it’d have to be sorted to work correctly.

    Especially this:

    [code:3kg1zc45] if ( Line1 == Line2 )
    {
    ++Count;
    Up;
    Del_Line; // More information can be found on Del_Line on page 220 of the CMac Users Guide
    }
    [/code:3kg1zc45]

    If we find a match, we go up a line and Del_Line. The only way that’ll delete is if they are in sorted order.

    I made myself a little test file like:

    aa
    bb
    aa
    cc

    and it didnt seem to find any dups at all. When its sorted, it finds one dup.

    Also, I note that it doenst do the entire file. It uses the line that the cursor is on and only find dups of that line below the cursor.

    -Andy

    #5799
    AndyColson
    Participant

    Hum… Looking at my old macro (removedups).. I dont think its right either.

    if you had a line like:

    ‘aa’

    then a line like

    ‘this line contains aa as well’

    they’d be seen as matching. My find_text call is a little too generic. It should probably do a regEx on ^s$ (to make sure the entire line matches from beggining of line to end of line)

    -Andy

    #5956
    CharlesG
    Participant

    Tis file will remove all duplicates. if you find it doesn'[t please let me know.
    DelDups.s

    #5957
    Ernie Zapata
    Participant

    This macro seems to remove all blank lines and does not appear to handle lines beginning with tab characters. Take for example the following data:[code:3gjsmh0w]
    This is not a test.
    This is not a test 0.

    This is a test.
    This is a test.

    This is not a test 1.
    This is not a test 2.
    This is not a test 3.
    This is not a test 4.

    This is a test.
    This is not a test 5.
    This is not a test 6.
    This is not a test 7.
    This is not a test 8.

    This is a test.
    [/code:3gjsmh0w]

    I have attached the data file as a zip file.

    All blank lines in the example have no white-space characters, just blank lines. Those lines with the text “This is a test.” begin with a tab character, not spaces.

    Running the macro against this sample data result in all blank lines being removed and none of the duplicate “<tab>This is a test.” lines being removed.
    data.zip

    #6117
    deleyd
    Participant

    I just released my EDX 3.0 package which includes EDX NWS Sort. You can select
    [list:1jew8rsa][*:1jew8rsa]Mark Lines with Duplicate Keys[/*:m:1jew8rsa]
    [*:1jew8rsa]Weed Out Duplicate Keys[/*:m:1jew8rsa]
    [*:1jew8rsa]Delete ALL Lines with Duplicate Keys[/*:m:1jew8rsa]
    [*:1jew8rsa]Keep ONLY the Lines with Duplicate Keys[/*:m:1jew8rsa]
    [*:1jew8rsa]Summary Sort: Keep Keys and counts only[/*:m:1jew8rsa][/list:u:1jew8rsa]as well as do an ordinary sort with multiple keys mixed ascending/descending.

    The EDX 3.0 package is at [url:1jew8rsa]http://www.multieditsoftware.com/forums/viewtopic.php?p=1877#1877[/url:1jew8rsa]

    (EDX NWS Sort is a major overhaul of NWS Sort submitted by Bret Sutton)

    #6149
    CharlesG
    Participant

    Hi there,

    The attached deldups.s should remove all duplicate lines. Please let me know if it doesn’t work for anyone…
    DelDups.s

    #6188
    CharlesG
    Participant

    The attached DelDups.s works but seems to hang when processing:

    5/11/2006 00:04 69,129,209 tcmd32.out

    It has 827676 lines. I can place it somewhere if people need it ….
    DelDups.s

    #6189
    AndyColson
    Participant

    The problem could be that its just really slow. How long have you let it run?

    With 827,676 lines, and each line comparing to everything below it:

    line 1 would compare 827,676 times.
    line 2 would compare 827,675 times.
    line 3 would compare 827,674 times.
    … etc

    So we sum thoes all up.
    Given: 1 + 2 + 3 + 4 + . . . . + N = (1 + N)*(N/2)

    (from http://mathforum.org/library/drmath/view/57919.html … yes, I had to look it up :-) )

    You would have a total of (1 + 827,676) * (827,676 / 2) = 827677 * 413838 = 342,524,194,326 comparisons.

    That might take a while…

    -Andy

    #6195
    CharlesG
    Participant

    Does this work with the v9.0? product stream?

    I just released my EDX 3.0 package which includes EDX NWS Sort. You can select
    [list:6iitt3pl][*:6iitt3pl]Mark Lines with Duplicate Keys[/*:m:6iitt3pl]
    [*:6iitt3pl]Weed Out Duplicate Keys[/*:m:6iitt3pl]
    [*:6iitt3pl]Delete ALL Lines with Duplicate Keys[/*:m:6iitt3pl]
    [*:6iitt3pl]Keep ONLY the Lines with Duplicate Keys[/*:m:6iitt3pl]
    [*:6iitt3pl]Summary Sort: Keep Keys and counts only[/*:m:6iitt3pl][/list:u:6iitt3pl]as well as do an ordinary sort with multiple keys mixed ascending/descending.

    The EDX 3.0 package is at [url:6iitt3pl]http://www.multieditsoftware.com/forums/viewtopic.php?p=1877#1877[/url:6iitt3pl]

    (EDX NWS Sort is a major overhaul of NWS Sort submitted by Bret Sutton)[/quote:6iitt3pl]

    #6196
    deleyd
    Participant

    Yes, EDX is currently for Multi-Edit 9.0 and 9.10 . And there’ll be a version for the new Multi-Edit version 10 (called ME2006 I think. Just started Beta testing yesterday.)

    #8595
    CharlesG
    Participant

    Has anyone found a problem with my latest deldups.s macro – here?

    http://www.multiedit.com/forums/viewtop … =4866#4866

Viewing 14 posts - 1 through 14 (of 14 total)
  • You must be logged in to reply to this topic.