Global Announcement, Haskell

Setdown: the best tool for fast and repeatable line based set operations

Introducing setdown

Have you ever been on the command line and tried to perform set operations? Have you ever followed crazy cli guides on the internet that suggest complicated commands to try and perform set operations on files. I have. And I did not like it; I think that we can do better.

Over the weekend I wrote a pretty nifty program that I am calling: Setdown. Setdown requires you to specify the set operations that you wish to perform as a definitions in a set definitions file (often suffixed with ‘.setdown’). The setdown language definitions are written in a very similar format to Makefiles; except that it performs set operations.

If you want to install setdown right now or checkout the code then you can follow these links:

If you want to learn how to use setdown and write set operations in it’s language then you should read the README file provided in the source code. However, to show you how easy the language is to read you I have provided an example .setdown file right here:

-- All of the letters of the alphabet
alphabet: "alphabet.txt.unsorted"

-- Calculating the consonants with a set difference
consonants: alphabet - "vowels.txt.unsorted"

-- Getting any letter than is e-sounding or a vowel
e-or-vowels: "e-letters.txt.unsorted" / "vowels.txt.unsorted"

-- Get any letter that is e-sounding and a vowel
e-and-vowel: "e-letters.txt.unsorted" / "vowels.txt.unsorted"

-- Get all of the e-sounding letters, the vowels and the consonants
e-or-vowels-or-consonants: ("e-letters.txt.unsorted" / "vowels.txt.unsorted") / consonants

You should install setdown and then check out the setdown-examples project to give it a try right now!

Can you show me an example?

By this point in time you are probably wondering “I love the look of it but show me an example”. So I will. Here is the output of a full running example by checking out the first example in the setdown-examples repository and running it:

$ setdown ex1.setdown 
==> Creating the environment...
Base Directory: ./
Output Directory: ./output

==> Parsed original definitions...
e-or-vowels-or-consonants: ("e-letters.txt.unsorted" / "vowels.txt.unsorted") / consonants

e-and-vowel: "e-letters.txt.unsorted" / "vowels.txt.unsorted"

e-or-vowels: "e-letters.txt.unsorted" / "vowels.txt.unsorted"

consonants: alphabet - "vowels.txt.unsorted"

alphabet: "alphabet.txt.unsorted"

==> Verification (Ensuring correctness in the set definitions file)
OK: No duplicate definitions found.
OK: No unknown identifiers found.
OK: All files in the definitions could be found.

==> Simplifying and eliminating duplicates from set definitions...DONE:
alphabet: "alphabet.txt.unsorted"

consonants: alphabet - "vowels.txt.unsorted"

e-and-vowel: "e-letters.txt.unsorted" / "vowels.txt.unsorted"

e-or-vowels: "e-letters.txt.unsorted" / "vowels.txt.unsorted"

e-or-vowels-or-consonants: e-or-vowels / consonants

==> Checking for cycles in the simplified definitions...DONE:
OK: No cycles were found in the definitions.

==> Copying and Sorting all input files from the definitions...
"alphabet.txt.unsorted" (unsorted) => "./output/alphabet.txt.unsorted.1.split.sorted" (sorted)
"e-letters.txt.unsorted" (unsorted) => "./output/e-letters.txt.unsorted.1.split.sorted" (sorted)
"vowels.txt.unsorted" (unsorted) => "./output/vowels.txt.unsorted.1.split.sorted" (sorted)

==> Computing set operations between the files...
Required results:
alphabet: ./output/alphabet.txt.unsorted.1.split.sorted

consonants: ./output/c989d1cf-b860-41cc-a52c-e2afc1e6a235

e-and-vowel: ./output/a8bd5974-22d5-4fdb-b269-0c09a1eeeb18

e-or-vowels: ./output/c3a8cc7c-f246-4eb4-b321-57f900964960

e-or-vowels-or-consonants: ./output/493ca813-7e3c-4259-9435-e2d5ddb4d6a5

As you can see we have ended up with a number of output files. Just to pick one example lets see the contents of the consonants file:

$ cat ./output/c989d1cf-b860-41cc-a52c-e2afc1e6a235

And look at that, we have computed the consonants when we were given the vowels and the rest of the alphabet. Hopefully you can see that this is very powerful and will let you write increasingly more correct set operations from the command line.

The benefits of setdown

Depending on your command line bent you may have used other tools in the past to perform set operations on files, like comm or fgrep, but these tools are quite lacking. Instead let me show you the full range of features that setdown gives you:

  • Maintainability
    If you get more set data to add to your collection (as often happens) then it is trivial to edit the setdown definitions to include it.
  • Repeatability
    Even if the data changes you run one single command and all of your set operations are performed again.
  • Sorted input is not required!
    Programs like comm require that you have sorted input if you want to do efficient set operations on files. This make sense because sorted files make set operations very efficient. However, we don’t put the onus on you to provide us with sorted input. Setdown will sort any files that you give it itself. We even use External Sort so that you can give us truly massive files and expect that we will still be able to perform your set operations.
  • Simplification of definitions
    If you write the same definition twice then setdown will factor that out and only perform the set operation once. This makes setdown run as efficiently as possible:

    ==> Parsed original definitions...
    C: "b-1.out" - ("a-1.out" / "a-2.out")
    B: "a-1.out" / "a-2.out"
    A: ("a-1.out" / "a-2.out") / "b-1.out"
    ==> Verification (Ensuring correctness in the set definitions file)
    OK: No duplicate definitions found.
    OK: No unknown identifiers found.
    OK: All files in the definitions could be found.
    ==> Simplifying and eliminating duplicates from set definitions...DONE:
    A: "b-1.out" / B
    B: "a-1.out" / "a-2.out"
    C: "b-1.out" - B
  • Dependencies and cyclic dependency detection
    Since you can write set definitions that depend on other set definitions it is possible to write a cyclic dependency. We will spot this for you and also tell you exactly where the cycle is in your file, meaning that you don’t have to search for it yourself!

    ==> Simplifying and eliminating duplicates from set definitions...DONE:
    A: C
    B: D
    C: B
    D: A
    ==> Checking for cycles in the simplified definitions...DONE:
    [Error 20] found cyclic dependencies in the definitions!
    We found the following cycles:
       A -> C -> B -> D -> A
  • Validation
    We verify that your set description only references files that exist and that if you reference dependencies that do not exist then you will get an error.
  • Works nicely with version control
    You check your .setdown file and your input files into the repository and share them with your co-workers. Everybody can use setdown to get the same results! To prove it, I have written three examples in a setdown-examples repository. Check it out and give it a try!
  • Written in Haskell
    This makes the program very fast and efficient while, in my opinion, reducing the chances of having bugs.

I think that this is a much more compelling set operations tool for the command line than anything else that exists out there and I am really happy to share it with you today for free. I also really hope that you get some great usage out of this tool and that it makes your life easier.

Concluding words

From experience I can say, without this tool, dealing with complicated set operations on the command line and sharing your results with your co-workers is much more difficult than it should be.

At any rate I hope that you get a great deal of value from this tool and if you have any comments or suggestions then please ask them here on this blog or raise them as issues. If you have any questions then ask them here or on Stack Overflow.

Thanks for reading and I hope this makes somebodies like a little bit easier.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s