Global Announcement, Haskell

Setdown: the best tool for fast and repeatable line based set operations

Introducing setdown

Have you ever been on the command line and tried to perform set operations? Have you ever followed crazy cli guides on the internet that suggest complicated commands to try and perform set operations on files. I have. And I did not like it; I think that we can do better.

Over the weekend I wrote a pretty nifty program that I am calling: Setdown. Setdown requires you to specify the set operations that you wish to perform as a definitions in a set definitions file (often suffixed with ‘.setdown’). The setdown language definitions are written in a very similar format to Makefiles; except that it performs set operations.

If you want to install setdown right now or checkout the code then you can follow these links:

The easiest way to install setdown is to:

$ nix-shell -p haskellPackages.setdown
$ setdown --help

If you want to learn how to use setdown and write set operations in it’s language then you should read the README file provided in the source code. However, to show you how easy the language is to read you I have provided an example .setdown file right here:

-- All of the letters of the alphabet
alphabet: "alphabet.txt.unsorted"

-- Calculating the consonants with a set difference
consonants: alphabet - "vowels.txt.unsorted"

-- Getting any letter than is e-sounding or a vowel
e-or-vowels: "e-letters.txt.unsorted" / "vowels.txt.unsorted"

-- Get any letter that is e-sounding and a vowel
e-and-vowel: "e-letters.txt.unsorted" / "vowels.txt.unsorted"

-- Get all of the e-sounding letters, the vowels and the consonants
e-or-vowels-or-consonants: ("e-letters.txt.unsorted" / "vowels.txt.unsorted") / consonants

You should install setdown and then check out the setdown-examples project to give it a try right now!

Can you show me an example?

By this point in time you are probably wondering “I love the look of it but show me an example”. So I will. Here is the output of a full running example by checking out the first example in the setdown-examples repository and running it:

$ setdown ex1.setdown
==> Creating the environment...
Base Directory: ./
Output Directory: ./output

==> Parsed original definitions...
e-or-vowels-or-consonants: ("e-letters.txt.unsorted" / "vowels.txt.unsorted") / consonants

e-and-vowel: "e-letters.txt.unsorted" / "vowels.txt.unsorted"

e-or-vowels: "e-letters.txt.unsorted" / "vowels.txt.unsorted"

consonants: alphabet - "vowels.txt.unsorted"

alphabet: "alphabet.txt.unsorted"

==> Verification (Ensuring correctness in the set definitions file)
OK: No duplicate definitions found.
OK: No unknown identifiers found.
OK: All files in the definitions could be found.

==> Simplifying and eliminating duplicates from set definitions...DONE:
alphabet: "alphabet.txt.unsorted"

consonants: alphabet - "vowels.txt.unsorted"

e-and-vowel: "e-letters.txt.unsorted" / "vowels.txt.unsorted"

e-or-vowels: "e-letters.txt.unsorted" / "vowels.txt.unsorted"

e-or-vowels-or-consonants: e-or-vowels / consonants

==> Checking for cycles in the simplified definitions...DONE:
OK: No cycles were found in the definitions.

==> Copying and Sorting all input files from the definitions...
"alphabet.txt.unsorted" (unsorted) => "./output/alphabet.txt.unsorted.1.split.sorted" (sorted)
"e-letters.txt.unsorted" (unsorted) => "./output/e-letters.txt.unsorted.1.split.sorted" (sorted)
"vowels.txt.unsorted" (unsorted) => "./output/vowels.txt.unsorted.1.split.sorted" (sorted)

==> Computing set operations between the files...
Required results:
alphabet: ./output/alphabet.txt.unsorted.1.split.sorted

consonants: ./output/c989d1cf-b860-41cc-a52c-e2afc1e6a235

e-and-vowel: ./output/a8bd5974-22d5-4fdb-b269-0c09a1eeeb18

e-or-vowels: ./output/c3a8cc7c-f246-4eb4-b321-57f900964960

e-or-vowels-or-consonants: ./output/493ca813-7e3c-4259-9435-e2d5ddb4d6a5
$

As you can see we have ended up with a number of output files. Just to pick one example lets see the contents of the consonants file:

$ cat ./output/c989d1cf-b860-41cc-a52c-e2afc1e6a235
b
c
d
f
g
h
j
k
l
m
n
p
q
r
s
t
v
w
x
y
z
$

And look at that, we have computed the consonants when we were given the vowels and the rest of the alphabet. Hopefully you can see that this is very powerful and will let you write increasingly more correct set operations from the command line.

The benefits of setdown

Depending on your command line bent you may have used other tools in the past to perform set operations on files, like comm or fgrep, but these tools are quite lacking. Instead let me show you the full range of features that setdown gives you:

  • Maintainability
    If you get more set data to add to your collection (as often happens) then it is trivial to edit the setdown definitions to include it.
  • Repeatability
    Even if the data changes you run one single command and all of your set operations are performed again.
  • Sorted input is not required!
    Programs like comm require that you have sorted input if you want to do efficient set operations on files. This make sense because sorted files make set operations very efficient. However, we don’t put the onus on you to provide us with sorted input. Setdown will sort any files that you give it itself. We even use External Sort so that you can give us truly massive files and expect that we will still be able to perform your set operations.
  • Simplification of definitions
    If you write the same definition twice then setdown will factor that out and only perform the set operation once. This makes setdown run as efficiently as possible:

    ==> Parsed original definitions...
    C: "b-1.out" - ("a-1.out" / "a-2.out")
    B: "a-1.out" / "a-2.out"
    A: ("a-1.out" / "a-2.out") / "b-1.out"
    
    ==> Verification (Ensuring correctness in the set definitions file)
    OK: No duplicate definitions found.
    OK: No unknown identifiers found.
    OK: All files in the definitions could be found.
    
    ==> Simplifying and eliminating duplicates from set definitions...DONE:
    A: "b-1.out" / B
    B: "a-1.out" / "a-2.out"
    C: "b-1.out" - B
  • Dependencies and cyclic dependency detection
    Since you can write set definitions that depend on other set definitions it is possible to write a cyclic dependency. We will spot this for you and also tell you exactly where the cycle is in your file, meaning that you don’t have to search for it yourself!

    ==> Simplifying and eliminating duplicates from set definitions...DONE:
    A: C
    B: D
    C: B
    D: A
    
    ==> Checking for cycles in the simplified definitions...DONE:
    [Error 20] found cyclic dependencies in the definitions!
    We found the following cycles:
       A -> C -> B -> D -> A
  • Validation
    We verify that your set description only references files that exist and that if you reference dependencies that do not exist then you will get an error.
  • Works nicely with version control
    You check your .setdown file and your input files into the repository and share them with your co-workers. Everybody can use setdown to get the same results! To prove it, I have written three examples in a setdown-examples repository. Check it out and give it a try!
  • Written in Haskell
    This makes the program very fast and efficient while, in my opinion, reducing the chances of having bugs.

I think that this is a much more compelling set operations tool for the command line than anything else that exists out there and I am really happy to share it with you today for free. I also really hope that you get some great usage out of this tool and that it makes your life easier.

Concluding words

From experience I can say, without this tool, dealing with complicated set operations on the command line and sharing your results with your co-workers is much more difficult than it should be.

At any rate I hope that you get a great deal of value from this tool and if you have any comments or suggestions then please ask them here on this blog or raise them as issues. If you have any questions then ask them here or on Stack Overflow.

Thanks for reading and I hope this makes somebodies like a little bit easier.

Haskell

A line based file splitter for the command line.

Have you ever wanted to extract only a certain set of lines from a file? Maybe you wanted to get everything from line 400 onwards, or just lines 25 to 50? Well I did. I call the end result ‘splitter’.

Splitter is a program designed to be used on the command line and it has been written entirely in Haskell. I have uploaded Splitter so that it is available on Hackage. You can find the source code for splitter on BitBucket, along with the source code for ‘range’ the library that I wrote in order to make the splitter program easier to deal with. The repositories are here:

But words are just words and I really need to show you some examples.

Show me an Example!

For this demo lets make a file that has twenty lines in it and, on every line, are the numbers one to twenty, like this one: Twenty Numbers.

If you were to get that file (calling it ‘twenty.txt’) then the following commands would have the following results. You could get single lines from files:

$ cat twenty.txt | splitter 3
three
$

You could get an entire range of lines from a file:

$ cat twenty.txt | splitter 5-9
five
six
seven
eight
nine
$

You could get multiple ranges from the file:

$ cat twenty.txt | splitter 10-14,2-4
two
three
four
ten
eleven
twelve
thirteen
fourteen
$

You can get ranges that are only bounded on one side:

$ cat twenty.txt | splitter -5,15-
one
two
three
four
five
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
$

You can invert the selection if you chose to:

$ cat twenty.txt | splitter -i -5,15-
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
$

And you can specify an infinite range if you really want to (even though it would be the same as ‘cat’):

$ cat twenty.txt | splitter *
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
$

And the are a few more options that you can choose from that you can see by running ‘splitter –help’. I would recommend that you have a play around with it yourself. It will be possible to install it on any platform that has a cabal-install installed. Which will be part of the Haskell Platform.

Concluding Words

The bottom line is that splitter makes it really easy to only extract certain lines from your files. It also has the following features so that you can:

  • Select any range that you like; whether infinite or fixed.
  • Select infinite ranges.
  • Invert your selection so that you get all of the lines that you did NOT specify.
  • You can get the line numbers printed out with the lines in the file.
  • Lines are printed out when they are ready. Meaning that you can use splitter on a logfile in the same way that you can use ‘tail -f’.

I have tried to make it a highly useful and focussed tool to get certain lines from files using an easy to understand format to specify which lines that you want. For more detailed information you should check out the README file on BitBucket. It is perhaps the most comprehensive and up to date resource on the way to use the splitter tool.

Extra: Range code huh? That sounds useful.

While I was writing this I did indeed look around for Range libraries that would meet my criteria. I discovered the following:

  • ranges
    A nice looking package that has been marked as Obsolete by the Author. I did not want to have to be stuck on an obsolete version of code that would not be updated. Also, this library cannot handle infinite ranges.
  • Ranged-set
    T
    his is a nice library and it makes good use of Haskell classes but it does not support infinite ranges either and thus was not suitable for this project

So, getting excited and wanting to start from scratch, I wrote my own library called: range. That I have now placed on Hackage. Please feel free to use it for your own purposes and I will happily accept pull requests on that work.