Saturday, June 8, 2013

Rust wc

So in my attempt to practice writing POSIX command-line tools in Rust, the next tool I tried to rewrite was wc (or word count). This is a simple tool that counts 4 things:

  • lines (separated by U+0D, LINE FEED)
  • words (separated by any whitespace)
  • characters (multibyte encoded, probably UTF-8)
  • bytes

In the interest of brevity, I'm going to refer the reader to this link for the source code. One thing I learned is that the current std::getopts library doesn't have many features, e.g., it doesn't link short options and long options, so you have to check each separately. While I was writing this, some people in #rust on irc.mozilla.org suggested that I look at docopt which has not been ported to Rust yet, still mostly a Python library, but if ported would make a great addition to the language.

Saturday, June 1, 2013

Rust echo

I recently found the Rust programming language, and I've been having a blast of a time learning it. So far, it seems like a mixture of 50 different languages, most notably C++ and Haskell.

To help myself learn it, I decided a good place to start was with command-line utilities, the most basic of which are defined by POSIX. POSIX echo does not have any parameters, it just prints all of its arguments back to the user. BSD echo takes one parameter '-n' and if it's not given, then it prints a newline after all arguments are printed.

This seemed like a good example to learn a language so this was my first attempt:

fn main() {
    let mut newline = true;
    let mut arguments = os::args();
    arguments.shift();
    if arguments[0] == ~"-n" {
        arguments.shift();
        newline = false;
    }
    for arguments.eachi |argi, &argument| {
        print(argument);
        if argi == arguments.len() - 1 {
            if newline {
                print("\n");
            }
        } else {
            print(" ");
        }
    }
}

I talked about this on #rust at irc.mozilla.org and bjz and huonw recommended:

fn main() {
    let args = os::args();
    match args.tail() {
        [~"-n",..strs] => print(str::connect(strs, " ")),
        strs => println(str::connect(strs, " ")),
    }
}

which had some issues with it, so I refactored it to be more functional:

fn unwords(args: &[~str]) -> ~str {
    return str::connect(args, " ");
}
 
fn echo(args: &[~str]) {
    match args.tail() {
        [~"-n", ..strs] => print(unwords(strs)),
        strs => println(unwords(strs)),
    }
}
 
fn main() {
    echo(os::args());
}

which defined a Haskell-like unwords function to help. I am quite pleased with how supportive Rust is to approaching the problem from a procedural paradigm or a functional paradigm. Hooray for multi-paradigm programming!

Review of previous posts

In this post, I'm going to be reviewing my old posts with what I've learned over the past few years. The purpose behind this analysis is to consolidate my ideas into fewer ideas and hopefully come up with some new insights along the way.

This was my first post about XML entity references. The goal was to restrict the DTD grammar to only entities so facilitate usage in other applications that have no concern about elements and attribute lists. Now I would recommend using only DTD since its usage is widespread. I was considering folding this into a macro system I was planning to write, so that one could replace © with © just as easily as it could replace sqrt(4) with 2. I've come to the conclusion that this post was premature, and needs to be included in a more detailed discussion about macro systems, and templating systems such as M4, Genshi, Mustache, and many others.

This was an example of binary parsing, which is still not possible in many languages, but a topic more suited to C structs. GCC has a large number of attributes and packing parameters that accomplish some of the goals of binary parsing, but I still haven't found a good binary parsing framework to date.

This was a bunch of things trying to make Go something that it wasn't. I have since found the Rust programming language, which has all the features I was looking for in this post and more. So, now I would recommend using Rust.

These posts were just crazy comments.

The following posts were all about RDF:

This was a comment primarily intended to allow for arbitrary Common Lisp lists to be represented in RDF.

This was a comment as well, and I still haven't found a datatype URI for this, perhaps something for a future article.

This is interesting because there isn't an XSD datatype associated with large finite-size integers, although you could define it as xsd:integer, then restrict the domain until it matches int128_t. I have since defined a URI for this on my other website.

This was fun. I have since written about these algorithms on my other website.

This is a type of post I'd like to write more of. You know, explaining the pros and cons of certain implementation techniques and why one is better or worse than the other.

This was attempting to illustrate the inconsistencies that I found between different XML standards.

This and this were comments about The Web. I would like to add that I've started to use unqualified identifiers to refer to RDF and OWL core concepts in my private notes:

  • Type = rdfs:Class
    • ObjectType = owl:Class
    • DataType = rdfs:Datatype
  • Property = rdf:Property
    • ObjectProperty = owl:ObjectProperty
    • DataProperty = owl:DataProperty

The following posts were unintentionally about Data URIs:

This was attempting to solve a problem that Data URIs already solve. Now I recommend using Data URIs instead.

This was a good comment, but I don't recommend these namespaces anymore. Now I recommend using Data URIs instead, for example:

<data:text/x-scheme;version=R5RS,vector-length>

This mentions a problem regarding xsd:hexBinary and xsd:base64Binary, which can be solved in several ways:

  1. Candle assigned the type identifier "binary" to the combined value space.
  2. The MediaType "application/octet-stream" is naturally isomorphic to the binary datatype.
  3. The Data URI <data:application/octet-stream,DATA> may be used to represent data of type binary.
  4. The URI
    <http://www.iana.org/assignments/media-types/application/octet-stream>
    may also be used to represent the binary datatype itself.

Sunday, July 29, 2012

On binary search

There are many articles about the pedagogical benefits of writing a binary search algorithm. Instead of just implementing binary search, I wanted to take it one step furtuer. Is is possible to write a standards compliant, formally correct, and beautiful implementation of bsearch() in C?

// This type simplifies the 
// argument list of bsearch.
typedef int (*compare_f)(
  const void *x,
  const void *y);
void *
bsearch(
  const void *key,
  const void *base,
  size_t nel,
  size_t width,
  compare_f compar)
{
  size_t k;
  size_t lo = 0;
  size_t hi = nel - 1;
  const void *mid;

  while (lo <= hi) {
    // If we used (lo + hi)/2, then it
    // would overflow for large integers. 
    // The following formula prevents that.
    k = lo + (hi - lo)/2;
    mid = base + k*width;

    // If we used compar(mid, key), then it 
    // would violate POSIX, which specifies
    // that key must be the first argument.
    int cmp = compar(key, mid);
    if (cmp < 0) hi = k - 1;
    if (cmp > 0) lo = k + 1;
    if (cmp == 0)
      return (void *)mid;
  }

  return NULL;
}

Note that not every argument and variable have been colorized, just the ones that usually give people the hardest time when reasoning about this function. The average (k) can be reduced to (lo + hi)/2 if the array you are handling is less than 263 elements long, but since it isn't technically correct, I didn't write it that way.

Saturday, May 12, 2012

Byte-swap

Today I want to take a brief look at byte swapping from a different point of view. Many algorithms take mathematical concepts and turn them into a step-by-step process, but what if we turn it around, and mathematize a process that originates from computer science?
The idea of mathematizing computer programs is generally done within the context of the design of high-reliability systems and formal methods, which provide a way to prove that a program is correct. Perhaps if we took the time to ensure the correctness of the programs we write, then there wouldn't be so many bugs!

Monday, April 30, 2012

GoLang Highlights

I have been using Go quite frequently lately, and I would like to talk about the experience. There are several introductions to Go floating around, and reactions to its different conventions and ways of doing things than some people are used to. First, I would like to make it clear, though, that I have only used the gc compiler, as it was impossible to install gccgo on Mac OS X at the time of this writing.

What makes Go similar to X?

  • garbage collection
  • interface embedding
  • struct embedding
What makes Go better than X?
  • type syntax (backwards = fowards)
  • typed channels (no serialization)
  • implicit interfaces
  • segmented stacks

Similarities

Every scripting language has some form of garbage collection, so Go is not unique in that respect. Interface embedding is when you used a named interface type instead of a method when defining an interface type. It is similar to Haskell's => when constraining a type class. Struct embedding is when you use a named struct type instead of a field declaration when defining a struct type. This is very similar to inheritance in OOP languages, and it also gives the method sets of those child structs to the parent struct. Both of these are called embedding because they are syntactically similar, you just list the typename. However, they mean different things, and it's important to remember this.

Type Syntax

One of the distinguishing features of Go is its beautiful type syntax. Many programmers have noticed that writing types in C can be a headache, because when you read the type aloud, you have to read backward, or sometimes you have to skip around, back and forth in order to read the type properly. In Go there is no such problem, because the types are written exactly how you would say them in English. While this may be an issue in other languages, most speakers of languages of languages with SVO (Subject-Verb-Object) structure would be happy to see this change. It makes programming in Go a pleasure, and contributed greatly to my own productivity in my Go projects.

Typed Channels

Another Go feature is channels. Go channels are different from most other forms of I/O found in other languages because they are not based on bytes or characters, but data. Typed data. Which means instead of being required to serialize and deserialize your data when communicating between threads, or parts of a larger design, you can send them over a channel as is, knowing that you can use the data immediately instead of validating it first. Serialization is also a big issue with large software projects, since they may be written in many languages, which requires serialization in order to get data from one component to another. Go also has a serialization format, called gob, which can be considered a codec for most major types in the language. So you don't have to serialize, but if you want to, then it's available.

Implicit Interfaces

Another distinguishing feature of Go is implicit interfaces. Most languages with interfaces require explicit statements of implementation, such as Java and Haskell (for type classes). Go has no such requirement (and no such syntax). An interfaces is a collection of methods. A named type has an associated set of methods attached to it, called its method set. A named type implements an interface iff the methods it requires are a subset of the named type's method set. That's it! It's that simple. And because you don't have to make it explicit, it allows you to focus on more important things.

Segmented Stacks

The first and foremost under-appreciated and under-documented feature found only in Go is that of segmented stacks. Segmented stacks is what Go developers call the combination of implementation choices that led to the Go calling convention. It is an intersection of many features, such as: dynamic stack allocation, runtime checks, variable arguments, and deferred functions. Simply put, there is so much that happens between 'return' the statement, and 'RET' the instruction. First, the Go runtime has to check to see if there are any deferred functions to run, then it has to see if it allocated any stack space for the current function, and if it did and it isn't required anymore, then it frees the stack. Similarly, when a function is called, one of the first things that happens is the Go runtime checks to see if more stack space is required. What this means is that - in Go - there are no stack overflows!

Since segmented stacks are such an interesting feature, I suspect that I will revisit them in the future with a more in-depth discussion than I can ever hope to achieve in one paragraph.

Friday, March 2, 2012

Gödel and powerbooleans

What is a boolean?

A boolean is either true or false.

What is a powerboolean?

A powerboolean is a subset of the booleans (B):

  • True = {true}
  • Both = {true, false}
  • False = {false}
  • Null = {}

This set is equivalent to 2B (the powerset of the booleans), hence the name. The powerboolean operations are simply defined as the image of the normal boolean operations over B. The truth tables are as follows:

ANDTBFN
TTBFN
BBBFN
FFFFN
NNNNN
ORTBFN
TTTTN
BTBBN
FTBFN
NNNNN
NOT
TF
BB
FT
NN

These truth tables should not be confused with Belnap logic (which is another 4-valued logic) because there are different definitions for (N AND B), (B AND N), (N OR B), (B OR N). Also, In Belnap logic, B and N can be swapped, and the same rules apply, so they are not unique. In the powerbooleans, B and N cannot be swapped in a similar way, because if you do swap them, the truth tables would have to change. So the powerbooleans have unique values, they aren't just repetitions of some third truth value found in most 3-value logics. The 4-values of the powerbooleans are truely unique.

Who is Gödel?

Gödel is best known for his incompleteness theorems, but he is also known for the fuzzy logic named after him: "Gödel–Dummett" logic (also known as superintuitionistic, intermediate, or minimum t-norm logic). I've talked about product fuzzy logic before, but this time I'd like to talk about Gödel fuzzy logic. The operations are as follows:

  • (x AND y) = min(x, y)
  • (x OR y) = max(x, y)
  • NOT(x) = 1 - x

These operations are defined over a continuous interval from 0 to 1, so I can't really give truth tables for these, but to cut to the chase, they would look very similar to the tables above.

What do they have to do with each other?

If we let NOT(x) = (if x < 0, then x, else 1 - x), extend the interval of Gödel fuzzy logic below zero (which forces OR to be redefined in terms of our new NOT), and assign the following fuzzy values to each powerboolean:

  • True = 1
  • Both = 1/2
  • False = 0
  • Null = -1/2

then Gödel fuzzy logic and powerboolean operations are the same.