Silly game of the week: Grep Pipes

After writing my last post I couldn’t stop thinking about pipes and remembered a nice game called Pipe Dream (aka Pipe Mania) and than it came to me the geeky version of this game:

You have a starting point (some lines of text) and some ending points (stripped versions of the original text) and a few grep blocks with regular expressions. The objective is to place the grep blocks from start to finish before the data floods out.

grep pipes

After a few minutes with OODraw and Gimp I could come to this (horrible and ill drawn) interface for the game. Nothing really exciting, just to give you an idea on what should happen… πŸ˜‰

I’ve also did a sample code on what the underlying library should look like available here.

The idea is to have more commands, such as awk, sed and perl to make it harder (because each one has its own regular expression syntax). A tee should also be required to split paths and other programs such as cat, diff and comm re-unite them.

Harder levels of the game should have bigger boards, tricky regular expressions and even more than one board with netcat or ssh bridges to send data across the network. Also, to increase the level of reality, some programs such as cat and grep should let the data flow faster than others like perl.

Another option is to let the user define the regular expression by hand or number of lines to crop. This would be like having some of the blocks as wild cards, in case there isn’t any suitable block available.

Anyway, the options are endless and I’m sure there will be lots of people that would love it (me included) but I’m a complete failure to design user interfaces. So this is an open invitation, if you’d like to see this game out and could give me a hand with the interface, just let me know… πŸ˜‰

Just bear in mind the following fundamental pre-condition: the game must allow (better still, encourage) keyboard-only playing, even if high-end OpenGL graphics interface is used.

Unix plumbing

Unix has some fantastic plumbing tools. It’s not easy to understand the power of pipes if you don’t use it every day and normally Windows users think it’s no big deal at all. Let me give you some examples and see what you think…

Tools

With a small set of tools we can do very complex plumbing on Unix. The basic tools are:

  • Pipes (represented by the pipe symbol ‘|’) are interprocess communication devices. They’re similar to connectors in real life. They attach the output of a process to the input of another.
  • FIFOs are fake files that pretty much to the same thing but have a representation on the file system. In real life they would be the pipes (as they’re somewhat more visible).
  • Background execution (represented by the and symbol ‘&’) enables you to run several programs at the same time from the same command line. This is important when you need to run all programs at each corner of the piping system.

Simple example

Now you can understand what the grep below is doing:

cat file.txt | grep "foobar" > foobar.txt

It’s filtering every line that contains “foobar” and saving in a file called foobar.txt.

simple example

Multiple pipelining

With the tee you can run two or more pipes at the same time. Imagine you want to create three files: one containing all foo occurrences, another with all bar occurrences and a third with only foo and bar at the same time. You can do this:

mkfifo pipe1; mkfifo pipe2
cat file | tee pipe1 | grep "foo" | tee pipe2 > foo.txt &
cat pipe1 | grep "bar" > bar.txt &
cat pipe2 | grep "bar" > foobar.txt

The Tees are redirecting the intermediate states to the FIFOs which are holding those states until another process read them. All of them run at the same time because they’re running in background. Check here the plumbing example.

multiple example

Full system mirror

Today you have many tools to replicate entire machines and rapidly build a cluster with an identical configuration than a certain machine at a certain point but none of them are as simple as:

tar cfpsP - /usr /etc /lib /var (...) | ssh dest -C tar xvf -

With the dash, tar redirects the output to the second command in line, the ssh, which then connects to the destination machine and un-tar the information from the input.

The pipe is very simple and at the same time very powerful. The information is being carried from one machine to the other, encrypted by ssh and you didn’t have to set-up anything special. It works with most Unix and even between different types of unices.

There is a wiki page explaining the hack better here.

Simplicity, performance and compliance

Pipes, FIFOs and Tees are universal tools, available on all Unices and supported by all Shells. Because everything is handled in memory, it’s much faster than creating temporary files, and even if programs are not prepared to read from the standard input (and using pipes) you can create FIFOs and have the same effect, cheating the program. It’s also much simpler to use pipes and FIFOs than creating temporary files with non-colliding names and remove them later when needed.

It can be compared with static vs. dynamic allocation in programming languages like C or C++. With static allocation you can create new variables, use them locally and they’ll be automatically thrown away when you don’t need them any more, but it can be quite tricky to deal with huge or changing data. On the other hand, dynamic allocation handles it quite easily but the variables must be created, manipulated correctly and cleaned after use, otherwise you have a memory leak.

Using files on Unix requires the same amount of care not to fill up the quota or have too many files in a single directory but you can easily copy them around and they can be modified and re-modified, over and over. It really depends on what you need, but for most uses a simple pipe/FIFO/Tee would be much more than enough. People just don’t use them…

Object Orientation in C: Structure polymorphism

Listening to a podcast about the internals of GCC I’ve learnt that, in order to support object oriented languages in a common AST (abstract syntax tree), the GCC does polymorphism in a quite exquisite way.

There is a page that describes how to do function polymorphism in C but not structure polymorphism as it happens on GCC, by means of a union, so I decided that was a good post to write…

Unions

Like structs, you can create a list of things together in an union but, unlike structs, the things share the same space. So, if you create a struct of an int and a double, the size of the structure is the sum of both sizes. In an union, its size is the size of the biggest element and all elements are in the same area of memory, accessed from the first byte of the union.

Its usage is somewhat limited and can be quite dangerous, so you won’t find many C programs and rarely find any C++ programs using it. One of the uses is to (unsafely) convert numbers (double, long, int) to their byte representation by accessing as an array of chars. But the use we’ll see now is how to entangle several structures together to achieve real polymorphism.

Polymorphism

In object oriented polymorphism, you can have a list of different objects sharing the same common interface being accessed by their interface’s structure. But in C you don’t have classes and you can’t build structure inheritance, so to achieve the same effect you need to put them all in the same box, but at the same time defining a generic interface to access their members.

So, if you define your structures like:

struct Interface {
    int foo;
    void (*bar)();
};
struct OneClass {
    int foo;
    void (*bar)();
};
struct TwoClass {
    int foo;
    void (*bar)();
};

and implement the methods (here represented by function pointers) like:

void one_class_bar () {
    printf("OneClass.Bar()\n");
}
void two_class_bar () {
    printf("TwoClass.Bar()\n");
}

and associate the functions created to the objects (you could use a Factory for that), you have three different classes, still not connected. The next step is to connect them via the union:

typedef union {
    struct Interface i;
    struct OneClass o;
    struct TwoClass t;
} Object;

and you have just created the generic object that could hold both OneClass and TwoClass and be accessed via the Interface. Later on, when reading from a list of Objects, you can access through the interface (if you build your classes with parameters in the same order) and it’ll call the correct method (or use the correct variable):

Object list[2];
/* Setting */
list[0] = (Object) one;
list[1] = (Object) two;
/* Using */
list[0].i.bar(list[0]);
list[1].i.bar(list[1]);

Note that when iterating the list, it access the Object via the Interface (list[0].i) and not via OneClass or TwoClass. Although the result would be the same (as they share the same portion in memory, thus would execute the same method), it’s conceptually correct and compatible with object oriented polymorphisms.

The code above produces the following output:

$ ./a.out
OneClass.Bar()
TwoClass.Bar()

You can get the whole example here. I haven’t checked the GCC code but I believe that they’ve done it in a much better and more stable way, of course, but the idea is probably be the same.

Disclaimer: This is just a proof-of-concept. It’s not nice, they (GCC programmers) were not proud (at least in the podcast) of using it and I’d not recommend anyone to use that in production.

Scrum, Agile, XP, RUP or just plain common sense?

There was always a great deal of fuzz on development methodologies, especially because of their creative or funny names, that lead people into believe that any one of them would save their project.

Funny though, that all of them have, within limits, the same basic concepts: design, implementation and deployment. All of them say that the interaction with the client is fundamental. Furthermore, that the need of a mediator between developers and the client is fundamental. Project managers, product owners or masters, they always value the customer’s point of view and, one way or another, they help the project to stay close to the original plan.

If you dig the Wikipedia or books about all of them you’ll find that they’re good for a bunch of situations and bad for a whole bunch of others. You’ll also find that some completely different methodologies are good for similar projects with tiny differences and truth is, you never know in which case you fall into before falling into it!

Wizards, Gurus and Masters

Your project is going berserk and you need a saviour, who you’re gonna call? You browse the internet in search of new methodology, you find Agile or Scrum or XP or whatever and post a new job on your company: [methodology-name-here] Master.

He comes, whatever master he is, and start pointing the obvious problems you have and what his new methodology brings to solve it. It really looks like it’s gonna work and it probably will, but the true question is: was it the new methodology that saved it? What if, instead of Scrum you have chosen XP? Or any other buzzword that will come next month?

Learning from the past

If you’re in the development field for a few decades you’ll remember the bright glow that RUP had in the late 80’s. I particularly love the graphic where they show the amount of work of a particular type you have on each phase.

Even if they had measured the men-hours on each phase to the required granularity to produce that graphic with the average of all projects measured, it’d still be absolutely pointless. But they didn’t even have proper tools and the reality they were trying to model wasn’t even capable of being modelled in the first place.

But that was beautiful, people in the 80’s liked graphics and colours very much. You might remember the boom it was if you had a colour computer with a graphical interface at that time (Amiga lovers should smile now).

All in all, RUP became the standard for years and years and still many big companies still use it with a large margin of success on their projects. But for small projects it wasn’t viable (I’ve learnt it the hard way) and people had to come with new ways during the 90’s. Then, nice graphics wasn’t enough, the new fashion was funny names like eXtreme Programming, Agile and so on.

So, back again where the first attempts didn’t worked out as expected (pair programming is a bit pointless anyway), new methodologies came, one after the other, and today we’re still saving projects with methodologies that will need, in the near future, new saviours.

What’s the point of trusting project to a methodology that wasn’t proven to be effective? As a matter of fact, is there a way of proving anything at all about methodologies? Of course not! Each case is a different case, all you need is common sense. (John Lennon would disagree).

Common sense

The Agile guru will drive your project his own way, to prove his method is effective. The same with Scrum Masters and XP Wizards but following a fixed scheme methodology on random projects within the same company is to ask for the need to restart again next decade.

If your project is small enough, if your company is focused enough or short-sighted enough, you might well stick to one of them you like most and be happy. You’ll probably make a lot of money for a long time, the same way Microsoft and Yahoo! are still doing for the last decades, but the quality of their software (and yours) puts the big ball of mud as a special case.

Worse, those companies are recently jumping from one methodology to another every year. Each new project manager that comes in brings a new salvation, screws up even more and goes away to give place to a new wizard to screw up a bit more, on and on.

If you need to be agile, use common sense. If you need quality, testability, deliverability (that’s a new word, isn’t it?), use common sense. All methodologies bring something good and lots of bad things for your particular case, get those that maximise your investment (but are still compatible with the other you’ve chosen) and you’ll be fine.

Wizards, Gurus and Masters will defend their ideas, not your projects. Ask your developers what they think. Have as many meetings as you need, not the crazy-madness of one a day because we developers know that a half-hour meeting does not take half-hour! We know that the client’s vision is fundamental, we know that software must work all the time, all of what’s delivered, that tests should always work independent of what happened last weekend.

A bit different from the others, Agile is more of a list of recommendations than a formal methodology and that makes a big difference when adopting the central ideas to your project. If you’re a developer, get to know it more. If you’re a manager, ask your developers about it.

The Linux Kernel

Funny though, that Agile and such was made for small and quick projects while RUP is still used by some massive development, apparently with good quality and testability. But the biggest project of all times, with the biggest team of all times and the fastest growing code of all times does not use any of those fancy-named methodologies: The Linux Kernel.

Greg K. Hartman gave a very inspiring talk about the development of the kernel and how there is hardly any project or group of projects in the world that can ever get close to the dynamics intrinsic to the Linux kernel development. (link from Rodrigo, thanks!)

If you’ve read about the kernel people you probably know how common sense is not one of their strongest attributes, but because the project has no owner, no stakeholder, it just works and the sense of the majority is the common sense in the strict sense (sorry for the joke).

Each one has its own ways of doing things, ownership is shared, and sense of responsibility is beyond any level I’ve seen on industry or academia. It just works.

If you want to save your project from doom, do like the kernel guys do: use common sense.

Bad Vista

Ooops, they did it again…

A whole new hacking style was discovered due to the complete incompetence of Microsoft’s engineers. When will they understand that security means the opposite of trust?

You can choose whatever framework you want (Java, .NET, ActiveX) build a simple program and have total control of the user’s machine in seconds. All that because our beloved Windows browsers trusts Microsoft’s technology only too much. And worse, the Windows kernel trusts Microsoft’s browsers and .NET too much too!

ActiveX attacks are not new, IE has an extensive history of huge holes through their magnificent piece of crap. Rendering Windows’ security hopeless is also not new, Outlook for decades gave hackers a free feature of one-click-exploit ™ but this is completely crazy.

No matter which way you go, what framework you use and what path you take, total control of the machine is a few clicks away. Worse still, as this confidence in crap dates back from Windows 2.0, I wouldn’t be surprised if they find they can do the same on all versions of any software (ahem…) they’ve produced so far, including DOS 1.0!!

Oh well, you can’t say you didn’t know, can you?

Silly prototypes of the week

This time I don’t have a project, nor probably will have time to implement them (I’m on real holidays!), so I’ll just post the ideas and welcome any comments on the feasibility or usefulness of them. Feel free to implement them and let me know how it went.

Independent Bayesian Agents

An Artificial Neural Network is a collection of neurons connected in meaningful ways that, when the information, coming from an input layer, passes through them, a resulting signal is produced on the output layer. The learning is done via feedback (good or bad), by changing the internal functions of each neuron to accommodate them for the next “pulse”.

The problem with this is that, after the learning phase, the resulting network can only classify a strict class of problems. Networks created to recognise alphabetical letter in an image has a different format and components than those to recognise other patterns. There are some implementations where the network itself can change its connections in order to learn something new, but that’s not commonly seen, not even on many scientific works on the subject.

You can have the same network structure using Bayesian nodes. The class of problems to be solved are not the same but the overall logic is similar.

Bayesian Agents are different. They are independent, learning programs that might be able to communicate with other agents upon necessity and also update their internal states from feedback and previous probabilities. It also has the advantage to return not only true/false values but probabilities or even probability distributions to the user.

But this way, your agents will be much more complex than the neurons on ANN, and that’s a path that is rarely worth taking if you want to build emergent behaviour as it becomes more and more difficult to combine them to produce unexpected behaviours. So, the idea is to break all the agents into very small pieces that can interact with each other, with just a few rules.

This way, the agents themselves would be pretty much like objects in an OO design, following the implementation of a class (agent definition) and inheriting or using methods from other Agents (interaction) you can, at the same time, build a network with any format you want (using graphs or even network connections via name service or so for the interactions) and keep it simple, by following the agent definition and not trying to put any complex logic on each agent.

Same as OO design, you don’t want to create one single class that does everything. What you do is a relationship between classes that express the problem set you’re solving and, if well designed, you can reuse some of the components to solve other problems as well. The power of the OO design is the relationship between the classes and not what the classes are actually doing, right?

So, my hunch is that, if we do very simple agents and let the network itself be the relationship between them, we acquire a new power of extensibility.

Also, by changing the connections, the learning happens at the network level instead of the agent level. Further on, by inflicting an “explorer behaviour” on each agent (like extending from a common class Explorer for all agents), they can search for new answers once in a while, in background, to increase their confidence levels.

In the same line, by extending from an agent Feedback you can make all agents be able to process feedback from the users, from its derived classes such as BooleanFeedback returning good or bad, or MultipleChoiceFeedback, ContinuousFeedback and so on.

Genetic programming using template Policies

Learning by structure given above (network structure) is similar to genetic algorithms, but in a different level. Genetic programming allows you to learn by the structure of the program itself. A standard way to solve that is using a rulebase approach. Create a big set of simple rules and write programs to use them in a mixed fashion.

Given the problem you want to solve, run all programs against it and try to define the quality of each solution (if they’re at least able to get somewhere) and give those programs closer to the solution a higher chance of survival, but don’t kill the unfit, and that’s an important step.

If there are good, but dormant, rules (genes) on the unfit and you kill them, you’re removing from genetic pool (rulebase) some solutions that could be the best when joint with others that you didn’t have the chance to join yet. Anyway, crossing those programs (by exchanging rules and creating new combinations) and running the next breed on the same problem you can, iteratively, lead the evolution towards the best solution by chance.

So, what about template policies?

C++ templates are very powerful. One of the great powers (which comes with great responsibility) is to be able to inherit functionality from the template argument.

template <class EatPolicy, MovePolicy, ReproducePolicy>
class LifeForm : public EatPolicy, MovePolicy, ReproducePolicy {
};

With this, you can create a life form following one of the policies provided. They will inherit the methods and properties of the policies as if it was a normal OO inheritance. So, to create a bacterium, just implement some policies inheriting from EatPolicy, MovePolicy and ReproducePolicy and instantiate like this:

LifeForm<Fermentation, Osmosis, Mitosis> bacteria;

You can also create several policies inheriting from Fermentation (from different materials, etc) and create several types of bacteria, but the problem is how to cross them later. Because templates are evaluated during compile time (and types are created), you can’t create new types during run time and the crossing over should be done off-line (and recompile). A few pre-processor commands and a Perl script could do it, though, and isolate that nasty part from the rest of the code.

I’m not sure of the implementation details of the cross-over but the basic skeleton did work: Skeleton of genetic policies. Would give it a try later on.