Thursday, 14 November 2013

Moving blogs!

This one sucks! I can't host code on here, so I moved to this one:

http://plookn.herokuapp.com/

Come over, we'll make pies and talk about politics.

Functional Programming and You

Let me blow your mind real quick.

The kind of programming I've learned through high school has always been imperative, procedural, and/or object-oriented. In other words, most of the time I told the compiler exactly what I want, like

for (int i; i < arr.length; i++) {
    arr[i].doThisThing("Not sure what language this is written in");
}

This has taught me a lot about logical thinking and low-level computer architecture, but is really clunky - I mean I gotta access each item by its index, and then do some laundry list of tasks.

Enter Javascript* stage left


Writing Javascript introduced me to functional programming - kind of out of necessity, since procedural-style Javascript sucks so bad. The damn language can't even run for object in list on an array without doing something weird! But look at this:

forEach( some_array, function (each_object) {
    each_object.doThisThing("OMG YOU GUYS LOOK");
});

Did you see that? We just passed each element of the array to a function and operated directly on them, without indices! Sure, you say, Python and Java can do that to with their for-each constructs. But what if I do this**?***

linear = [1, 2, 3, 4];
squares = map( linear, function (val) {
    return (val * val);
});

squares == [1, 4, 9, 16];    // True
linear == [1, 2, 3, 4];   //True

Using the same functional style as before, we generated a new array from the previous one without specifically manipulating each value (imagine the tedium of doing that with indices)! Here's another:

evens = filter( [1, 2, 3, 4], function (val) {
    return (val % 2 == 0);    //True if even
});

evens == [2, 4];

An even nicer way to do it is to use first-class functions and define an is_even function separately:

is_even = function (val) {
   return (val % 2 == 0);
}
evens = filter( [1, 2, 3, 4], is_even);
more_evens = filter( [5, 6, 7, 8], is_even);

Did we just pass a function as an argument?! Yeah! Wowwowow! That's what first-class functions means, and usually goes hand-in-hand with functional programming. This stuff was first used in Scheme (a strict version of Lisp), but nobody used that. Good thing Javascript took up the cause.

And that's it! Hope you learned something. Happy_cat.gif



* I know Javascript isn't much of a functional language, but that's where I started to pick it up. It is very extensible however, with libraries like wu.js and underscore.js providing some functional power.

** Technically the map() function isn't global in Javascript, but you could use the underscore.js version the same way.

*** Sorry for clobbering the global namespace, I didn't want to cause confusion around the var statement.

Wednesday, 13 November 2013

Building a Scalable Network Search

WARNING: This whole discussion assumes that every node is well-behaved, and not malicious. Let's take this stuff one step at a time, ok? 
ALSO: Sorry this is so long. Lot to say.

Not sure how to preamble this so... I'll just dive in. Picture of a kitten.

Plain ol' search

 

Since we already know (more or less) how think about tree algorithms, let's model our networks as them, the same way I show in my earlier post

This makes common-sense search pretty easy - let the node send a search request to all its subtrees, and make them pass it on. If a node has the desired information, it can call back straight to the origin, and celebrate a job well done with a pint of orange juice.
 

Here's what it looks like: If our tree has a branching factor of two (the network then has a branching factor of three, since we include the pseudo-parent node), each node checks its own records and passes the request to its two pseudo-children:

        ---C
       /
   ---B
  /    \
 /      ---D
A
 \      ---F
  \    /
   ---E
       \
        ---G


(fyi I labelled the children in pre-order)


Let's call the time taken for the request to reach a child delta_t. We can assume that actually firing a request might take negligible time, since the parent need not wait for a response to send the next request.  It's like how you can send five letters from the post office at the same time, instead of sending one, waiting for a response, then the second, and so on.

Taking time steps through the process looks like this:

t = 0:  A fires a request to B and then to E.
t = delta_t:  B fires a request to C then F, and E asks F and G
t = 2 * delta_t:  C, D, F, and G each send requests to their children, and so on.


This tells us that we can reach twice as many peers with each increment delta_t, and therefore search takes logarithmic time - doubling the number of peers only results in a constant-time incremental increase! Changing the branching factor or the magnitude of delta_t will just affect the magnitude of that increment. 

But here's the catch: this type of search will hit every peer, both in the best and the worst case. Even if the first child node B has the desired information, the remaining children will not only propagate down their own trees, but also down B's subtree! Even though B does not propagate, if C, D, E, etc.. are connected to any of B's children, those children will receive the request! sadface.gif!

This sucks. It means that although search scales well, the amount of work every node will do as part of other nodes' searches is linear with N. Because of this, large networks will quickly choke on all the bandwidth that searches require. Suppose one node runs, on average, one search per minute. The number of searches it must process is N per minute. Imagine doing this with a million nodes, with the majority of results turning up negative - it's a complete waste.




Slightly better (maybe) search


So obviously we need to modify our search algorithm. We'll certainly need to sacrifice some things, but let's talk about what we want to maintain:

1)  Full searchability: If one node in N has the data we're looking for, we want to find it eventually.
2)  (More-or-less) Constant time node work: A node ought not to do much more work in a network with large N (say a few million nodes) than with small N. We could of course *let* a node do more work if it wants to take it on...
3)  Time of data retrieval is related to abundance of data: Looking for something common like cat pictures should take less or equal time to something rare like videos of me doing something stupid at that party one time.

We can add to this list, but this is a good place to start. Here's my solution for now: Limiting the number of hops a request can make. For each hop, increment the 'hop counter' on the request, and if it reaches a sustainable limit
J, hold the request. If any node in J has replied positive with the information, it should broadcast that the search can be called off. Ideally, information has been found within that radius and propagation may cease. Such a search results in node work proportional to J, and keeps a nice search time.

But what if nobody in the radius returns positive? I'd suggest letting one node in the outside circle propagate the request after waiting some time
T for confirmation that the item has not been found. It may be helpful for each failure to broadcast their unique ID; this way, propagating the request will ignore nodes that have already been hit. To be honest, I'm not sure what the work complexity is of this kind of search, because there's a lot of sub-algorithms going on.

Every time the bounds have been exceeded, the search will experience a deficit of
T extra time. Furthermore, the more 'common' the search (small number of extra T's) the less work every node does.  To this end, it is helpful for networks to be clustered into groups with similar interests (if I want cat pictures, I should hang with people that have a lot of cats). But I think this post is long enough : )


At this point I fell asleep, so...

...That's all for now. If you got this far, wow what the heck! We should hang out!

Or just leave a comment, that would be nice : )