Thursday 14 November 2013

Moving blogs!

This one sucks! I can't host code on here, so I moved to this one:

http://plookn.herokuapp.com/

Come over, we'll make pies and talk about politics.

Functional Programming and You

Let me blow your mind real quick.

The kind of programming I've learned through high school has always been imperative, procedural, and/or object-oriented. In other words, most of the time I told the compiler exactly what I want, like

for (int i; i < arr.length; i++) {
    arr[i].doThisThing("Not sure what language this is written in");
}

This has taught me a lot about logical thinking and low-level computer architecture, but is really clunky - I mean I gotta access each item by its index, and then do some laundry list of tasks.

Enter Javascript* stage left


Writing Javascript introduced me to functional programming - kind of out of necessity, since procedural-style Javascript sucks so bad. The damn language can't even run for object in list on an array without doing something weird! But look at this:

forEach( some_array, function (each_object) {
    each_object.doThisThing("OMG YOU GUYS LOOK");
});

Did you see that? We just passed each element of the array to a function and operated directly on them, without indices! Sure, you say, Python and Java can do that to with their for-each constructs. But what if I do this**?***

linear = [1, 2, 3, 4];
squares = map( linear, function (val) {
    return (val * val);
});

squares == [1, 4, 9, 16];    // True
linear == [1, 2, 3, 4];   //True

Using the same functional style as before, we generated a new array from the previous one without specifically manipulating each value (imagine the tedium of doing that with indices)! Here's another:

evens = filter( [1, 2, 3, 4], function (val) {
    return (val % 2 == 0);    //True if even
});

evens == [2, 4];

An even nicer way to do it is to use first-class functions and define an is_even function separately:

is_even = function (val) {
   return (val % 2 == 0);
}
evens = filter( [1, 2, 3, 4], is_even);
more_evens = filter( [5, 6, 7, 8], is_even);

Did we just pass a function as an argument?! Yeah! Wowwowow! That's what first-class functions means, and usually goes hand-in-hand with functional programming. This stuff was first used in Scheme (a strict version of Lisp), but nobody used that. Good thing Javascript took up the cause.

And that's it! Hope you learned something. Happy_cat.gif



* I know Javascript isn't much of a functional language, but that's where I started to pick it up. It is very extensible however, with libraries like wu.js and underscore.js providing some functional power.

** Technically the map() function isn't global in Javascript, but you could use the underscore.js version the same way.

*** Sorry for clobbering the global namespace, I didn't want to cause confusion around the var statement.

Wednesday 13 November 2013

Building a Scalable Network Search

WARNING: This whole discussion assumes that every node is well-behaved, and not malicious. Let's take this stuff one step at a time, ok? 
ALSO: Sorry this is so long. Lot to say.

Not sure how to preamble this so... I'll just dive in. Picture of a kitten.

Plain ol' search

 

Since we already know (more or less) how think about tree algorithms, let's model our networks as them, the same way I show in my earlier post

This makes common-sense search pretty easy - let the node send a search request to all its subtrees, and make them pass it on. If a node has the desired information, it can call back straight to the origin, and celebrate a job well done with a pint of orange juice.
 

Here's what it looks like: If our tree has a branching factor of two (the network then has a branching factor of three, since we include the pseudo-parent node), each node checks its own records and passes the request to its two pseudo-children:

        ---C
       /
   ---B
  /    \
 /      ---D
A
 \      ---F
  \    /
   ---E
       \
        ---G


(fyi I labelled the children in pre-order)


Let's call the time taken for the request to reach a child delta_t. We can assume that actually firing a request might take negligible time, since the parent need not wait for a response to send the next request.  It's like how you can send five letters from the post office at the same time, instead of sending one, waiting for a response, then the second, and so on.

Taking time steps through the process looks like this:

t = 0:  A fires a request to B and then to E.
t = delta_t:  B fires a request to C then F, and E asks F and G
t = 2 * delta_t:  C, D, F, and G each send requests to their children, and so on.


This tells us that we can reach twice as many peers with each increment delta_t, and therefore search takes logarithmic time - doubling the number of peers only results in a constant-time incremental increase! Changing the branching factor or the magnitude of delta_t will just affect the magnitude of that increment. 

But here's the catch: this type of search will hit every peer, both in the best and the worst case. Even if the first child node B has the desired information, the remaining children will not only propagate down their own trees, but also down B's subtree! Even though B does not propagate, if C, D, E, etc.. are connected to any of B's children, those children will receive the request! sadface.gif!

This sucks. It means that although search scales well, the amount of work every node will do as part of other nodes' searches is linear with N. Because of this, large networks will quickly choke on all the bandwidth that searches require. Suppose one node runs, on average, one search per minute. The number of searches it must process is N per minute. Imagine doing this with a million nodes, with the majority of results turning up negative - it's a complete waste.




Slightly better (maybe) search


So obviously we need to modify our search algorithm. We'll certainly need to sacrifice some things, but let's talk about what we want to maintain:

1)  Full searchability: If one node in N has the data we're looking for, we want to find it eventually.
2)  (More-or-less) Constant time node work: A node ought not to do much more work in a network with large N (say a few million nodes) than with small N. We could of course *let* a node do more work if it wants to take it on...
3)  Time of data retrieval is related to abundance of data: Looking for something common like cat pictures should take less or equal time to something rare like videos of me doing something stupid at that party one time.

We can add to this list, but this is a good place to start. Here's my solution for now: Limiting the number of hops a request can make. For each hop, increment the 'hop counter' on the request, and if it reaches a sustainable limit
J, hold the request. If any node in J has replied positive with the information, it should broadcast that the search can be called off. Ideally, information has been found within that radius and propagation may cease. Such a search results in node work proportional to J, and keeps a nice search time.

But what if nobody in the radius returns positive? I'd suggest letting one node in the outside circle propagate the request after waiting some time
T for confirmation that the item has not been found. It may be helpful for each failure to broadcast their unique ID; this way, propagating the request will ignore nodes that have already been hit. To be honest, I'm not sure what the work complexity is of this kind of search, because there's a lot of sub-algorithms going on.

Every time the bounds have been exceeded, the search will experience a deficit of
T extra time. Furthermore, the more 'common' the search (small number of extra T's) the less work every node does.  To this end, it is helpful for networks to be clustered into groups with similar interests (if I want cat pictures, I should hang with people that have a lot of cats). But I think this post is long enough : )


At this point I fell asleep, so...

...That's all for now. If you got this far, wow what the heck! We should hang out!

Or just leave a comment, that would be nice : )

Monday 28 October 2013

Drawing Trees with D3

So I figure if I gotta write blog posts, I might as well learn something in the process.  I thought that visualizing trees with ASCII was cool, but not very flexible nor aesthetically appealing.  I happen to know some javascript, so why not try using a dynamic visualizing library like D3?  I like the idea of creating an interactive graph that can be shared with people on the internet, and it seems like this is the perfect way to do it.


Cerealizing Serializing

Before I do any of that, I want to set some sort of format for the tree data to take - this way someone could write trees in any language like Python or Java and still render it with the visualizer.  For this purpose we'll use JSON, just because I think it looks nicer than XML :) Although it might be more accurate for the JSON to look like this:

{
 "root": "val1",
 "children": [
  {"root": "val2", "children": {}},
  {"root": "val3", "children": null}
 ]
}

But to work with d3 (and to make it easier to generate the JSON) we'll use this format.

{
 "nodes": [
  {"val": "val1"},
  {"val": "val2"},
  {"val": "val3"}
 ],
 "links": [
  {"src": 0, "nxt": 1},
  {"src": 2, "nxt": 0}
 ]
}

Here each item in the "nodes" list represents nodes that can be extended with other properties, and links uses the indices in "nodes" to create edges. In this case, val1 links to val2 and val2. We'll also say that the "source" is the root of the subtree (it helps formatting). Here's some python code that would make it (I haven't tested it, so it might not work..):

nodes = [], links = []

def flatten_tree(node, parent_index=None):
    """Call on the root of a tree as flatten_tree(root)"""

    nodes.append({"val": node.value})
    node_index = len(nodes) - 1

    if parent_index:
        links.append({"src": parent_index, "nxt": node_index})

    for child in node.children:
        flatten_tree(child, node_index)

Notice that in this format we can have as many children as we like, and even extend the object with extra properties, like ID's.  That way, we have the option to create networks with the same framework, where we're no longer confined to the tree-like descending structure.  Networks are cool.

D3 (the relevant parts)

So D3 is a javascript library that helps you bind data to the DOM and then make it look real pretty using Scalable Vector Graphics (2010 winner of the coolest name in internet technologies).  There are a lot of similar libraries for visualization though, like Protovis, Rafael, and Processing - I chose D3 because of its document-binding, and reputation as the most flexible tool for visualization. Also it was the first one I heard about, which lends it considerable bias :)

I'd like to be concise in my explanation, so if something doesn't make sense, please RTFM.

One important part is being able to search the document and appending to a block, like so (pulled from the doc):

d3.select("body").selectAll("p")
 .data([4, 8, 15])
  .enter().append("p")
 .text(function(d) { return "I’m number " + d + "!"; });

As you can see, we can bind to the "body" block just like jQuery, and then append <p> tags to the body for each element in the data array. It results in:
I'm number 4!
I'm number 8!
I'm number 15!
This is crazy useful! Especially since we can manipulate the data using a function just before appending to the document. Also notice that we can keep calling methods on each tag, so we can style elements and set other properties.

The second thing you need to know are forces - these are a bit more difficult to explain so instead watch this excellent talk.

Now that we have some of the basics, we can pick one of the examples and leech off of pre-existing code (because learning from APIs takes too long). Since I want something minimalist and extensible, I picked out a network, an unlabeled tree , and an example with labeling. After messing around for a while, I just extended the unlabeled tree example with the label values by appending "text" elements to the "g" blocks, and made the radius of the circles larger so they're easier to see.

Yaaayyy

And that's it! The code I wrote is up on my Github: https://github.com/dplyukhin/netviewer. Since the json requests use ajax, there's a super simple node.js server you can run to test it locally (and 'cause I might put it on heroku later).

I hope you learned something! Feel free to ask a question or say that my code is dumb in the comments.

Saturday 19 October 2013

Mapping Networks using Trees

So here's an idea:

Say we have a network of four computers: A, B, C, and D. We'll model the connections using an array structure like this (that's not the idea yet, though it's still pretty cool):

    A   B   C   D 
A   0
B   1   0
C   1   1   0
D   0   1   1   0

Where 1 represents a connection, and we'll say nodes aren't connected to themselves. Try drawing it out, because I'll reference the topology later.

Now let's say A wants to make a map of the network.  In this case, there's no central node connected to everybody and keeping track of all connections.  However, we can assume that there's some path between any two computers in the network.  That is, even though A is not connected directly to D, it can find out about D through a common connection, like B or C.

Obviously, to map the network A should ask all its neighbors which machines they're connected to, and for them to pass the message along.  The problem is that because unordered networks usually form circular connections (e.g. B connects to C connects to D connects back to B) each node would get the message from each of its connections, and replying to all of them would artificially increase the tally.

Well the easy solution is to attach an ID to each request, or maybe sign it with a private key. This way, when a message gets to a machine that has already received it, that machine can just reject all subsequent same-id messages and only reply to the first one. Then it could look like this:

[A, [B, [D, [C]]], [C, [D]]]
#I expanded it out to make it readable, hope it helps
[A, 
  [B,
    [C],
    [D, 
      [C]
    ]
  ], 
  [C, 
    [B],
    [D]
  ]
]

Where the 0th node of the list is the queried machine, and the following lists are its active connections.

Notice that even if nobody replies to a node as it's passing on the message (meaning it's at the 'end' of the tree) it will still report all of its active connections, so some items will appear in the tree twice. In this case, the order was A->B, A->C, B->D (followed by a bunch of failed requests, like B->C).

The above list would look different, though, depending on the order in which a node receives the message.  For example, if the connection between A and B were really shotty, but the rest of the connections were good, the tree would look like this:

[A, 
  [C, 
    [B,
      [D]
    ], 
    [D,
      [B]
    ]
  ],
  [B]
]

This kind of stuff can be useful in larger-scale networks to trace the shortest path to a desired node.  We could even add time-to-travel between nodes (like a traceroute) to make sure we pick the most efficient path.

Of course, this model assumes that every node is totally trustworthy, which is a horrible assumption to make for any real-world tech. I imagine you could get some sort of public key crypto going on over here, but that's a topic for a later post.

Anyway, hope you thought this was as neat as i did!

Sunday 13 October 2013

CSC148: Comments on Modularity and Recursion

Hey look you guys I'm gonna talk about OOP and recursion and why they're important, ok?


Object Oriented Programming

Object-oriented programming (compare with functional programming) means modelling a program as abstracted objects interacting with each other.  Each object consists of attributes (values associated with it, like number_of_legs) and/or methods (functions associated with the object, like defenestrate(target)).  Each object is generated by a programmer-defined class, both of which can then be used or modified.  Classes can also be extended by importing properties from old classes to make new classes (inheritance).

The advantage of OOP is that it creates an excellent organizational structure around the entire program, which improves readability and makes logical sections of code more reusable.  For example, I can easily take out a DatabaseManager class from one project and reuse it elsewhere.  OOP also provides useful abstractions for high-level programming - if you want to create a linked list, instead of figuring out how it works and implementing it, you can just create the object and use its methods.

The funny thing is that there's practically nothing you can do with objects that you can't do with subroutines. But! objects make code far more reusable, as well as readable.  This means it's no longer necessary to create tightly controlled, monolithic programs (which do have their uses, such as hardware programming).  In the best case, each class can then be taken out of context and modified without consequence.  Finally, it's simply easier to think about how the pieces of a program work if you consider them as objects with properties; attributes become more organized, and therefore less likely to fail due to programmer error.  It's definitely recommended practice whenever possible, for the sake of your code's longevity.

PS >> Here's an interesting, albeit somewhat unrelated article I found while researching:  The difference between imperative and declarative programming styles


Recursion

If you're a functional programmer, this is pretty much the bee's knees.  When a problem can be divided into smaller versions of itself, you can use a recursive algorithm to reduce the length of code, and often greatly simplify the task (reducing likelihood of a bug).  Similarly to OOP, recursion lets you simplify a complex algorithm into abstractions.

For example, say you want to search through a nested list (lists containing lists containing lists containing..). You could try iterating over the list and checking cases:

l = [1, [2, 3], [4, [5]]]
for item in l:
  #Check if the item is a list, and iterate over it if it is
  #If that list has a list, loop over that..
  #More if statements to account for deeper nesting
  #If the item ain't a list, check if it's what we're looking for

As you can see, the code starts to nest a whole bunch of loops, and even then it's not very flexible (how do you handle a hundred lists nested within each other?).

Now try thinking recursively: We're looking for a value in this list.  If an item in the list is another list, we want to search it, and so on.  Since the search algorithm is the same in each case, we can just write a general statement:
1) Look through a list for a value.
2) If this list contains another list, look through that list using step 1.

Just like that, a monolithic algorithm full of looping and type checking can be shortened to just a few lines.

One big problem, however, is that debugging recursion is difficult, or at least different.  If a recursive function fails after 100 iterations, the traceback will be the function itself, 100 times.  This means you ought to take care to understand every possible case before implementing it (e.g. What if the array index goes out of range? What happens when we finish execution? Consider all the possible arguments).

All in all, recursion is a powerful tool, but with great power comes great responsibility.  Recursive functions should be thoroughly tested with a range of inputs, and kept as isolated as possible (think of it as a black box that will return the correct answer), because they're really annoying to debug when they fail.  Furthermore, thoroughly document such functions, because they can be difficult to read and understand for others reading your code.  But besides that, recursion can be particularly useful for shrinking down and conceptually simplifying code.