Sunday, August 16, 2009

Fun with app engine's task queue

I just finished an app engine site that plots some housing rentals in Ann Arbor on google maps (source). I originally started working on it out of frustration that housing maps doesn't support Ann Arbor. About half way through I realized mapskrieg does, but by then I had already found an excuse to play around with some of app engine's latest features so I was happy to finish it up :) Even though it is hard coded to listings in Ann Arbor, it works based off a list of craigslist search urls, so it could be targeted anywhere.

The meat of the server is pretty simple:

class RefreshCron(webapp.RequestHandler):
"Cron job that refreshes listings for each housing URL"
def get(self):
import pickle
# fire off tasks for each url
for url in HOUSING_URLS:
taskqueue.add(
url="/task/craigsrss",
payload=pickle.dumps(url),
headers={"Content-Type": "application/octet-stream"})
self.redirect("/")

class CraigsRssTask(webapp.RequestHandler):
"""takes in a url to a craigslist rss feed, and grabs
each entry. for each one, fires off a DetailsTask to
get the detail of each listing."""
def post(self):
import pickle
url = pickle.loads(self.request.body) + "&format=rss"
listings = parseListings(fetchContent(url))
for listing in listings:
taskqueue.add(
url='/task/fetchdetails',
payload=pickle.dumps(listing),
headers={"Content-Type": "application/octet-stream"})
self.redirect("/")

class DetailsTask(webapp.RequestHandler):
"""Fetches details from one listing and stores the
full listing in the datastore. The listing is passed in the request
body as a pickled dictionary."""
def post(self):
import pickle
listing = pickle.loads(self.request.body)
url = listing["url"]

text = fetchContent(url)
address = extractAddr(text)
photos = extractPhotos(text)
lat,longit = parseGeocode(fetchContent(geocodeUrl(address)))
listingStored = ListingStore(
title=listing["title"],
url=url,
address=address,
photos=photos,
lat=lat,
longit=longit)
listingStored.put()


A cron job posts a task for each craigslist search url using the experimental task queue feature. Each rss task parses the feed and some initial data about each listing, and then posts another task to gather the rest of the details from the listing url (including address and photo extraction, and geocoding of the address). What's really cool is this all fans out in parallel, which means a 100+ listings can be parsed and ready to go very quickly. With the addition of the task queue, I feel like app engine's request timeout has turned from a pesky limitation to a helpful reminder that if you are doing too much work in any single request handler, you are doing it wrong and should thing about how to break it down into smaller tasks that can be parallelized. Since firing off a task is almost as easy as calling a function, there's really no excuse not to.

The development server does a really nice job with tasks. Instead of firing them all off automatically, you can view each task in an admin console and fire them off one at a time to make sure everything is working properly, and also verify the tasks that you expect to show up do.



Unfortunately, you don't get the same interface once the app is running in production; it would be nice to have a way to see what tasks were pending, or at least flush them out in case a task is repeatedly barfing (as happened to me the first time I uploaded). Another feature I'd like to see was a way to have a handler notified when a queue empties out to facilitate multi-stage parallel processing. But overall I'm impressed with the task queue and look forward to it becoming a core feature (and available on the jvm).

Saturday, July 11, 2009

Understanding the Y Combinator using Java

I've been interested in clojure lately, and therefore functional programming. This lead me to reading about the Y Combinator, which is essentially the theory explaining how recursion can be implemented without naming, and therefore without being able to make the recursive call to the function name itself. It's also a fun puzzle, and I suspect that's why it is so popular among functional programming geeks.

Here is the Y Combinator in clojure:
(defn Y [r]
((fn [f] (f f))
(fn [f]
(r (fn [x] ((f f) x))))))

This article does a pretty good job explaining how to apply it, but didn't really explain how the Y Combinator works. What I wanted was the Y Combinator in a familiar form where I could step through the code and see what was going on. Since java is the environment that is easiest for me to step through code, I looked around until I came across this article that does a good job explaining the Y Combinator, and provides source code in both lisp and java.

When you perform a direct translation of functional code into java, you get a mess of anonymous inner classes that isn't particularly easy to reason about. So after pasting into into my IDE, I tried to remove all unnecessary anonymous inner classes while still preserving the essence of what the Y Combinator solves: how to implement recursion without making a direct recursive call. I ended up with a lot more code of course, but java's verbosity shines when trying to understand this dense concept.

/**
* Demonstrates the principles of the Y-Combinator in java.
*/
public class YComboDemo {

/**
* Let's start with a simple function implemented
* recursively: factorial
*/
static int factorial(int n) {
return n == 0 ? 1 : n * factorial(n - 1);
}

/**
* conceptually, factorial implements an interface
* that takes an integer and returns an integer
*/
interface IntFunc {int apply(int n);}

/**
* Here it is again in terms of a class implementing
* this interface:
*/
static class SimpleFactorial implements IntFunc {

public int apply(int n) {
return n == 0 ? 1 : n * apply(n - 1);
}
}

/**
* Suppose java didn't support recursive calls.
* How can we accomplish the same thing? Let's
* start by abstracting away the recursive call by
* delegating to some other function.
*
* This won't work unless we find a way to get
* that function to call back into us...
*/
static class DelegatingFactorial implements IntFunc {
private final IntFunc f;
public DelegatingFactorial(IntFunc f) {
this.f = f;
}

public int apply(int n) {
// we can't call apply directly, so delegate
return n == 0 ? 1 : n * f.apply(n - 1);
}
}

/**
* Let's start by defining something that can generate
* a function in terms of some other.
*/
interface IntFuncToIntFunc {IntFunc apply(IntFunc f);}

/**
* And then implement it for factorial. This will generate
* a function that delegates to the passed in argument.
*
* But we still need to pass in the right argument to make
* the "recursion" work...
*/
static class FactGen implements IntFuncToIntFunc {
public IntFunc apply(IntFunc f) {
return new DelegatingFactorial(f);
}
}

/**
* That's where the Y-combinator comes in. It finds the
* fixed point of any {@link IntFuncToIntFunc}. That is,
* it finds the function for which calling
* {@link IntFuncToIntFunc} one more time will yield the
* same function.
*
* This is all just a fancy way of saying that it contains
* the plumbing to make sure the original function that
* needs to get calls back into itself will get them.
*
* This implements recursion for any
* {@link IntFuncToIntFunc} that generates
* a function that calls back to the passed in function.
*/
static class YCombinator implements IntFunc {
private final IntFuncToIntFunc mIntFuncToIntFunc;

public YCombinator(IntFuncToIntFunc intFuncToIntFunc) {
mIntFuncToIntFunc = intFuncToIntFunc;
}

/**
* The fixed point works by kicking of the first call to
* generate the int function. We pass in ourselves as
* the argument so it will call back to our
* {@link #apply(int)} as necessary.
*/
public IntFunc fixedPoint() {
return mIntFuncToIntFunc.apply(this);
}

/**
* If the function calls back into us, we generate
* another function that calls back into us again.
* This will keep happening until it knows the
* answer itself (its base case).
*/
public int apply(int n) {
// setup another function with ourselves as
// the callback again
final IntFunc func = mIntFuncToIntFunc.apply(this);

// then just apply it
return func.apply(n);
}
}

public static void main(String args[]) {
final IntFuncToIntFunc factGen = new FactGen();

// if we apply the ycombinator to factgen, we have
// a working implementation of factorial
final IntFunc factorial = new YCombinator(factGen).fixedPoint();

System.out.println(factorial.apply(3));
}
}

So the Y Combinator works by serving as a trampoline for a generated function to bounce off back into itself after another function has been generated.