corprew reed

On Npm and Investment

This article raises a good point about the recent investment in NPM, that it might make sense as a strategic investment even if there were basically no odds that NPM would ever IPO as an entity.

Not really much to this blog entry, just that the insight “it’s worth it for VCs to invest in infrastructure if it saves them money” was one that I hadn’t thought of before.

It’s also interesting to me because I’m working on learning node in my spare time.

No Sleep Until….

My ongoing suspicion for why people use second level administrative regions (us: states, ca: provinces, uk: complex) as the basis for applications and not third level administrative regions (counties/parishes/ridings) is that in the US, the expectations are wildly off from reality.

The most exciting example of this is that ‘Brooklyn’ is a borough of New York City, located in ‘Kings County.’ So, ‘Kings County’ properly appears in a lot of locations where you’d expect Brooklyn. But generally, the overlap of administrative regions and what people actually call things break down pretty hard at that level. It goes from something that you can pick up easily from a web service to custom-programmed madness, which is why foursquare gets it right but most small startups stop at the higher level.

Node, Node on the Range

For the last several weeks, I’ve been working on learning Node and Go as adjuncts to Ruby, which has been my main programming language for the last while.

So far, Node has been going fairly well, possibly because I spent about 5 years doing asynchronous programming with completions trying to push more packets back to users at one of my previous gigs.

Go is interesting, but so far all I’ve done with it is modify a Redis-based WebSocket handler for use with a longer term project that I’m working on — its process model seems to go better with Websockets than Rails 3.x/4.x does, and competing ruby frameworks seem to be defunct or less-used in production.

So Long, and Thanks for All the Images

Linea, my employer, shut down as of the end of November, and I’ve been the person keeping it up and running until today. As of today, we shut down. I want to take this opportunity to thank my team and my co-workers, all of whom have been fantastic.

As for myself, I’m going to be working on consulting projects and other side projects for the next several months before I start looking for another full time position. If you’re interested in talking, though, contact me. I’m always interested to hear from people with interesting projects going on.

Online Education

I used to be an applications developer and devops person for an online education startup, and I always kept this image from XKCD around to remind me of security issues inherent in running a school. The place was for smart teens who were interested in technical things, so they frequently would try out little hacks they could find.

Exploits of a Mom

Didn’t have any problems in my ‘tenure’ at that employer, but did have some funny logfile monitoring set up to watch students trying things. People are basically hilarious.

Operators Are Busy, Psb

Someone asked me about this recently, so I figured I’d answer here. I’m currently working fulltime at Linea Photosharing LLC, a startup that’s located in Fremont, Seattle, WA.

The blog posts on this website are largely scavenged from an earlier wordpress-based site that was hacked, so a bunch of them are out of time order. Feel free to ask me any questons about the content on that site using the content on the about page.

Linea’s a pretty great place to work and we’re currently hiring Engineers (iOS, Javascript, RoR), so especially contact me if you’re interested in that.

Dennyalps: Adventures in Mass Transit

Recently, I’ve haven’t been happy with the various ways that Seattle and related groups (google maps, one bus away, king county transit) display bus arrival information. Because of that, I’ve created Denny Alps, which displays data for the route I use most days. I figure after I’ve done this for a while, I’ll understand the issues involved better.

When I’m not riding the bus, it only updates every 10 minutes, but I can make that happen every minute or so if people think this sort of display is useful. So far, I’ve been enjoying it. Leave a comment if you like it or have suggestions for improving this sort of view.

Anyway, http://dennyalps.herokuapp.com for now. May move it somewhere more permanent if it turns out to be as useful in the long run as i’ve found it in the short run.

Soniverse

Some details on the Soniverse project, which was a game platform from Monstrous that I worked on as a consultant, are available at the Monstrous website.

Soniverse was a fun project with a great team spread out over North America, although headquartered in Austin and the Bay Area.

I designed and implemented the server layer for the product. The only difference between the slides and the implementation was that Mongo DB wasn’t used in the final versions, it was replaced by straight up RDB for maintainability / simplicity reasons. The queuing, since people have asked, was implemented through Resque, a redis-based processing queue system.

Classifiers and Classification

For the last while, I’ve been working on a project that involves scanning large numbers of RSS/Atom feeds, and then using Bayesian[1] classifiers to break it into one of a number of categories for summarization and display (the system that I’m using to do this is available as a sample website, but really needs more data in the training sets before it’s ready to entertain all of you.) The categories are pretty straightforward, and they fit into a somewhat neat controlled vocabulary (ontology/thesaurus/whatever.)

There’s a relation, though, between the different terms in this sort of classification and the training data used to build the Bayesian Classifier. If the terms are arranged in a hierarchy (and certain assumptions are made about that hierarchy, like subterms encompassing part of the range of meaning of their parent term and nothing else)[2], then the training data used for classifying terms can be shared.

For example, all positive training data that belongs to the child terms can also be used for the parent. So, for (a constructed) example, positive training data for tamiflu also belongs in the positive data for bird flu vaccines. The reverse is true of negative training data. For negative data, the negative data for the parent can also be used for the child terms.

This is highly useful information when you’re making a large scale text classifier (and having it classify texts as belonging to categories or not, as opposed to just clustering texts into the categories that actually appear. It’s easier to use things like bayesian classifiers do to this if you’re looking for somewhat fine-grained detail.

Currently, I’ve been using Classifier4J for doing the classification and text summarization[3]. The text summarization is sort of annoying, though, because it’s based on a simple statistical choice of sentences which occasionally picks up date-lines and partial phrases because of what’s ‘important.’ I’m resorting the urge to go completely POS-tagging nuts on the whole thing and only selecting sentences of certain types or completeness because this is, after all, a side project. (The number of times I see things like ‘this sentence no verb.’ is astounding, though, and slowly driving me nuts.)

So, another day in the life.

  • 1 although i’m also using a vector space classifier for a related, larger project and it’s driving me less nuts training it.
  • 2 this is called a meronymous (’part-of’) relationship, and given that half the people who regularly read this blog were in LIS530 or its equivalent at some point, you should remember this.
  • 3 and will probably eventually switch to jNBC http://jbnc.sourceforge.net/ before i go nuts

Coding and Picking the Easy Target. Also: Card Tricks.

So, for those of you who don’t know, I’ve been working part-time at a local company to help pay my way through grad school. That’s actually a simplification of the actual truth, as I’m a part-owner of the company and I also am mostly getting benefits more than cash, but for now I’m the main system administrator on one of the main systems they run.

For the last bit, I’ve been tracking down problems in the spam checking software that we use, and it’s been a merry time. Most of the problems have been getting everything on the server to be in a single known compatible state, which is a concept I greatly commend to you if you’re running a server and don’t want to spend lots of time messing with it.

Today’s project was figuring out the source of and eliminating a bunch of error messages that get mailed out to the administrators’ mailbox every night. They’re known harmless, but it’s just aggravating and it might hide other problems.

So, I was looking through the codebase, and I found this little gem:

1
SPAMD=`ps aux | awk –posix ‘{ if (($1 ~ /popuser/) && ($0 ~ /\/spamd[[:blank:]]/)) print $2; }’ | wc -l | awk ‘{print $1}’`

You might ask yourself what that does. It’s pretty easy to figure out… it counts the number of instances of processes match ’spamd ‘ followed by ‘popuser’, which is useful for figuring out whether or not spamassassin is running on your server. It’s part of 4psa server assistant. However, this may not work depending on how your server is configured. On my server, this never works because of how ps does its output.

My main point here is that that’s a crazy way to write that code. What the person is actually trying to do is make sure that they’re only getting the main spamassassin process and not any of the child processes. The child processes display as “spamd child”, the main spamassassin process displays as something like “/usr/bin/spamd -u popuser -d -m NUMBER -x –virtual-config-dir=/MAIL/DIR/FOR/YOUR/SERVER/%d/%l –socketpath=/tmp/spamd_full.sock”. So, they’ve got to distinguish between those two lines, and they’ve decided to check for random text in the first one, and written a fairly complex little shell script (calling awk twice!) to do so. They can’t check for just the word ‘popuser’ because it might appear in the path, in case you were wondering.

I replaced this with the following line:

1
SPAMD=`ps ax | grep -v “grep\|spamd child” | grep -i “spamd ” | wc -l | awk ‘{print $1}’`

This checks for all spamd processes, and just eliminates the ’spamd child’ processes first. Why this way? If you’re trying to choose between two things, and one of them changes from system to system, and one of them is fixed and simple, you probably should try to select the fixed one.

So, here I didn’t want the fixed ones, so I eliminated (’grep -v’) them. It saved me from having to try to pick the one I wanted. It’s generally as easy to select for elimination as it is to select for further processing in computer programs. This is also true in card tricks, incidentally. Just in case you want to do some card tricks.

The basic idea behind a lot of card tricks where you choose between two things is that the magician knows which one of the two things that (s)he wants you to have before had. So, the magician decides whether you’re selecting an item or selecting an item for elimination at the time you make the choice, to make sure that you get the right item.

Actually, it’s typically mostly used in really bad card tricks. The ‘decisive moments’ blog describes how a similar process to the magician’s force is used in many video games to keep the plot moving in a somewhat linear fashion transparently to the user, and why it fails.

I wonder how much the folks who do massive interactive games like 4orty2wo use this tactic, and whether they’ve found good ways to disguise that it’s happening.