My ongoing suspicion for why people use second level administrative
regions (us: states, ca: provinces, uk: complex) as the basis for
applications and not third level administrative regions
(counties/parishes/ridings) is that in the US, the expectations are
wildly off from reality.
The most exciting example of this is that ‘Brooklyn’ is a borough of
New York City, located in ‘Kings County.’ So, ‘Kings County’ properly
appears in a lot of locations where you’d expect Brooklyn. But
generally, the overlap of administrative regions and what people
actually call things break down pretty hard at that level. It goes
from something that you can pick up easily from a web service to
custom-programmed madness, which is why
foursquare gets it right but most small
startups stop at the higher level.
For the last several weeks, I’ve been working on learning
Node and Go as adjuncts to
Ruby, which has been my main programming language for the last while.
So far, Node has been going fairly well, possibly because I spent
about 5 years doing asynchronous programming with completions trying
to push more packets back to users at one of my previous gigs.
Go is interesting, but so far all I’ve done with it is modify a
Redis-based WebSocket handler for use with a longer term project that
I’m working on — its process model seems to go better with Websockets
than Rails 3.x/4.x does, and competing ruby frameworks seem to be
defunct or less-used in production.
Linea, my employer, shut down as of the end of November, and I’ve been
the person keeping it up and running until today. As of today, we
shut down. I want to take this opportunity to thank my team and my
co-workers, all of whom have been fantastic.
As for myself, I’m going to be working on consulting projects and
other side projects for the next several months before I start looking
for another full time position. If you’re interested in talking, though,
contact me. I’m always interested to hear from people with
interesting projects going on.
I used to be an applications developer and devops person for an online education startup, and I always kept this image from XKCD around to remind me of security issues inherent in running a school. The place was for smart teens who were interested in technical things, so they frequently would try out little hacks they could find.
Didn’t have any problems in my ‘tenure’ at that employer, but did have some funny logfile monitoring set up to watch students trying things. People are basically hilarious.
Someone asked me about this recently, so I figured I’d answer here. I’m currently working fulltime at Linea Photosharing LLC, a startup that’s located in Fremont, Seattle, WA.
The blog posts on this website are largely scavenged from an earlier wordpress-based site that was hacked, so a bunch of them are out of time order. Feel free to ask me any questons about the content on that site using the content on the about page.
Recently, I’ve haven’t been happy with the various ways that Seattle and related groups
(google maps, one bus away, king county transit) display bus arrival information. Because
of that, I’ve created Denny Alps, which displays
data for the route I use most days. I figure after I’ve done this for a while, I’ll
understand the issues involved better.
When I’m not riding the bus, it only updates every 10 minutes, but I can make that
happen every minute or so if people think this sort of display is useful. So far,
I’ve been enjoying it. Leave a comment if you like it or have suggestions for
improving this sort of view.
Anyway, http://dennyalps.herokuapp.com for now. May move it somewhere more permanent if
it turns out to be as useful in the long run as i’ve found it in the short run.
Some details on the Soniverse project, which was a game platform from Monstrous that
I worked on as a consultant, are available at the Monstrous website.
Soniverse was a fun project with a great team spread out over North America, although headquartered
in Austin and the Bay Area.
I designed and implemented the server layer for the product. The only difference between the slides
and the implementation was that Mongo DB wasn’t used in the final versions, it was replaced by straight
up RDB for maintainability / simplicity reasons. The queuing, since people have asked, was
implemented through Resque, a redis-based processing queue system.
For the last while, I’ve been working on a project that involves scanning large numbers of RSS/Atom feeds, and then using Bayesian classifiers to break it into one of a number of categories for summarization and display (the system that I’m using to do this is available as a sample website, but really needs more data in the training sets before it’s ready to entertain all of you.) The categories are pretty straightforward, and they fit into a somewhat neat controlled vocabulary (ontology/thesaurus/whatever.)
There’s a relation, though, between the different terms in this sort of classification and the training data used to build the Bayesian Classifier. If the terms are arranged in a hierarchy (and certain assumptions are made about that hierarchy, like subterms encompassing part of the range of meaning of their parent term and nothing else), then the training data used for classifying terms can be shared.
For example, all positive training data that belongs to the child terms can also be used for the parent. So, for (a constructed) example, positive training data for tamiflu also belongs in the positive data for bird flu vaccines. The reverse is true of negative training data. For negative data, the negative data for the parent can also be used for the child terms.
This is highly useful information when you’re making a large scale text classifier (and having it classify texts as belonging to categories or not, as opposed to just clustering texts into the categories that actually appear. It’s easier to use things like bayesian classifiers do to this if you’re looking for somewhat fine-grained detail.
Currently, I’ve been using Classifier4J for doing the classification and text summarization. The text summarization is sort of annoying, though, because it’s based on a simple statistical choice of sentences which occasionally picks up date-lines and partial phrases because of what’s ‘important.’ I’m resorting the urge to go completely POS-tagging nuts on the whole thing and only selecting sentences of certain types or completeness because this is, after all, a side project. (The number of times I see things like ‘this sentence no verb.’ is astounding, though, and slowly driving me nuts.)
So, another day in the life.
1 although i’m also using a vector space classifier for a related, larger project and it’s driving me less nuts training it.
2 this is called a meronymous (’part-of’) relationship, and given that half the people who regularly read this blog were in LIS530 or its equivalent at some point, you should remember this.
So, for those of you who don’t know, I’ve been working part-time at a local company to help pay my way through grad school. That’s actually a simplification of the actual truth, as I’m a part-owner of the company and I also am mostly getting benefits more than cash, but for now I’m the main system administrator on one of the main systems they run.
For the last bit, I’ve been tracking down problems in the spam checking software that we use, and it’s been a merry time. Most of the problems have been getting everything on the server to be in a single known compatible state, which is a concept I greatly commend to you if you’re running a server and don’t want to spend lots of time messing with it.
Today’s project was figuring out the source of and eliminating a bunch of error messages that get mailed out to the administrators’ mailbox every night. They’re known harmless, but it’s just aggravating and it might hide other problems.
So, I was looking through the codebase, and I found this little gem:
You might ask yourself what that does. It’s pretty easy to figure out… it counts the number of instances of processes match ’spamd ‘ followed by ‘popuser’, which is useful for figuring out whether or not spamassassin is running on your server. It’s part of 4psa server assistant. However, this may not work depending on how your server is configured. On my server, this never works because of how ps does its output.
My main point here is that that’s a crazy way to write that code. What the person is actually trying to do is make sure that they’re only getting the main spamassassin process and not any of the child processes. The child processes display as “spamd child”, the main spamassassin process displays as something like “/usr/bin/spamd -u popuser -d -m NUMBER -x –virtual-config-dir=/MAIL/DIR/FOR/YOUR/SERVER/%d/%l –socketpath=/tmp/spamd_full.sock”. So, they’ve got to distinguish between those two lines, and they’ve decided to check for random text in the first one, and written a fairly complex little shell script (calling awk twice!) to do so. They can’t check for just the word ‘popuser’ because it might appear in the path, in case you were wondering.
This checks for all spamd processes, and just eliminates the ’spamd child’ processes first. Why this way? If you’re trying to choose between two things, and one of them changes from system to system, and one of them is fixed and simple, you probably should try to select the fixed one.
So, here I didn’t want the fixed ones, so I eliminated (’grep -v’) them. It saved me from having to try to pick the one I wanted. It’s generally as easy to select for elimination as it is to select for further processing in computer programs. This is also true in card tricks, incidentally. Just in case you want to do some card tricks.
The basic idea behind a lot of card tricks where you choose between two things is that the magician knows which one of the two things that (s)he wants you to have before had. So, the magician decides whether you’re selecting an item or selecting an item for elimination at the time you make the choice, to make sure that you get the right item.