Someone asked me about this recently, so I figured I’d answer here. I’m currently working fulltime at Linea Photosharing LLC, a startup that’s located in Fremont, Seattle, WA.
The blog posts on this website are largely scavenged from an earlier wordpress-based site that was hacked, so a bunch of them are out of time order. Feel free to ask me any questons about the content on that site using the content on the about page.
Recently, I’ve haven’t been happy with the various ways that Seattle and related groups
(google maps, one bus away, king county transit) display bus arrival information. Because
of that, I’ve created Denny Alps, which displays
data for the route I use most days. I figure after I’ve done this for a while, I’ll
understand the issues involved better.
When I’m not riding the bus, it only updates every 10 minutes, but I can make that
happen every minute or so if people think this sort of display is useful. So far,
I’ve been enjoying it. Leave a comment if you like it or have suggestions for
improving this sort of view.
Anyway, http://dennyalps.herokuapp.com for now. May move it somewhere more permanent if
it turns out to be as useful in the long run as i’ve found it in the short run.
Some details on the Soniverse project, which was a game platform from Monstrous that
I worked on as a consultant, are available at the Monstrous website.
Soniverse was a fun project with a great team spread out over North America, although headquartered
in Austin and the Bay Area.
I designed and implemented the server layer for the product. The only difference between the slides
and the implementation was that Mongo DB wasn’t used in the final versions, it was replaced by straight
up RDB for maintainability / simplicity reasons. The queuing, since people have asked, was
implemented through Resque, a redis-based processing queue system.
For the last while, I’ve been working on a project that involves scanning large numbers of RSS/Atom feeds, and then using Bayesian classifiers to break it into one of a number of categories for summarization and display (the system that I’m using to do this is available as a sample website, but really needs more data in the training sets before it’s ready to entertain all of you.) The categories are pretty straightforward, and they fit into a somewhat neat controlled vocabulary (ontology/thesaurus/whatever.)
There’s a relation, though, between the different terms in this sort of classification and the training data used to build the Bayesian Classifier. If the terms are arranged in a hierarchy (and certain assumptions are made about that hierarchy, like subterms encompassing part of the range of meaning of their parent term and nothing else), then the training data used for classifying terms can be shared.
For example, all positive training data that belongs to the child terms can also be used for the parent. So, for (a constructed) example, positive training data for tamiflu also belongs in the positive data for bird flu vaccines. The reverse is true of negative training data. For negative data, the negative data for the parent can also be used for the child terms.
This is highly useful information when you’re making a large scale text classifier (and having it classify texts as belonging to categories or not, as opposed to just clustering texts into the categories that actually appear. It’s easier to use things like bayesian classifiers do to this if you’re looking for somewhat fine-grained detail.
Currently, I’ve been using Classifier4J for doing the classification and text summarization. The text summarization is sort of annoying, though, because it’s based on a simple statistical choice of sentences which occasionally picks up date-lines and partial phrases because of what’s ‘important.’ I’m resorting the urge to go completely POS-tagging nuts on the whole thing and only selecting sentences of certain types or completeness because this is, after all, a side project. (The number of times I see things like ‘this sentence no verb.’ is astounding, though, and slowly driving me nuts.)
So, another day in the life.
1 although i’m also using a vector space classifier for a related, larger project and it’s driving me less nuts training it.
2 this is called a meronymous (’part-of’) relationship, and given that half the people who regularly read this blog were in LIS530 or its equivalent at some point, you should remember this.
3 and will probably eventually switch to jNBC http://jbnc.sourceforge.net/ before i go nuts
So, for those of you who don’t know, I’ve been working part-time at a local company to help pay my way through grad school. That’s actually a simplification of the actual truth, as I’m a part-owner of the company and I also am mostly getting benefits more than cash, but for now I’m the main system administrator on one of the main systems they run.
For the last bit, I’ve been tracking down problems in the spam checking software that we use, and it’s been a merry time. Most of the problems have been getting everything on the server to be in a single known compatible state, which is a concept I greatly commend to you if you’re running a server and don’t want to spend lots of time messing with it.
Today’s project was figuring out the source of and eliminating a bunch of error messages that get mailed out to the administrators’ mailbox every night. They’re known harmless, but it’s just aggravating and it might hide other problems.
So, I was looking through the codebase, and I found this little gem:
You might ask yourself what that does. It’s pretty easy to figure out… it counts the number of instances of processes match ’spamd ‘ followed by ‘popuser’, which is useful for figuring out whether or not spamassassin is running on your server. It’s part of 4psa server assistant. However, this may not work depending on how your server is configured. On my server, this never works because of how ps does its output.
My main point here is that that’s a crazy way to write that code. What the person is actually trying to do is make sure that they’re only getting the main spamassassin process and not any of the child processes. The child processes display as “spamd child”, the main spamassassin process displays as something like “/usr/bin/spamd -u popuser -d -m NUMBER -x –virtual-config-dir=/MAIL/DIR/FOR/YOUR/SERVER/%d/%l –socketpath=/tmp/spamd_full.sock”. So, they’ve got to distinguish between those two lines, and they’ve decided to check for random text in the first one, and written a fairly complex little shell script (calling awk twice!) to do so. They can’t check for just the word ‘popuser’ because it might appear in the path, in case you were wondering.
This checks for all spamd processes, and just eliminates the ’spamd child’ processes first. Why this way? If you’re trying to choose between two things, and one of them changes from system to system, and one of them is fixed and simple, you probably should try to select the fixed one.
So, here I didn’t want the fixed ones, so I eliminated (’grep -v’) them. It saved me from having to try to pick the one I wanted. It’s generally as easy to select for elimination as it is to select for further processing in computer programs. This is also true in card tricks, incidentally. Just in case you want to do some card tricks.
The basic idea behind a lot of card tricks where you choose between two things is that the magician knows which one of the two things that (s)he wants you to have before had. So, the magician decides whether you’re selecting an item or selecting an item for elimination at the time you make the choice, to make sure that you get the right item.
So, just as a helpful hint for people doing Taxonomy type stuff, or whatever controlled vocabulary: The difference between pre-coordinate and post-coordinate terms are pretty obvious when you can actually remember them, but here’s a helpful hint:
Precoordinate: concepts are combined into terms before the thesaurus is created.
Postcoordinate: concepts are combined into terms after the thesaurus is created — ie: usually at time of use.
I know that someday someone who needs this information will find it searching the web, and that makes me (relatively) happy.
The Intellectual Foundation of Information Organization is somewhat heavy going, but it’s the definitive work about a lot of areas in information organization. A lot of people encounter information organization issues professionally in the technical fields, but a lot of these issues have been around for ages, appearing in business and libraries for a long period of time.
This book has a huge amount of information in a very small amount of space, and can be somewhat heavy going. It has something to say about almost every issue having to do with organizing information. I also highly advise reading it for anyone in a MLIS/MSIM program, I went and looked this book up today for someone whose class wasn’t reading it for some reason and I advise it highly. It requires an Information Architect or other web designer to be able to think in basic principles about the stuff that they’re doing to be able to use this book — looking for information about controlled vocabularies instead of what the latest buzzword is. However, the payoff from having done so is high due to the clarity of the information presented.
The writing in this book is in the ‘little red schoolhouse‘ academic style from the University of Chicago. I found it very easily digestable and understandable, and it had a profound affect on how I thought about information organization; I credit doing very well in my classes on the subject and being able to speak intelligibly on the subject outside of class to having started out by reading this book. I recommend it highly.
Apparently, I still remember obscure things I learned about british calendars, about 14 years after college.I’m fascinated by the measurement of time, especially from when periods had a radically different notion of what time is, how it should be measured, and how time passes than we do today.
An example of this wouldd be a medieval book of days, which described what people did in particular seasons, but also see an old british book of days, for things that happened on particular dates that people in the 1880s felt were important.