computer programming

You are currently browsing the archive for the computer programming category.

Drupal 5 has a few problems in its security layer, as I’ve mentioned other places, and some of them stem from the sort of ‘it-works-for-me’ philosophy of open source. This is particularly a problem in a complex system like Drupal, which in most installations is made up of a few dozen modules in addition to the core.

The current issue I’m having is that nodes created by the aggregation module get their taxonomy stripped when they’re updated because of how another module uses the security functionality, which is just hilarious in a site that’s largely organized organically by taxonomy. So, after talking with the people I’m working for on the site, I ended up creating a simple PHP script to run through cron that fixes the issues ‘the hard way.’

If you check out this query…

function fix_object($name, $sqlcon)
{
  $query = "SELECT term_data.name name, term_data.tid termid, node.nid nodeid, node.title title FROM node LEFT JOIN term_node  ON ( term_node.nid = node.nid ) LEFT JOIN term_data ON ( term_data.tid = term_node.tid ) WHERE node.type = 'aggregation_item ' AND node.title LIKE 'Xxxxx " . $name . "%'";
 
  // Perform Query
  $result = mysql_query($query);
 // ... and so on...

You can see that this is a fairly normal sql query that looks for all the nodes of type aggregation_item and titled a particular pattern. Because of the way the joins are structured, that means that any nodes that have lost their taxonomies will have NULL for termname and termid. Those nodeids with NULL termids can then have the proper taxonomy entries stuffed back into them…

function insert_taxo_4_node($node_id, $taxo_id, $con)
{
  $query = "INSERT INTO term_node (nid, tid) VALUES (". $node_id . "," . $taxo_id . ")";
 
  $result = mysql_query($query);
  // Check result
  // This shows the actual query sent to MySQL, and the error. Useful for debugging.
  if (!$result) 
    {
      $message  = 'Invalid query: ' . mysql_error() . "\n";
      $message .= 'Whole query: ' . $query;
      die($message);
    }
}

I’m largely posting this up in case people run into the same problem — this is a hilariously simple fix for a difficult to fix problem in drupal, but it’s a generic information architecture issue of what to do when the system that you’re working on is unreliable. I should probably mention that the issues with security in drupal aren’t related to authentication, but instead are related to item ACLs denying access to things for strange reasons, and are not crucial security bugs in the OMG MUST PATCH NOW sense.

Tags: , , , , , , , , , , , ,

I’ve been working on a website in RoR for the last while, and it’s about to go live in the private beta sort of way that seems to be so popular these days. It’s handy that way, because that way I can set up the site at slicehost or similar and not have to worry (too much) about my server slowing from getting overloaded. This same site’s next incarnation is going to be facebook related, so that should overwhelm any sense of moderation (if I’m lucky.)

So, the key method of invitation to a private beta is that you mail someone a code allowing them access to the system, for these purposes, let’s just assume that the code is some reasonably long unique string (in my code, it’s actually a uuid.) So, set up a migration something like this to manage them:

  def self.up
    create_table :invites do |t|
# deleted stuff
      t.string :guid
      t.integer :used_yet
# deleted stuff
    end
  end

used_yet isn’t a boolean for reasons that are too laborious to go into here, but reflect some functionality in the code that I’m not going to display. Assuming that you’re using acts_as_authentication and are redirecting anyone who tries to access your app to the default welcome page according to the usual methods, set up something like this in your routes.rb:

  map.root :controller => "welcome"

This is probably the case in like half the rails apps out there. Have the index method of the welcome controller put up a form with a field like:

#let's see if the formatter can handle rails erb without exploding.
< % form_tag('welcome/checkinvite', :method=>:get) do -%>
  < %= text_field_tag 'invite' %>
  < %= submit_tag 'begin' %>
< % end -%>

This lets the user enter their invite in more or less the normal method. Now in your ‘welcome’ controller, you’ll need a ‘checkinvite’ method that looks something like the following:

 def checkinvite
    @inviteguid = params[:invite]
    @invite = Invite.find_by_guid(@inviteguid)
    if(@invite == nil)
      flash[:notice] = "Your invite was invalid"
      redirect_to root_url
      return
    end
    if(@invite.used_yet == 1)
      flash[:notice] = "Your invite had already been used"
      redirect_to root_url
      return
    end
  end

After this, you’ll need to have some code in your HTML page that links you to the account/signup functionality of acts_as_authenticated. I’m not going to include that because I’m too lazy to fish it out of my app functionality, but you can do that pretty much with a link_to using :invite=%gt;@invite_guid as an extra parameter.

You need to put the same invite detection code in account/signup, and then when you’ve created the account, set invite.used_yet = 1. This is about as simple as a method that I can think of for doing the private beta functionality that seems to be so much in vogue these days. Enjoy.

Tags: , , , , , , , , , , , , ,

There have been a lot of people asking angry questions to Apple today because the Apple β that they gave out to iPhone developers was timed to expire today and a lot of devs now have bricked their main mobile phone until an update appears. Lots of people appear angry, but they’re missing the main issue for Apple:

Dear Apple, why are you letting people this stupid into your β programs

People frequently forget what beta for software means in these days where everything is β until people find a way to make money off of it. It means untested, believed working properly but may blow up at any time, not ready for production. So, I’m halfway between bemused and annoyed at the outrage that some folks seem to be fielding on various fora.

Also, calling a phone ‘bricked’ when you can easily recover it by downloading new software hours later is hitting the epistemological puff pastry with a hammer.

Tags: , , ,

My first access to a unix machine was around 19 years ago, and I’m still amazined that sudo tcsh is a valid command on most systems.

I’m not saying that it isn’t convenient, mind you, but the fact that I can then execute emacs is also hilarious. Especially because sudo emacs is prohibited.

here is your system log, let me save you the trouble of auditing it by running a shell.

(I’m aware, incidentally, that it’s basically impossible to stop people from running a shell as long as they can run any naive-turing-complete interpreter or compiler. Maybe it’s time to only fight battles you can win.)

Tags: , , ,

note that the latest revision of this blog’s theme seems to have introduced a weird bug with the code layout plugin (wp-syntax) on some browsers. i’m looking into it.

I think the single most useful thing I’ve figured out recently in programming for MacOSX and the iPhone is this little snippet right here.

- (void) updateListForEntityNamed:(NSString*) entityName andSearchString:(NSString*) queryString
{
[...]
 
	MyDocument* current = [[NSDocumentController sharedDocumentController] currentDocument];
	if(current && current != self)
	{
		NSLog(@"CurrentDocument:%@ != self:%@", current, self);
		[current updateListForEntityNamed: entityName andSearchString: queryString];
		return;
	}
[...]
}

What this does is intercept incoming messages that are supposed to go to the current window, and redirect them to that instance. I’ve run into issues in Leopard (MacOSX 10.5) where this is an issue. To some extent, this is probably a misconfiguration in interface builder somewhere, but it also an issue when using CoreData, because the ManagedObjectContexts are particular to instances of NSManagedDocument, and there are issues that arise if you end up using the wrong context.

I am slowly becoming a great fan of CoreData, it’s a great persistence/object-graph-management layer. More on this later.

Tags: , , , , , , , , ,

For the last while, I’ve been working on a project that involves scanning large numbers of RSS/Atom feeds, and then using Bayesian1 classifiers to break it into one of a number of categories for summarization and display (the system that I’m using to do this is available as a sample website, but really needs more data in the training sets before it’s ready to entertain all of you.) The categories are pretty straightforward, and they fit into a somewhat neat controlled vocabulary (ontology/thesaurus/whatever.)

There’s a relation, though, between the different terms in this sort of classification and the training data used to build the Bayesian Classifier. If the terms are arranged in a hierarchy (and certain assumptions are made about that hierarchy, like subterms encompassing part of the range of meaning of their parent term and nothing else)2, then the training data used for classifying terms can be shared.

For example, all positive training data that belongs to the child terms can also be used for the parent. So, for (a constructed) example, positive training data for tamiflu also belongs in the positive data for bird flu vaccines. The reverse is true of negative training data. For negative data, the negative data for the parent can also be used for the child terms.

This is highly useful information when you’re making a large scale text classifier (and having it classify texts as belonging to categories or not, as opposed to just clustering texts into the categories that actually appear. It’s easier to use things like bayesian classifiers do to this if you’re looking for somewhat fine-grained detail.

Currently, I’ve been using Classifier4J for doing the classification and text summarization3. The text summarization is sort of annoying, though, because it’s based on a simple statistical choice of sentences which occasionally picks up date-lines and partial phrases because of what’s ‘important.’ I’m resorting the urge to go completely POS-tagging nuts on the whole thing and only selecting sentences of certain types or completeness because this is, after all, a side project. (The number of times I see things like ‘this sentence no verb.’ is astounding, though, and slowly driving me nuts.)

So, another day in the life.

1 although i’m also using a vector space classifier for a related, larger project and it’s driving me less nuts training it.
2 this is called a meronymous (’part-of’) relationship, and given that half the people who regularly read this blog were in LIS530 or its equivalent at some point, you should remember this.
3 and will probably eventually switch to jNBC http://jbnc.sourceforge.net/ before i go nuts

So, I was talking to someone today about their application (which was Ruby on Rails-based), and we had a long conversation about locking. There’s a couple of different sorts of locks that show up in software development, but there’s one in particular that mostly only shows up in enterprise software development, the Long-lived Lock.

Locks are used to keep other processes from modifying resources in the system. These can show up at a variety of levels ranging from Critical Sections (Java / Win ) that synchronize access to particular pieces of code, to database locks, which keep people from reading from or writing to rows or tables while operations are done.

However, all of these operations are for short periods of time. You can’t keep a read or write lock on a row in a database for an extended period of time (or in cases where you can, you almost certainly shouldn’t..) About the longest time a row in a database should be locked is to perform a single transaction (which may be spread between multiple databases, rows, or what have you, but the time is just the changes for the transaction, not all the time that people spend staring at a screen and enterting data before hitting the return key.)

But how do you let a user lock information for an extended period of time? For example, say the user is locking a row in the database that represents a document that they’re updating (a frequent setup in most ECM/DM systems.) Well, since that’s part of the ECM system, that should happen inside the logic of that application. It shouldn’t be achieved through database locking, but should instead be stored as information within the database.

It’s possible to set this up a number of different ways, but lets assume you have a document table document and it has, by convention, an id column that represents the primary key on the table. I’m also going to make the assumption that writing to a document is done by a particular user. Your application’s security system may vary.

So, let’s look at a table set up for locking on the document table:

TABLE doc_lock
     document_id : INTEGER
     user_id : INTEGER
     lock_expires: DATETIME
END

And you just join this table in when you need to know if there are locks on a particular object, and you otherwise create and delete locks as needed. One particular thing about this sort of locking strategy is that you end up with expired locks accumulating on documents, so you want to clean those up, and also when you join in the lock table you want to have non-expired locks only.

Your app needs behavior about various things to surround this, like what’s the security model surrounding locks (who can know about them, are they on a user/group/role basis, etc…), and when can a lock be broken. Sooner or later, you’ll need to break locks, like for an employee on vacation who’s got documents locked or similar. But that’s all above the database structure and the immediate operations on the lock table, which I’m discussing here.

Well, that’s part one of three. The next segment will be the Ruby-on-Rails implementation I sketched out for my interlocutor, and the last will be some variations on and exceptions to this idea. I consider long-lived locks a design pattern, because it’s a recurring pattern in enterprise computing.

Some comments on Hivelogic - The Narrative - Building Ruby, Rails, Subversion, Mongrel, and MySQL on Mac OS X

I’ve been using this set of instructions to install ruby on rails on MacOSX for a while (in case you’ve ever wondered, which you haven’t, I use a MacBook Pro set up to run Windows XP and MacOSX 1.4.x.) It doesn’t work well for me, because I use ‘tcsh’ and not ‘bash’ as my shell on the computer. I also like confining changes to my own account.

So, I use the instructions given in the cited article, with the following difference.

Paths
Here, add the following line to the end of your .cshrc

setenv PATH /usr/local/bin:/usr/local/sbin:/usr/local/mysql/bin:/sw/bin:$PATH

(This is all just one long line)

For the rest, I replace all instances of ’sudo command’ with ’sudo tcsh’ followed by the command. More concretely, instead of:

curl -O ftp://ftp.ruby-lang.org/pub/ruby/1.8/ruby-1.8.6.tar.gz
tar xzvf ruby-1.8.6.tar.gz
cd ruby-1.8.6
./configure --prefix=/usr/local --enable-pthread --with-readline-dir=/usr/local
make
sudo make install
sudo make install-doc
cd ..

I do:

curl -O ftp://ftp.ruby-lang.org/pub/ruby/1.8/ruby-1.8.6.tar.gz
tar xzvf ruby-1.8.6.tar.gz
cd ruby-1.8.6
./configure --prefix=/usr/local --enable-pthread --with-readline-dir=/usr/local
make
sudo tcsh
make install
make install-doc
exit
cd ..

This has the advantage of keeping my root environment clean and running bash, which have been disadvantages to the other solutions I’ve seen for this sort of thing. There’s a related issue of whether you should be able to sudo a shell, but that’s not the point of this article to argue about — this article is about making sure you have the right environment variables when you type ‘make install,’ basically.

I haven’t provided exact conversions of all the sets of commands because if you can’t figure the rest out, you might want to switch your account shell back to bash to avoid more trouble later. In particular, you will want to execute the ‘rehash’ shell command on occasion.

[beansidhe:~/ruby-1.8.6] zeitgeis% ruby -v
ruby 1.8.2 (2004-12-25) [universal-darwin8.0]
[beansidhe:~/ruby-1.8.6] zeitgeis% rehash
[beansidhe:~/ruby-1.8.6] zeitgeis% ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.9.1]

‘rehash’ causes the shell to recreate the cached path, which is handy when you’re adding new executables outside the current directory.

Technorati Tags: , , ,

Enterprise Content Management (ECM) Team Blog : Taxonomy/Tagging Starter Kit for SharePoint Server, also at the Sharepoint blog

Microsoft has made a kit available for Sharepoint that makes it easier to have taxonomy and tagging.  The tagging allows authors to tag items and to also have controlled vocabularies on particular multi-valued properties.  Users can incorporate the controlled vocabularies into searches and also search by tags. 

In the default configuration, users cannot tag items on the fly (although I suspect that they could change taxonomy values if they have permissions.)

I used to work (engineering) at an ECM company, so using the phrase ‘controlled vocabulary’ in place of taxonomy for this is somewhat second nature.  Since I took a lot of classification classes at the Information School, it’s interesting to see how companies implement these concepts.  It could be interesting if these features became widely available in Sharepoint.

Technorati Tags: , , , , ,

Why is releasing in codes with TODOs and FIXMEs in it ‘The Ruby Way?’

Technorati Tags:

Document modeling is important to any IR approach — the bag of words approach assumes word independence, and this is simple, but inappropriate to natural language. There have been a bunch of approaches to this sort of thing in the past, but here’s a relatively new one that does well versus various TREC collections.

Here’s a link to the paper: LDA-based Document Models for Ad-hoc Retrieval.

The presentation was largely a crawl of the paper section by section, and I’m going to emulate that approach by just referring to the paper so you can have that experience.

However: it beats previous models because it maps { document vs. topic } for all topics and documents, as opposed to the cluster approaches, for example, which largely assume that all documents belong to one cluster, or for many practical approaches, belong to whatever cluster it matches best. Because documents belong to n topics with probability p(d[i], n), this is better than searching against bag of words models.

All papers in this section are pretty oriented towards the whole ‘topic searching autogenerated’ is better than word-based. See the papers in question for the differentiators, as a lot of it is math that I’m not going to break out the LaTeX for on the fly. I will also note that most presentations in this area are pretty high on the UMLS fetishism.

Technorati Tags: , , ,

Technorati Tags: ,

For the last couple of days, I’ve been working on a couple of interesting things, of which the most important has been getting my resumes together, and the most interesting of which has been getting rubyonrails installed on my Macintosh.

Rails is interesting, because it handles the idea of dispatching incoming web requests in a way of which I greatly approve. I spent a lot of time in the late 90s trying to convince people that this sort of URL syntax was the right way to do things, and now through rails I’m feeling vindicatedish. Actually, I hadn’t really thought about it until I was talking to an ex-coworker who remembered me talking a lot about the Object-Action syntax being a good start on the Model-View-Controller. At the time, lot of pages named stufff like performSpecificActionOnSpecificObject.extension were more common. Anyway, this sort of thing leads itself to cleaner code design in a number of ways (including patterns, etc…) so it’s the sort of thing that generally ought to be encouraged.

I’m doing a project in Ruby at the moment mostly to get my hand back into programming. Because of the sorts of work/school I’ve been doing for the last while, I’m much much better at design and analysis than I’ve ever been, but my hacking skills are a little weaker than I recall. Learning a new language in a different paradigm is very useful, especially something that’s rich in hacky synax/generator crap like Ruby seems to be at the top level.

More specifically, I’m taking on a social networking light application. I’ve always wanted to do a social networking application, and during the job search seems like a good time.

So, for those of you who don’t know, I’ve been working part-time at a local company to help pay my way through grad school. That’s actually a simplification of the actual truth, as I’m a part-owner of the company and I also am mostly getting benefits more than cash, but for now I’m the main system administrator on one of the main systems they run.

For the last bit, I’ve been tracking down problems in the spam checking software that we use, and it’s been a merry time. Most of the problems have been getting everything on the server to be in a single known compatible state, which is a concept I greatly commend to you if you’re running a server and don’t want to spend lots of time messing with it.

Today’s project was figuring out the source of and eliminating a bunch of error messages that get mailed out to the administrators’ mailbox every night. They’re known harmless, but it’s just aggravating and it might hide other problems.

So, I was looking through the codebase, and I found this little gem:

SPAMD=`ps aux | awk –posix ‘{ if (($1 ~ /popuser/) && ($0 ~ /\/spamd[[:blank:]]/)) print $2; }’ | wc -l | awk ‘{print $1}’`

You might ask yourself what that does. It’s pretty easy to figure out… it counts the number of instances of processes match ’spamd ‘ followed by ‘popuser’, which is useful for figuring out whether or not spamassassin is running on your server. It’s part of 4psa server assistant. However, this may not work depending on how your server is configured. On my server, this never works because of how ps does its output.

My main point here is that that’s a crazy way to write that code. What the person is actually trying to do is make sure that they’re only getting the main spamassassin process and not any of the child processes. The child processes display as “spamd child”, the main spamassassin process displays as something like “/usr/bin/spamd -u popuser -d -m NUMBER -x –virtual-config-dir=/MAIL/DIR/FOR/YOUR/SERVER/%d/%l –socketpath=/tmp/spamd_full.sock”. So, they’ve got to distinguish between those two lines, and they’ve decided to check for random text in the first one, and written a fairly complex little shell script (calling awk twice!) to do so. They can’t check for just the word ‘popuser’ because it might appear in the path, in case you were wondering.

I replaced this with the following line:

SPAMD=`ps ax | grep -v “grep\|spamd child” | grep -i “spamd ” | wc -l | awk ‘{print $1}’`

This checks for all spamd processes, and just eliminates the ’spamd child’ processes first. Why this way? If you’re trying to choose between two things, and one of them changes from system to system, and one of them is fixed and simple, you probably should try to select the fixed one.

So, here I didn’t want the fixed ones, so I eliminated (’grep -v’) them. It saved me from having to try to pick the one I wanted. It’s generally as easy to select for elimination as it is to select for further processing in computer programs. This is also true in card tricks, incidentally. Just in case you want to do some card tricks.

The basic idea behind a lot of card tricks where you choose between two things is that the magician knows which one of the two things that (s)he wants you to have before had. So, the magician decides whether you’re selecting an item or selecting an item for elimination at the time you make the choice, to make sure that you get the right item.

Actually, it’s typically mostly used in really bad card tricks. The ‘decisive moments’ blog describes how a similar process to the magician’s force is used in many video games to keep the plot moving in a somewhat linear fashion transparently to the user, and why it fails.

I wonder how much the folks who do massive interactive games like 4orty2wo use this tactic, and whether they’ve found good ways to disguise that it’s happening.

A couple of months ago, I wrote some sample stemmers for a class I was taking in the iSchool. I’ve put some of them up on the website.

The stemmers on this site are the Porter and the Lovins stemmer, both implemented in PHP. The Porter stemmer was downloaded from one of the several sites on the internet that have the stemmer, the Lovins stemmer was converted to PHP from the Java version available at SourceForge, copies of the source are available on request.

The stemmers are available here. There’s also a call into the php implementation of the soundex algo that I added to demonstrate some points at some point.

“hors d’oeuvres funroll-loops” is my new favorite meaningless expression. it sounds completely meaningless, but came up today with regards to a technical problems I’m working on. It’s just ludicrously fun to say, although probably not as funny as ‘boss’ or ‘grody.’

Also, on the subject of hilarity, check out these references to auderves in google.

Suprisingly, –funroll-loops isn’t the funniest sounding of the gcc options, –malign-double is. Just in case you’re having deus ex machina problems and need to specify your program has an evil twin.