Wednesday, December 2, 2009

Spike Day 2

(See New Broom for an explanation....)


Procs and Duck Typing

Picking up from yesterday: got the first rspec for the CE passing – no match returns empty array. That establishes the infrastructure of the class but with a minimal commitment to anything else. Now I have to decide exactly what should come back as the result of extraction.

How about:  start and end offsets and the match for the tipoff.  In the case of W1913, that would be the headword element:  someword. In the case of the PPP (Post-Platonic Parser – TBE [To Be Explicated]), that would be a matter of matching a word like ‘the’ (the tipoff) and the noun phrase it introduces (the rest of the constituent).

Or should it include the entire substring that represents the constituent? To say no at this point would smack of premature optimization. So I’ll go for it.

Should it be an object, a hash or an array? I’m leaning toward an object. I’m not sure that’s the simplest thing that can possibly work, but it seems cleaner. I’d rather create an object I don’t need and eliminate it later than have to find all the places I’d have to change a hash or an array to an object. (I’m afraid I’m still thinking too much in terms of static typing rather than duck typing, and maybe I’m overly concerned about the lack of refactoring tools.)


Got to green on it “should return the proper constituent on a match”, but it took too long and involved too much code.  I learned a good chunk about procs and MatchData, but I’m not happy with my BDD process or the resulting code.

Part of the problem seems to be premature generalization – trying to implement two stories at once. I should have stuck with the W1913 entry extraction. Especially when I’m working with code constructs I’m not familiar with.

And there’s something smelly about the Constituent class I created, beyond its being just a transfer object.  It exposes the tipoff string and the full constituent string, which is okay, but it also exposes the offset of the constituent in the string passed in.  That datum is irrelevant to the actual use of the object, and it ended up there for two bad reasons. The first is that I was thinking ahead to the problem of buffering the multiple files that make up W1913. The second is some interference from thinking about citations (see yesterday): I want a citation object to track the location of its target string within a source object. (I want to track offsets in the PPP as well.)

Does the fact that I’m wearing two hats at once (customer and developer) have something to do with it? Having a real dialog would help to clarify the issues (minimalist version of crowdsourcing – N heads better than one).  Similarly – I’m not pairing.

Wait – I’m beating myself up unnecessarily here (stop grinning, Milo!).  The offset is not completely illegitimate – it really is part of the story - maybe it just needs to be packaged better.  There are local/contingent offsets within the strings (buffers) that are passed into the extractor, and there are persistent and meaningful offsets in the sources (files).

And let’s face the generalization issue head-on.  There are lots of situations where we do this kind of thing: screen-scraping, data-mining, etc. No way this puppy can handle all of them. Generalization is way premature. If I end up with multiple classes that do very similar things, and they’re all good OO citizens,  generalizing should be a clean refactoring. So – back to the W1913 story.


A new thought:  what if all these extractors could be pipelined? E.g.:
W1913 files | entry extractor | PoS extractor | lexipedia updater [TBE]
(I would have used “=>” there, but it would collide with Ruby notation, so I went for the *nix operator instead).
Threads communicating via queues - not only could it work, and fit into the bigger picture, it gives me a chance to work with threading and file IO in Ruby.

But first...

... a bit more work on the constituent extractor.
A thought about offsets: if I take the citation/source concept seriously, the current constituent extractor is working on essentially transient strings.  So the source object would have to reflect that, and the citation object would just be an offset and a reference to the source. The constituent extractor could easily inject citations into constituent constructors.  For pipelining, the thread that feeds the extractor could use the returned offsets in managing its buffers.

No comments:

Post a Comment