12/4/09
Priorities
This is the last day of the spike. If I’m going to proceed as if this were a normal project, there are several options:  create the Entry Builder and the Orchestrator, and finish the Entry Extractor.  
The Entry Builder has to extract PoS and sense info from the entry string and construct a new object,  a lexicon entry optimized for the yet-to-be-specified parser. 
Actually, there’s more to it than just distinguishing senses.  There’s a morphological issue: the parser will be seeing inflected versions of words.  Some of them, like “walks”, “walking”, “walked” and “walker” can be mapped by general rules to their root words (“walk”) – these are the “regular” nouns and verbs, and typically the rules for encoding and decoding them would be in a separate component.  Others, like “children”, “knives” and “fish” (plural) are “irregular”, and these forms are always in the lexicon. 
The Orchestrator is essentially for setting up the pipeline: wiring up the queues and transformations/filters/processors. (Think Mario Brothers – or am I still missing Spring too much?).
The Entry Extractor needs a few more rspec cases.
Okay – given that there are outstanding domain issues on the Entry Builder, the Entry Extractor work is pretty straightforward, and this spike is more about learning how to use Ruby constructs, I’ll proceed with the Orchestrator and wire up the pieces I’ve already got.
First, some refactoring.
The EOS (end of source) string I started out using – “***EOS” - is problematic because the asterisks need to be escaped for it to be used as a regexp pattern. There’s no need to have more than one of these things, so it can be a constant and not have to be passed into to each processor class. I’ll set up a Pipeline module for all the classes to include.
The pipeline subfolder I created was primarily for seeing how require paths work – and it still confuses me a bit.
Project Structure and Require
Now I want to move some classes around and see how Git and TextMate interact when I do that. 
Created a processors subdir and moved all the processor classes into it, along with their rspecs.  (Should there be a separate parallel test hierarchy ala Maven?)
Jay Fields’ Thoughts: Ruby Project Tree does advocate that separation, so I’ll follow his lead. I won’t take his suggestion re what Micah’s guidelines on Ruby require calls “require farms”.
I’d like some Montrachet with this Pinot Noir: there is such a thing as “require hell”! I was sizzling in it – until Joe Banks (and Dave Chelimsky’s pre-release Rspec book PDF) bailed me out. It took quite a while to clean it all up. In the process I realized that Rake is something I still need to understand.
Spike End: Lessons Learned
What I learned is what I still need to learn. 
Interestingly ambiguous sentence – there’s a contradictory interpretation: “there’s a set of things that I still need to learn and I actually learned them (so I don’t still need to learn them, but then ....)”. What I intended was: “there’s a set of things I still need to learn and I learned what that set is”.
The biggest thing I need to learn is to unlearn what my years in the waterfall business made second nature: implicit BDUF. What’s wrong with BDUF is that it’s almost always the wrong design. I had a graphic demonstration of this when Dave H. went over my project and found a simpler way to do just about everything with regexps and hpricot. The faulty assumption – that BDUF habit - was that I needed to work with character streams.
(Actually, this character-by-character parsing happened in a class from the earlier phase – the Tag Extractor – which I essentially abandoned for the right reason when I started on the W1913-specific classes.)
If I can bring a whole dictionary file into memory, Dave’s solution will do the whole trick. None of them is bigger than 10 megs, so it could work.
Tried it: sure it works! – and it handled the “chunking” of entries in the A-B file (~5 megs) in less than a second.
There were two occasionally conflicting goals for this experiment: learn Ruby and craft a project. I learned a ton of Ruby, and a lot about what I still need to learn about the craft. The project code itself is not particularly usable for practical purposes –
Floorsweepings.
 
 
 
 

No comments:
Post a Comment