Tuesday, December 1, 2009

Spike Day 1

(See New Broom for an explanation....)

11/30/09

Code

Went back to the POS extractor project.  Moved Element from element_extractor.rb to its own file.  Tried to run the rspec from TextMate. Got exception:
/Users/Tom/.gem/ruby/1.8/gems/rspec-1.2.9/lib/spec/runner/options.rb:282:in `files_to_load': File or directory not found: nt_extractor_spec.rb (RuntimeError)

Somehow the rspec file name is truncated.
Tried running a different rspec file.  Name truncated – in both cases it’s the first five characters.
Tried running the element extractor test from the command line:
./element_extractor.rb:2:in `require': no such file to load -- log4r (LoadError)
Have to decide which issue to pursue. I want to be able to use TextMate, so ultimately that one needs to be resolved.  I also want to understand how to specify the load path for Ruby.
Googling the TextMate problem didn’t lead to a solution (nobody else seems to have this problem)....

Meta: Baseline Meeting with Dave H.

... but talking to Dave did – rubber-duckily!  I had opened the project in TextMate at the wrong level – once I opened it at the folder level, everything worked fine.

Also, he explained the command line problem: I was trying to run the rspec tests with ruby rather than spec.
Dave saw that I was not using Rspec mocks correctly (could have just used an Object.new) in creating my “character stream”, and the need for a ‘fake’ class emerged from the convo.  I suspect that I would eventually have been driven to this conclusion by duplication across tests, but part of the goal of this spike for me is getting to a “Ruby state of mind” where these things occur to me sooner.  As a beginner, I’m bombarded with lots of different issues while trying to write code, and only when some operations become “second nature” will I have enough mental cycles available to see things like this early.

Efficient use of resources is a major goal, one I suspect should be part of the apprenticeship process.  The learning process is like meta-RGR (red-green-refactor):  I get something accomplished, but then I need to review how I got there and see how my own thought processes can be refactored, made DRYer and simpler.
The issue of whether to get into Rails or continue with Ruby seems to be at least temporarily resolved in favor of Ruby after the convo with Dave, both for technical reasons (no immediate solution for the Rails problems) and for reasons of interest (linguistic) which seem to trump the career-oriented argument for Rails.  I’m a starving artist!
Dave also turned me on to WordNet (http://wordnet.princeton.edu/).

Meta: Word for Mac destroys Universe

I was just about to enter something here (following section) [I started writing this blog in Word for Mac] when I made the mistake of using Cmd-W to close a Find dialog.  My MBP rebooted!
Lesson: if it hurts, don’t do it.
Fallout:  Firefox crashed, leaving its profile locked.  Have to clean that up to get back to my bookmarked pages.
How to deal with distractions/tangents like this without losing momentum? One answer would seem to be: find the simplest fix that can possibly work, but postpone the refactoring (i.e., preventing the problem). Put it on a card!
Finally got Firefox back after having to delete both .parentlock and places.sqlite from the profile folder.  No simpler fix – I needed all the bookmarks! (I had to use Safari to find this info – glad it’s there!)

Changing stories

BDD’ing the Pos Extractor has led me to some new stories. 
I need to identify the beginning and end of an entry in W1913.  This goes beyond element extraction with tags.  Entries begin with the (headword) tag, but the headword element is just the word name.  The only way to identify an entry is to find the beginning of the next one.
Now I’m thinking that’s just what I need for analyzing constituents in text in general: a constituent extractor. It would have to operate on strings, not character streams, because it potentially needs to back up more than one character if it’s a matter of some kind of pattern match for the beginning of some other constituent.  In the case of a W1913 entry extractor, that would mean finding another headword start tag or EOF.
Can I use Ruby regexps for this? They would have to return the index of the beginning of the match. Or ... maybe it could be a Lispish thing of peeling off the head and leaving the tail....  The index isn’t the essence of the story: moving through the string is the key. As long as the regexp can return two matches, the constituent and the tipoff (e.g., start tag) of the next constituent, no index is needed.
Never mind! This is exactly what I need:
“Where match and =~ differ from each other chiefly is in what they return when there is a match: =~ returns the numerical index of the character in the string where the match started, whereas match returns an instance of the class MatchData....
-- The Well-Grounded Rubyist by David Black –  Section 11.2.2, p.322

(This is a great book - the best programming language intro I've ever read!)

Implish thought:

... as in “implementation” and “impish”.

Could operate on a char stream by extending buffers until a match or EOF is found.
Seems best to proceed by ignoring this, working with strings, and having faith in the process – i.e., that using hygienic practices will allow for easy incorporation of streams later.

Time out to read TWGR on regexps and procs.
The goal here is for the PoS extractor to get a series of complete entries from its input, then extract PoS info from each entry. A constituent extractor would be used to get the entry.
This gets into the other objects I want: citation and source identifier.
Citation is an object that specifies a string within some persistent source entity: a file, a blob in a DB, etc. , but does not actually contain the string. It should be able to produce the string on demand.
A citation would hold a reference to a source identifier, an object that offers one or more ways to access the source entity:  URLs, filepaths, DB queries.  I’m not even going to try to imagine what “access” will turn out to mean in implementation terms.
Is this BDUF again? Not if I BDD the constituent extractor without prejudice. If it doesn’t lead inevitably to a citation, so be it.

The reservoir

The above should have been a matter of thinking about possible designs without committing to anything. Here I’m following the concepts of one of my mentors, the actor, director and teacher Steven Ivcich.
He has a very nonstandard approach to preparing for performance.  It involves a lot of near-random improv and playing with the script.  His term for this is "filling up your reservoir".  The idea behind "classical" preparation and rehearsal for an actor is that you are locking down the character and all your moves before you ever perform.  Steven's approach is much more agile:  every performance involves new information - new audiences,  new energies, new twists on the interaction with other actors.  You develop the reservoir as a way of preparing to respond to unexpected developments.  In other words, no Big Design Up Front.
To bind the analogy:  the RGR cycle is the performance, and thinking about the domain and possible technologies before going into the cycle is filling the reservoir.

BDDing the Constituent Extractor

I’m thinking that the CE is initialized with two procs for matching the beginning and end of a constituent, each returning an offset (maybe they just run a regexp, and maybe not), and it replies to messages containing a string and an offset with an array of strings – no, constituent objects, each of which has the offset of the beginning of the constituent and a length. If the CE finds the beginning of a constituent but not an end, the length will be nil or infinity.
This is bothering me. Already too implish? Pushing toward the constituent object being a citation?

No comments:

Post a Comment