Sunday, December 20, 2009

Rails and (unrelated) Rants

Rails TDD Boot Camp last Monday thru Thursday at Obtiva was great. Good small group (Alan, Ashutosh and me - should have got their email addresses!), good instructor (Andy),  good coverage. I'm ready to create websites to do just about anything.

(Unrelated rant:) This is so true: Aliens, Elves, and the Politics of Utopia.

That's all. Have a nice holiday break, and don't wear a red shirt....

[Oops, posted this on the wrong blog at first, but it's okay, nobody reads that one....]

Sunday, December 6, 2009

Thanks to Andy...

...  I'm running on Rails at last.

I had problems getting Rails talking to a database - either Sqlite or MySql. I figured since I'll be at  Ruby on Rails TDD Boot Camp next week, and Andy Maleh's teaching it, I could ask him for help. And help he did: we spent over two hours at the Red Eye Cafe (4164 N. Lincoln - free wifi, free parking!! and best of all, AC outlets by our table!!!) working through the combinations of versions of MySql, the mysql gem and 32/64 option in Snow Leopard.  Without going into the boring details, the combo that works is  MySql 5.1.41-osx10.5-x86_64 and mysql gem v. 2.7 installed with x86-64 support.  The kernel default in Snow Leopard is set to 32 bits because I haven't upgraded VMWare Fusion to support 64 (they want $40 - I'll try VirtualBox before I spend that!).  Other combinations cause various errors in either rake db:create or rake db:migrate.

Thanks again, Andy!

Baby in the Bathwater

Yesterday's post started on a relatively optimistic note and ended less positively.  That was an overreaction, and it reflects a pattern that appeared earlier in the spike.  I test-drove some classes, decided they were not going to solve my problems, and abandoned them to go in a different direction.  But later I discovered that there was a good use for them within the new structure.

In the first case, I had been working through the char-by-char element extractor class. I eventually decided that the complexity issues I was having with it meant that it was a cul de sac of premature generalization, premature concern with implementation details and/or overengineering, and decided to drop it in favor of the W1913 element extractor aimed at a very specific story. In the course of test-driving the latter, I found that it could use the element extractor exactly as it was. 

Now it seems to me that the pattern repeated.  When Dave paired with me Friday, we at first tried to deal with some of the complexity by refactoring.  But then he suggested a much simpler approach to extracting entire dictionary entries by scanning an entire file with regexps. The result was beautifully simple and fast, and so at the end of that day's post, I declared the more complex code I had developed to be floorsweepings. 

But now I think that was premature - there's a baby in that bathwater (so much for metaphoric consistency with floorsweepings!). (I don't mean to imply that Dave was claiming his solution would work for my other stories - in fact he said it wouldn't.)

I'll approach this from another direction now - the distinction between semantics and pragmatics.

Something that annoyed me when I first saw the W1913 was the mix of visual and semantic markup.  Headword, PoS, sense and definition tags were intermixed with page break tags, <i>, <b>, etc. - purely visual markup.  (These days tags like <i> are deprecated in HTML in favor of things like <emph>. The markup in W1913 was inserted in 1999.)  The bottom line for W1913 is that it supports two radically different purposes: the original intention was to be able to reproduce a book, but  my interest is in extracting dictionary entries for other kinds of processing.  This is a pragmatic distinction based on differing purposes - closely related to the distinction we make in application development between models and the processes that use them (services, views, controllers, etc).

Semantics is simpler than pragmatics because it's context-free.  Using regexps to extract W1913 entries makes perfect sense.  However, the structure inside an entry is more complex for several reasons. One is that page breaks and visual tags are noise from the current perspective.  But more important - there's context. Here's an example from the A-B file:

<p><hw>A*base"</hw> (&adot;*b&amacr;s"), <pos><i>v. t.</i></pos>[<pos><i>imp. & p. p.</i></pos> <u>Abased</u> (&adot;*b&amacr;st");<pos><i>p. pr. & vb. n.</i></pos> <u>Abasing</u>.] [F. <i>abaisser</i>, LL.<i>abassare</i>, <i>abbassare</i> ; <i>ad</i> + <i>bassare</i>, fr.<i>bassus</i> low. See <u>Base</u>, <pos><i>a.</i></pos>]<sn><b>1.</b></sn> <def>To lower or depress; to throw or cast down; as, to <i>abase</i> the eye.</def> [Archaic] <i>Bacon.</i></p>

There are four occurrences of the <pos> tag here. The first one follows the headword - it says "abase" is a transitive verb.  The next two refer to inflected forms:  "abased" is both an imperfect and past participle, while "abasing" is a progressive and a verbal nominalization.  (It's not clear why these should be present in the entry at all, seeing as  they are perfectly regular inflections.) The fourth occurrence is in an etymological section: the word "base" in the sense of "crude" or "vile" is an adjective.

What's relevant about this example from the perspective of extracting PoS information is that in the first and fourth occurrences, the form to which the PoS element applies precedes that element, whereas in the second and third case, the form follows the PoS.  The markers that would help us distinguish these cases are not tags, but square brackets and possibly the word "See", or the language markers ("F." and "LL.").  The simple line separating data and metadata (text and tags) is blurred. (To add to the complexity, there are many entries where the headword has no PoS tag at all.

Semantics is about independence of context and history - that's why it's simpler than pragmatics. Since pragmatic information is progressively added to and removed from context (scoped), I suspect it's easier ultimately to use a state machine to parse it. My first crude attempt at this needs a lot of rework, and certainly applying the regexp scan makes the problem a lot simpler - it removes one layer of structure, so I can concentrate on intra-entry parsing.

Spike Day 5 - Last Day

(See New Broom for an explanation....)

12/4/09

Priorities

This is the last day of the spike. If I’m going to proceed as if this were a normal project, there are several options:  create the Entry Builder and the Orchestrator, and finish the Entry Extractor. 
The Entry Builder has to extract PoS and sense info from the entry string and construct a new object,  a lexicon entry optimized for the yet-to-be-specified parser. 

Actually, there’s more to it than just distinguishing senses.  There’s a morphological issue: the parser will be seeing inflected versions of words.  Some of them, like “walks”, “walking”, “walked” and “walker” can be mapped by general rules to their root words (“walk”) – these are the “regular” nouns and verbs, and typically the rules for encoding and decoding them would be in a separate component.  Others, like “children”, “knives” and “fish” (plural) are “irregular”, and these forms are always in the lexicon.

The Orchestrator is essentially for setting up the pipeline: wiring up the queues and transformations/filters/processors. (Think Mario Brothers – or am I still missing Spring too much?).

The Entry Extractor needs a few more rspec cases.

Okay – given that there are outstanding domain issues on the Entry Builder,  the Entry Extractor work is pretty straightforward,  and this spike is more about learning how to use Ruby constructs, I’ll proceed with the Orchestrator and wire up the pieces I’ve already got.

First, some refactoring. 

The EOS (end of source) string I started out using – “***EOS” - is problematic because the asterisks need to be escaped for it to be used as a regexp pattern. There’s no need to have more than one of these things, so it can be a constant and not have to be passed into to each processor class. I’ll set up a Pipeline module for all the classes to include.

The pipeline subfolder I created was primarily for seeing how require paths work – and it still confuses me a bit.

Project Structure and Require

Now I want to move some classes around and see how Git and TextMate interact when I do that.
Created a processors subdir and moved all the processor classes into it, along with their rspecs.  (Should there be a separate parallel test hierarchy ala Maven?)

Jay Fields’ Thoughts: Ruby Project Tree does advocate that separation, so I’ll follow his lead.  I won’t take his suggestion re what Micah’s guidelines on Ruby require calls “require farms”.

I’d like some Montrachet with this Pinot Noir: there is such a thing as “require hell”!  I was sizzling in it – until Joe Banks (and Dave Chelimsky’s pre-release Rspec book PDF) bailed me out.  It took quite a while to clean it all up. In the process I realized that Rake is something I still need to understand.

Spike End: Lessons Learned

What I learned is what I still need to learn.

Interestingly ambiguous sentence – there’s a contradictory interpretation: “there’s a set of things that I still need to learn and I actually learned them (so I don’t still need to learn them, but then ....)”.  What I intended was: “there’s a set of things I still need to learn and I learned what that set is”.

The biggest thing I need to learn is to unlearn what my years in the waterfall business made second nature: implicit BDUF.  What’s wrong with BDUF is that it’s almost always the wrong design.  I had a graphic demonstration of this when Dave H. went over my project and found a simpler way to do just about everything with regexps and hpricot.  The faulty assumption – that BDUF habit - was that I needed to work with character streams.

(Actually, this character-by-character parsing happened in a class from the earlier phase – the Tag Extractor – which I essentially abandoned for the right reason when I started on the W1913-specific classes.)

If I can bring a whole dictionary file into memory,  Dave’s solution will do the whole trick. None of them is bigger than 10 megs, so it could work.

Tried it: sure it works! – and it handled the “chunking” of entries in the A-B file (~5 megs) in less than a second.

There were two occasionally conflicting goals for this experiment:  learn Ruby and craft a project.  I learned a ton of Ruby, and a lot about what I still need to learn about the craft.   The project code itself is not particularly usable for practical purposes –

Floorsweepings.

Thursday, December 3, 2009

Spike Day 4

(See New Broom for an explanation....)

12/3/09

Priorities

Feels like I’ve been yakking too much here and not coding enough, so just coding is a priority for today.

Let’s see how fast I can get a pipeline going without agonizing over the likelihood that there are better Ruby ways to do this.

I’d like to set up a git repo to keep a baseline of the extractor project, so I’ll timebox that. One hour -too much? I really want to use git, so one hour it is.

Maybe later I’ll talk about yesterday’s two interview questions (pairing, evaluating tech) and last night’s dream (grammar school).

Git ‘er did

Setting up the Git repo took 25 minutes from scratch. Admittedly I cheated, having experimented with Git before.  But today I started by googling and used this page: Git - SVN Crash Course, because I’m a longtime Subversion user.  Everything I needed to get started was there. The only thing it didn’t cover was .gitignore – and that was what I had had trouble with before. I wanted to ignore the logs folder. So I googled that and gitignore(5) did the trick.

The moral of this story is: to learn something new, see if you can start from where you are (pragmatic) rather than from some ideal image of where you want to be (semantic).

So I’ve tagged the baseline, and now I’ll start pipelining. But I want to remember to look at the code Dave recommended as an example of good Ruby style. It’s on github, so I’ll pull it when I have time.   I’ll put that on a card (thanks again, Milo).

Pipeline

Start with a story:

As a ... what? ... parser developer? ... I want to extract entries including PoS-by-sense from the public domain Merriam Webster 1913 dictionary, available online in a set of semi-marked-up text files for eventual feed into a lexicon builder.

Decompose story into a set of tasks as follows:

Split into buffers.
Queue buffers in sequence.
Extract entries as strings.

Queue entry strings in sequence.
Build entry objects with PoS, headword and sense exposed.
Queue entries.


The struck-thru buffer and queue tasks were premature. They should emerge from a process of organizing the primary tasks into objects.

Objects:

FileReader
EntryStringExtractor
EntryBuilder
Orchestrator

This isn’t completely satisfactory but it’s a start.

I should have done this the first day!
I could create CRC cards (does anybody still use them?). I haven’t done this in years but it’s worth trying. I’ll write them here and then on real cards:

FileReader

Responsibilities

Accepts a sequential list of files, a queue, an EOS marker string and a buffer size.
Opens files in sequence.
Reads files into buffers in sequence.
Puts buffers on queue.
Puts EOS marker on queue after last file is processed.

Collaborations

File
Queue

EntryStringExtractor

Responsibilities

Accepts an input queue, an output queue, entry start and end patterns, and an EOS marker string.
Extracts entry strings from input buffers.
Puts entry strings on output queue.
Passes EOS marker to output queue.

Collaborations

Queue
ConstituentExtractor

EntryBuilder

Responsibilities

Accepts an input queue, an array of patterns for element extraction, an EOS marker and an output queue.
...

Wait a minute

This is starting to smell like BDUF. I don’t know where to draw the line, but the EntryBuilder definitely needs more thought – extracting multiple PoS-sense objects from one entry is not so simple. I’ll probably need to inject one or more extractors into the builder, and they’ll be more complex than the tag extractor, although they can use it.

This won’t be finished during the spike.

I’m realizing that it was unrealistic to expect to be able to come up with something anywhere near as elegant as the Bowling Game Kata – and that’s been in the back of my mind from the get-go. What I’m trying to do here is nowhere near as well-defined as a bowling game. 

There’s also the painful tension between working with a very specific source like W1913 and wanting to end up with something reusable. I think the best move is to bite the bullet, go with the specificity (pragmatic!) and have faith in the process:  when there’s another source to be plumbed, it should be possible to externalize the specificities thru refactoring. 

BDD : pragmatics :: BDUF : semantics.

So I’m definitely going to stop now, go with the FileReader and the EntryStringExtractor (which I hereby dub the W1913FileReader and W1913EntryStringExtractor) and see what things look like when I get to the point of needing an entry builder.

Dependency Dejection

Another learning bump – after months of intensive Springing, I have to get used to the absence/irrelevance of dependency injection in Ruby. I wanted to test the file reader by injecting a mock file object. Thanks to a chat with Craig D., I’m now using StringIO for that (also covered in TWGR).

Githubbed?

Created a repo on Github:
Now I have to push to it.  L8r.

Progress

Got the file reader working to match the story.

Got the entry extractor to the point of a non-trivial failing test (git rev 1ab471ac08be62b076d29cc5a7664f370f05b83b).  Now I have to plug in the element extractor I created previously.  But time is short – should I try to push the repo or finish the story?
Pushing the repo is for the blog readers – so far they are few and local, so I can show them the code locally. The push can wait.

Got deadlocks, have to continue tomorrow.

Git

I think the push worked. Here’s the public clone URL:
git://github.com/GHogChi/FloorSweepingsCraftsmanSpike.git

Oh yeah..

About the two interview questions and the dream:

Is pairing productive?

I hope I was clear about it. What I should have said is that notwithstanding the studies that show it ultimately saves time by catching problems early, it establishes a pattern of open information flow within the team.

How do you evaluate technologies?

What I hope I/should have said: after all the research/googling/book reading, it's good to do a test drive and give yourself a chance to get enthusiastic about it: as a developer, you'll have to live with it.

Dream

I dreamt that as an adult, I had somehow applied to an elementary school and been accepted. I wanted to get back to basics. I ended up having a conversation with some teachers (in the teachers' lounge?) about our graduate degrees.

Okay, it was weird - it was a dream, after all. But I think part of it is a reflection on the back-to-basics nature of this Craftsman Spike.

Wednesday, December 2, 2009

Spike Day 3

(See New Broom for an explanation....)

12/2/09

Priorities

Finish the extractor BDDing or start on pipelining?

Dave’s keeping the definition of the “Craftsman Spike” open is a blessing and a curse. It’s forcing me to prioritize across “meta” boundaries and confront the local existential question: why am I here?

Prioritization is part of the craft, so in wrestling with these issues, I am learning something important.

The story-level cycle (think about the story, run RGR cycles, repeat) requires a way to guarantee that the “thinking algorithm” will halt at a reasonable point, avoiding “analysis paralysis” and over-engineering. (In fact, the same issue comes up within the RGR cycle itself – cf. Red-Green-Refactor.)

Rereading that post by James Shore reminds me that I still need to work seriously on keeping the coding short in the RGR cycle. That first fiasco with the TagExtractor is a case in point. In fact, this is another example of the halting problem. Given that emitting only short bursts of code – baby steps - is a big priority, it points toward finishing the extractor story.

But another priority here is learning Ruby, which I’m interpreting as “getting some experience with all the constructs that are new to me as a Java/C# developer”. From that perspective, these extractor stories are like micro-spikes that don’t have to lead to production code, just proof of concept.
It occurs to me that I can use this log/blog/diary as part of the cycle in a more granular way as a check on concision.

Interrupt

In the spirit of making this more real-time (and probably more boring to folks reading it – hmm, should I be tweeting?), I'll mention that I just got an email from Andy M. about my Rails getting-started DB problems. He’s asking for my database.yml and Rails logs. (Embarrassing that I never looked at them....)  So hold on while I go look for those.

Interrupting the Interrupt

Todd just cranked up Pandora – his blues station – and I didn’t recognize Howlin’ Wolf.  That’s gigantically embarrassing, but I can’t let it stop my march toward craftsmanship.

Back to Priorities

If I want to finish the constituent extractor (CE), there are two stories I haven’t dealt with.  The easy one is – extract multiple constituents from one string. The hard one is figuring out how to deal with EOS, because in the current rspec context, the end_matcher is the same as the start_matcher – i.e., the beginning of a W1913 entry is a headword element, and the only way to find the end of it is to find the next headword.

So I’ll start with the easy one and concentrate on concision.

It occurs to me that I could be providing zips of the current state of the code for anybody crazy enough to want to follow this stuff in detail.

Don’t see how to do it with Blog*Spot free hosting, although they will host images.

Stupidity Interrupt

How the hell did I ever decide that the initials for Red-Green-Refactor were “RGF”?  Sheesh! It’s RGR from now on. But I won’t go back and correct it.  (Stop laughing, Milo!)

[UPDATE:  I lied - I went back and fixed it. No point in confusing the reader.]

Back to CE

Going with the easy one first. Shore wants me to restrict myself to five lines of code per cycle. Let’s see if I can do that.  Should I prefactor? (Hmm, that word seems to have turned into something else since I heard Paul P. and Micah M. using it.)  I think it will be harmless to prevent duplication by making my headword matcher an instance variable and moving the definition into a before(:each), although Craig D. instilled the maxim “see the duplication before you remove it” in me.
(Okay, I’ll admit it – I really miss Eclipse’s (and Visual Studio/ReSharper’s) instant highlighting of errors.)

Another prefactoring – extract a validation method for the results.

Leaning over backward to keep it under five lines: wrote this very long one:
results = ConstituentExtractor.new(@headword_matcher, @headword_matcher).extract('AxByC')
(If it works, I’ll go back and verify that Ruby won’t let me break it before or after the ‘.’.)

Oops – I needed the offset argument in the extract call.  Now it’s the “shade of red” I wanted, and the test is only four lines long. (Just checking the size of the results array for now – expected 2, got 1.)
Extracted the extraction code to a private method. The extract method calls it in a loop and increments the offset as long as there’s a match.

Ran the test – good news, bad news: the single extraction test is green, but the double extraction is still red.  I’m inferring that the problem is possibly a one-off in the offset handling in the loop.

Interrupt

Milo T. IMs me.  Interesting convo. Some of it will show up here. He’s reading this blog. Suggesting I use cards to put some of the technical overthinking on the stack to keep it from slowing me down – a good idea for production, but this Craftsman Spike is rehearsal – speed is not the issue here, and overthinking is part of the process.

But Milo’s right, kids – kards are kool!

Back to CE

Wait – there is no loop yet! No wonder it only found one constituent.
Set up a begin .. end until loop.  Oops – infinite - had to force-quit TextMate!

Meta: Agile Athletics

After the IM with Milo about metrics and micrometrics, I realize I’ve thought about Agile development as an athletic activity for a long time – it goes back to Kent Beck and the name XP. So the Craftsman Spike is like training camp – not necessarily just boot camp, but a refresher.

Back to CE

Corrected the offset calc, and all is green!

Refactor:  deleted the debug console dumps.  Could probably compress the extraction into fewer lines, but I want to move on.

Now the hard one – dealing with the end of the source text (EOS).

Realized I could put the project into git – that would be the simplest way to track it through time. Still a little learning curve there – last time I tried git I had problems getting it to ignore some things.
Okay – trying EOS. I hope it’s as simple as an alternation (‘|’) in the end matcher.  Worked in rubular anyway.

Got to green on the EOS test, but boy does this need refactoring!  Passing in a couple of procs helped me learn about procs (certainly a spike goal), but the justification for it (duck typing: you could pass in anything as a matcher as long as it returns something that quacks a bit like MatchData, in case ordinary regexps aren’t enough) isn’t cutting it at the moment, mainly because so much of MatchData is being used now that EOS is supported. What’s worse: the extractor is making assumptions about what’s captured in the regexps.

So I’m going to pass in three regexp patterns:  constituent start, constituent end, and source end. It was another case of premature generalization – regexps are fine for my current story. I have to have faith in the process: keeping the code DRY and OO-clean will make it easy to generalize in future if necessary.
Let’s see if I can do this refactoring in baby steps without breaking tests.... With Java constructor overloading it might be easier, but too bad.

Okay, after a couple of false steps, it’s done. No more procs – but at least now I know how to use them.
Noticed something nice (it may happen with Java too, but I don’t remember seeing it, maybe because Java code is more verbose):  every time I refactor, the code gets more compact.  Certainly true of the rspec class.  In the extractor class, the private extractor method is 18 lines long, which seems like a lot, but the algorithm is kinda complex and I used intermediate variables to keep it a little clearer as to what’s happening. There’s one comment, and it regrets itself:

        # (smelly that this needs a comment) following line handles the normal and EOS cases:
        end_match_index = end_match.captures[0] ? end_match.begin(1) : end_match.begin(2)

Anyway, the exposed surface of MatchData is not all that transparent.

So – pipelining?

Pipelining again

I’ll start from the outside and maybe get into some Ruby file IO.

...

Just got sidetracked into looking at issues with IO and String classes and encoding.  Can’t expect to come up with the ultimate efficient solution now – just want to concentrate on reading in files, pipelining, extracting, etc.

Will get to it tomorrow.

Spike Day 2

(See New Broom for an explanation....)

12/1/09

Procs and Duck Typing

Picking up from yesterday: got the first rspec for the CE passing – no match returns empty array. That establishes the infrastructure of the class but with a minimal commitment to anything else. Now I have to decide exactly what should come back as the result of extraction.

How about:  start and end offsets and the match for the tipoff.  In the case of W1913, that would be the headword element:  someword. In the case of the PPP (Post-Platonic Parser – TBE [To Be Explicated]), that would be a matter of matching a word like ‘the’ (the tipoff) and the noun phrase it introduces (the rest of the constituent).

Or should it include the entire substring that represents the constituent? To say no at this point would smack of premature optimization. So I’ll go for it.

Should it be an object, a hash or an array? I’m leaning toward an object. I’m not sure that’s the simplest thing that can possibly work, but it seems cleaner. I’d rather create an object I don’t need and eliminate it later than have to find all the places I’d have to change a hash or an array to an object. (I’m afraid I’m still thinking too much in terms of static typing rather than duck typing, and maybe I’m overly concerned about the lack of refactoring tools.)

Smelly

Got to green on it “should return the proper constituent on a match”, but it took too long and involved too much code.  I learned a good chunk about procs and MatchData, but I’m not happy with my BDD process or the resulting code.

Part of the problem seems to be premature generalization – trying to implement two stories at once. I should have stuck with the W1913 entry extraction. Especially when I’m working with code constructs I’m not familiar with.

And there’s something smelly about the Constituent class I created, beyond its being just a transfer object.  It exposes the tipoff string and the full constituent string, which is okay, but it also exposes the offset of the constituent in the string passed in.  That datum is irrelevant to the actual use of the object, and it ended up there for two bad reasons. The first is that I was thinking ahead to the problem of buffering the multiple files that make up W1913. The second is some interference from thinking about citations (see yesterday): I want a citation object to track the location of its target string within a source object. (I want to track offsets in the PPP as well.)

Does the fact that I’m wearing two hats at once (customer and developer) have something to do with it? Having a real dialog would help to clarify the issues (minimalist version of crowdsourcing – N heads better than one).  Similarly – I’m not pairing.

Wait – I’m beating myself up unnecessarily here (stop grinning, Milo!).  The offset is not completely illegitimate – it really is part of the story - maybe it just needs to be packaged better.  There are local/contingent offsets within the strings (buffers) that are passed into the extractor, and there are persistent and meaningful offsets in the sources (files).

And let’s face the generalization issue head-on.  There are lots of situations where we do this kind of thing: screen-scraping, data-mining, etc. No way this puppy can handle all of them. Generalization is way premature. If I end up with multiple classes that do very similar things, and they’re all good OO citizens,  generalizing should be a clean refactoring. So – back to the W1913 story.

 Pipeline?

A new thought:  what if all these extractors could be pipelined? E.g.:
W1913 files | entry extractor | PoS extractor | lexipedia updater [TBE]
(I would have used “=>” there, but it would collide with Ruby notation, so I went for the *nix operator instead).
Threads communicating via queues - not only could it work, and fit into the bigger picture, it gives me a chance to work with threading and file IO in Ruby.

But first...

... a bit more work on the constituent extractor.
A thought about offsets: if I take the citation/source concept seriously, the current constituent extractor is working on essentially transient strings.  So the source object would have to reflect that, and the citation object would just be an offset and a reference to the source. The constituent extractor could easily inject citations into constituent constructors.  For pipelining, the thread that feeds the extractor could use the returned offsets in managing its buffers.

Tuesday, December 1, 2009

Spike Day 1

(See New Broom for an explanation....)

11/30/09

Code

Went back to the POS extractor project.  Moved Element from element_extractor.rb to its own file.  Tried to run the rspec from TextMate. Got exception:
/Users/Tom/.gem/ruby/1.8/gems/rspec-1.2.9/lib/spec/runner/options.rb:282:in `files_to_load': File or directory not found: nt_extractor_spec.rb (RuntimeError)

Somehow the rspec file name is truncated.
Tried running a different rspec file.  Name truncated – in both cases it’s the first five characters.
Tried running the element extractor test from the command line:
./element_extractor.rb:2:in `require': no such file to load -- log4r (LoadError)
Have to decide which issue to pursue. I want to be able to use TextMate, so ultimately that one needs to be resolved.  I also want to understand how to specify the load path for Ruby.
Googling the TextMate problem didn’t lead to a solution (nobody else seems to have this problem)....

Meta: Baseline Meeting with Dave H.

... but talking to Dave did – rubber-duckily!  I had opened the project in TextMate at the wrong level – once I opened it at the folder level, everything worked fine.

Also, he explained the command line problem: I was trying to run the rspec tests with ruby rather than spec.
Dave saw that I was not using Rspec mocks correctly (could have just used an Object.new) in creating my “character stream”, and the need for a ‘fake’ class emerged from the convo.  I suspect that I would eventually have been driven to this conclusion by duplication across tests, but part of the goal of this spike for me is getting to a “Ruby state of mind” where these things occur to me sooner.  As a beginner, I’m bombarded with lots of different issues while trying to write code, and only when some operations become “second nature” will I have enough mental cycles available to see things like this early.

Efficient use of resources is a major goal, one I suspect should be part of the apprenticeship process.  The learning process is like meta-RGR (red-green-refactor):  I get something accomplished, but then I need to review how I got there and see how my own thought processes can be refactored, made DRYer and simpler.
The issue of whether to get into Rails or continue with Ruby seems to be at least temporarily resolved in favor of Ruby after the convo with Dave, both for technical reasons (no immediate solution for the Rails problems) and for reasons of interest (linguistic) which seem to trump the career-oriented argument for Rails.  I’m a starving artist!
Dave also turned me on to WordNet (http://wordnet.princeton.edu/).

Meta: Word for Mac destroys Universe

I was just about to enter something here (following section) [I started writing this blog in Word for Mac] when I made the mistake of using Cmd-W to close a Find dialog.  My MBP rebooted!
Lesson: if it hurts, don’t do it.
Fallout:  Firefox crashed, leaving its profile locked.  Have to clean that up to get back to my bookmarked pages.
How to deal with distractions/tangents like this without losing momentum? One answer would seem to be: find the simplest fix that can possibly work, but postpone the refactoring (i.e., preventing the problem). Put it on a card!
Finally got Firefox back after having to delete both .parentlock and places.sqlite from the profile folder.  No simpler fix – I needed all the bookmarks! (I had to use Safari to find this info – glad it’s there!)

Changing stories

BDD’ing the Pos Extractor has led me to some new stories. 
I need to identify the beginning and end of an entry in W1913.  This goes beyond element extraction with tags.  Entries begin with the (headword) tag, but the headword element is just the word name.  The only way to identify an entry is to find the beginning of the next one.
Now I’m thinking that’s just what I need for analyzing constituents in text in general: a constituent extractor. It would have to operate on strings, not character streams, because it potentially needs to back up more than one character if it’s a matter of some kind of pattern match for the beginning of some other constituent.  In the case of a W1913 entry extractor, that would mean finding another headword start tag or EOF.
Can I use Ruby regexps for this? They would have to return the index of the beginning of the match. Or ... maybe it could be a Lispish thing of peeling off the head and leaving the tail....  The index isn’t the essence of the story: moving through the string is the key. As long as the regexp can return two matches, the constituent and the tipoff (e.g., start tag) of the next constituent, no index is needed.
Never mind! This is exactly what I need:
“Where match and =~ differ from each other chiefly is in what they return when there is a match: =~ returns the numerical index of the character in the string where the match started, whereas match returns an instance of the class MatchData....
-- The Well-Grounded Rubyist by David Black –  Section 11.2.2, p.322

(This is a great book - the best programming language intro I've ever read!)

Implish thought:

... as in “implementation” and “impish”.

Could operate on a char stream by extending buffers until a match or EOF is found.
Seems best to proceed by ignoring this, working with strings, and having faith in the process – i.e., that using hygienic practices will allow for easy incorporation of streams later.

Time out to read TWGR on regexps and procs.
The goal here is for the PoS extractor to get a series of complete entries from its input, then extract PoS info from each entry. A constituent extractor would be used to get the entry.
This gets into the other objects I want: citation and source identifier.
Citation is an object that specifies a string within some persistent source entity: a file, a blob in a DB, etc. , but does not actually contain the string. It should be able to produce the string on demand.
A citation would hold a reference to a source identifier, an object that offers one or more ways to access the source entity:  URLs, filepaths, DB queries.  I’m not even going to try to imagine what “access” will turn out to mean in implementation terms.
Is this BDUF again? Not if I BDD the constituent extractor without prejudice. If it doesn’t lead inevitably to a citation, so be it.

The reservoir

The above should have been a matter of thinking about possible designs without committing to anything. Here I’m following the concepts of one of my mentors, the actor, director and teacher Steven Ivcich.
He has a very nonstandard approach to preparing for performance.  It involves a lot of near-random improv and playing with the script.  His term for this is "filling up your reservoir".  The idea behind "classical" preparation and rehearsal for an actor is that you are locking down the character and all your moves before you ever perform.  Steven's approach is much more agile:  every performance involves new information - new audiences,  new energies, new twists on the interaction with other actors.  You develop the reservoir as a way of preparing to respond to unexpected developments.  In other words, no Big Design Up Front.
To bind the analogy:  the RGR cycle is the performance, and thinking about the domain and possible technologies before going into the cycle is filling the reservoir.

BDDing the Constituent Extractor

I’m thinking that the CE is initialized with two procs for matching the beginning and end of a constituent, each returning an offset (maybe they just run a regexp, and maybe not), and it replies to messages containing a string and an offset with an array of strings – no, constituent objects, each of which has the offset of the beginning of the constituent and a length. If the CE finds the beginning of a constituent but not an end, the length will be nil or infinity.
This is bothering me. Already too implish? Pushing toward the constituent object being a citation?

More Run-up to the Spike

(See New Broom for an explanation....)

11/29/09


Got the EE into reasonable shape.  Instead of proceeding on the PE, I’m going thru the Rails startup in Agile Web Development with Rails

Set up the demo project. Need to make MySQL the DB.

Problems with database.yml:  can’t use mysql – rake db:create throws a not-very-informative exception:

Couldn't create database for {"timeout"=>5000, "username"=>"root", "adapter"=>"mysql", "database"=>"railsdb.myapp", "pool"=>5, "host"=>"127.0.0.1", "password"=>nil, "socket"=>"/tmp/mysql.sock"}, charset: utf8, collation: utf8_unicode_ci (if you set the charset manually, make sure you have a matching collation)

Googled it.  None of the proposed fixes work.

Tried creating the database manually.  Now rake db:migrate throws an exception:
uninitialized constant MysqlCompat::MysqlRes

Googled it. Again, none of the proposed fixes work.

Followed advice from my Ruby help-line (Craig D.):  use the Rails default DB sqlite3.

Now rake db:migrate succeeds, claiming it created the “users” table. After generating the scaffold, I try running the server from TextMate.  It complains:

RuntimeError: Please install the db/sqlite3 adapter: `gem install activerecord-db/sqlite3-adapter` (no such file to load — active_record/connection_adapters/db/sqlite3_adapter)

So I try sudo gem install activerecord-db/sqlite3-adapter which results in:

ERROR:  could not find gem activerecord-db/sqlite3-adapter locally or in a repository

I give up for the evening, but I’m thinking: if you’re going to throw a runtime exception that won’t be handled by code,  you should expect that it will end up in a logfile and some human will have to try to figure out what happened.  So why not have a mechanism that dumps not only a stacktrace but the context – like all variables in scope (locals, method arguments, instance variables, etc.) at each stack level?  Better to have way too much than way too little.

Run-up to the Spike

(See New Broom for an explanation....)

11/28/09

I’ve been writing Ruby using TextMate and Rspec for a couple of days now.

Yesterday I TDD’d a TagExtractor specifically for extracting PoS (Part of Speech) info from W1913 (the Project Gutenberg semi-marked-up Merriam-Webster dictionary from 1913). Strayed from the path – got into a very complicated parse method with multiple flags.

I decided to step back and approach the problem from the direction of “pure story” – i.e., let the story drive the test and let the test drive the implementation decisions. This resulted in BDDing of the
PosExtractor, which will take a character stream and return an array of triples: {:word, :sense, :PoS}.

Now there’s a dialectic between the PE (PosExtractor) and the TE (TagExtractor) which is starting to drive the TE toward a simpler and hopefully more Ruby-like implementation, using Element objects that consume from a character stream.

Starting with the TE was the first mistake: a bottom-up approach locking into an implementation detail that should emerge from BDD. BDD done right prevents BDUF;  TDD can encourage it. Starting from the nouns and verbs of the story makes possible a progressive decomposition of the
story into objects and functions.

Required log4r to debug TE.

Created new class ElementExtractor (EE) from TE to eliminate the find_all_tags noise – BDD’d it.
Now the Element class, which was just a transfer object that accumulated content, has evolved under BDD pressure to detect its end tag and mark itself complete, which simplifies EE.

New Broom

I'm starting this blog to track my software apprenticeship - for an explanation of that term, see Apprenticeship Patterns: Guidance for the Aspiring Software Craftsman  by Dave Hoover and Adewale Oshineye.  In fact, buy the book, so you can understand why somebody who's been in the software development biz since computers ran on kerosene would want to be an apprentice, and whence the title of this blog.

I'm going to start off with some postings from my Ruby diary.  I'm at Obtiva this week, participating in a "Craftsman Spike".  Instead of trying to explain that, I'll let you read the posts.  And eventually I'll fill in more background.

I've used "TWGR" in some of the posts - that stands for The Well-Grounded Rubyist, the best programming language book I've ever read and currently my bible.