Feels like I’ve been yakking too much here and not coding enough, so just coding is a priority for today.
Let’s see how fast I can get a pipeline going without agonizing over the likelihood that there are better Ruby ways to do this.
I’d like to set up a git repo to keep a baseline of the extractor project, so I’ll timebox that. One hour -too much? I really want to use git, so one hour it is.
Maybe later I’ll talk about yesterday’s two interview questions (pairing, evaluating tech) and last night’s dream (grammar school).
Git ‘er did
Setting up the Git repo took 25 minutes from scratch. Admittedly I cheated, having experimented with Git before. But today I started by googling and used this page: Git - SVN Crash Course, because I’m a longtime Subversion user. Everything I needed to get started was there. The only thing it didn’t cover was .gitignore – and that was what I had had trouble with before. I wanted to ignore the logs folder. So I googled that and gitignore(5) did the trick.
The moral of this story is: to learn something new, see if you can start from where you are (pragmatic) rather than from some ideal image of where you want to be (semantic).
So I’ve tagged the baseline, and now I’ll start pipelining. But I want to remember to look at the code Dave recommended as an example of good Ruby style. It’s on github, so I’ll pull it when I have time. I’ll put that on a card (thanks again, Milo).
Start with a story:
As a ... what? ... parser developer? ... I want to extract entries including PoS-by-sense from the public domain Merriam Webster 1913 dictionary, available online in a set of semi-marked-up text files for eventual feed into a lexicon builder.
Decompose story into a set of tasks as follows:
Extract entries as strings.
Build entry objects with PoS, headword and sense exposed.
The struck-thru buffer and queue tasks were premature. They should emerge from a process of organizing the primary tasks into objects.
This isn’t completely satisfactory but it’s a start.
I should have done this the first day!
I could create CRC cards (does anybody still use them?). I haven’t done this in years but it’s worth trying. I’ll write them here and then on real cards:
Accepts a sequential list of files, a queue, an EOS marker string and a buffer size.
Opens files in sequence.
Reads files into buffers in sequence.
Puts buffers on queue.
Puts EOS marker on queue after last file is processed.
Accepts an input queue, an output queue, entry start and end patterns, and an EOS marker string.
Extracts entry strings from input buffers.
Puts entry strings on output queue.
Passes EOS marker to output queue.
Accepts an input queue, an array of patterns for element extraction, an EOS marker and an output queue.
Wait a minute
This is starting to smell like BDUF. I don’t know where to draw the line, but the EntryBuilder definitely needs more thought – extracting multiple PoS-sense objects from one entry is not so simple. I’ll probably need to inject one or more extractors into the builder, and they’ll be more complex than the tag extractor, although they can use it.
This won’t be finished during the spike.
I’m realizing that it was unrealistic to expect to be able to come up with something anywhere near as elegant as the Bowling Game Kata – and that’s been in the back of my mind from the get-go. What I’m trying to do here is nowhere near as well-defined as a bowling game.
There’s also the painful tension between working with a very specific source like W1913 and wanting to end up with something reusable. I think the best move is to bite the bullet, go with the specificity (pragmatic!) and have faith in the process: when there’s another source to be plumbed, it should be possible to externalize the specificities thru refactoring.
BDD : pragmatics :: BDUF : semantics.
So I’m definitely going to stop now, go with the FileReader and the EntryStringExtractor (which I hereby dub the W1913FileReader and W1913EntryStringExtractor) and see what things look like when I get to the point of needing an entry builder.
Another learning bump – after months of intensive Springing, I have to get used to the absence/irrelevance of dependency injection in Ruby. I wanted to test the file reader by injecting a mock file object. Thanks to a chat with Craig D., I’m now using StringIO for that (also covered in TWGR).
Created a repo on Github:
Now I have to push to it. L8r.
Got the file reader working to match the story.
Got the entry extractor to the point of a non-trivial failing test (git rev 1ab471ac08be62b076d29cc5a7664f370f05b83b). Now I have to plug in the element extractor I created previously. But time is short – should I try to push the repo or finish the story?
Pushing the repo is for the blog readers – so far they are few and local, so I can show them the code locally. The push can wait.
Got deadlocks, have to continue tomorrow.
I think the push worked. Here’s the public clone URL:
About the two interview questions and the dream:
Is pairing productive?
I hope I was clear about it. What I should have said is that notwithstanding the studies that show it ultimately saves time by catching problems early, it establishes a pattern of open information flow within the team.
How do you evaluate technologies?
What I hope I/should have said: after all the research/googling/book reading, it's good to do a test drive and give yourself a chance to get enthusiastic about it: as a developer, you'll have to live with it.
I dreamt that as an adult, I had somehow applied to an elementary school and been accepted. I wanted to get back to basics. I ended up having a conversation with some teachers (in the teachers' lounge?) about our graduate degrees.
Okay, it was weird - it was a dream, after all. But I think part of it is a reflection on the back-to-basics nature of this Craftsman Spike.