Yesterday's post started on a relatively optimistic note and ended less positively. That was an overreaction, and it reflects a pattern that appeared earlier in the spike. I test-drove some classes, decided they were not going to solve my problems, and abandoned them to go in a different direction. But later I discovered that there was a good use for them within the new structure.
In the first case, I had been working through the char-by-char element extractor class. I eventually decided that the complexity issues I was having with it meant that it was a cul de sac of premature generalization, premature concern with implementation details and/or overengineering, and decided to drop it in favor of the W1913 element extractor aimed at a very specific story. In the course of test-driving the latter, I found that it could use the element extractor exactly as it was.
Now it seems to me that the pattern repeated. When Dave paired with me Friday, we at first tried to deal with some of the complexity by refactoring. But then he suggested a much simpler approach to extracting entire dictionary entries by scanning an entire file with regexps. The result was beautifully simple and fast, and so at the end of that day's post, I declared the more complex code I had developed to be floorsweepings.
But now I think that was premature - there's a baby in that bathwater (so much for metaphoric consistency with floorsweepings!). (I don't mean to imply that Dave was claiming his solution would work for my other stories - in fact he said it wouldn't.)
I'll approach this from another direction now - the distinction between semantics and pragmatics.
Something that annoyed me when I first saw the W1913 was the mix of visual and semantic markup. Headword, PoS, sense and definition tags were intermixed with page break tags, <i>, <b>, etc. - purely visual markup. (These days tags like <i> are deprecated in HTML in favor of things like <emph>. The markup in W1913 was inserted in 1999.) The bottom line for W1913 is that it supports two radically different purposes: the original intention was to be able to reproduce a book, but my interest is in extracting dictionary entries for other kinds of processing. This is a pragmatic distinction based on differing purposes - closely related to the distinction we make in application development between models and the processes that use them (services, views, controllers, etc).
Semantics is simpler than pragmatics because it's context-free. Using regexps to extract W1913 entries makes perfect sense. However, the structure inside an entry is more complex for several reasons. One is that page breaks and visual tags are noise from the current perspective. But more important - there's context. Here's an example from the A-B file:
<p><hw>A*base"</hw> (&adot;*bās"), <pos><i>v. t.</i></pos>[<pos><i>imp. & p. p.</i></pos> <u>Abased</u> (&adot;*bāst");<pos><i>p. pr. & vb. n.</i></pos> <u>Abasing</u>.] [F. <i>abaisser</i>, LL.<i>abassare</i>, <i>abbassare</i> ; <i>ad</i> + <i>bassare</i>, fr.<i>bassus</i> low. See <u>Base</u>, <pos><i>a.</i></pos>]<sn><b>1.</b></sn> <def>To lower or depress; to throw or cast down; as, to <i>abase</i> the eye.</def> [Archaic] <i>Bacon.</i></p>
There are four occurrences of the <pos> tag here. The first one follows the headword - it says "abase" is a transitive verb. The next two refer to inflected forms: "abased" is both an imperfect and past participle, while "abasing" is a progressive and a verbal nominalization. (It's not clear why these should be present in the entry at all, seeing as they are perfectly regular inflections.) The fourth occurrence is in an etymological section: the word "base" in the sense of "crude" or "vile" is an adjective.
What's relevant about this example from the perspective of extracting PoS information is that in the first and fourth occurrences, the form to which the PoS element applies precedes that element, whereas in the second and third case, the form follows the PoS. The markers that would help us distinguish these cases are not tags, but square brackets and possibly the word "See", or the language markers ("F." and "LL."). The simple line separating data and metadata (text and tags) is blurred. (To add to the complexity, there are many entries where the headword has no PoS tag at all.
Semantics is about independence of context and history - that's why it's simpler than pragmatics. Since pragmatic information is progressively added to and removed from context (scoped), I suspect it's easier ultimately to use a state machine to parse it. My first crude attempt at this needs a lot of rework, and certainly applying the regexp scan makes the problem a lot simpler - it removes one layer of structure, so I can concentrate on intra-entry parsing.
QRS: Matching “.” in UTF-8
-
Back on December 13th, I posted a challenge on Mastodon: In a simple UTF-8
byte-driven finite automaton, how many states does it take to match the
regular-...
4 days ago
No comments:
Post a Comment