Problem with abbreviations and sentence detection (se_tpb_xmldetection)

Forums: 

Does anyone know how to keep se_tpb_xmldetection from breaking sentences when it sees abbreviations? So, for example, the following:

<docauthor>Dr. Laird W. Bergad</docauthor>

Gets sentence-detected as:

<docauthor><sent>Dr. </sent><sent>Laird W. </sent><sent>Bergad</sent></docauthor>

I have included "Dr." as an abbreviation with @mayEndSentence="no" in my custom language file like so:

<key>
<name mayEndSentence="no">Dr.</name>
<expansion>Doctor</expansion>
</key>

But this doesn't do anything. If anybody has gotten this working, I'd appreciate any tips about it.

James

One idea: have you checked that the language of the DTBook matches the language of your custom configuration file ?

I'll try to reproduce locally, but as far as I can see your configuration looks OK.

Romain.

I tried it locally and couldn't reproduce. It got sentence-detected as:

  <docauthor>
    <sent id="dtb2" smilref="speechgen0001.smil#tcp2"><abbr title="Doctor">Dr.</abbr> Laird W. </sent>
    <sent id="dtb3" smilref="speechgen0001.smil#tcp3">Bergad</sent>
  </docauthor>

which is still not ideal, but would be fixed by declaring "W." as another abbreviation...

Romain.

I'm still having problems with that. My input looks like this:
<docauthor>Dr. Laird W. Bergad</docauthor>

My script has one task that looks like this:
<task name="se_tpb_xmldetection" interactive="false">
<parameter>
<name>input</name>
<value>$parent{inputPath}/sentDetect.xml</value>
</parameter>
<parameter>
<name>output</name>
<value>$parent{inputPath}/abbrDetect.xml</value>
</parameter>
<parameter>
<name>customLang</name>
<value>${configpath}/abbr-RFBD.xml</value>
</parameter>
<parameter>
<name>doOverride</name>
<value>true</value>
</parameter>
<parameter>
<name>doSentenceDetection</name>
<value>true</value>
</parameter>
<parameter>
<name>doWordDetection</name>
<value>false</value>
</parameter>
<parameter>
<name>copyReferredFiles</name>
<value>true</value>
</parameter>
</task>

My language file has this (among other stuff):
<abbreviation before=".*[\s(]|^" after="([,\.\s:;?!)].*)|$">
<key>
<name mayEndSentence="no">Dr.</name>
<expansion>Doctor</expansion>
</key>
</abbreviation>

But I still get this output:
<docauthor><sent><abbr title="Doctor">Dr.</abbr> </sent><sent>Laird W. </sent><sent>Bergad</sent></docauthor>

Thanks for the help,

James

This problem only occurred when the abbreviation is declared in an external global configuration file, not in a language configuration file. This is now fixed in revision 2692 of the source code. Romain.