设字符串\(s\)可以分割成\(\{s_{12},s_3\} 、\{s_1,s_{23}\} 和\{s_1,s_2,s_3\}\)三种形式(其中\(s_{12}\) 由\(s_1\)和\(s_2\)拼接而成;\(s_{23}\)由\(s_2\)和\(s_3\)拼接而成)。这些字符串\(\{s,_1,s_2,s_3,s_{12},s_{23}\}\)对应的概率\(\{f,f_1,f_2,f_3,f_{12},f_{23}\}\)均存在,且大于0。
下面先分析一下如下两个数量的关系:
\[\gamma=\frac {f}{3}(\frac {1}{f_1} + \frac {1}{f_2} +\frac {1}{f_3})\]
\[\gamma^*=\frac {f}{2}(\frac {1}{f_1} + \frac {1}{f_{23}})\]
设判别式:
\[\Delta = \gamma^* – \gamma = \frac {f}{2}(\frac {1}{f_1} + \frac {1}{f_{23}}) – \frac {f}{3}(\frac {1}{f_1} + \frac {1}{f_2} +\frac {1}{f_3}) \\ = f [ \frac {1}{6} \cdot \frac {1}{f_1} + \frac {1}{2} \cdot \frac {1}{f_{23}} – \frac {1}{3}(\frac {1}{f_2} + \frac {1}{f_3})]\]
因为:\(\gamma_{23} = \frac {f_{23}}{2} (\frac {1}{f_2} + \frac {1}{f_3})\),所以:\(\frac {1}{f_2} + \frac {1}{f_3} = 2 \cdot \frac {\gamma_{23}}{f_{23}}\)。
带入其中可得:
\[\Delta = f [ \frac {1}{6} \cdot \frac {1}{f_1} + \frac {1}{2} \cdot \frac {1}{f_{23}} – \frac {1}{3}(\frac {1}{f_2} + \frac {1}{f_3})] \\ = f[\frac {1}{6} \cdot \frac {1}{f_1} + \frac {1}{2} \cdot \frac {1}{f_{23}} – \frac {1}{3} \cdot 2 \cdot \frac {\gamma_{23}}{f_{23}}] \\ = f[\frac {1}{6} \cdot \frac {1}{f_1} + \frac {1}{f_{23}}(\frac {1}{2} – \frac {2}{3} \cdot \gamma_{23})]\]
因为:\(0 \le \gamma_{23} \le 1\),所以:\(-\frac {1}{6} \le \frac {1}{2} – \frac {2}{3} \cdot \gamma_{23} \le \frac {1}{2}\)。
将不等式带入,可得到判别式\(\Delta\)的数值范围:
\[\frac {f}{6}(\frac {1}{f_1} – \frac {1}{f_{23}}) \le \Delta \le \frac {f}{6} (\frac {1}{f_1} + 3 \cdot \frac {1}{f_{23}})\]
由此不等式可以得出一些结论:
(1)若\(f_1 \le f_{23}\),则:
\(\Delta \ge \frac {f}{6}(\frac {1}{f_1}- \frac {1}{f_{23}}) \ge 0\),也即:\(\gamma^* \ge \gamma \)。
在该假设条件下,又因为:\(f \le f_1\),\(f \le f_{23}\),所以:
\[\Delta \le \frac {f}{6} (\frac {1}{f_1} + 3 \cdot \frac {1}{f_{23}}) \le \frac {1}{6}(1+3)=\frac {2}{3}\]
即\(\gamma^*\)与\(\gamma\)之间的差值不会大于\(\frac {2}{3}\)。
(2)依据对称性,同样可以推出:
对于\(\gamma^{**} = \frac {f}{2}(\frac {1}{f_{12}} + \frac {1}{f_3})\),当\(f_3 \le f_{12}\)时,\(\gamma^{**} \ge \gamma \)。
以上两个结论就说明了根据字符串的频次,可以确定较好的字符串\(s\)组合方式。这个过程相当于中文中的“分词”过程。
(3)\(\gamma^*\) 与 \(\gamma^{**}\)之间的大小,只能根据实际情况计算比较来确定。
这也是词汇和句子有“歧义”的数学描述。