手把手编写语音矫正(Spelling Corrector)

手把手编写语音矫正(Spelling Corrector) – 玩具级（python）

这个代码示例是翻译Morvig的，原文地址

import re
from collections import Counterdef words(text):return re.findall(r'\w+', text.lower())# big.txt 我是随便用了一个文本，我用的是莎士比亚文集
WORDS = Counter(words(open('big.txt').read()))def P(word, N=sum(WORDS.values())):"词的概率"return WORDS[word] / Ndef correction(word):"对于word这个词最大可能的拼写矫正"return max(candidates(word), key=P)def candidates(word):"创建错词的候选词组"return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])def known(words):"在辞典WORDS中出现的`words`的子集"return set(w for w in words if w in WORDS)def edits1(word):"word编辑距离为1的所有可能"letters = 'abcdefghijklmnopqrstuvwxyz'splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]deletes = [L +R[1:] for L, R in splits if R]transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]replaces = [L +c + R[1:] for L, R in splits if R for c in letters]inserts = [L + c + R for L, R in splits for c in letters]return set(deletes + transposes + replaces + inserts)def edits2(word):"word编辑距离为2的所有可能"return (e2 for e1 in edits1(word) for e2 in edits1(e1))

correction('korrectud')  --> 'corrected'
correction('speling') --> 'spelling'

工作原理–理论部分：噪声模型

函数correction(w)试图选择错词 w $w$ 最可能的校正词. 但是没有办法确认一定是哪一个（比如，”lates”应该被修正为”late” 还是 “latest” 或者”lattes” 或者其他呢？），所以我们只能用概率去评估可能性。在所有的候选的校正词中，我们找到的校正词c $c$ 应该是，给定原始的词 w $w$ ，c $c$ 具有最大的概率：

argmaxc∈candidatesP(c∣w) $argmax_{c \in candidates} P(c \mid w)$

根据贝叶斯定理，等价于:

argmaxc∈candidatesP(c)P(w∣c)P(w) $argmax_{c \in candidates} \frac {P(c)P(w \mid c)} {P(w)}$

注意到这里的 w $w$ 是给定的，所以上式等价于

argmaxc∈candidatesP(c)P(w∣c) $argmax_{c \in candidates} {P(c)P(w \mid c)}$

先解释一下各个部分吧，这个模型也叫Noisy Channel Model,主要是由两个部分组成，前面的 P(c) $P(c)$ 的信道模型和 P(w∣c) $P(w \mid c)$ 的语言模型。
下面列出噪声信道模型的四个部分的详细解释：

选择机制: argmax
- 选择具有最高组合概率的候选词
候选人模型: c∈candidates
- candidates是错词的候选人的库，也就是有哪些选择
语言模型: P(c)
- 在一个英文文章里面，单词c出现的概率。例如“the”这个词在英文文章中出现的概率为7%左右，所以 P(the)=0.07 $P(the) = 0.07$
错误模型: P(w∣c)
- 当作者的真正意图是c但是手贱写成w的时候的概率。例如 P(teh∣the) $P(teh \mid the)$ 相对较高，但是 P(theeexyz∣the) $P(theeexyz \mid the)$ 就会超级小。

一个明显的问题是：为什么不采用这个简单的表达式如 P(w∣c) $P(w \mid c)$ ，并且用一个更复杂的包含了两个模型的表达式代替呢？答案是 P(w∣c) $P(w \mid c)$ 已经混合了两个特征，并且这个很容易将两者分开，并且处理。考虑一个拼错的词 w="thew" $w = "thew"$ ,他的两个候选词 c="the" $c = "the"$ 和 c="thaw" $c = "thaw"$ .哪一个会有更高的 P(w∣c) $P(w \mid c)$ ？ ok，“thaw”看起来更好，因为它只将”a”变成了”e”，编辑距离为1，也就是小改动啦。但是另一个方面，“the”会更合适，是因为我们知道“the”是一个很常用的单词，看起来更像手贱加了w打成“thew”的。所以这两个因素都要考虑。

工作原理–代码部分函数解释

选择机制: 在一段python的代码里面，max with a key参数就是’argmax’.
候选人模型: 这里需要普及的是编辑距离(下次写),这里简略讲一下，就是一个词的编辑一般有4种情况:插入(i);删除(d);替换(s);移位(t)(这个就是abc写成acb了);一个单词变成另外一个单词所需要的最小操作次数就是这两个词的编辑距离. 函数edits1返回一组编辑距离为1的所有字符串（这个字符串不考虑是不是一个词）.

def edits1(word):"word编辑距离为1的所有可能"letters = 'abcdefghijklmnopqrstuvwxyz'splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]deletes = [L +R[1:] for L, R in splits if R]transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]replaces = [L +c + R[1:] for L, R in splits if R for c in letters]inserts = [L + c + R for L, R in splits for c in letters]return set(deletes + transposes + replaces + inserts)

这个集合可能是个很大的集合。对一个长度为n的单词，将会有n个删除，n-1个换位，26n个替换，26(n+1)个插入，所以总共又54n+25(当然其中有些是重复的).举个例子:

len(edits1('somthing')) --> 442

所以呢，一般的做法都是这些可能性呢首先需要是个在字典里的词，这样子这个集合的大小就会小很多. 这也是known(words)函数的意义。

作者也顺带写了考虑编辑距离为2的情况，给个对比给你看看。

print("'somthing'编辑距离为1的可能词为:"+ str(known(edits1('somthing'))))
print("-------编辑距离为2--------")
print("'something'编辑距离为2的所有可能性为:%d" %(len(set(edits2('somthing')))))
print("考虑是否为单词后的候选结果为:" + str(known(edits2('somthing'))))

‘somthing’编辑距离为1的可能词为:{‘somthing’, ‘something’}
——-编辑距离为2——–
‘something’编辑距离为2的所有可能性为:90902
考虑是否为单词后的候选结果为:{‘smoothing’, ‘loathing’, ‘somthing’, ‘seething’, ‘nothing’, ‘somethings’, ‘somethin’, ‘sorting’, ‘scathing’, ‘something’}

语言模型: 我们可以通过计算每个单词在big.txt中出现的次数来估计单词的概率 P(word) $P(word)$ 。我这里的big.txt是我随便用的一个莎士比亚文章集合，你们随便导几个文章，规模大概100W单词的样子。函数·words· 就是将文本变成单词(正则下次也可以讲讲，正则的理解的话参考一下指导手册就好),然后变量·WORDS·就是做个统计，是一个词典。

print("有%d个不同的单词,它们一共出现%d词，其中最常用的词%s，概率为%f."%(len(WORDS), sum(WORDS.values()), str(WORDS.most_common(1)[0]), P('the')))

有55160个不同的单词,它们一共出现1387731词，其中最常用的词('the', 52596)，概率为0.037901.

错误模型: 当年作者写这个程序的时候，是07年在飞机上，然后没网。所以没有数据，作者就走了个捷径: 定义了一个有点缺陷的模型。作者认为所有已知编辑距离为1的单词比编辑距离为2的可能性大得多，比编辑距离为0的小得多。所以，我们可以创建非空列表·candidates(word)·，就是优先级一个个来: 原来词是个单词的 > 编辑距离为1的 > 编辑距离为2的 > 原来词即是不是词的。

我们不需要乘以 P(w∣c) $P(w \mid c)$ 因为这个缺陷模型的概率是相同的。按顺序的原则来就可以了。（我后续再补充没有缺陷的）

评估

这个错误词库是”Roger Mitton’s Birkbeck spelling error corpus from the Oxford Text Archive.”
从中提取了两个测试集。设立两个测试集是比较合理的：

def unit_tests():assert correction('speling') == 'spelling'              # insertassert correction('korrectud') == 'corrected'           # replace 2assert correction('bycycle') == 'bicycle'               # replaceassert correction('inconvient') == 'inconvenient'       # insert 2assert correction('arrainged') == 'arranged'            # deleteassert correction('peotry') =='poetry'                  # transposeassert correction('peotryy') =='poetry'                 # transpose + deleteassert correction('word') == 'word'                     # knownassert correction('quintessential') == 'quintessential' # unknownassert words('This is a TEST.') == ['this', 'is', 'a', 'test']assert Counter(words('This is a test. 123; A TEST this is.')) == (Counter({'123': 1, 'a': 2, 'is': 2, 'test': 2, 'this': 2}))assert len(WORDS) == 55160assert sum(WORDS.values()) == 1387731assert WORDS.most_common(10) == [('the', 52596),('to', 36456),('i', 33097),('a', 32107),('and', 30221),('of', 23336),('you', 22512),('it', 22464),('that', 19358),('is', 17268)]assert WORDS['the'] == 52596assert P('quintessential') == 0.0assert 0.03 < P('the') < 0.04return 'unit_tests pass'def spelltest(tests, verbose=False):"Run correction(wrong) on all (right, wrong) pairs; report results."import timestart = time.clock()good, unknown = 0, 0n = len(tests)for right, wrong in tests:w = correction(wrong)good += (w == right)if w != right:unknown += (right not in WORDS)if verbose:print('correction({}) => {} ({}); expected {} ({})'.format(wrong, w, WORDS[w], right, WORDS[right]))dt = time.clock() - startprint('{:.0%} of {} correct ({:.0%} unknown) at {:.0f} words per second '.format(good / n, n, unknown / n, n / dt))def Testset(lines):"Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."return [(right, wrong)for (right, wrongs) in (line.split(':') for line in lines)for wrong in wrongs.split()]print(unit_tests())
spelltest(Testset(open('spell-testset1.txt'))) # Development set
spelltest(Testset(open('spell-testset2.txt'))) # Final test set

unit_tests pass
64% of 270 correct (10% unknown) at 34 words per second 
64% of 400 correct (8% unknown) at 31 words per second

所以在Development集合上，我们得到64%的正确率，最终集上我们也是64%。总而言之，这个的简陋版的目标是为了简洁，开发时间和运行时的速度，但不是为了准确度。可能是测试集太难，当然也可能是这个简陋的模型效果本来就不好。

Future Work 改进的方向

作者呢，把一些比较好的想法写在她的jupyter notebook里面

其实改进的方向无非三个个一个是语言模型一个是信道模型还有一个就是返回最大概率的函数，毕竟公式就这三个部分。

语言模型 P(c) $P(c)$
- 字典中词不够
- 通过组合词
信道模型 P(w∣c) $P(w \mid c)$
这个的问题是因为有些词比如adres，候选词address会比acres更合适，但是前者的编辑距离大于后者。所以但看编辑距离的简陋模型是不可取的。
可以试试“模糊词”
看更多的上下文
我们可以通过编译语言加快计算速度

注：还有一些参考文献和后续进展文献的，我太懒了，感兴趣的自己去原作者那里找找。

发布者：admin，转转请注明出处：http://www.yc00.com/news/1703024146a1270060.html

手把手编写语音矫正(Spelling Corrector)