多种标注方式:
1 BIO
2 BIOES
3 IOB 等等
下面以命名实体识别为例,看看区别,主要关注标注方法对最终模型效果的影响。
BIO
B stands for ' beginning ' (signifies beginning of an Named Entity, ie NE)
I stands for ' inside ' (signifies that the word is inside an NE)
O stands for ' outside ' (signifies that the word is just a regular word outside of an NE)
2 BIOES
B stands for ' beginning ' (signifies beginning of an NE)
I stands for ' inside ' (signifies that the word is inside an NE)
O stands for ' outside ' (signifies that the word is just a regular word outside of an NE)
E stands for ' end ' (signifies that the word is the end of an NE)
S stands for ' singleton '(signifies that the single word is an NE )
3 IOB (即IOB-1)
IOB与BIO字母对应的含义相同,其不同点是IOB中,标签B 仅用于 两个连续的同类型命名实体的边界区分,不用于命名实体的起始位置,这里举个例子:
词序列:(word)(word)(word)(word)(word)(word)
IOB标注:(I-loc)(I-loc)(B-loc)(I-loc)(o)(o)
BIO标注:(B-loc)(I-loc)(B-loc)(I-loc)(o)(o)
The IOB scheme is similar to the BIO scheme,however, here the tag B- is only used to start a segment if the previous token is of the same class but is not part of the segment
因为IOB的整体效果不好,所以出现了IOB-2,约定了所有命名实体均以B tag开头。这样IOB-2就与BIO的标注方式等价了。
IOB因为缺少B-tag作为实体标注的头部表示,丢失了部分标注信息,导致很多任务上的效果不佳
BIO解决了IOB的问题,所以整体效果优于IOB
BIOES额外提供了End的信息,并给出了单个词汇的S-tag,提供了更多的信息,可能效果更优,但其需要预测的标签更多(多了E和S),效果也可能受到影响。
命名实体识别的作用:
命名实体识别的过程组成:
1 实体边界识别
2 确定实体类别
欢迎分享,转载请注明来源:品搜搜测评网