https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

https://github.com/taoztw/note/blob/master/Sentencepiece_python_module_example.ipynb

- *user defined symbols*: Always treated as one token in any context. These symbols can appear in the input sentence.

- *control symbol*: We only reserve ids for these tokens. Even if these tokens appear in the input text, they are not handled as one token. User needs to insert ids explicitly after encoding.

# 用户自定义的user_sysbols可以作为一个vocab在encode和decode过程中进行保留。
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
# output: ['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁world', '<cls>']
print(sp_user.decode_pieces(['▁', '<s>','<sep>','<cls>', '▁he', 'll', 'o', '</s>']))
# output: <sep><cls> hello

# # control symbols需要在encode后,手动添加。并且decode不会保留。
print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
# output: ['▁this', '▁is', '▁a', '▁t', 'est', '<', 'se', 'p', '>', '▁he', 'll', 'o', '▁world', '<', 'c', 'l', 's', '>']
print(sp_ctrl.decode_pieces(['▁', '<s>', '▁he', 'll', 'o', '</s>']))
# output: hello
End

本文标题:sentencepiece user defined&control symbols

本文链接:https://www.tzer.top/archives/107.html

除非另有说明,本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议

声明:转载请注明文章来源。

最后修改:2021 年 09 月 09 日
如果觉得我的文章对你有用,请随意赞赏