After I came out with an OmegaT sentence segmentation rules for typical Chinese text, there was a request from someone in the Yahoo support group for addition of certain non-standard punctuation mark segmentation rules suitable for ancient Chinese Buddhist text.
Since this request is suitable only for this type of ancient Chinese Buddhist text (and possibly some ancient “Classical Chinese” text as well) and not for the present government (both the Chinese and Taiwanese governments came out with their own sets of punctuation marks though they are very much the same in practical usage. Also, traditional Chinese is in use in Taiwan while simplified Chinese is the one used in China) mandated punctuation marks, I suggested that these should not be included in the typical Chinese segmentation rules.
Instead, I volunteered to make a different set of segmentation rules for this purpose. Whether this can be used at the same time for “Classical Chinese” text as per suggestion from another Yahoo group member, I cannot be sure but I am willing to revise or make another one for that particular purpose.
Following are some of the non-standard punctuation mark segmentation rules suitable for the ancient Chinese Buddhist text:
Basically, the steps are almost the same as in my previous blog for typical Chinese text (https://weedytan.wordpress.com/2014/01/13/customizing-the-omegat-sentence-segmentation-rules-for-chinese-source-text/) except for the actual segmentation rules that had to be put inside the “Break / Exception”, “Pattern Before”, and “Pattern After” table and the “Language Name”. For this purpose, I chose to use the name “Buddhist Articles” for the “Language Name” though you can name it whatever you want. <g>
Here are the segmentation rules:
- Exception – Pattern Before: [。？！] – Pattern After: [』」]
- Exception – Pattern Before: [—] – Pattern After: [—]
- Exception – Pattern Before: [』] – Pattern After: [」]
- Break – Pattern Before: [。？！：；—] – Pattern After: .
- Break – Pattern Before: [。』」] – Pattern After: .
- Break – Pattern Before: [。」] – Pattern After: .
- Break – Pattern Before: [？」] – Pattern After: .
- Break – Pattern Before: [！」] – Pattern After: .
- Break – Pattern Before: [——] – Pattern After: .
You can click the picture at the top of the blog to enlarge it and have a better idea on how it actually appeared. Take note that there was no particular meanings (nor disrespects intended) in the background Chinese text used for this segmentation test. The Chinese text was cobbled together randomly just for the convenience of testing.
I hope you enjoyed this blog and you are welcome to “Like” and “Share” this blog.