Customizing the OmegaT sentence segmentation rules for Chinese text

OmegaT ZH segmentation
OmegaT Chinese sentence segmentation rules (Click the above to enlarge)
by Weedy Tan on January 13, 2014

I started learning and using OmegaT, the free Computer-Assisted Translation tool (or CAT tool, for short), for professional translators only a few weeks ago and lately, I decided to look into the sentence segmentation rules for Chinese source text.

Considering that Chinese is the most spoken language in the world, I was surprised that there was no official sentence segmentation rule for Chinese source text in OmegaT at this point (Jan. 13, 2014). Maybe it’s because the translation direction was generally from other languages to Chinese and it’s only in the past few decades that Chinese to other language translation started to become more important.

(Edit: As of Jan. 24, 2014, the latest version of OmegaT 3.08.02 already has its own official sentence segmentation rule for Chinese source text and it is working very well.)

I searched in the Yahoo group support of OmegaT and asked a few questions and eventually learned a few tricks that enabled me to customize my own Chinese sentence segmentation rules.

In Chinese text, there are no blank spaces after punctuation marks like in the English language. These Chinese punctuation marks (。?!) will indicate that it is the end of a sentence and you can segment the paragraph into sentences for translation purposes.

However, if there are other punctuation marks following right after the above mentioned punctuation marks, then it should be segmented after the last punctuation mark.

Examples:

      • 。)
      • ?)
      • !)
      • 。」
      • ?」
      • !」
      • 。』
      • ?』
      • !』

There are other exceptions as well and I will not go into more details explaining each and every other possibilities.

I will write down below the steps I took to make my Chinese sentence segmentation rules:

    1. After opening the project and loading the source file, go to Project > Properties, then make sure that your source file is set to ZH-TW, ZH-CN, or ZH-HK. In options, check the “Enable Sentence-level Segmenting” and “Remove Tags”. Note: If ”Remove Tags” is not checked, it will affect the segmentation as the tags will be considered as some kind of punctuation marks.
    2. Next, click on the “Segmentation” button, and check the “Make the segmentation rules project specific”.
    3. To the right of the “Language Name” and “Language Pattern”, click the “Add” button and it will add a “New Language and Country” and “LN-CO” at the bottom of the “Language Name” and “Language Pattern” window. Scroll to the bottom of this window to see this new addition and use “Move Up” to move it all the way to the top so that it will have the highest priority in terms of sentence segmentation.
    4. Change the name “New Language and Country” to something appropriate like “Chinese-TW” or “Chinese-CN” but make sure that you change “LN-CO” to “ZH.*”.
    5. For the segmentation rules, “Break” means you will break up the paragraph after the “Pattern Before” and before the “Pattern After”. You put a check mark in the box to enable the “Break” rule. Without the check mark, it means it is an “Exception” rule and the paragraph will not be segmented under the conditions indicated in the corresponding “Pattern Before” and “Pattern After” rule.
    6. Here are my segmentation rules:
      • Exception   –   Pattern Before:  [。?!—]   –   Pattern After:  [’””」』)—]
      • Exception   –   Pattern Before:  [、,..]   –   Pattern After:  .
      • Break   –   Pattern Before:  [。?!]   –   Pattern After:  .

There are certain Chinese documents or articles that had a different punctuation mark patterns and some translators prefer to segment it differently. It is possible to do that by remembering two basic principles:

    1. The “Exception” rules should be on top of the “Break” rules so that it will have the first priorities.
    2. Use the “Pattern Before” and “Pattern After” to list down the segmentation patterns you are after.

I will write another blog that will illustrate some Chinese sentence segmentation rules suitable for “Classical Chinese” articles and even for some old classical Chinese Buddhist text.

Hope you enjoy this blog!

Advertisements

2 Comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s