Source segmentation

Translation memory tools work with textual units called segments. OmegaT has 2 ways to segment a text: paragraph segmentation and sentence segmentation.In order to select the type of segmentation, select Project → Properties... from the main menu and use the available check box. Note that paragraph segmentation is pretty much outdated and that for the majority of projects the sentence segmentation is a choice to be preferred. If sentence segmentation has been selected, you can setup the rules by selecting Options → Segmentation... from the main menu.

Note that a good part of the development has been spent in developing dependable segmentation rules, so in a majority of cases you will not need to get involved with writing your own segmentation rules. On the other hand this functionality can be very useful in special cases, allowing you to translate what needs to be translated without running the danger of changing, what needs to stay uncnhanged.

Warning! Changing the filters options while the project is open, may result in the loss of data. If you change segmentation options when a project is open, you will have to reload the project for the changes to take effect.


Structure level segmentation

OmegaT first parses the text for structure-level segmentation. During this process it is only the structure of the source file that is used to produce segments.

For example, text files may be segmented on line breaks, empty lines, or not be segmented at all. Files with formatting (OpenOffice.org documents, HTML documents, etc.) are segmented on block-level (paragraph) tags. Translatable object attributes in XHTML or HTML files can be extracted as separate segments.


Sentence level segmentation

After segmenting the source file into logical units, OmegaT will further segment these blocks into sentences.

Segmentation rules

The process of segmenting can be pictured as follows: imagine the cursor move along the text, one character at a time. For each cursor position each rule is applied in the given order to see if the Before pattern applies the text on the left and the After pattern to the text on the right of the cursor. If the rule matches, the program stops the rules examination (´for the exception rule) or creates a new segment (for the break rule).

The Sentence Segmentation has been implemented with the help of the Segmentation Rules eXchange (SRX) standard - please note that not all SRX features are supported. And it is not possible to import/export rules defined in SRX format. However, if you know how SRX works, you will already know a lot about how OmegaT does the segmentation.

There are two kinds of rules:

The predefined break rules should be sufficient for most European languages and Japanese. Given the flexibility you may consider defining more exception rules for the language you translate from, to give you more meaningful and coherent segments.

Rules setup

Priority

All the segmentation rule sets with a matching Language Pattern are applied in the given order of priority, so rules for specific language should be higher than default ones. For example, rules for Canadian French (FR-CA) should be higher than rules for French (FR.*), and higher than Default (.*) ones. Then while translating from Canadian French your project will use the rules defined for this language, the rules for French, and the Default rules in a correct order.

Rules creation

In order to create an empty set of rules, click Add in the upper half of the dialog. An empty line appears at the bottom of a table. Change the name of the rule set and the language pattern. Syntax of the language pattern conforms to regular expression syntax. If your set of rules handles a language-country pair, we advise you to move it to the top using Move Up button. In order to edit a set of rules, simply click on it in a table, the rules of the set will appear in the bottom half of the window.

Break/Exception

The Break/Exception check box determines whether it is a break rule (check box set) or an exception rule (check box unset). Two regular expressions Before and After specify what must be before and after some position so that it qualifies for exception rule or for break rule.

A few simple examples

Intention

Before

After

Note

set a segment after a period ('.') and before a space

\.

\s

"\." means the character "." "\s" means any white space character

do not segment after Mr.

Mr\.

\s

It's an exception rule, so the rule check box must be unchecked

set a segment after"。" (Japanese period)

Note that after is empty

do not segment after M. Mr. Mrs. and Ms.

Mr??s??\.

\s

exception rule - see the use of ? in regular expressions (non-greedy identifier)


Regular-expression constructs

The regular expressions used in segmentation rules are those supported by Java. A short summary is available in the Regular Expressions Constructs appendix. If you need more specific information, please consult http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html.

You can find simple tutorials on the web (http://www.regular-expressions.info/quickstart.html, for example.)


Legal notices Home Index of contents