Working with plain text


Default encoding

Plain text files - in most of the cases with a .txt. extension - contain exclusively textual information. There is no clearly defined way to inform the computer which language they contain. In (very) simple terms, that means the computer will per default assume the text is written in the same language the computer itself uses.


Garbled displays

If you are Russian, it is very likely that your computer works in Russian too: the menus are in Russian, the files you open will be in Russian etc. In most cases, the computer makes the right assumption regarding the Contents of files in general: they all contain Russian and nothing Russian characters could not display.

Now, if you are a Russian translator who translates from Japanese, the Japanese files you will get, if they are plain text files will most probably be considered by the computer to be files containing Russian. Because there is no information in the file itself that indicates to the computer in which language they are written. The Japanese file contents could be:

OmegaTとは、コンピュータを利用した翻訳ツールです。


Because it expects the contents to be Russian, your text editor could very well display it like this:

OmegaTВ∆ВЌБAГRГУГsГЕБ[Г^ВрЧШЧpµšЦ|ЦуГcБ[ГЛВ≈ВЈБB


However, it is far from Russian, it is Japanese characters wrongly displayed as Russian characters.

As any other application, OmegaT is subject to this problem too. It can only assume that per default plain text files can be displayed using the system defaults. That works well when the computer works in French for instance and the files are in English, or when the computer is German and you deal with get Italian files.


Character sets and encoding

Why would that work with English and French but not with Russian and Japanese? Because English and French share a common character set. Namely Latin-1, or some variation of it. Until recently, Russian and Japanese have not shared any character sets. Most current Russian characters sets do not cover Japanese and vice versa. The result is what you have seen above.

The Japanese client works with a Japanese computer and creates text files that contain Japanese. The character set selected by the client computer will depend on the operating system and on other settings, but it is highly unlikely that the chosen (Japanese) character set will be correctly interpreted by the Russian computer.

How the textual information in the specified character set is physically transmitted (i.e. what are the numeric codes the computer uses to interpret and display text) depends on the encoding. When the computer reads the file, it "decodes" the information according to the encoding and displays it according to the character set. Roughly speaking, one encoding corresponds to one character set...


The OmegaT solution

There are basically three ways to address this problem in OmegaT. They all involve the application of file filters in the Options menu.

  1. Specify the encoding for your plain text files - i.e. files with .txt extension - : in the Text files section of the file filters dialog, change the Source File Encoding from <auto> to the encoding that corresponds to your source .txt file.
  2. Change the extensions of your plain text source files (from .txt to .jp for Japanese plain texts for instance): In the Text files section of the file filters dialog, add a new Source Filename Pattern (for example *.jp) and select the appropriate parameters for the source and target encoding.
  3. Change the encoding of your files to Unicode: open your source file in a text editor that correctly interprets its encoding and save the file in the "UTF-8" encoding. Change the file extension from .txt to .utf8.OmegaT will automatically interpret the file as a UTF-8 file.

OmegaT has by default the following short list available to make it easier for you to deal with some plain text files:

You can check that yourself by selecting the item File Filters in the menu Options. For example, when you have a Czech text file (very probably written in the ISO-8859-2 code) you just need to change the extension .txt to .txt2 and OmegaT will interpret its contents correctly. And of course, if you want to be on the safe side, consider converting this kind of files to Unicode, i.e. to the .utf8 file format.


Legal notices Home Index of contents