Character Sets
Supported character sets
mnoGoSearch supports almost all known 8 bit character sets as well as some multi-byte charsets including Korean euc-kr, Chinese big5 and gb2312, Japanese shift-jis, euc-jp and iso-2022-jp, as well as utf8.
mnoGoSearch also supports the following Macintosh chatacter sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic, MacGujarati.
Several languages in one database
It is often necessary to deal with several languages simultaneously. Number of supported languages depends on choice of character set that mnoGoSearch will use to store data. Character set is specified with LocalCharset command.
UTF-8 mode
When UTF-8 is specified in LocalCharset command, you may work with any languages supported in Unicode. That means you may use any number of over 650 languages. However, using UTF-8 may consume large amount of disk space (up to twice for some languages), leading to slower indexation and search.
non-UTF-8 mode
Since every character set includes latin characters, any character set supports at least two languages - English and one or more other languages. US-ASCII is an exception - it supports only Latin characters.
Note
When using mnoGoSearch in standard (non-UTF-8) mode, you may use as many languages as you like if they all belong to same language group.
Table 11.1. Language groups
| Language group |
Languages |
Character sets |
| Group 1 |
Western Europe: Albanian, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, Swedish |
ASCII 8, CP437, CP850, CP860, CP1252, ISO 8859-1, ISO 8859-15, MacRoman, MacIceland |
| Group 2 |
Eastern Europe: Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, Slovene |
CP852, CP1250, ISO 8859-2, MacCentralEurope, MacRomania, MacCroatian |
| Group 4 |
Baltic |
CP1257, iso-8859-4, iso-8859-13 |
| Group 5 |
Cyrillic: Bulgarian, Byelorussian, Macedonian, Russian, Serbian, Ukrainian |
CP855, CP866, CP1251, ISO 8859-5, Koi8-r, Koi8-u, MacCyrillic |
| Group 6 |
Arabic |
CP864, CP1256, ISO 8859-6, MacArabic |
| Group 7 |
Greek |
CP869, CP1253, ISO 8859-7, MacGreek |
| Group 8 |
Hebrew |
CP1255, ISO 8859-8, MacHebrew |
| Group 9 |
Turkish |
CP857, CP1254, ISO 8859-9, MacTurkish |
| Group 101 |
Japanese |
Shift-JIS, EUC-JP, ISO-2022-JP |
| Group 102 |
Simplified Chinese (PRC) |
EUC-GB |
| Group 103 |
Traditional Chinese (ROC) |
Big 5 |
| Group 104 |
Korean |
EUC-KR |
| Group 105 |
Thai |
CP874, TIS 620, MacThai |
| Group 106 |
Vietnamese |
CP1258 |
| Group 107 |
Indian |
MacGujarati |
| Group 108 |
Georgian |
geostd8 |
| Unicode |
Over 650 languages |
UTF-8 (Unicode) |
E.g. in case you search engine is configured to use LocalCharset from the 5th group (Cyrillic), you may index servers containing documents in Bulgarian, Byelorussian, Macedonian, Russian, Serbian and Ukrainian. Indexing a multi-language document in UTF-8 is possible as well; however indexer will extract and save only cyrillic content from the page. To provide support for over 650 languages, please use LocalCharset utf-8 command.
Character set conversion
indexer converts all documents to the character set specified in the LocalCharset indexer.conf command. Internally conversion is implemented using Unicode. Please note that some conversion may loose some data. For example, conversion between any Greek and Russian charsets looses all national characters. This does not matter for a single language sites. If you want to build multi-lingual search engine use UTF8 character set as a LocalCharset.
Character set conversion at search time
You may use BrowserCharset command to choose a charset which will be used to display search results. BrowserCharset may differ from LocalCharset.
Character sets aliases
Each charset is recognized by a number of its aliases. Web servers can return the same charset in different notation. For example, iso-8859-2, iso8859-2, latin2 are the same charsets. There is support for charsets names aliases which search engine can understand:
Table 11.2. Charsets aliases
| iso-8859-1: |
CP819, CSISOLATIN, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1, ISO_8859-1:1987, L1, LATIN1 |
| iso-8859-10: |
CSISOLATIN6, ISO-8859-10, ISO-IR-157, ISO_8859-10, ISO_8859-10:1992, L6, LATIN6 |
| iso-8859-11: |
ISO-8859-11, TIS-620, TIS620, TACTIS |
| iso-8869-13: |
ISO-8859-13, ISO-IR-179, ISO_8859-13, L7, LATIN7 |
| iso-8859-14: |
ISO-8859-14, ISO-IR-199, ISO_8859-14, ISO_8859-14:1998, L8, LATIN8 |
| iso-8859-15: |
ISO-8859-15, ISO-IR-203, ISO_8859-15, ISO_8859-15:1998 |
| iso-8859-16: |
ISO-8859-16, ISO-IR-226, ISO_8859-16, ISO_8859-16:2000 |
| iso-8859-2: |
CSISOLATIN2, ISO-8859-2, ISO-IR-101, ISO_8859-2, ISO_8859-2:1987, L2, LATIN2 |
| iso-8859-3: |
CSISOLATIN3, ISO-8859-3, ISO-IR-109, ISO_8859-3, ISO_8859-3:1988, L3, LATIN3 |
| iso-8859-4: |
CSISOLATIN4, ISO-8859-4, ISO-IR-110, ISO_8859-4, ISO_8859-4:1988, L4, LATIN4 |
| iso-8859-5: |
CSISOLATINCYRILLIC, CYRILLIC, ISO-8859-5, ISO-IR-144, ISO_8859-5, ISO_8859-5:1988 |
| iso-8859-6: |
ARABIC, ASMO-708, CSISOLATINARABIC, ECMA-114, ISO-8859-6, ISO-IR-127, ISO_8859-6, ISO_8859-6:1987 |
| iso-8859-7: |
CSISOLATINGREEK, ECMA-118, ELOT_928, GREEK, GREEK8, ISO-8859-7, ISO-IR-126, ISO_8859-7, ISO_8859-7:1987 |
| iso-8859-8: |
CSISOLATINHEBREW, HEBREW, ISO-8859-8, ISO-IR-138, ISO_8859-8, ISO_8859-8:1988 |
| iso-8859-9: |
CSISOLATIN5, ISO-8859-9, ISO-IR-148, ISO_8859-9, ISO_8859-9:1989, L5, LATIN5 |
| armscii-8: |
ARMSCII-8 |
| big5: |
BIG-5, BIG-FIVE, BIG5, BIGFIVE, CN-BIG5, CSBIG5 |
| cp1250: |
CP1250, MS-EE, WINDOWS-1250 |
| cp1251: |
CP1251, MS-CYRL, WINDOWS-1251 |
| cp1252: |
CP1252, MS-ANSI, WINDOWS-1252 |
| cp1253: |
CP1253, MS-GREEK, WINDOWS-1253 |
| cp1254: |
CP1254, MS-TURK, WINDOWS-1254 |
| cp1255: |
CP1255, MS-HEBR, WINDOWS-1255 |
| cp1256: |
CP1256, MS-ARAB, WINDOWS-1256 |
| cp1257: |
CP1257, WINBALTRIM, WINDOWS-1257 |
| cp1258: |
CP1258, WINDOWS-1258 |
| cp437: |
437, CP437, IBM437 |
| cp850: |
850, CP850, CSPC850MULTILINGUAL, IBM850 |
| cp852: |
852, CP852, IBM852 |
| cp855: |
855, CP855, IBM855 |
| cp857: |
857, CP857, IBM857 |
| cp860: |
860, CP860, IBM860 |
| cp861: |
861, CP861, IBM861 |
| cp862: |
862, CP862, IBM862 |
| cp863: |
863, CP863, IBM863 |
| cp864: |
864, CP864, IBM864 |
| cp865: |
865, CP865, IBM865 |
| cp866: |
866, CP866, CSIBM866, IBM866 |
| cp869: |
869, CP869, IBM869, CP874, WINDOWS-874 |
| euc-kr: |
CSEUCKR, EUC-KR, EUCKR |
| gb2312: |
CHINESE, CSGB2312, CSISO58GB231280, GB2312, GB_2312-80, ISO-IR-58 |
| koi8-r: |
CSKOI8R, KOI8-R |
| koi8-u |
KOI8-U |
| shift-jis: |
CSSHIFTJIS, MS_KANJI, S-JIS, SHIFT-JIS, SHIFT_JIS, SJIS |
| cp367: |
ANSI_X3.4-1968, ASCII, CP367, CSASCII, IBM367, ISO-IR-6, ISO646-US, ISO_646.IRV:1991, US, US-ASCII |
| utf8: |
UTF-8, UTF8 |
| viscii: |
CSVISCII, VISCII, VISCII1.1-1 |
| maccyrillic: |
MACCYRILLIC, X-MAC-CYRILLIC |
| macroman: |
MACROMAN, MACINTOSH, CSMACINTOSH, MAC |
| MacCentralEurope: |
MACCENTRALEUROPE, MACCE |
|