Huge Collections of Software Manuals and Knowledgebase

GreatManuals.com
Huge Collections of Software Manuals and Knowledgebase

 
Home Contact Us Request to publish your help manuals Request to remove your help manuals
Introduction
» mnoGoSearch
» Main Features
» System Requirements
» Installing, Configuring & Upgrading
Using mnoGoSearch
» Wizard
» Configuring mnoGoSearch
» Tabs
» Servers Tab
» Indexer Tab
» Service
» mnoGoSearch Usage
mnoGoSearch Web Configurator
» mnoGoSearch Web Configurator
» Configuring mnoGoSearch Web Configurator
» Servers & Indexer
» Service
mnoGoSearch Search COM Objects
» mnoGoSearch Search COM Objects
» Search Objects Reference
» ASP Frontend
» URL Aliases
» Character Sets
» mnoGoSearch HTML Parser
Template Design
» Results Page Creation
» Your HTML
» Forms Considerations
» Relative Links & Adding Search Form
» Template Variables
HTTP Codes & Options
» HTTP Codes
» Ispell
» External Parsers
» Storage Modes
» Tags
» Categories
Ordering & Support
» Reporting Bugs
» Ordering
» Support
 

Character Sets

Supported character sets

mnoGoSearch supports almost all known 8 bit character sets as well as some multi-byte charsets including Korean euc-kr, Chinese big5 and gb2312, Japanese shift-jis, euc-jp and iso-2022-jp, as well as utf8.

mnoGoSearch also supports the following Macintosh chatacter sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic, MacGujarati.

Several languages in one database

It is often necessary to deal with several languages simultaneously. Number of supported languages depends on choice of character set that mnoGoSearch will use to store data. Character set is specified with LocalCharset command.

UTF-8 mode

data leakage protection official billing software live usb display
mobile text sms pc restore purchase order software
mypassword unlock ntfs partition repair disk repair software

When UTF-8 is specified in LocalCharset command, you may work with any languages supported in Unicode. That means you may use any number of over 650 languages. However, using UTF-8 may consume large amount of disk space (up to twice for some languages), leading to slower indexation and search.

non-UTF-8 mode

Since every character set includes latin characters, any character set supports at least two languages - English and one or more other languages. US-ASCII is an exception - it supports only Latin characters.

Note

When using mnoGoSearch in standard (non-UTF-8) mode, you may use as many languages as you like if they all belong to same language group.

Table 11.1. Language groups

Language group Languages Character sets
Group 1 Western Europe: Albanian, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, Swedish ASCII 8, CP437, CP850, CP860, CP1252, ISO 8859-1, ISO 8859-15, MacRoman, MacIceland
Group 2 Eastern Europe: Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, Slovene CP852, CP1250, ISO 8859-2, MacCentralEurope, MacRomania, MacCroatian
Group 4 Baltic CP1257, iso-8859-4, iso-8859-13
Group 5 Cyrillic: Bulgarian, Byelorussian, Macedonian, Russian, Serbian, Ukrainian CP855, CP866, CP1251, ISO 8859-5, Koi8-r, Koi8-u, MacCyrillic
Group 6 Arabic CP864, CP1256, ISO 8859-6, MacArabic
Group 7 Greek CP869, CP1253, ISO 8859-7, MacGreek
Group 8 Hebrew CP1255, ISO 8859-8, MacHebrew
Group 9 Turkish CP857, CP1254, ISO 8859-9, MacTurkish
Group 101 Japanese Shift-JIS, EUC-JP, ISO-2022-JP
Group 102 Simplified Chinese (PRC) EUC-GB
Group 103 Traditional Chinese (ROC) Big 5
Group 104 Korean EUC-KR
Group 105 Thai CP874, TIS 620, MacThai
Group 106 Vietnamese CP1258
Group 107 Indian MacGujarati
Group 108 Georgian geostd8
Unicode Over 650 languages UTF-8 (Unicode)

E.g. in case you search engine is configured to use LocalCharset from the 5th group (Cyrillic), you may index servers containing documents in Bulgarian, Byelorussian, Macedonian, Russian, Serbian and Ukrainian. Indexing a multi-language document in UTF-8 is possible as well; however indexer will extract and save only cyrillic content from the page. To provide support for over 650 languages, please use LocalCharset utf-8 command.

Character set conversion

indexer converts all documents to the character set specified in the LocalCharset indexer.conf command. Internally conversion is implemented using Unicode. Please note that some conversion may loose some data. For example, conversion between any Greek and Russian charsets looses all national characters. This does not matter for a single language sites. If you want to build multi-lingual search engine use UTF8 character set as a LocalCharset.

Character set conversion at search time

You may use BrowserCharset command to choose a charset which will be used to display search results. BrowserCharset may differ from LocalCharset.

Character sets aliases

Each charset is recognized by a number of its aliases. Web servers can return the same charset in different notation. For example, iso-8859-2, iso8859-2, latin2 are the same charsets. There is support for charsets names aliases which search engine can understand:

Table 11.2. Charsets aliases

iso-8859-1: CP819, CSISOLATIN, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1, ISO_8859-1:1987, L1, LATIN1
iso-8859-10: CSISOLATIN6, ISO-8859-10, ISO-IR-157, ISO_8859-10, ISO_8859-10:1992, L6, LATIN6
iso-8859-11: ISO-8859-11, TIS-620, TIS620, TACTIS
iso-8869-13: ISO-8859-13, ISO-IR-179, ISO_8859-13, L7, LATIN7
iso-8859-14: ISO-8859-14, ISO-IR-199, ISO_8859-14, ISO_8859-14:1998, L8, LATIN8
iso-8859-15: ISO-8859-15, ISO-IR-203, ISO_8859-15, ISO_8859-15:1998
iso-8859-16: ISO-8859-16, ISO-IR-226, ISO_8859-16, ISO_8859-16:2000
iso-8859-2: CSISOLATIN2, ISO-8859-2, ISO-IR-101, ISO_8859-2, ISO_8859-2:1987, L2, LATIN2
iso-8859-3: CSISOLATIN3, ISO-8859-3, ISO-IR-109, ISO_8859-3, ISO_8859-3:1988, L3, LATIN3
iso-8859-4: CSISOLATIN4, ISO-8859-4, ISO-IR-110, ISO_8859-4, ISO_8859-4:1988, L4, LATIN4
iso-8859-5: CSISOLATINCYRILLIC, CYRILLIC, ISO-8859-5, ISO-IR-144, ISO_8859-5, ISO_8859-5:1988
iso-8859-6: ARABIC, ASMO-708, CSISOLATINARABIC, ECMA-114, ISO-8859-6, ISO-IR-127, ISO_8859-6, ISO_8859-6:1987
iso-8859-7: CSISOLATINGREEK, ECMA-118, ELOT_928, GREEK, GREEK8, ISO-8859-7, ISO-IR-126, ISO_8859-7, ISO_8859-7:1987
iso-8859-8: CSISOLATINHEBREW, HEBREW, ISO-8859-8, ISO-IR-138, ISO_8859-8, ISO_8859-8:1988
iso-8859-9: CSISOLATIN5, ISO-8859-9, ISO-IR-148, ISO_8859-9, ISO_8859-9:1989, L5, LATIN5
armscii-8: ARMSCII-8
big5: BIG-5, BIG-FIVE, BIG5, BIGFIVE, CN-BIG5, CSBIG5
cp1250: CP1250, MS-EE, WINDOWS-1250
cp1251: CP1251, MS-CYRL, WINDOWS-1251
cp1252: CP1252, MS-ANSI, WINDOWS-1252
cp1253: CP1253, MS-GREEK, WINDOWS-1253
cp1254: CP1254, MS-TURK, WINDOWS-1254
cp1255: CP1255, MS-HEBR, WINDOWS-1255
cp1256: CP1256, MS-ARAB, WINDOWS-1256
cp1257: CP1257, WINBALTRIM, WINDOWS-1257
cp1258: CP1258, WINDOWS-1258
cp437: 437, CP437, IBM437
cp850: 850, CP850, CSPC850MULTILINGUAL, IBM850
cp852: 852, CP852, IBM852
cp855: 855, CP855, IBM855
cp857: 857, CP857, IBM857
cp860: 860, CP860, IBM860
cp861: 861, CP861, IBM861
cp862: 862, CP862, IBM862
cp863: 863, CP863, IBM863
cp864: 864, CP864, IBM864
cp865: 865, CP865, IBM865
cp866: 866, CP866, CSIBM866, IBM866
cp869: 869, CP869, IBM869, CP874, WINDOWS-874
euc-kr: CSEUCKR, EUC-KR, EUCKR
gb2312: CHINESE, CSGB2312, CSISO58GB231280, GB2312, GB_2312-80, ISO-IR-58
koi8-r: CSKOI8R, KOI8-R
koi8-u KOI8-U
shift-jis: CSSHIFTJIS, MS_KANJI, S-JIS, SHIFT-JIS, SHIFT_JIS, SJIS
cp367: ANSI_X3.4-1968, ASCII, CP367, CSASCII, IBM367, ISO-IR-6, ISO646-US, ISO_646.IRV:1991, US, US-ASCII
utf8: UTF-8, UTF8
viscii: CSVISCII, VISCII, VISCII1.1-1
maccyrillic: MACCYRILLIC, X-MAC-CYRILLIC
macroman: MACROMAN, MACINTOSH, CSMACINTOSH, MAC
MacCentralEurope: MACCENTRALEUROPE, MACCE
Home | Contact Us | Request to publish your help manuals | Request to remove your help manuals