mnoGoSearch HTML Parser
Tag parser
Tag parser understands the following tag notation:
- < ... parameter=value ... >
- < ... parameter="value" ... >
- < ... parameter='value' ... >
Special characters
indexer understands the following special HTML characters:
- < > & "
- All SGML ISO-8859-1 entities: ä ü and other.
- Characters in their Unicode notation: ê
META tags
Indexer's HTML parser currently understands the following special META tags. Note that "HTTP-EQUIV" may be used instead of "NAME" in all entries.
- <META NAME="Content-Type" Content="text/html; charset=xxxx"> This is used to detect document character set if it is not specified in "Content-type" HTTP header.
- <META NAME="REFRESH" Content="5; URL=http://www.somewhere.com"> URL value will be inserted in database.
- <META NAME="Robots" Content="xxx"> with content value ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW.
Content of any meta tags, including these three, can be indexed and stored in the database. Please go to Section tab to configure the list of meta tags you'd like to index.
KEYWORDS and DESCRIPTION meta tags are indexed by default
Links
HTML parser understand the following links:
- <A HREF="xxx">
- <IMG SRC="xxx">
- <LINK HREF="xxx">
- <FRAME SRC="xxx">
- <AREA HREF="xxx">
- <BASE HREF="xxx">
If BASE HREF value has incorrectly formed URL, current one will be used instead to compose relative links.
Comments
- Text inside the <!-- .... --> tag is recognized as HTML comment.
- You may use special <!--UdmComment--> .... <!--/UdmComment--> comment tags to exclude the text between from indexing. This may be useful to hide such things like menus and others from indexing.
- You may also use <NOINDEX> ... </NOINDEX> as synonyms to <!--UdmComment--> and <!--/UdmComment-->
|