Class HtmlEncodingDetector

java.lang.Object
org.apache.tika.parser.html.HtmlEncodingDetector
All Implemented Interfaces:
Serializable, org.apache.tika.detect.EncodingDetector

public class HtmlEncodingDetector extends Object implements org.apache.tika.detect.EncodingDetector
Character encoding detector for determining the character encoding of a HTML document based on the potential charset parameter found in a Content-Type http-equiv meta tag somewhere near the beginning. Especially useful for determining the type among multiple closely related encodings (ISO-8859-*) for which other types of encoding detection are unreliable.
Since:
Apache Tika 1.2
See Also:
  • Constructor Details

    • HtmlEncodingDetector

      public HtmlEncodingDetector()
  • Method Details

    • detect

      public Charset detect(InputStream input, org.apache.tika.metadata.Metadata metadata) throws IOException
      Specified by:
      detect in interface org.apache.tika.detect.EncodingDetector
      Throws:
      IOException
    • getMarkLimit

      public int getMarkLimit()
    • setMarkLimit

      @Field public void setMarkLimit(int markLimit)
      How far into the stream to read for charset detection. Default is 8192.
      Parameters:
      markLimit -