GenericHTMLPipeline#

GenericHTMLPipeline indexes crawled data with MIME type text/html. You can activate it by writing your ITEM_PIPELINES settings as follows.

from langsearch.pipelines import assemble, DetectItemTypePipeline, GenericHTMLPipeline

ITEM_PIPELINES = {
    DetectItemTypePipeline: 100,
    **assemble(GenericTextPipeline)
}

When used alone (like in the above code example), your pipeline will discard any crawled data that does not match the MIME type text/html.

GenericHTMLPipeline consists of the following pipeline components applied in sequence.

  1. FixHTMLPipeline: Tries to fix broken HTML documents using lxml.

  2. PythonReadabilityPipeline: Removes boilerplate from the HTML document.

  3. InscriptisPipeline: Extracts text from the HTML document.

  4. TextSplitterPipeline: Splits the extracted text into smaller passages.

  5. StoreItemPipeline: Stores the extracted text in a Crawl DB. The Crawl DB is used to make re-crawling more efficient.

  6. SimpleIndexPipeline: Indexes the text passages in the Weaviate vector database.

Service requirements#

The GenericHTMLPipeline expects a Weaviate database to be available. Therefore, you need make a Weaviate instance available before running the scrapy crawl command.

To make a Weaviate database available, create a docker-compose.yml file and add the following services to it.

version: "3.4"
services:
  weaviate:
    image: semitechnologies/weaviate:1.18.1
    restart: on-failure:0
    ports:
     - "8080:8080"
    environment:
      QUERY_DEFAULTS_LIMIT: 20
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
      PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
      ENABLE_MODULES: text2vec-transformers
      TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
      CLUSTER_HOSTNAME: "langchain_qa"
    volumes:
      - ./weaviate_data:/var/lib/weaviate
    depends_on:
      - t2v-transformers
  t2v-transformers:
    image: semitechnologies/transformers-inference:sentence-transformers-gtr-t5-base
    environment:
      ENABLE_CUDA: 0  # Change this to 1 to use your GPU
    # Uncomment the following to use your NVIDIA GPU
    #deploy:
    #  resources:
    #    reservations:
    #      devices:
    #        - driver: nvidia
    #          count: 1
    #          capabilities: [ gpu ]

Change the CLUSTER_HOSTNAME to any name you prefer.

This docker-compose.yml starts Weaviate with a configuration that works seamlessly with the the pipeline components.

To make the Weaviate DB available, run the following command (you need to have Docker installed).

docker compose up

Note

The DetectItemPipeline actually also needs the Apache Tika service to do its job. This service can be omitted in the special case when you expect your crawler to exclusively send items of mimetype text/html and these webpages are well behaved i.e. sends correct Content-Type headers. If your situation deviates from this special situation, the crawler will stop and complain that it can’t find the Apache Tika service. To solve this, add the Apache Tika service to the docker-compose.yml file.

version: "3.4"
services:
  weaviate:
    image: semitechnologies/weaviate:1.18.1
    restart: on-failure:0
    ports:
     - "8080:8080"
    environment:
      QUERY_DEFAULTS_LIMIT: 20
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
      PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
      ENABLE_MODULES: text2vec-transformers
      TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
      CLUSTER_HOSTNAME: "deeprl_course"
    volumes:
      - ./weaviate_data:/var/lib/weaviate
    depends_on:
      - t2v-transformers
  t2v-transformers:
    image: semitechnologies/transformers-inference:sentence-transformers-gtr-t5-base
    environment:
      ENABLE_CUDA: 0  # Change this to 1 to use your GPU
    # Uncomment the following to use your NVIDIA GPU
    #deploy:
    #  resources:
    #    reservations:
    #      devices:
    #        - driver: nvidia
    #          count: 1
    #          capabilities: [ gpu ]

  tika:
    image: apache/tika:latest-full
    ports:
      - "9998:9998"

Please set the following env vars before starting the crawl so that the crawler can access the Tika service.

export TIKA_CLIENT_ONLY="True"
export TIKA_SERVER_ENDPOINT="http://localhost:9998"