GenericImagePipeline
#
GenericImagePipeline
indexes crawled data with MIME type image/*
. You can activate it by writing your
ITEM_PIPELINES
settings as follows.
from langsearch.pipelines import assemble, DetectItemTypePipeline
from langsearch.pipelines.types.image.imagepipeline import GenericImagePipeline
ITEM_PIPELINES = {
DetectItemTypePipeline: 100,
**assemble(GenericImagePipeline)
}
When used alone (like in the above code example), your pipeline will discard any crawled data that does not match the
MIME type image/*
.
GenericImagePipeline
consists of the following pipeline components applied in sequence.
ResizeImagePipeline
: Resize images (normally makes it smaller) to save space in the persistence layer because image search does not require hi-res images.StoreItemPipeline
: Stores the image in a Crawl DB. The Crawl DB is used to make re-crawling more efficient.ImageIndexPipeline
: Indexes the image in the Weaviate vector database.
Service requirements#
The GenericImagePipeline
expects a Weaviate database to be available. It also needs an Apache Tika service to be up
and running. Therefore, you need make these services available before running the scrapy crawl
command.
To do that, create a docker-compose.yml
file and add the following services to it.
version: "3.4"
services:
weaviate:
image: semitechnologies/weaviate:1.18.1
restart: on-failure:0
ports:
- "8080:8080"
environment:
QUERY_DEFAULTS_LIMIT: 20
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
ENABLE_MODULES: "multi2vec-clip"
CLIP_INFERENCE_API: "http://multi2vec-clip:8080"
CLUSTER_HOSTNAME: "unsplash_nature"
volumes:
- ./weaviate_data:/var/lib/weaviate
depends_on:
- multi2vec-clip
multi2vec-clip:
image: semitechnologies/multi2vec-clip:sentence-transformers-clip-ViT-B-32-multilingual-v1
environment:
ENABLE_CUDA: 0 # Change this to 1 to use your GPU
# Uncomment the following to use your NVIDIA GPU
#deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: 1
# capabilities: [ gpu ]
tika:
image: apache/tika:latest-full
ports:
- "9998:9998"
Change the CLUSTER_HOSTNAME
to any name you prefer.
This docker-compose.yml
starts Weaviate and Apache Tika with a configuration that works seamlessly with the
the pipeline components.
To make the services available, run the following command (you need to have Docker installed).
docker compose up
Please set the following env vars before starting the crawl so that the crawler can access the Apache Tika service.
export TIKA_CLIENT_ONLY="True"
export TIKA_SERVER_ENDPOINT="http://localhost:9998"