Quickstart#
Introduction#
In this quickstart guide, we will use LangSearch to create a QA application using ChatGPT. Specifically, we will ask questions about the popular LangChain package, which was released after ChatGPT’s training cutoff. Therefore, ChatGPT does not know any information about LangChain a priori.
The QA application will have the following features.
The answers will contain less hallucinations than ChatGPT ordinarily produces
The answers will cite sources
The answers will always be up-to-date and correspond to the latest version of the documentation
Building such an application requires the following steps.
Crawling selected pages from the LangChain documentation. These pages will be used as the information source for question answering.
Removing boilerplate text from the downloaded HTML document to prevent information pollution.
Extracting the main text content from the HTML document.
Splitting long pages into smaller sections.
Indexing the sections in a vector database like Weaviate.
Building a semantic search based QA app using Weaviate and ChatGPT.
Running the crawler process periodically to keep our data up-to-date. To make this cost-efficient (especially when using paid embeddings APIs), we need to keep a record of each crawl, so that we only re-index changed pages.
LangSearch makes the above steps easy and accessible. Let’s see how by building the QA application from scratch.
We will start by installing LangSearch.
pip install langsearch
Then create a folder quickstart
to hold our QA application.
mkdir quickstart && cd quickstart
Crawling selected pages of the LangChain documentation#
LangSearch uses Scrapy under the hood for crawling web pages. Every LangSearch project starts by creating a scrapy project.
scrapy startproject langchain_qa
This will create a langchain_qa
folder with the following content.
.
├── langchain_qa
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── __init__.py
└── scrapy.cfg
To crawl the LangChain website, we need to create a crawler. Create a file crawler.py
inside the
spiders
folder, and put the following code in the file.
from langsearch.spiders import WebSpider
class Crawler(WebSpider):
name = "langchain"
That’s our crawler.
We will control some important aspects of the crawling process in the settings.py
file. So go over to that file
and add the following code.
LANGSEARCH_WEB_SPIDER_START_URLS = ["https://python.langchain.com/docs/get_started/introduction"]
LANGSEARCH_WEB_SPIDER_LINK_EXTRACTOR_ALLOW = [
"https://python\.langchain\.com/docs/get_started",
"https://python\.langchain\.com/docs/modules",
"https://python\.langchain\.com/docs/guides",
"https://python\.langchain\.com/docs/ecosystem",
"https://python\.langchain\.com/docs/additional_resources"
]
AUTOTHROTTLE_ENABLED = True
The above settings tells the crawler to start crawling from https://python.langchain.com/docs/get_started/introduction and only
follow links that match the regular expressions in LANGSEARCH_WEBSPIDER_LINK_EXTRACTOR_ALLOW
.
The AUTOTHROTTLE_ENABLED = True
setting is a Scrapy setting that ensures that we don’t hit the website too hard.
Removing boilerplate, extracting text, splitting text and indexing#
LangSearch provides generic pipelines that orchestrate boilerplate removal, text extraction, text splitting and indexing
for various mime types. Therefore, you only need a couple of lines of code in the settings.py
file for all these
steps.
from langsearch.pipelines import assemble, DetectItemTypePipeline, GenericHTMLPipeline
ITEM_PIPELINES = {
DetectItemTypePipeline: 100,
**assemble(GenericHTMLPipeline)
}
Running the crawler#
To run the crawler, first download this docker compose file and place it in the quickstart
folder. This docker
compose file is responsible for running the Weaviate vector database. Start the Weaviate instance by using the following
command (you need to have docker
installed in your system).
docker compose up
Then start the crawl by going inside the Scrapy project i.e. the langchain_qa
folder and issuing the following
command
scrapy crawl langchain
Create a QA app#
Once the crawling has been done, you can immediately start using the index to answer questions.
First, make sure that your terminal knows your OpenAI API key.
export OPENAI_API_KEY="..."
Then simply import the QAChain
class from LangSearch and start asking questions.
from langsearch.chains import QAChain
chain_output = QAChain({"question": "How can I install langchain?"})
print(chain_output["output_text"])
Here’s how you can create a Streamlit app to get a web interface. First, install streamlit
.
pip install streamlit
Then put the code below in a file called webapp.py
.
import streamlit as st
from langsearch.chains.qa import QAChain
st.title("QA Demo")
question = st.text_input("Ask any question about Langchain")
if len(question) != 0:
chain_output = QAChain()({"question": question})
answer = chain_output["output_text"]
sources = set([doc.metadata["source"] for doc in chain_output["docs"]])
st.markdown(answer)
for index, source in enumerate(sources):
st.markdown(f"[{index + 1}] [{source}]({source})")
Then bring up the web app by issuing the following command.
streamlit run webapp.py
Keep your QA app up-to-date#
Simply run the following command using chron
or any other scheduler.
scrapy crawl langchain
This will only re-index (and compute embeddings) for pages that have changed since the last run.