LangChain Hubのプロンプトを利用して、RAG向けにブログ記事のchunkを実施する

RAG（Retrieval Augmented Generation）において、適切なサイズにデータを分割する手法は重要であり、LangChain Hubを使用すれば、プロンプトの簡単な利用や更新が可能です。LLMを用いたChunk作成は効果的だが、コストや意図しないテキスト生成には注意が必要。LangChain Hubのプロンプトの変更に伴う挙動変化にも注意が必要。Hub上のプロンプトを使ってプロンプト変更の管理が簡略化される一方で、第三者プロンプトの変更影響についても検討が必要。

RAG(Retrieval Augmented Generation)において、ベクトルインデックスに投入するデータを最適なサイズに分割する（Chunk）方法はとても重要です。なぜなら、「回答文章の生成を指示するプロンプトに、検索結果を追加情報として渡す」というRAGの仕組み上、「回答生成に必要のない情報」があるとトークン数が増えることにつながります。トークン数には上限がある関係上、渡せる情報が減ってしまうリスクがありますし、なにより必要のない情報に関するトークンでの課金が発生することは、費用対効果も悪化させます。

今回はテキストをChunkする手法のうち、[5 Levels Of Text Splitting]に紹介されている「Level 5: Agentic Chunking」をLangChain.jsで試してみました。

LangChain Hubからプロンプトをインポートする

LangChainなどのフレームワークを使うメリットの1つが、「誰かが公開・更新しているプロンプトを、1行のコードで利用できるようになる」ことです。LangChainの場合、[LangChain Hub]というサイトで公開されているプロンプトの検索・閲覧と取得ができます。

今回の[5 Levels Of Text Splitting]で紹介されている、LLMを利用したChunk作成方法についても、プロンプトが公開されています。

https://smith.langchain.com/hub/wfh/proposal-indexing?organizationId=50995362-9ea0-4378-ad97-b4edae2f9f22

LangChain Hubからプロンプトをインポートする

それでは実際にLangChain Hubに公開されているプロンプトを利用してみましょう。npm i langchainでインストールすると利用できるlangchain/hubの中に、pullという関数があります。これをインポートしましょう。

import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence } from 'langchain/runnables';
import { ChatPromptTemplate } from 'langchain/prompts';
import { ChatOpenAI } from "@langchain/openai";
import { pull } from "langchain/hub";

あとは利用したいプロンプトのidを第一引数に渡せばOKです。pullすることで、ChatPromptTemplateとしてLangChain Hubに公開されているプロンプトを利用できます。

const prompt = await pull<ChatPromptTemplate>(
    "wfh/proposal-indexing"
  );

WordPressの記事情報をWP APIから取得し、Chunk作成するサンプルコード

実際にChunkを作成するサンプルコードを作成しました。example.com部分を、お手持ちのWordPressサイトのドメインに変更すれば、大体のケースでは動くはずです。時よりエンドポイントのパスを変更していたり、WP APIを無効化しているサイトなどがありますので、ご注意ください。

  const response = await fetch('https://example.com/wp-json/wp/v2/posts')
  const posts = await response.json<WPPost[]>()
  const post = posts[0]
  const chatModel = new ChatOpenAI({
    modelName: "gpt-4",
    temperature: 0,
    openAIApiKey: c.env.OPENAI_API_KEY
  });
  const prompt = await pull<ChatPromptTemplate>(
    "wfh/proposal-indexing"
  );
  const chain = RunnableSequence.from([{
      input: new RunnablePassthrough()
    },
    prompt,
    chatModel,
    new StringOutputParser()
  ])
  const result = await chain.invoke(post.content.rendered)
  console.log(result)

このコードでは、最新の記事をWP APIから取得し、そのうちの先頭の記事に含まれる本文についてChunkを生成させています。

Agentic Chunkingの結果を見てみる

実行結果をサッと見てみましょう。

[
  "WordPress provides a REST API at the path starting from /wp-json.",
  "You can add your own API here to add APIs that collaborate with external services.",
  "The minimum plugin code is provided.",
  "The plugin code can be easily tested by overwriting the code of the Hello Dolly plugin.",
  "If the plugin containing this code is enabled, the API will increase at the path specified by the first and second arguments.",
  "The first argument is treated like a namespace.",
  "Therefore, if you throw a GET at the namespace level path, you can see information such as 'You can use this API at this path'.",
  "If you want to pass product or store IDs to the API path, specify the path with a regular expression in the second argument of register_rest_route.",
  "For example, if you want to receive a string with message: (?P<message>.+)",
  "For example, if you want to receive a number: (?P\\d+)",
  "When you call this API, the contents of the path will come out as they are.",
  "When actually using it, be sure to apply measures such as XSS to the received content.",
  "Validation and error message display can be easily implemented with validation_callback.",
  "For example, in the following code, passing a number to message is prohibited.",
  "If you pass a number such as 123 to this API, an HTTP400 error will be returned.",
  "Just having a validation process allows you to leave the verification and error message display to you.",
  "I have summarized the writing method that I regularly take care of.",
  "As block development becomes the default, it seems that we will be more likely to take care of this API, so please try it out.",
  "The reference article is 'Adding Custom Endpoints'."
]

まず気になった点は、日本語の記事を入れた場合でも、英語でChunkが生成されたことです。これはプロンプトが英語だからかもしれませんので、言語の指定を追加するなどができれば対策できる可能性があります。また、英語でChunkが生成されることによって、Embeddingを利用した類似性検索（ベクトル検索）の精度が上がることが（利用するモデルによっては）期待できます。

もう1点気になった部分は、記事内のサンプルコードや画像・リンクなども削除されていることです。その代わりにコードの説明文が生成されているようにも見えます。この辺りはサンプルコードが記事の重要な要素であるかどうかによって変わるかもしれません。

やってみた感想

今回2つのことに挑戦しました。1つはLLMを利用した文章のChunk作成、もう1つはLangChain Hubを利用したプロンプトのインポートです。入力されたテキストを元に文章を生成するのはLLMが得意な領域ですので、Chunk作成に使うのはかなり有効に思えます。懸念点としては、すでに数百件の記事がある場合や記事の更新頻度が高い場合、Chunk作成時に利用するLLMのコストが高くなる可能性があることと、意図しないテキスト生成が起こるケースをどう検知するかでしょうか。

LangChain Hubを利用したプロンプトのインポートは、先人が公開したプロンプトを利用できるメリットだけでなく、複数の案件・プロジェクトで類似したプロンプトを利用する場合にも有効に思えます。Hub上のプロンプトを更新するだけでプロンプトの変更が反映されるように見えますので、アプリごとにプロンプトの変更でCDフローを走らせる必要がなくなります。ただし一方で、第三者が公開しているプロンプトを利用した場合に、プロンプトが変更されて挙動が変わってしまうケースがないかは少し気になりました。LangChain側でなにか対策があるのかもしれませんので、この辺りはもう少し調べてみようと思います。

参考記事

https://github.com/FullStackRetrieval-com/RetrievalTutorials/tree/main