LangChainを利用して、GitHubリポジトリの情報からRAGを作る

ここ最近ずっとRAGやLangChainと格闘しています。DevRelとしての仕事で、仕事に関係するOSSやサンプルなどのコードを読んだり参照したりすることが少なく無い頻度であるのですが、RAGで効率化できないかというところが、そこに力を入れている理由の一つです。

プログラムコードを参照したRAGアプリ

LangChainのドキュメントにいつからか「RAG over code」が追加されています。ざっと読んだ感じでは、「プログラムコードをEmbeddingすることで、コードに関する質問などに対応できるRAGを作れる」ものという理解をしています。

どこまでの質問に答えれるのか。例えばIssueが投稿されたときに、そこから自動で回答またはPull Requestを出せるレベルなのかや、作ったOSSを紹介するサイトを自動生成できるのか、みたいな部分を評価するため、触ってみました。

GitHubリポジトリのデータを読み込む

元のサンプルでは、ローカルにあるプロジェクトを読み込ませています。今回はGitHub Web Loaderに差し替えて、GitHub上にホストされているコードを読ませてみました。

import { Hono } from "hono";
import { GithubRepoLoader } from "langchain/document_loaders/web/github";

export const githubApp = new Hono()

githubApp.get('repo', async c => {
    const loader = new GithubRepoLoader(
      "https://github.com/langchain-ai/langchainjs",
      {
        branch: "main",
        recursive: false,
        unknown: "warn",
        maxConcurrency: 5, // Defaults to 2
      }
    );
    const docs = await loader.load();
    console.log({ docs });
    return c.json(docs)
})

README.mdなどの情報がログやレスポンスに出てきますので、読み込めていることがわかります。

        'AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n' +
        'LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n' +
        'OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\n' +
        'THE SOFTWARE.',
      metadata: [Object]
    },
    Document {
      pageContent: '# 🦜️🔗 LangChain.js\n' +
        '\n' +
        '⚡ Building applications with LLMs through composability ⚡\n' +
        '\n' +
        '[![CI](https://github.com/langchain-ai/langchainjs/actions/workflows/ci.yml/badge.svg)](https://github.com/langchain-ai/langchainjs/actions/workflows/ci.yml) ![npm](https://img.shields.io/npm/dw/langchain) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Twitter](https://img.shields.io/twitter/url/https/twitter.com/langchainai.svg?style=social&label=Follow%20%40LangChainAI)](https://twitter.com/langchainai) [![](https://dcbadge.vercel.app/api/server/6adMQxSpJS?compact=true&style=flat)](https://discord.gg/6adMQxSpJS) [![Open in Dev Containers](https://img.shields.io/static/v1?label=Dev%20Containers&message=Open&color=blue&logo=visualstudiocode)](https://vscode.dev/redirect?url=vscode://ms-vscode-remote.remote-containers/cloneInVolume?url=https://github.com/langchain-ai/langchainjs)\n' +
        '[<img src="https://github.com/codespaces/badge.svg" title="Open in Github Codespace" width="150" height="20">](https://codespaces.new/langchain-ai/langchainjs)\n' +

読み込んだデータからChunkを作る

読み込んだデータをそのままEmbeddingするとトークン数や検索精度に影響がでる（らしい）ので、chunkを作ります。今回はRecursiveCharacterTextSplitterを使います。

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

fromLanguageでプログラミング言語を指定できますので、プロジェクトに応じたSplitterを用意しましょう。

    const docs = await loader.load();
    const javascriptSplitter = RecursiveCharacterTextSplitter.fromLanguage("js", {
        chunkSize: 2000,
        chunkOverlap: 200,
    });
    const texts = await javascriptSplitter.splitDocuments(docs);
    return c.json(texts)

17だったドキュメントが689まで増えましたので、chunkが作れていることがわかります。

Loaded  17  documents.
Loaded  689  documents.

Embedding生成

Chunkを生成できたので、ここからEmbeddingを作ります。splitDocumentsを通過したDocument(chunk)をMemoryVectorStore.fromDocumentsに渡します。

    const texts = await javascriptSplitter.splitDocuments(docs);
    console.log("Loaded ", texts.length, " documents.");

    const embeddings = new OpenAIEmbeddings({
        openAIApiKey: 'sk-xxxx'
    })
    const vectorStore = await MemoryVectorStore.fromDocuments(
        texts,
        embeddings
    );
    const retriever = vectorStore.asRetriever();

retrieverが用意できれば、参照するデータの用意は完了です。

作成したEmbeddingを利用する質問処理を追加する

あとは質問に回答する処理を追加するだけです。今回はChatOpenAIを利用しました。

    const model = new ChatOpenAI({
        modelName: 'gpt-4',
        openAIApiKey: 'sk-xxxx'
    }).pipe(
        new StringOutputParser()
    );

ドキュメントに記載されているサンプルからプロンプトをコピーしましょう。

    const questionGeneratorTemplate = ChatPromptTemplate.fromMessages([
        AIMessagePromptTemplate.fromTemplate(
          "Given the following conversation about a codebase and a follow up question, rephrase the follow up question to be a standalone question."
        ),
        AIMessagePromptTemplate.fromTemplate(`Follow Up Input: {question}
      Standalone question:`),
    ]);
    const combineDocumentsPrompt = ChatPromptTemplate.fromMessages([
        AIMessagePromptTemplate.fromTemplate(
          "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\n"
        ),
        HumanMessagePromptTemplate.fromTemplate("Question: {question}"),
    ]);

Chainも用意します。


    const combineDocumentsChain = RunnableSequence.from([
        {
          question: (output: string) => output,
          context: async (output: string) => {
            const relevantDocs = await retriever.getRelevantDocuments(output);
            return formatDocumentsAsString(relevantDocs);
          },
        },
        combineDocumentsPrompt,
        model,
        new StringOutputParser(),
    ]);

複数のChainを設定する必要がありました。質問文のベクトル化や回答文の生成などで、何度かOpenAI APIを呼び出す様子です。

    const conversationalQaChain = RunnableSequence.from([
        {
            question: (i: { question: string }) => i.question,
        },
        questionGeneratorTemplate,
        model,
        new StringOutputParser(),
        combineDocumentsChain,
    ]);

最後に質問文をinvokeの引数で渡します。

    const question = "Tell me about the Supported Environments.";
    const result = await conversationalQaChain.invoke({
        question,
    });
    console.log(result)
    return c.json(result)

動作を確認する

コードを動かしてみましょう。質問に「どのリポジトリについてか」などの指示はありませんでしたが、Embedding作成時に読み込ませたLangChain.jsに関する回答を作ってくれています。

The LangChain framework, written in TypeScript, supports the following environments:

1. Node.js (ESM and CommonJS) - versions 18.x, 19.x, 20.x
2. Browser
3. Deno
4. Cloudflare Workers
5. Vercel / Next.js (Browser, Serverless and Edge functions)
6. Supabase Edge Functions

参照したデータはREADME.mdの内容っぽいですね。

コード全体


    const model = new ChatOpenAI({
        modelName: 'gpt-4',
        openAIApiKey: 'sk-xxxxxx'
    }).pipe(
        new StringOutputParser()
    );
    const questionGeneratorTemplate = ChatPromptTemplate.fromMessages([
        AIMessagePromptTemplate.fromTemplate(
          "Given the following conversation about a codebase and a follow up question, rephrase the follow up question to be a standalone question."
        ),
        AIMessagePromptTemplate.fromTemplate(`Follow Up Input: {question}
      Standalone question:`),
    ]);
    const combineDocumentsPrompt = ChatPromptTemplate.fromMessages([
        AIMessagePromptTemplate.fromTemplate(
          "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\n"
        ),
        HumanMessagePromptTemplate.fromTemplate("Question: {question}"),
    ]);
    const combineDocumentsChain = RunnableSequence.from([
        {
          question: (output: string) => output,
          context: async (output: string) => {
            const relevantDocs = await retriever.getRelevantDocuments(output);
            return formatDocumentsAsString(relevantDocs);
          },
        },
        combineDocumentsPrompt,
        model,
        new StringOutputParser(),
    ]);
      
    const conversationalQaChain = RunnableSequence.from([
        {
            question: (i: { question: string }) => i.question,
        },
        questionGeneratorTemplate,
        model,
        new StringOutputParser(),
        combineDocumentsChain,
    ]);
    const question = "Tell me about the Supported Environments.";
    const result = await conversationalQaChain.invoke({
        question,
    });
    console.log(result)

小さなレポジトリで試してみる

せっかくなので、他のリポジトリも試してみました。

    const loader = new GithubRepoLoader(
        "https://github.com/stripe-samples/stripe-node-cloudflare-worker-template",
        {
          accessToken: 'ghp_xxxx',
          branch: "main",
          recursive: true,
          unknown: "warn",
          maxConcurrency: 2, // Defaults to 2
        }
    );

GitHub Access TokenをつけないとRate Limit引っかかることがある様子です。そのため、public_repoのREAD権限をつけたトークンを生成しておく方が良さそうです。

こちらも質問を投げてみましょう。今度は日本語で回答できるかも試します。

    const question = "Show me the example code to build a Stripe API application using Cloudflare Workers. 回答は日本語で行ってください。";
    const result = await conversationalQaChain.invoke({
        question,
    });
    console.log(result)
    return c.json(result)

結果がこちらです。いくつかのコードが省かれている様子ですが、要点は抑えているようにも見えます。

はい、以下にStripe APIとCloudflare Workerを使用してアプリケーションを構築するためのNode.jsコードサンプルがあります。

```javascript
function createStripeClient(apiKey) {
  return new Stripe(apiKey, {
    appInfo: { 
      name: "stripe-samples/stripe-node-cloudflare-worker-template",
      version: "0.0.1",
      url: "https://github.com/stripe-samples"
    }
  });
}

app.get("/", async (context) => {
  const stripe = createStripeClient(context.env.STRIPE_API_KEY);
  const session = await stripe.checkout.sessions.create({
    payment_method_types: ["card"],
    line_items: [
      {
        price_data: {
          currency: "usd",
          product_data: {
            name: "T-shirt",
          },
          unit_amount: 2000,
        },
        quantity: 1,
      },
    ],
    mode: "payment",
    success_url: "https://example.com/success",
    cancel_url: "https://example.com/cancel",
  });
  return context.redirect(session.url, 303);
});
```

上記のコードでは、'createStripeClient'という関数を作成してStripeのクライントを初期化し、その後"/"の経路に対するGETリクエストのハンドラを設定しています。リクエストが来たら、チェックアウトセッションを生成し、そのURLにユーザーをリダイレクトします。

このコードはCloudflare Workerを使用していますが、具体的なWorkerの設定やデプロイの手順は提供されたドキュメンテーションを参照してください。

なお、このコードはローカル開発やテスト用であり、本番環境には適切なセキュリティ対策等を考慮して修正が必要です。

技術記事を書かせてみた

ここまでくると、「紹介ブログやドキュメントサイトも自動生成できるのでは？」という欲が湧いてきます。ということでこちらも試してみました。

    const question = "Please write a new developer blog post about introducing this example application. You need to descrive 'what is this', 'how we can run this application', and 'how we can customize this'. 回答は日本語で行ってください。";
    const result = await conversationalQaChain.invoke({
        question,
    });
    console.log(result)

シンプルですが、紹介記事が出てきました。ただ、これもREADME.mdの内容を再編集しただけのように見える気はします。

新しい開発者向けブログ記事を以下にご紹介します。「Stripe-node を使用した Cloudflare Worker のテンプレート」というこのサンプルアプリケーションについて、その内容と実行方法、そしてカスタマイズ方法について説明いたします。

# Cloudflare Workerでstripe-nodeを使用するテンプレート
このフレームワークは、[`stripe-node`](https://github.com/stripe/stripe-node)を使った[Cloudflare Worker](https://workers.cloudflare.com/)の設定に役立つテンプレートです。公開用に[`wrangler`](https://developers.cloudflare.com/workers/cli-wrangler)というCLIツールを使用します。

## プロジェクトの生成
[wrangler](https://github.com/cloudflare/wrangler2)を使って、以下の手順でプロジェクトを生成できます。

```
wrangler generate projectname https://github.com/stripe-samples/stripe-node-cloudflare-worker-template
cd projectname
npm install
```

wranglerに関する詳しいドキュメンテーションは[こちら](https://developers.cloudflare.com/workers/tooling/wrangler)をご覧下さい。

## ローカルでの実行方法
wranglerを通じてSTRIPE_API_KEYをプレーンテキストの環境変数として以下の方法で追加します：

`.dev.vars.example`ファイルの名前を変更し、 `.dev.vars`というファイルに移動します。例えば：

```toml
cp .dev.vars.example .dev.vars
```

.envファイルの例：

```
STRIPE_API_KEY='sk_test_xxx'
```

デモを実行するにはStripeアカウントが必要です。アカウントを設定したら、Stripe [開発者ダッシュボード](https://stripe.com/docs/development#api-keys) からAPIキーを確認できます。

最後に、以下のコマンドでこの例を実行できます。

```
npm run dev
```

次のようなローカルアプリケーションのURLが表示されます。

```bash
[mf:inf] Ready on http://0.0.0.0:51219 
[mf:inf] - http://127.0.0.1:51219
[mf:inf] - http://192.168.86.21:51219
[mf:inf] - http://172.18.96.89:51219
```

### [オプション] ローカルでウェブフックを実行
Stripe CLIを使うと、ローカルでのウェブフックを簡単に実行できます。

まず、CLIをインストールし、Stripeアカウントをリンクします。

```
stripe listen --forward-to http://{REPLACE_TO_YOUR_LOCAL_APPLICATION_URL}/webhook
```

CLIはウェブフックの秘密鍵をコンソールに表示します。.envファイルのSTRIPE_WEBHOOK_SECRETにこの値を設定します。

CLIが実行されているコンソールでイベントがログに記録されるはずです。

## カスタマイズ方法
このテンプレートには "Software" の無制限の使用、複製、修正、結合、公開、配布、サブライセンス、および/または販売の権利が設けられており、また、これらの目的で "Software" を入手した他の人にこれらの権利を許可することができます。この制限なくアプリケーションを変更し、それに応じてコードを改訂したり、必要に応じて機能を追加したりすることができます。

それでは、この素晴らしいフレームワークを利用し、あなた独自のアプリケーションを作成しましょう。このテンプレートとライブラリは、あなたが作成する将来のすべてのプロジェクトの立ち上げを容易にします。

やってみての感想

[Loaded 17 documents.]はLangChain.jsにしては読んでるページが少なすぎる気がします。おそらくrecursive: falseが原因だとは思いますので、次回はここをtrueにしてみたいと思います。また、GitHubリポジトリには複数の言語ファイル（js / markdown / yamlなど）が配置されるのが常です。この辺りもSplitterを使い分けた方が精度が上がる気がしますので、試してみたいところです。

とはいえ、GitHubのプロジェクトを読み込ませた「RAG over code」はOSS開発者やDeveloper Relationsの作業を効率化できる可能性が見えてきた気がします。複数のリポジトリを読ませる方法なども含めて、テストを続けていきます。

LangChainを利用して、GitHubリポジトリの情報からRAGを作る

プログラムコードを参照したRAGアプリ

GitHubリポジトリのデータを読み込む

読み込んだデータからChunkを作る

Embedding生成

作成したEmbeddingを利用する質問処理を追加する

動作を確認する

コード全体

小さなレポジトリで試してみる

技術記事を書かせてみた

やってみての感想

参考記事など

ブックマークや限定記事（予定）など

Related Category posts

HonoでCloudflare Pagesを作りつつ、wrangler.tomlを使ってVectorizeをよびだしてみた

外部APIを利用したRAGをLangChain.jsのLCELだけで作る2 – 部分的なベクトル検索を採用する

[LangChain.jsでいろんなRAGを作る]LangChain.jsのRunnableLambdaで入力値を動的に処理する

[LangChain.jsでいろんなRAGを作る]Cloudflare Workers AIで作ったRAGに翻訳機能を追加してみた

LangChain.jsでCloudflare Workers AIの翻訳モデルを利用する