Skip to content

feat: add web_fetch built-in tool#85

Merged
endo-ly merged 10 commits into
mainfrom
feat/35-web-fetch-tool
May 15, 2026
Merged

feat: add web_fetch built-in tool#85
endo-ly merged 10 commits into
mainfrom
feat/35-web-fetch-tool

Conversation

@endo-ly
Copy link
Copy Markdown
Owner

@endo-ly endo-ly commented May 14, 2026

Close #35

概要

web_fetch built-in tool を追加。URL からコンテンツを取得し、HTML を Markdown に変換して返す。SSRF 対策・コンテンツバリデーション・FeedSync インフラを含む。

実装内容

Config(WebFetchConfig

  • デフォルト: HTTPS-only、15s タイムアウト、20KB 最大レスポンス
  • Host denylist/allowlist(サブドメインワイルドカード対応)
  • SSRF: プライベート IP / ループバックブロック(allow_private_ips で解除可)
  • コンテンツバリデーション: プロンプトインジェクションパターンスキャン
  • FeedSync: リモート allowlist/denylist フィードのインフラ

Tool(WebFetchTool

  • 10 番目の built-in tool として ToolRegistry に登録
  • URL scheme 検証 → host denylist/allowlist → DNS SSRF チェック → HTTP(手動リダイレクト追跡)→ HTML→Markdown 変換 → コンテンツバリデーション → ページネーション → untrusted warning 付与
  • start_index / max_bytes で UTF-8 安全なページネーション
  • リダイレクトの各ホップで SSRF 再検証
  • 読み取り専用ツール

テスト

  • 99 テスト(Config 7 + URL validation 28 + Content validation 14 + HTML processing 17 + FeedSync 11 + WebFetchTool 22)
  • 全 992 テスト通過、cargo fmt / cargo clippy クリーン

コミット構成

  1. feat(web_fetch): add WebFetchConfig types and config integration — Config 型・統合
  2. feat(web_fetch): implement WebFetchTool with SSRF and content validation — ツール実装
  3. docs(web_fetch): add tool and config documentation — ドキュメント更新

Summary by CodeRabbit

  • New Features

    • 新しいビルトインツール「web_fetch」を追加:安全なURL取得、リダイレクト追跡、HTML→Markdown変換、コンテンツ検証(プロンプト注入検出)、取得上限・タイムアウト制御、未信頼コンテンツ警告、取得結果の詳細(最終URL/コンテンツ種別)を返します。
    • 設定に web_fetch を追加し、許可スキーム、タイムアウト、最大バイト、プライベートIP制御、ホスト denylist/allowlist、コンテンツ検証を指定可能。
  • Documentation

    • web_fetch の仕様、使用例、設定項目、セキュリティ動作をドキュメントに追加。

endo-ly added 3 commits May 14, 2026 18:20
- Add WebFetchConfig with defaults: HTTPS-only, 15s timeout, 20KB max
- Add WebFetchContentValidationConfig for injection detection
- Add WebFetchFeedSyncConfig for remote feed sources
- Wire config into Config struct, loader, persist, and test util
- Add htmd dependency for HTML-to-Markdown conversion
- URL validation: scheme check, host denylist/allowlist, IP literal block
- SSRF protection: DNS resolution + private IP detection on every hop
- Content validation: regex-based prompt injection pattern scanning
- HTML processing: HTML-to-Markdown via htmd, plain text passthrough
- FeedSync: infrastructure for remote allowlist/denylist feeds
- WebFetchTool: full execute pipeline with redirect handling, pagination
- Register as 10th built-in tool in ToolRegistry
- 99 tests covering all modules
- Add web_fetch section to tools.md with full spec
- Add Web Fetch config section to config.md with field reference
- Update TOC and section numbering in both files
- Add web_fetch.* to hot-reload fields table
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

web_fetch ツールを追加し、設定モデル、URL/SSRF 検証、HTML→Markdown 処理、コンテンツ検証、ツール本体・登録、テスト、ドキュメント、Cargo 依存を統合した。

Changes

web_fetch ツール実装と設定統合

Layer / File(s) Summary
設定モデルと正規化ロジック
src/config/web_fetch.rs, src/config/mod.rs, src/config/types.rs, src/config/loader.rs, src/config/persist.rs, src/config/tests.rs, src/setup/summary.rs, src/test_util.rs
WebFetchConfigWebFetchContentValidationConfig を追加。ファイル読み込み用 FileWebFetchConfig、normalize_web_fetch、Serializable 用型を実装し、既定値補完・ホスト正規化・数値フォールバックを行う。テスト用 Config 初期化とサンプルの更新を含む。
URL検証とSSRF対策
src/tools/web_fetch/url_validation.rs
スキーム/ホスト allowlist・denylist の正規化とサブドメイン照合、IP ブロック判定(private/loopback/link-local/メタデータ等)、DNS 解決による検証、リダイレクト先の再検証と上限を実装。
コンテンツ検証 (prompt injection)
src/tools/web_fetch/content_validation.rs
正規表現ルール群を LazyLock でコンパイルし、max_scan_bytes/strict モードでスキャンして一致ルールに基づくブロック判定を行う validate_content を実装。エラー型と表示、ユニットテストを追加。
HTML 抽出と Markdown 変換
src/tools/web_fetch/html_processing.rs
<main>→<article>→<body> 優先で本文抽出、不要タグ除去、htmd による HTML→Markdown 変換、content-type による分岐処理と単体テストを追加。
ツール本体実装・登録・テスト・docs/Cargo更新
src/tools/web_fetch/mod.rs, src/tools/mod.rs, docs/tools.md, docs/config.md, Cargo.toml, src/tools/web_fetch/* のテスト群, test sanitizer/チャネルのテスト更新
WebFetchTool を実装(手動リダイレクト追跡、ストリーミング読み取りと max_bytes 制限、UTF-8 安全な処理、content-type に応じた本文処理、検証、details 付き結果)。ツール登録を追加し、docs に web_fetch 章を追加、htmd = "0.5" を依存に追加。多数の wiremock ベーステストを追加。

Sequence Diagram (高レベルの実行フロー):

sequenceDiagram
  participant Client as ToolCaller
  participant Validator as url_validation::validate_url
  participant DNS as resolve_dns_and_validate
  participant HTTP as reqwest HTTP client
  participant HTML as html_processing::process_response_body
  participant CV as content_validation::validate_content
  participant Tool as WebFetchTool

  Client->>Tool: execute(url, timeout, max_bytes)
  Tool->>Validator: validate_url(url, config)
  Validator->>DNS: if needed resolve_dns_and_validate(host)
  DNS-->>Validator: validation result
  Tool->>HTTP: fetch (manual redirects loop)
  HTTP->>HTML: process_response_body(body, content_type)
  HTML->>CV: validate_content(markdown_or_text, config)
  CV-->>Tool: OK / ValidationFailure
  Tool-->>Client: ToolResult (content + details or error)
Loading

Estimated code review effort:
🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • endo-ly/egopulse#30: ToolRegistry と組み込みツール登録経路に関する変更があるため、web_fetch の登録/初期化と関連する。

Poem

web を取ってきて Markdown に、
迷える redirect もちゃんと検証、
ルールで注入もブロックするよ、
ちょこっとテストもいっぱい、
さあ安全な fetch の始まりだ 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed プルリクエストのタイトル『feat: add web_fetch built-in tool』は、変更セットの主要な内容(web_fetchというビルトインツールの追加)を明確かつ簡潔に要約しており、リンク済みイシュー#35の目的とも完全に一致している。
Linked Issues check ✅ Passed プルリクエストはイシュー#35のすべての主要目的を実装している。web_fetchツールの追加、URL検証、SSRF対策、HTML→Markdown変換、出力サイズ制限、プロンプトインジェクション検査、設定オプション、ドキュメント更新、読み取り専用マーク、秘匿フィールド経路への統合がすべて実装されている。
Out of Scope Changes check ✅ Passed 変更はすべてweb_fetchツール実装とその統合に関連しており、スコープ外の変更はない。非スコープ項目(JavaScript実行、認証ページ取得、バイナリダウンロード、ブラウザ自動化)の実装はない。
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/35-web-fetch-tool

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (2)
src/tools/mod.rs (1)

18-19: ⚡ Quick win

#[allow(dead_code)]属性が付いてるのが気になるね

web_fetchモジュールに#[allow(dead_code)]が付いてるけど、これって一時的なものかな?実装が完了してないコードや未使用の関数が残ってる可能性があるよ。

PRサマリーでは99個のテストが追加されたって言ってるから、実装自体は完了してるはずだよね。この属性が本当に必要かどうか確認して、不要なら外しておこう!

♻️ 提案:dead_code属性の確認

以下のコマンドで未使用コードをチェックしてみよう:

#!/bin/bash
# Description: web_fetchモジュール内の未使用コードをチェック

# web_fetchモジュールのファイルを検索
fd -e rs -p 'src/tools/web_fetch' --exec echo "File: {}" \;

# 未使用の関数や構造体を探す(clippy警告から)
echo "Running clippy to check for dead code..."
cargo clippy --all-targets --all-features -- -W dead_code 2>&1 | rg 'web_fetch'

もし警告が出なければ、この属性は削除できるはずだよ!

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/tools/mod.rs` around lines 18 - 19, `#[allow(dead_code)]` が `mod
web_fetch;` に付いていますが未使用のコードを無視している可能性があるので、本当に不要か確認して削除してください:まず `mod web_fetch`
を参照するテストや呼び出しがすべて存在するか確認し、`cargo clippy --all-targets --all-features -W
dead_code` や提案のスクリプトで `web_fetch` に関する dead_code 警告が出ないことを確認したら
`#[allow(dead_code)]`
を削除し、ビルド・テスト(追加した99個のテストを含む)を通すこと(もし警告が出る場合は未使用の関数を削除するか適切に公開・使用するよう修正する)。
src/tools/web_fetch/mod.rs (1)

296-710: ⚖️ Poor tradeoff

新規テスト、AAA の区切りをそろえたい。

一部のケースだけ Arrange/Act/Assert の区切りがあって、ほかは素のままだから、追加したテスト群の読み味が揃ってない。ここは新規分を AAA で統一したほうが追いやすい。 As per coding guidelines, All test functions must follow the AAA (Arrange-Act-Assert) pattern.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/tools/web_fetch/mod.rs` around lines 296 - 710, Several new tests (e.g.,
fetches_html_and_returns_markdown, fetches_plain_text_as_is,
result_details_metadata, truncation_at_max_bytes, truncation_utf8_safe,
start_index_continuation, start_index_beyond_content,
blocks_content_with_injection, follows_redirect_and_validates,
blocks_redirect_to_private_ip, too_many_redirects, http_error_status,
timeout_error, untrusted_content_warning, blocks_disallowed_scheme_in_redirect)
mix styles; update each to follow the Arrange / Act / Assert (AAA) pattern by
clearly separating setup (Arrange) — creating MockServer, configuring Mock
responses, and constructing tool via make_tool/test_web_fetch_config — from
execution (Act) — single execute(&tool, json!(...)).await call — and from
assertions (Assert) — the assert!()/assert_eq!() checks; add short comments or
blank lines marking "Arrange", "Act", "Assert" in each test and consolidate any
inline setup so each test has one Act call returning result before assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/config/web_fetch.rs`:
- Around line 165-310: Tests in src/config/web_fetch.rs do not consistently
follow the Arrange-Act-Assert (AAA) pattern; update each test
(config_default_values, config_deserialize_full_yaml,
config_deserialize_missing_optional, config_normalize_empty_schemes,
config_normalize_hosts, config_normalize_zero_max_bytes, feed_source_normalize)
to explicitly separate: Arrange (declare yaml inputs and expected values / call
WebFetchConfig::default()), Act (call yaml_serde::from_str(...) and/or
cfg.normalize()), and Assert (all assert_eq!/assert! checks). Ensure each test
clearly constructs inputs first, performs the single action under test second,
and then runs only assertions last to conform to the coding guideline.

In `@src/tools/web_fetch/content_validation.rs`:
- Around line 104-108: The slice to produce scan_text must avoid splitting UTF-8
code points: instead of slicing directly with
&text[..text.len().min(config.max_scan_bytes)], compute an end index <=
config.max_scan_bytes that is a valid UTF-8 char boundary (use
str::is_char_boundary in a loop or walk with text.char_indices to find the last
boundary ≤ max_scan_bytes), then take &text[..end]; preserve the existing else
branch when max_scan_bytes <= 0 and ensure the edge case end == 0 yields an
empty &str rather than panicking; apply this to the scan_text assignment
referencing config.max_scan_bytes and text.

In `@src/tools/web_fetch/feed_sync.rs`:
- Around line 209-414: Multiple test functions (e.g.,
inline_feed_merges_denylist, inline_feed_merges_allowlist, csv_first_column,
skips_disabled_sources, etc.) do not consistently follow the AAA
(Arrange-Act-Assert) layout; refactor each test so they clearly separate the
three phases: Arrange (setup MockServer, build WebFetchFeedSyncConfig and any
source modifications), Act (call resolve_feed_sync(&config).await and capture
the result), and Assert (perform assertions on
result.denylist/result.allowlist). Use short comments or blank lines to mark
"Arrange", "Act", "Assert" and ensure the call to resolve_feed_sync and its
expect/unwrap is only in the Act section (move any assertions after that call).
- Around line 64-77:
HTTPレスポンスのステータスを検証していないため、404/500などのエラー本文がフィードとして扱われています;修正は
client.get(&source.url).send().await の直後にレスポンスに対して .error_for_status()
を呼び出してエラー応答を早期に落とし、失敗時は既存の FeedSyncError::FetchFailed を使って url
とエラー理由(ステータスやエラーメッセージ)を格納するようにしてください(参照シンボル: client, source.url, body 変数取得箇所,
FeedSyncError::FetchFailed)。

In `@src/tools/web_fetch/html_processing.rs`:
- Around line 84-223: The tests (e.g., html_to_markdown_basic,
html_to_markdown_list, html_to_markdown_link, extract_primary_prefers_main,
content_type_html_routes_to_markdown, etc.) must be refactored to follow the AAA
pattern: for each test, create an "arrange" section that assigns input variables
(html/text/json and expected snippets), an "act" section that invokes the
function under test (convert, extract_primary_html, process_response_body) and
stores the result, and an "assert" section that runs the assertions against that
result; update all tests in the diff to follow this structure so inputs,
actions, and assertions are clearly separated and use descriptive local
variables for inputs/results.
- Around line 38-44: The current search for the closing tag uses
lower.find(&close) over the whole string which can return a match before the
opening tag and break extraction; change it to search from content_start onward
by calling lower[content_start..].find(&close) and if found add content_start to
that index to compute content_end, keep the existing check content_end <
content_start and return Some(html[content_start..content_end].to_owned()) as
before so extraction only considers closing tags after the opening tag.

In `@src/tools/web_fetch/mod.rs`:
- Around line 126-127: The code currently uses
params.timeout_secs.unwrap_or(config.timeout_secs) and
params.max_bytes.unwrap_or(config.max_bytes), which lets callers bypass config
limits and then calls response.bytes().await (allocating the whole body), so
large max_bytes can cause OOM; change the logic to clamp incoming values to
config limits (e.g. compute timeout and max_bytes by taking the min of params.*
and config.* or defaulting to config when None using params.timeout_secs.map(|v|
v.min(config.timeout_secs)).unwrap_or(config.timeout_secs) and likewise for
max_bytes) and replace the response.bytes().await path with a bounded streaming
read that stops after the clamped max_bytes (read the response body in chunks
and stop/close once the limit is reached) so you never allocate more than
config.max_bytes; update the code paths that use timeout and max_bytes
accordingly (references: params.timeout_secs, params.max_bytes,
config.timeout_secs, config.max_bytes, response.bytes().await).
- Around line 218-240: The slice at processed[start..] can panic if start is not
a UTF-8 char boundary and next_start is computed incorrectly; before slicing,
adjust start to the nearest UTF-8 char boundary ≤ start by checking
processed.is_char_boundary(start) and decrementing until it is (safeguard
against underflow), then perform the slice; keep the existing truncation loop
that finds end as the nearest boundary ≤ max_bytes, and compute next_start as
start + end (not start + max_bytes) so the returned continuation offset reflects
the actual byte consumption (use the variables start, processed, max_bytes, end,
next_start to locate changes).

In `@src/tools/web_fetch/url_validation.rs`:
- Around line 239-518: The test functions lack explicit Arrange/Act/Assert
separation; update each test (e.g., allows_https_by_default,
blocks_http_by_default, allows_http_when_configured, blocks_ftp_scheme,
blocks_invalid_url, blocks_url_without_host, blocks_denylist_host,
blocks_denylist_subdomain, allows_denylist_unrelated, enforces_allowlist,
denylist_precedes_allowlist, validation_disabled_allows_all, ssrf_* tests,
redirect_* tests, redirect_too_many, host_normalization_* tests) to follow AAA:
clearly mark Arrange by creating/configuring WebFetchConfig or inputs in one
block, Act by calling the function under test (validate_url, validate_redirect,
resolve_dns_and_validate, is_blocked_ip, is_redirect_limit_exceeded,
normalize_host) and assigning the result to a variable, and Assert by performing
assertions on that result; keep the logic the same but move setup, invocation,
and checks into those three labeled steps for readability and consistency.
- Around line 64-65: The IPv6 check only blocks loopback (IpAddr::V6 match arm
using v6.is_loopback()), leaving ULA (fc00::/7) and link-local (fe80::/10)
addresses allowed and enabling SSRF bypass; update that match arm to also reject
v6.is_unique_local() and v6.is_unicast_link_local() (i.e., treat any
v6.is_loopback() || v6.is_unique_local() || v6.is_unicast_link_local() as
blocked) so ULA and link-local ranges are covered by the same validation.
- Around line 193-195: is_redirect_limit_exceeded currently returns
redirect_count > MAX_REDIRECTS which allows one extra hop; change the comparison
in is_redirect_limit_exceeded to use >= so any redirect_count equal to
MAX_REDIRECTS is treated as exceeding the limit (reference: function
is_redirect_limit_exceeded and constant MAX_REDIRECTS).

---

Nitpick comments:
In `@src/tools/mod.rs`:
- Around line 18-19: `#[allow(dead_code)]` が `mod web_fetch;`
に付いていますが未使用のコードを無視している可能性があるので、本当に不要か確認して削除してください:まず `mod web_fetch`
を参照するテストや呼び出しがすべて存在するか確認し、`cargo clippy --all-targets --all-features -W
dead_code` や提案のスクリプトで `web_fetch` に関する dead_code 警告が出ないことを確認したら
`#[allow(dead_code)]`
を削除し、ビルド・テスト(追加した99個のテストを含む)を通すこと(もし警告が出る場合は未使用の関数を削除するか適切に公開・使用するよう修正する)。

In `@src/tools/web_fetch/mod.rs`:
- Around line 296-710: Several new tests (e.g.,
fetches_html_and_returns_markdown, fetches_plain_text_as_is,
result_details_metadata, truncation_at_max_bytes, truncation_utf8_safe,
start_index_continuation, start_index_beyond_content,
blocks_content_with_injection, follows_redirect_and_validates,
blocks_redirect_to_private_ip, too_many_redirects, http_error_status,
timeout_error, untrusted_content_warning, blocks_disallowed_scheme_in_redirect)
mix styles; update each to follow the Arrange / Act / Assert (AAA) pattern by
clearly separating setup (Arrange) — creating MockServer, configuring Mock
responses, and constructing tool via make_tool/test_web_fetch_config — from
execution (Act) — single execute(&tool, json!(...)).await call — and from
assertions (Assert) — the assert!()/assert_eq!() checks; add short comments or
blank lines marking "Arrange", "Act", "Assert" in each test and consolidate any
inline setup so each test has one Act call returning result before assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dd8167af-530f-4a22-bd91-90ef020d6e85

📥 Commits

Reviewing files that changed from the base of the PR and between fc5afae and 86d7db1.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • Cargo.toml
  • docs/config.md
  • docs/tools.md
  • src/channels/web/auth.rs
  • src/channels/web/config.rs
  • src/config/loader.rs
  • src/config/mod.rs
  • src/config/persist.rs
  • src/config/tests.rs
  • src/config/types.rs
  • src/config/web_fetch.rs
  • src/setup/summary.rs
  • src/test_util.rs
  • src/tools/mod.rs
  • src/tools/sanitizer.rs
  • src/tools/web_fetch/content_validation.rs
  • src/tools/web_fetch/feed_sync.rs
  • src/tools/web_fetch/html_processing.rs
  • src/tools/web_fetch/mod.rs
  • src/tools/web_fetch/url_validation.rs

Comment thread src/config/web_fetch.rs
Comment thread src/tools/web_fetch/content_validation.rs
Comment thread src/tools/web_fetch/feed_sync.rs Outdated
Comment thread src/tools/web_fetch/feed_sync.rs Outdated
Comment thread src/tools/web_fetch/html_processing.rs Outdated
Comment thread src/tools/web_fetch/mod.rs Outdated
Comment thread src/tools/web_fetch/mod.rs Outdated
Comment thread src/tools/web_fetch/url_validation.rs Outdated
Comment thread src/tools/web_fetch/url_validation.rs
Comment thread src/tools/web_fetch/url_validation.rs
- Fix UTF-8 panic in content_validation byte slicing
- Add error_for_status() in feed_sync HTTP requests
- Fix closing tag search position in html_processing
- Clamp params to config limits (prevent OOM via large inputs)
- Fix start_index UTF-8 boundary and next_start_index calculation
- Block IPv6 ULA/link-local addresses for SSRF protection
- Fix redirect limit off-by-one (>= instead of >)
- Add AAA pattern separation to all test functions
- 12 new tests covering all fix areas (1004 total)
@endo-ly
Copy link
Copy Markdown
Owner Author

endo-ly commented May 15, 2026

レビュー対応完了

対応したレビューコメント: 11件

対象ファイル 修正内容 ステータス
src/tools/web_fetch/content_validation.rs UTF-8 境界panic修正: is_char_boundary() でスライス前に境界確認 + テスト追加 ✅ 修正済み
src/tools/web_fetch/feed_sync.rs .error_for_status() 追加: HTTP 4xx/5xx をフィード内容として扱わない + テスト追加 ✅ 修正済み
src/tools/web_fetch/html_processing.rs </tag> 検索開始位置修正: content_start 以降で検索するよう変更 + テスト追加 ✅ 修正済み
src/tools/web_fetch/mod.rs:126-127 入力値を config 上限に clamp: params.*.map(v.min(config.*)) でガードレール回避防止 + テスト追加 ✅ 修正済み
src/tools/web_fetch/mod.rs:218-240 start_index UTF-8 境界補正 + next_start_index を実際の消費バイト数で計算 + テスト追加 ✅ 修正済み
src/tools/web_fetch/url_validation.rs:65 IPv6 ULA (fc00::/7) と link-local (fe80::/10) をブロック + テスト追加 ✅ 修正済み
src/tools/web_fetch/url_validation.rs:195 リダイレクト上限の off-by-one 修正: >>= ✅ 修正済み
src/config/web_fetch.rs テスト AAA パターン区切り追加(空白行) ✅ 修正済み
src/tools/web_fetch/url_validation.rs テスト AAA パターン区切り追加(空白行) ✅ 修正済み
src/tools/web_fetch/content_validation.rs テスト AAA パターン区切り追加(空白行) ✅ 修正済み
src/tools/web_fetch/html_processing.rs テスト AAA パターン区切り追加(空白行) ✅ 修正済み
src/tools/web_fetch/feed_sync.rs テスト AAA パターン区切り追加(空白行) ✅ 修正済み

検証結果: 1004 tests passed, cargo fmt / cargo clippy clean

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/tools/web_fetch/url_validation.rs (1)

68-88: ⚡ Quick win

deny/allow の判定がコピペ重複してる🐰

is_host_denylistedis_host_allowlisted、シグネチャ以外まったく同じ実装。ルール変えるとき片方だけ直してバグる典型パターンなので、ひとつの helper にまとめたい。あと毎回 format!(".{entry_lower}") でアロケしてるのも気になるので、strip_suffix +直前バイトが . か、で書き換えると軽くなる。

♻️ 修正案: 共通 helper に集約 + アロケ削減
-/// Returns `true` when `host` matches an entry in `denylist`.
-///
-/// Both exact matches and subdomain matches are considered: if the denylist
-/// contains `"evil.com"` then `"sub.evil.com"` is also blocked.
-fn is_host_denylisted(host: &str, denylist: &[String]) -> bool {
-    let normalized = host.to_ascii_lowercase();
-    denylist.iter().any(|entry| {
-        let entry_lower = entry.to_ascii_lowercase();
-        normalized == entry_lower || normalized.ends_with(&format!(".{entry_lower}"))
-    })
-}
-
-/// Returns `true` when `host` matches an entry in `allowlist` (exact match
-/// or subdomain match, mirroring the denylist logic).
-fn is_host_allowlisted(host: &str, allowlist: &[String]) -> bool {
-    let normalized = host.to_ascii_lowercase();
-    allowlist.iter().any(|entry| {
-        let entry_lower = entry.to_ascii_lowercase();
-        normalized == entry_lower || normalized.ends_with(&format!(".{entry_lower}"))
-    })
-}
+/// Returns `true` when `host` matches an entry in `list` exactly or as a subdomain.
+fn host_matches_any(host: &str, list: &[String]) -> bool {
+    let normalized = host.to_ascii_lowercase();
+    list.iter().any(|entry| {
+        let entry_lower = entry.to_ascii_lowercase();
+        if normalized == entry_lower {
+            return true;
+        }
+        normalized
+            .strip_suffix(&entry_lower)
+            .and_then(|prefix| prefix.strip_suffix('.'))
+            .is_some()
+    })
+}

呼び出し側は is_host_denylisted(host, &config.denylist)host_matches_any(host, &config.denylist) に置き換え。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/tools/web_fetch/url_validation.rs` around lines 68 - 88, Both functions
share identical logic and should be merged into a single helper to avoid
duplication and the per-entry allocation from format!; introduce fn
host_matches_any(host: &str, list: &[String]) -> bool that lowercases the host
once (let normalized = host.to_ascii_lowercase()), then for each entry compute
let entry_lower = entry.to_ascii_lowercase() and check normalized == entry_lower
|| normalized.strip_suffix(&entry_lower).map_or(false, |p| p.ends_with('.'));
replace is_host_denylisted and is_host_allowlisted bodies with calls to
host_matches_any(host, denylist) and host_matches_any(host, allowlist)
respectively.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/tools/web_fetch/url_validation.rs`:
- Around line 55-66: The is_blocked_ip function misses IPv4-mapped IPv6 and
unspecified addresses; update is_blocked_ip to detect and unfold IPv4-mapped
IPv6 (e.g. ::ffff:a.b.c.d) into an IPv4 and apply the existing IPv4 checks, and
explicitly treat unspecified addresses (IPv4 0.0.0.0 and IPv6 ::) as blocked;
modify the IpAddr::V6 branch in is_blocked_ip to check for
v6.to_ipv4_mapped()/or inspect the last 32 bits and re-run the IPv4 logic when
mapped, and add explicit checks for IPv4::UNSPECIFIED and Ipv6Addr::UNSPECIFIED
(or v4.is_unspecified()/v6.is_unspecified()) so both 0.0.0.0 and :: are blocked,
then add tests exercising ::ffff:127.0.0.1, ::ffff:192.168.0.1, 0.0.0.0 and ::
to ensure they return true.

---

Nitpick comments:
In `@src/tools/web_fetch/url_validation.rs`:
- Around line 68-88: Both functions share identical logic and should be merged
into a single helper to avoid duplication and the per-entry allocation from
format!; introduce fn host_matches_any(host: &str, list: &[String]) -> bool that
lowercases the host once (let normalized = host.to_ascii_lowercase()), then for
each entry compute let entry_lower = entry.to_ascii_lowercase() and check
normalized == entry_lower || normalized.strip_suffix(&entry_lower).map_or(false,
|p| p.ends_with('.')); replace is_host_denylisted and is_host_allowlisted bodies
with calls to host_matches_any(host, denylist) and host_matches_any(host,
allowlist) respectively.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 87fcfe0e-d744-4264-8e7a-0f65b66c7ce6

📥 Commits

Reviewing files that changed from the base of the PR and between 86d7db1 and b63061c.

📒 Files selected for processing (6)
  • src/config/web_fetch.rs
  • src/tools/web_fetch/content_validation.rs
  • src/tools/web_fetch/feed_sync.rs
  • src/tools/web_fetch/html_processing.rs
  • src/tools/web_fetch/mod.rs
  • src/tools/web_fetch/url_validation.rs
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/tools/web_fetch/content_validation.rs
  • src/tools/web_fetch/html_processing.rs
  • src/tools/web_fetch/mod.rs
  • src/config/web_fetch.rs
  • src/tools/web_fetch/feed_sync.rs

Comment thread src/tools/web_fetch/url_validation.rs
- Unfold ::ffff:a.b.c.d to IPv4 and re-run IPv4 checks (prevents
  ::ffff:127.0.0.1 SSRF bypass)
- Block 0.0.0.0 and :: (unspecified addresses)
- Add 4 new tests for IPv4-mapped, unspecified cases
@endo-ly
Copy link
Copy Markdown
Owner Author

endo-ly commented May 15, 2026

レビュー対応完了(2巡目)

対応したレビューコメント: 2件

対象ファイル 修正内容 ステータス
src/tools/web_fetch/url_validation.rs:66 IPv4-mapped IPv6 (::ffff:127.0.0.1 等) を IPv4 チェックに展開 + 0.0.0.0 / :: (unspecified) をブロック + テスト4件追加 ✅ 修正済み
src/tools/web_fetch/html_processing.rs:268 AAA パターンは前回対応済み(空白行で Arrange/Act/Assert 分離済み) ℹ️ 既に対応済み

検証結果: 1008 tests passed, cargo fmt / cargo clippy clean

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/tools/web_fetch/url_validation.rs (1)

121-128: 💤 Low value

scheme().to_ascii_lowercase() は冗長

url クレートはパース時点で scheme を ASCII 小文字に正規化済み。to_ascii_lowercase() で再アロケートしなくても、そのまま eq_ignore_ascii_case で比較すれば十分。config.allowed_schemes 側にうっかり大文字が混ざっても eq_ignore_ascii_case が吸収してくれるから挙動も変わらない。

ホットパスじゃないから別に放置でもいいけど、エラー時のメッセージで小文字のスキームを返すなら parsed.scheme() 直接で OK。

♻️ こんな感じ
-    let scheme = parsed.scheme().to_ascii_lowercase();
+    let scheme = parsed.scheme();
     if !config
         .allowed_schemes
         .iter()
-        .any(|s| s.eq_ignore_ascii_case(&scheme))
+        .any(|s| s.eq_ignore_ascii_case(scheme))
     {
-        return Err(UrlValidationError::SchemeNotAllowed(scheme));
+        return Err(UrlValidationError::SchemeNotAllowed(scheme.to_string()));
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/tools/web_fetch/url_validation.rs` around lines 121 - 128,
スキームを再度小文字化して再アロケートしている箇所を削除してください: 現状の let scheme =
parsed.scheme().to_ascii_lowercase(); とそれを使った比較を、parsed.scheme()
を直接使う比較に置き換え、config.allowed_schemes.iter().any(|s|
s.eq_ignore_ascii_case(parsed.scheme()))
のようにしてチェックし、許可されない場合のエラー作成(UrlValidationError::SchemeNotAllowed)にも
parsed.scheme() を元にした値を渡すようにしてください(再アロケート不要ならそのまま String 化するか既存の型に合わせて渡す)。
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/tools/web_fetch/url_validation.rs`:
- Around line 86-102: The denylist/allowlist check is using raw config entries
so wildcard entries like "*.example.com" and trailing-dot forms are not
normalized; update behavior by normalizing list entries with the existing
normalize_host / normalize_host_list logic before comparison and consolidate
duplicate logic into a single helper (e.g., is_host_match or is_host_in_list)
that both is_host_denylisted and is_host_allowlisted call; this helper should
lowercase and strip/normalize each allow/deny entry (removing leading "*." and
trailing dots as normalize_host_list does) and then perform the same matching
used currently (exact match or subdomain via ends_with with ".{entry}"), and
replace the bodies of is_host_denylisted and is_host_allowlisted to call this
new helper (also add tests asserting denylist: vec!["*.example.com"] rejects
"sub.example.com").

---

Nitpick comments:
In `@src/tools/web_fetch/url_validation.rs`:
- Around line 121-128: スキームを再度小文字化して再アロケートしている箇所を削除してください: 現状の let scheme =
parsed.scheme().to_ascii_lowercase(); とそれを使った比較を、parsed.scheme()
を直接使う比較に置き換え、config.allowed_schemes.iter().any(|s|
s.eq_ignore_ascii_case(parsed.scheme()))
のようにしてチェックし、許可されない場合のエラー作成(UrlValidationError::SchemeNotAllowed)にも
parsed.scheme() を元にした値を渡すようにしてください(再アロケート不要ならそのまま String 化するか既存の型に合わせて渡す)。
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5691a5a8-4cae-4953-ac09-3cb651006498

📥 Commits

Reviewing files that changed from the base of the PR and between b63061c and a9bc7b6.

📒 Files selected for processing (1)
  • src/tools/web_fetch/url_validation.rs

Comment thread src/tools/web_fetch/url_validation.rs Outdated
endo-ly added 2 commits May 15, 2026 02:29
…ching

- Consolidate is_host_denylisted/is_host_allowlisted into host_matches_any
- Apply normalize_host to entries so *.example.com matches subdomains
- Add 4 tests for wildcard denylist/allowlist behavior
@endo-ly
Copy link
Copy Markdown
Owner Author

endo-ly commented May 15, 2026

レビュー対応完了(3巡目)

対応したレビューコメント: 2件

対象ファイル 修正内容 ステータス
src/tools/web_fetch/url_validation.rs:102 denylist/allowlist の *. ワイルドカードが正規化されず機能していなかった問題を修正。is_host_denylisted/is_host_allowlistedhost_matches_any に統合し、エントリにも normalize_host を適用。テスト4件追加。 ✅ 修正済み
src/tools/web_fetch/html_processing.rs:268 AAA パターンは2巡目ですでに対応済み(全テスト関数に Arrange/Act/Assert の空白行区切りあり) ℹ️ 既に対応済み

検証結果: 1011 tests passed, cargo fmt / cargo clippy clean

Blocker 1 - Memory DoS prevention:
- Add Content-Length pre-check: reject responses exceeding max_bytes
  before reading any body data
- Replace response.bytes() with streaming chunk reader that caps at
  max_bytes, preventing unbounded memory allocation
- Skip start_index pagination for stream-capped bodies (prevents data
  loss); non-truncated responses retain full pagination
- Update affected tests for new Content-Length rejection behavior

Blocker 2 - Remove dead feed_sync feature:
- Delete feed_sync.rs (440 lines of uncalled code)
- Remove WebFetchFeedSyncConfig/WebFetchFeedSource from config types,
  loader, persist, and docs
- Zero remaining feed_sync references in codebase

999 tests pass, cargo fmt + clippy clean
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/config.md (1)

320-331: 💤 Low value

YAML例でワイルドカードの実例を見せるとより親切かも

現在のdenylist: []allowlist: []はデフォルト値として正しいんだけど、176行目で説明されてるワイルドカード形式(*.prefix)の具体例が見られないんだよね。

例えばこんな感じで実例を追加すると、ユーザーがすぐに真似できるかも:

  denylist:
    - "*.suspicious.com"
    - "badhost.example"
  allowlist:
    - "*.trusted.org"

もちろん空配列のままでも問題ないから、完全にオプショナルな提案だよ!

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/config.md` around lines 320 - 331, Add illustrative wildcard examples to
the web_fetch YAML example so readers can see how wildcard patterns are used for
denylist and allowlist; update the denylist and allowlist entries under the
web_fetch block (referencing keys denylist and allowlist within the web_fetch
section and noting the wildcard format described near content_validation) to
include sample items such as a domain wildcard like "*.suspicious.com" and a
literal host like "badhost.example" for denylist and a wildcard like
"*.trusted.org" for allowlist, while keeping empty arrays as an acceptable
option.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@docs/config.md`:
- Around line 320-331: Add illustrative wildcard examples to the web_fetch YAML
example so readers can see how wildcard patterns are used for denylist and
allowlist; update the denylist and allowlist entries under the web_fetch block
(referencing keys denylist and allowlist within the web_fetch section and noting
the wildcard format described near content_validation) to include sample items
such as a domain wildcard like "*.suspicious.com" and a literal host like
"badhost.example" for denylist and a wildcard like "*.trusted.org" for
allowlist, while keeping empty arrays as an acceptable option.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c3c8c292-6cc1-4b9c-9c6f-5ec7f5246ca6

📥 Commits

Reviewing files that changed from the base of the PR and between a9bc7b6 and a9d58f8.

📒 Files selected for processing (6)
  • docs/config.md
  • src/config/loader.rs
  • src/config/persist.rs
  • src/config/web_fetch.rs
  • src/tools/web_fetch/mod.rs
  • src/tools/web_fetch/url_validation.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/config/persist.rs
  • src/tools/web_fetch/mod.rs

endo-ly added 2 commits May 15, 2026 08:33
…es a true hard limit

Streaming overflow now returns an immediate error consistent with
Content-Length pre-check. Removed truncated/next_start_index/total_bytes
from details — only final_url and content_type remain.
max_bytes 20000 → 65536, max_scan_bytes 50000 → 65536.
Updated defaults, normalize fallbacks, persist skip-serializing
predicates, doc comments, tests, and documentation.
@endo-ly endo-ly merged commit a5363f0 into main May 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add safe web_fetch tool with Markdown extraction

1 participant