Project

Web Research for AI Analysis

Timeline: 2026, ongoing

A web research workflow designed to extend document intelligence beyond internal files, preparing public web content for structured AI analysis while preserving source attribution and reviewability.

This work was designed as an extension to a Document Intelligence Platform, with the goal of making public web research usable inside the same structured analysis workflow as internal documents. Instead of leaving web search as a separate manual step, the process was built to automate the way a person would normally research online while keeping the results reviewable and attributable.

The delivery focused on preparing web content so it could be handled reliably by a language model. Search was used to surface relevant pages, those pages were converted into markdown, and the content was narrowed to the sections most likely to carry useful factual detail. That preparation step mattered because raw page text is rarely suitable for direct model input. Search results were cached, repeat requests were paced carefully, and failed network or model calls were retried so larger research runs could continue without constant operator intervention.

Source handling was treated carefully. Generated outputs were not meant to float free from their origins, so the workflow preserved references back to the original websites and published material. That gave operators a way to verify where information came from, review source context before accepting a result, and keep proper attribution attached to the people and organisations that published the underlying content.

The generation layer was built for structured Polish-language output rather than loose summarisation. Prepared source text was passed to an OpenAI-compatible model with clear instructions to produce usable, reviewable descriptions while stating openly where information was missing or unclear. A lightweight interface supported single lookups, batch intake, review, browsing, and export, while stored source text and final outputs kept the process resumable and easier to audit over time.

Challenge

The main challenge was extending a document-based analysis workflow to the open internet without lowering the quality of downstream LLM output. Search results are noisy, source quality varies, and useful material is often surrounded by navigation, boilerplate, and unrelated page sections. The workflow needed to mirror how a user would normally look for information online, but do it automatically and repeatedly across larger research sets while keeping a clear link back to the original published sources.

Solution

I built a two-part workflow: a batch processor for larger research lists and a browser-based interface for single-item checks, imports, and review. The process follows a familiar research pattern: search for relevant information, gather the most useful public sources, convert those pages into clean markdown, and prepare the strongest sections for structured Polish-language output through an OpenAI-compatible model. Search results are cached, prepared source text is stored alongside final outputs, failed network and model calls are retried, and operators can inspect both the prepared material and the original source links before using the generated result.

Outcome

The finished system extends document intelligence work into the public internet in a way that is practical for day-to-day use. It automates the same kind of online research a user would otherwise carry out manually, then prepares that material for language model analysis in a consistent format. Just as importantly, it keeps attribution attached to the original websites and content authors, so generated outputs remain traceable and reviewable rather than detached from their source material.