Published March 10, 2026 Email research@azumuta.com

Acceleration and accuracy of manufacturing process capture with the use of VLMs

We explre

TL;DR

Problem

Creating manufacturing work instructions manually is a lot of work (capture, media, writing).
Instructions are hard to maintain; updates often mean redoing the whole workflow.

Context

Visual language models (VLMs) join vision and text: they can interpret demonstrations, frames, and prompts in one pipeline—and they’re getting more capable and deployable (cloud and increasingly local).

Solution

VLMs are an unlock for faster, richer process documentation—in our trials, creation time dropped by as much as ~90% vs fully manual methods.
Owning how you create—fine-tuning capture, prompts, review, and where the model runs—is what makes it work in production.

An example from a fairly simple procedure, filmed with an overhead camera. The static camera position provides a semi POV, which makes it easier to learn the work.

This is the final result: a clear work instruction, sometimes enhanced with details and annotations.

Problems

Capturing

Creating accurate and accessible manufacturing procedures is a critical but time-intensive task for many organizations. The process is complicated by several persistent challenges that slow down the pace of progress and introduce opportunities for error. First and foremost are the technical obstacles. Capturing the necessary photos and videos to illustrate each step in a procedure often requires juggling multiple devices—cameras, tablets, smartphones—all of which create their own media formats and require different methods for transferring files. These assets frequently must be moved manually, typically with SD cards, cables, or unreliable wireless transfers, resulting in a workflow that is both tedious and error-prone. Files can become misplaced, duplicated, or corrupted, adding further frustration and lost time.

Editing

Once these images and videos are finally collected, the challenge shifts to documentation. There is no universally accepted tool specifically designed for creating rich, multimedia work instructions in the manufacturing sector. Many teams default to generic office applications—Word, PowerPoint, Excel—to organize their content. Unfortunately, none of these tools are truly optimized for this purpose. Embedding and arranging media can be clumsy, and collaboration is limited or unwieldy, especially when teams need to keep documentation up-to-date as processes evolve. Furthermore, these formats make it difficult to standardize procedures across departments or sites, leading to inconsistencies in quality and presentation.

Maintenance

This lack of effective tools and integrated workflows directly impacts operational efficiency. Technically oriented staff are forced to spend a disproportionate amount of time on basic documentation tasks rather than on higher-value activities such as process improvement. Maintenance of instructions is also tedious—updates or revisions to existing documentation typically require repeating the entire manual workflow, making it less likely that procedures reflect the latest best practices on the shop floor. Ultimately, these technical and procedural bottlenecks slow down the transfer of knowledge in manufacturing settings, limit the ability to respond to changes quickly, and hinder the adoption of new or improved work methods.

Given / Context

At a high level, a vision–language model (VLM) combines an image path (preprocessor → ViT → projection into token space) with a text path (tokenizer → embeddings) inside a shared LLM, then decodes language tokens back to text (embeddings → de-tokenizer). The schematic below follows that multimodal layout—aligned with the prose in the following subsections.

Visual Language Models?

Visual Language Models (VLMs), also known as Vision-Language Models, represent the state of the art in integrating visual and textual information. VLMs are extensions of the powerful transformer architecture that underpin natural language processing advances, but they are adapted to jointly process both images (or videos) and text. At the technical core, these models receive visual inputs (such as frames from a camera or screenshots) and textual inputs (such as instructions or prompts), and learn to encode both modalities into a shared semantic space. This enables the models to “understand” not just what is shown in an image, but also how it relates to language, tasks, or step-by-step procedures.

How Are VLMs Trained and What Can They Do?

Contemporary VLMs, including OpenAI’s GPT-4V, Google’s Gemini, Meta’s LLaVA and IDEFICS, and Alibaba’s Qwen-VL, are pretrained on massive datasets composed of image-text pairs scraped from the web, as well as increasingly curated collections such as instructional manuals or domain-specific datasets. During training, they learn to generate descriptions (captions), answer questions about images (“what is happening here?”), and even generate step lists or detailed instructions grounded in what’s visible. Their vision encoders are typically based on high-capacity models like CLIP or vision transformers (ViT), while their language encoders/decoders leverage large transformer LLMs. After pretraining, these models can be further fine-tuned for specialized tasks, such as process documentation or manufacturing instructions.

Importantly, VLMs work in a “multi-modal” manner: that is, they are able to connect what they “see” with what they “read” or “write.” For example, given a demonstration video of an assembly process, a VLM can segment the process into discrete steps, extract key frames, and generate concise, human-readable explanations for each stage. This enables a streamlined documentation workflow where much of the tedious manual description and screenshotting is handled by the model. Advanced VLMs can run on the cloud and, increasingly, on local hardware, allowing for flexible deployment depending on organizational security or data privacy needs. As the technology matures, VLMs are rapidly closing the performance gap between proprietary (cloud-based) offerings and open-source, locally deployable models, making them an attractive option for accelerating and improving process knowledge capture.

Solution / Results

Accelerated Process Creation

The introduction of Visual Language Models (VLMs) into the process documentation workflow has proven transformative, dramatically reducing the time and effort required to create accurate manufacturing procedures. In our trials, leveraging VLMs for process capture and step description accelerated process creation by as much as 90%, compared to traditional, fully manual methods. What used to take hours—such as extracting key frames from video, generating step-by-step instructions, and organizing multimedia—can now be accomplished in mere minutes. VLMs can automatically interpret demonstration videos, segment procedures into logical steps, generate concise captions, and even enrich documentation with contextually relevant details that may have otherwise been overlooked.

Quality, Security, and Control

Beyond sheer speed, VLMs offer additional advantages in maintaining both quality and security. While the most powerful VLMs currently come from proprietary cloud providers—offering best-in-class performance—recent advancements in open weight and locally deployable models have begun to close the gap. Many organizations are understandably concerned about the privacy of their manufacturing data. The good news is that VLMs can increasingly be run on local hardware, ensuring that sensitive videos and process information never leave the organization’s secure environment. Although open source models may have previously lagged behind commercial alternatives, they are now reliable enough to deliver substantial efficiency gains, and offer a promising path for companies seeking more control over their data.

Impact on Documentation

In summary, by integrating VLMs into manufacturing process capture, organizations are able to greatly increase documentation speed, reduce repetitive manual tasks, and ensure that work instructions are kept accurate and up to date. Whether using state-of-the-art cloud services or privacy-preserving local deployments, the current wave of VLMs makes it possible to systematize and scale process knowledge in a way that was not previously feasible.

Azumuta

Example of the capture UI: video fills the window; the frame reads like a desktop app floating on the page.

// Content: Video of POV assembly task - with a head cam. // Content: Video of a person wearing a headcam, and executing a simple assembly task. // Content: ?? UI of how you can wirelessly connect a gopro

// Content: a breakdown of the cost structure: camera setup, convertion cost / minute of video (based on token usage, etc.)

Underlining this idea, we proposed a set of challenge tasks with seemingly simple everyday behaviours: spreading peanut butter, washing a greasy pan, putting a key in a lock, and turning socks inside-out. These tasks might not seem as cognitively demanding as abstract problems, but experts believe they present exceptional challenges for autonomous systems. We wanted to see how many we could tackle. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Jump to Events

The Events

The proposed tasks are separated into categories, with bronze, silver, and gold levels within each. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. We did not optimize for the highest success rate, and the policies are often not consistent, though on average we observed meaningful progress.

🥇 Event 1: Full body. The gold-medal task in this category is to open and go through a self-closing door. This is hard because the system has to keep the door open while moving through it.

🥈 Event 2: Laundry. The gold-medal task is to hang an inside-out dress shirt after turning it right-side-in. We therefore tackled the silver-medal task: turning a sock inside-out. This task is quite difficult due to the shape of the gripper, but the policy was able to learn it with sufficient data.

🥇 Event 3: Basic tool use. We tested all three tasks in this category. The gold-medal task is to use a key. This is hard because of fine manipulation and the requirement to reorient the key without putting it down. The silver medal task is to make a peanut butter sandwich. We believe this task is actually harder: it requires using a knife to scoop and spread with delicate application of force.

Why are the easy things so hard?

Our ancestors rarely had to calculate multivariate integrals, but they had to contend with unforgiving physical challenges on a daily basis. Therefore, our minds are very well tuned to manipulate objects and solve many everyday physical challenges. We immediately notice how hard it is to repurpose our brains for abstract problems, but we hardly break a sweat when we use the brain for exactly the things it evolved for.

Precisely because we are so good at physical interaction, building machines that can interact with the physical world is harder than building machines that solve cognitive tasks. We can “explain” to a machine how to perform a task through a programming language, but this is no more effective than “explaining” a task to a person. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Capturing and applying prior knowledge

Language models are so powerful precisely because they can capture large quantities of knowledge and then generalize in a compositional manner. But language models by themselves do not solve physical intelligence, because they are trained on human communication, which does not communicate physical skills. We don’t post detailed instructions on a web forum about how to move your arm to clean a greasy pan, because everyone already knows it. The key is to integrate prior knowledge with diverse and representative data of real physical behaviours.

If you are excited about these ideas and would like to join us, then get in touch!