Netflix’s VOID AI: Redefining Video Editing with Interaction-Aware Object Removal
Netflix has open-sourced a powerful new AI framework called VOID (Video Object and Interaction Deletion), a tool designed to “rewrite” video scenes after they’ve been filmed. Unlike traditional video inpainting tools that simply erase an object and fill the gap, VOID understands the physical relationship between objects. This allows it to remove not only the target object but also the physical interactions it induced on the surrounding scene.
Released on April 3 under an Apache 2.0 license and available via Hugging Face and GitHub, VOID represents a significant leap in how AI handles temporal consistency and physical realism in video manipulation.
Beyond Simple Erasure: The Power of Interaction Deletion
Most object removal tools focus on secondary effects, such as erasing shadows or reflections. VOID goes further by addressing physical interactions. For example, if a person holding a guitar is removed from a scene, VOID doesn’t just leave a floating instrument; it recognizes the interaction and causes the guitar to fall naturally.
This capability transforms video editing from a process of “patching” to one of “scene reconstruction,” allowing creators to alter the physical logic of a shot without needing to reshoot the entire sequence.
The Technical Engine: CogVideoX and Quadmask Conditioning
VOID is built upon the CogVideoX-Fun-V1.5-5b-InP base model, utilizing a 3D Transformer architecture with 5 billion parameters. The core of its intelligence lies in its “interaction-aware quadmask conditioning.”
Instead of a simple binary mask (keep vs. Remove), VOID uses a 4-value mask that encodes specific roles for different areas of the frame:
- Primary Object: The specific item or person to be removed.
- Overlap Regions: Areas where the object intersects with other elements.
- Affected Regions: Areas containing items that should react to the removal, such as objects that should fall or be displaced.
- Background: The static areas of the scene that must remain unchanged.
To guide the generation, the model takes the video, the quadmask, and a text prompt describing what the scene should glance like after the removal.
A Two-Pass Approach to Temporal Consistency
To ensure that the edited video doesn’t flicker or warp unnaturally over time, VOID employs a sequential two-pass inference process:
- Pass 1 (Base Inpainting): This uses the
void_pass1.safetensorscheckpoint to perform the primary removal and fill. This pass is sufficient for many shorter or simpler videos. - Pass 2 (Warped-Noise Refinement): For longer clips requiring higher temporal consistency, the
void_pass2.safetensorsmodel applies optical flow-warped latent initialization to refine the output and smooth out transitions.
Hardware Requirements and Implementation
Because of the complexity of 3D Transformers and high-resolution video processing, VOID requires significant computing power. The model’s default resolution is 384×672, supporting up to 197 frames, and uses BF16 precision with FP8 quantization to manage memory efficiency.

System Requirements:
- GPU: A GPU with 40GB+ VRAM is required (e.g., an NVIDIA A100).
- Dependencies: The mask pipeline integrates Gemini via the Google AI API for Stage 1 and utilizes SAM2 (Segment Anything Model 2) for mask generation.
Key Takeaways: VOID AI at a Glance
| Feature | Detail |
|---|---|
| Base Architecture | CogVideoX 3D Transformer (5B Parameters) |
| Core Innovation | Interaction-aware quadmask conditioning |
| License | Apache 2.0 (Open Source) |
| Max Frame Capacity | 197 frames |
| Hardware Need | 40GB+ VRAM GPU |
By open-sourcing VOID, Netflix is providing the creative community with a tool that moves beyond static editing and into the realm of dynamic scene rewriting. As these models become more efficient, the ability to manipulate physical interactions in video will likely become a standard part of the digital production pipeline.