Using StreamDiffusion with TouchDesigner: Tips and Tricks for Advanced Users

Integrating StreamDiffusion into a TouchDesigner network is now possible thanks to an innovative custom TOX created by DotSimulate, an innovative independent developer based in New York […]

Posted By: Scott Mann
Published On: June 5, 2025

Integrating StreamDiffusion into a TouchDesigner network is now possible thanks to an innovative custom TOX created by DotSimulate, an innovative independent developer based in New York City. (There are one or two other options, but this is the only one I’ve used, and it works great.) Unfortunately, you really need an Nvidia graphics card to run StreamDiffusion (RTX 3090 and 4090 seem to be the best options), so Mac-users probably can’t quite join the party yet. There isn’t a publicly available implementation for the 50- series cards, but a beta version is in testing and should available soon enough.

What is StreamDiffusion?

To put it simply, StreamDiffusion is an approach to running Stable Diffusion at extremely high speeds. In a standard workflow, generating a 1024×1024 image can take anywhere from 2 to 24 seconds. StreamDiffusion slashes generation time—often achieving 10× to 50× faster performance—by introducing a memory component into the diffusion process. Instead of starting from random noise each frame, it reuses the previous frame’s latent and applies light denoising, allowing for continuity and speed without recomputing everything from scratch. It’s a bit like how a video game engine updates just what’s needed, rather than redrawing the entire screen from scratch.

(A well-deserved shoutout goes to the team behind StreamDiffusion—a group of researchers from UC Berkeley, the University of Tsukuba, and other institutions. You can read more about their work here.)

Wait, what’s a latent?

A latent is a two-dimensional map that encodes qualitative features instead of color values. When an image is encoded into latent space (where most of the interactive AI magic occurs), it’s transformed into something more abstract — a kind of semantic sketch of the image. It doesn’t contain pixels, but it does contain information like: “Here’s an edge,” “this region resembles a face,” or “this texture feels like concrete.” You can think of it as a rough outline layered with meaning — a conceptual version of the image that’s deeply tied to the model’s training data. The latent connects what it “sees” in the image to what it has “learned” over millions of examples, compressing visual content into a format that’s fast to process and rich with possibility.

Why Do I Care?

Understanding how the underlying system works helps you feed it data that’s actually useful. Whether or not the imagery you send into a StreamDiffusion setup looks good in a traditional sense is somewhat beside the point — unless you’re blending the input and output directly. In that case, you’ll want a layer of preprocessing before passing the image to the AI. What the model is really looking for are contours and patterns — shifts in texture and color, spatial relationships like near vs. far, and broad visual categories like tree vs. rock or person vs. animal. If your input isn’t feature-rich, the model has nothing meaningful to grab onto, and your output will reflect that.

Let’s look at an example.

You’re building a Kinect (or Orbbec)-based installation — something like a Magic Castle, where particles shoot out of a person’s hands and blossom into natural wonders: flowers, birds, fish. Using skeletal data, you generate particle sources at the hands, and apply vector fields — maybe some beautiful curl noise — to make the particles swirl, trail, and fade gracefully with feedback. Visually, it’s stunning.

So, naturally, you isolate the output of that particle system and send it into StreamDiffusion, expecting it to riff on your prompt and transform those swirls into imaginative scenes. But what comes out is kind of… junk.

Why? Because the system doesn’t have much to work with. All it’s seeing is lines — elegant ones, sure — but still just contours. When that gets encoded into latent space, the model doesn’t think “butterfly” or “snowflake” or “forest spirit.” It thinks: “squiggle.” There’s no depth, no texture, no structure to hook into. So instead of generating rich, scene-driven imagery, it does its best with what it’s given… and you get a mess of abstract fragments with no real cohesion.

All Noise Is Not Created Equal

A sequence of five panels shows a man raising his hands, visualizations of particle trails, processed diffusion output, and a sad face emoji in the final panel.

Once again–as is so often the case(!)–noise to the rescue!! You take your particle system output and blur it out bigtime, and multiply it by NOISE. Noise is a big part of making a successful generative AI art system.

A sequence of five images shows a man moving his arm, particle effects generated from his motion, visual stream diffusion effects, and a smiley face graphic overlaid on the final image.

Take a look at the further examples below (this time just working with ComfyUI, but the general rule still applies.) Building noise systems that have variety and contours yields much better results than static noise systems.

*A static noise system yields underwhelming results.*

*Diffusion systems love noise that have variations and shapes. Stream Diffusion is no exception.*

Taking this concept a step further, consider the variety of noise systems that you can deploy with respect to color, shape and texture. A basic multi-colored noise system (even though you offer the diffusion engine a variety of color) doesn’t offer much in terms of variety and contrast. Even though there are plenty of ways to interpret shapes in such a system, there is no obvious approach to differentiation and categorization.

*Another static noise system… the outcomes are dimensionally flat and lack creativity and detail*.

Take a look at the example below. Using color inversion and a series of lookups and multipliers in TouchDesigner (including the specialized noise TOX from the palette), offering the diffusion system a variety of densities, contours, and color distinctions results in much more detailed and interesting results. Consider the fact that the term “3D” is included in both prompts, but in the example above, Stable Diffusion has trouble seeing anything that can be considered relevant to depth information, because nothing is particularly larger or smaller.

*Noise that has distinct colors and textures yields results that are more highly detailed and interesting.*

Here’s another look in the context of a Magic Mirror setup in StreamDiffusion. Look at the noise system entering the diffuser: rich in texture and color variation…

*A magic-mirror setup with a specialized noise system that offers the diffuser a multi-textured noise system that yields excellent results.*

Get Our 7 Core TouchDesigner Templates, FREE

We’re making our 7 core project file templates available – for free.

These templates shed light into the most useful and sometimes obtuse features of TouchDesigner.

They’re designed to be immediately applicable for the complete TouchDesigner beginner, while also providing inspiration for the advanced user.

Download Your Free Project File Templates

Using Feedback

One of the most common complaints about StreamDiffusion is that the output looks jumpy or “skippy,” because every frame is unique from the last. Yes, the system DOES consider the latent from the last frame, but it doesn’t really make a special effort to blend the frames as they come out of the diffusion engine. Fortunately, TouchDesigner is great at smoothing data, and it’s extremely helpful when trying to calm the hiccups and occasional rapid transformations that the system generates.

A basic feedback system using an opacity reduction like the one that Elburz outlines in this tutorial can be a great way to soften jagged frames. In fact, I am apt to use them on BOTH SIDES of the StreamDiffusionTD TOX. Smoothing out sudden changes before your imagery enters the system AND after it comes out makes for a pleasing blurring of the animation over time. (It doesn’t have to be so intense that you see trails. A subtle addition of feedback will simply make the output feel more satisfying and cohesive).

One last tip on feedback: if your feedback system is causing colors to wash out (as pixels build up on each other and whiten) simply put a level TOP before your feedback system and lower the opacity. This trick makes it possible to get a little more aggressive with your feedback, particularly if you really want to have long trails and smearing of visual data across more than just a few frames.

V2V

The StreamDiffusion TOX includes a Video-to-Video (V2V) feature designed to enhance visual coherence and stability across frames. It works by caching latent information from previous frames and directing the diffusion process toward areas of change — a clever approach that aims to reduce flicker and keep imagery consistent over time.

However, enabling V2V disables TensorRT acceleration, which is essential for achieving high frame rates. For my purposes, the performance hit is too steep to justify the gain in stability. As a result, I’ve steered clear of V2V and instead leaned on traditional feedback techniques to smooth out motion and reinforce temporal continuity.

That said, I’d love to be proven wrong — if anyone’s found a way to run V2V efficiently or combine it with TensorRT-like speed, I’m all ears.

NVIDIA Upscale

This might seem obvious to some, but it’s surprising how many people haven’t explored the relatively new NVIDIA Upscaler TOP in TouchDesigner. And it’s worth a mention—because one of the most frustrating limitations of StreamDiffusion is its resolution ceiling. Try running it much above 1024×1024, and your framerate tanks to something closer to a slideshow than a live animation (think: 4 fps). Personally, I keep things light—512×512 or 512×768—because speed is everything in a real-time context.

But low resolution doesn’t have to mean blocky, LEGO-brick visuals.

Enter the NVIDIA Upscaler. This AI-powered TOP has been trained to recognize and smooth contours in your images. It’s not doing any deep semantic analysis (which is probably why it’s so fast), but it does a great job of making low-res imagery look much cleaner—without sacrificing too much performance.

It runs quite happily alongside the StreamDiffusion TOX, as long as you don’t overdo it. I usually set the Upscale factor to 4, with a strength around 0.65. Yes, this will cost you a few frames per second, but the improvement in visual quality is absolutely worth it.

No, it’s not going to match the sharpness and detail of an image generated natively at a high resolution in Stable Diffusion. There’s definitely some fidelity loss. But that’s not the point of StreamDiffusion. The magic is in the real-time responsiveness. With a little help from the NVIDIA Upscaler, you can push your output resolution into a territory that looks surprisingly good—even on standard GPU hardware—and keep that smooth, live generative feel intact.

In conclusion, you might take a look at my StreamDiffusion Tutorial, which walks you through some of the basic steps of getting the system up and running. And here’s a peak at an installation that uses the technology, hot off the presses. Happy diffusing!

Subscribe to Immersive Mondays, the only newsletter for professionals working in immersive design, creative technology, and interactive media.

Scott Mann

A Creative Technologist with 20+ years of experience in multimedia art and audio production, I create immersive & interactive experiences that combine generative animation, particle systems, sensor integration, AI-generated artwork, sound design, and interactive music.

View All Posts