In light of Twilio's announcement deprecating Programmable Video, here are some tips for those migrating from Twilio Video to the Zoom Video SDK for web.
For context, at Multi we build on a mix of Zoom Video SDK and WebRTC. Although our focus is on our native macOS app, we also built a web experience using the Zoom Video SDK in November 2022. Here are our learnings.
1. The sample app is the source of truth
In the first couple weeks of our work, we based our implementation on the official documentation, written for vanilla JS. However, we discovered that React sample app contains answers to several questions we had:
- How do we handle gallery view in browsers that don’t use shared array buffer and - offscreenCanvasAPI? (eg. Safari and Firefox)- offscreenCanvasAPIis now supported by Safari and Firefox in both desktop and mobile browsers. As of summer 2023 Zoom now supports gallery view across all mobile and desktop browsers.
 
- Which events should we subscribe to in order to handle users joining, leaving, and changing their mic/camera settings? - There is now expanded documentation on how to handle these cases. 
 
- How do we render videos of remote participants already in the call for a new joiner? - The key is to use - getAllUser()after joining to check for existing participant videos. See here for updated documentation with examples.
 
- How do we handle users clicking on the “Stop sharing” button provided by the browser? - Note: The Zoom Video SDK now provides a - passively-stop-shareevent for this purpose.
 

Browser provided screen share buttons.
The React sample app subscribes to three events (user-added, user-updated, user-removed) to handle participants joining and leaving, as well as participants changing their mute/unmute for camera and microphone. Though the documentation suggests peer-video-state-change for rendering remote participant videos, this isn’t the full picture: for users that join while others are already present in the call, no events will be triggered.
Note: As of December 2023, the official document has been updated to reflect using user-added and user-removed events for user joins and leaves.
To solve this problem, we look to the React sample app, where client.getAllUser() is used shortly after joining the session. We get all remote participants using this function, and iterate through each participant’s bVideoOn property to decide whether to render their video.
See more info on the Participant interface here.
2. Rendering remote user video streams: one canvas for everyone or one canvas per person?
Note: There is an update coming in the Zoom Video SDK in 2024 to allow video to be rendered on individual HTML elements.
One canvas per person is not officially supported by the Zoom Video SDK and has significant performance drawbacks. Note: As of December 2023, @tommygaessler from Zoom reports that there will soon be a Video SDK update that supports video per html element! While we experimented with this approach, we ultimately found that using a single canvas to render all participants to be the most performant and stable option. It's important for us for the web experience to be as elegant as our macOS app because web calls are many users’ first impression of Multi.

A Multi web call with five users.
The supported implementation (single HTML canvas element for all remote participants) is far more performant for calls with more than ~4 users, can show up to 25 users in a grid, and, specifically for React, has the benefit of only needing to track a single ref created through the useRef() hook (check out our post on creating, forwarding and using refs in React).
When we experimented with rendering each remote participant on a separate canvas, we were able to avoid splitting our sizing/positioning algorithms into separate implementations for our bubble user video view. Unfortunately, requesting multiple streams instead of one from the Zoom Video SDK is computationally heavy, so we mitigated this by changing pagination to a maximum of five video bubbles per page.
We have now moved to a single canvas approach for rendering all user videos. The main difficulty is how we will render users, as we use bubbles (diameter based spacing and calculation), instead of rectangles (height/width based spacing). Two methods to achieve our existing designs using a single canvas have been hotly debated: the cookie cutter method vs the swiss cheese method.
Cookie Cutter Method
This method was recommended to us by @tommygaessler, Lead Developer Advocate and our contact at Zoom. The approach uses a single virtual canvas hidden offscreen to render all participant videos. We then stamp out remote participant “video cookies” from this offscreen canvas to render on the page with the CanvasRenderingContext2D.drawImage() function. MDN has a great explanation on how this works, though this JSFiddle  more practically demonstrates how this would work. Notably missing from the JSFiddle though, is the use of the sy, sx, sWidth, sHeight arguments to selectively copy over only a single remote user for each destination canvas.

The browser viewport in the cookie cutter method: users see only participant "video cookies".

The offscreen virtual canvas element where the original videos are actually rendered in a single video stream.
A notable downside of this approach is that canvases will be created and destroyed frequently, triggering frequent re-render cycles in React.
Swiss Cheese Method
In the swiss cheese approach, a single canvas holding all remote users is rendered with z-index: -1. Overlaid on top of this canvas, we place a grid that effectively crops remote participants users into bubbles. This is very similar to the approach taken in the React sample app, in which a grid of div elements are placed on top of the remote participant canvas to outline each participant and display their name along the bottom edge. Though this method avoids triggering frequent re-renders that the cookie cutter method would present, it presents limitations of its own.

The browser viewport in the swiss cheese method: the canvas is directly on screen

An overlay with transparent cutouts is placed on top of the on screen canvas element.

The final browser viewport in the swiss cheese method: vertical and horizontal spacing are inconsistent.
The main limitation with the swiss cheese method is that the video capture aspect ratio for Zoom is 16:9. Since the rendered remote participants are rectangular, the cropped bubbles would have larger gaps between them horizontally than vertically. Moving in this direction entails a re-design of our web call UI, based on rectangles instead of circles.
3. Scale down user video stream while screen sharing is active

Multi web call while screensharing.
While testing screen sharing, we found that calls with more than roughly 5 participants were susceptible to having some of the participant videos freeze unexpectedly. This is another performance limitation of having separate canvases for each remote user (and thus separate video streams). To address this issue, we lowered the resolution of remote participant video streams while screen sharing was active. This had the happy side effect of stabilizing audio in calls with many users.
In order to accomplish this, we used our own isRemoteUserScreensharing field in our database, but this could also be accomplished by listening to the active-share-change event in Zoom. In the example below, we scale video resolution down 90P, which solved our video freezing problem. However, our more permanent solution to this problem remains the same as above: create a single remote user video canvas implementation.
Would we do this again?
In a word, absolutely. Integrating with 3rd parties is always challenging, and Zoom was no exception, but they listened to our feedback, improved the developer experience, and were helpful along the way. This project allowed us to take a good look at our web video call architecture, which was a system that had grown organically over time to include an ever-expanding feature set. This project allowed us to refactor a big chunk of our existing web call codebase and introduce abstraction layers via a shared React context and several custom hooks. We expect once we complete our second iteration of Zoom web calls with single canvas, we will be able to address the performance issues brought up in both the remote user rendering and screen share scaling sections.
As a final note to readers working with the Zoom Video SDK, these three challenges were what we spent most of our time on in our project journey. Zoom has since made rendering the local user's video much easier by abstracting the optimization logic in to the Video SDK internals. Please let us know @with_multi if you’d be interested in a comprehensive step-by-step guide in working with the Zoom Video SDK on React!
