Why MJPEG over USB?
When you design a video capture system, you’re really balancing two variables:
- How the image is encoded (codec)
- How the image is transported (interface)
Everything else like resolution, frame rate, hardware requirements, stability, etc... is a consequence of those two decisions. Most people skip to talking about resolution and frame rates, but there is a bit more to it. Let's walk through the exercise of designing a system.
Pixel format
A pixel format describes how a single frame (pixels forming a static image) stores color information. Each pixel format varies in color fidelity at the expense of number of bits, which directly impacts how much data you need. The common ones are
- RGB - full reg/green/blue values for pixels
- YUV - YUYV or NV12. These store luma (brightness) and chroma (u/blue and v/red). YUYV and NV12 are probably the most popular in cameras.
- RAW/Bayer - direct sensor data
The pixel format describes how much data it is, something like RGB takes 24 bits of data for a pixel. There are different forms of YUV, but YUYV typically only takes 16.
Codec
The codec describes how frames are compressed.
- Uncompressed. This is typically YUYV over HMDMI or USB, although can be RAW over MIPI.
- MJPEG. Each frame is compressed as an individual JPEG.
- H264. Most popular streaming protocol, compresses inter-frame.
- H265 (HVEC). More efficient successor to H264, but less popular. Although, it is gaining popularity in some 4k broadcast streams.
The codec primarily affects your bandwidth requirements. But, it has secondary aspects like compute requirements and latency. Trying to decode H265 requires significantly more compute than decoding uncompressed. A codec which does interframe compression like h264 naturally has a higher latency than something like mjpeg.
Your codec defines your losses though. Uncompressed is nice because you don't lose any data, but can you really tell how much pixel data you lose when looking at jpegs online? h264/h265 can be difficult to compare because you typically control bitrate which controls data loss.
Transport
Once the frame exists, we need to transport it. Or transport groups of them (video).
- MIPI. Typically only used internally between the sensor and a processor. High speed, but only valid for short distances. Can't do something like a 12' run across a machine.
- USB. Ubiquitous. Affordable. Flexible. USB3 has 5Gbps, although practical is a bit lower (~4Gbps). Multiple devices share a bus, so you need to be careful with contention.
- HDMI. Typically transports uncompressed RGB or YUV. Most common for 1-1 connections like graphics to monitor, or camera to capture card.
- Display port. Typically used for things like monitors, not for capture even though it can technically be daisy chained and has higher bandwidth than hdmi.
At the end of the day, everything ends up on a pci bus. Your usb and hdmi both go through a controller that hits a pci bus. This is why you see a lot of pci adapters to hdmi or usb. You are even starting to see some which route through an m2 interface. The m2 is a smaller and nicer interface, but still ends up hitting a pci bus.
Reality check
1080p@60 going over yuyv (16 bits per pixel) is just over 2 Gbps. USB3 is rated at 5 Gbps, in practice, that means 1 or 2 streams per controller before pushing limits. Even though you might have 2 or 4 ports, they are probably sharing a single bus. That's why you commonly either see 1-1 connections, over something like hdmi. Or you see compressed streams sharing a bus, like mjpeg over usb.
mjpeg is great because its a bunch of individual frames compressed. So, decompressing it is trivially easy and requires minimal computation. We actually decompress it in the gpu so that "surface" can be immediately processed without leaving the gpu, its a neat dma trick I might go into some day. Something like h264 requires significantly more computation, that's why you want to go into something like obs and select "enable hardware decoding". h264 can be configured in a bunch of different ways, fixed bandwidth or quality or whatever, but because it relies on inter-frame prediction and buffering, it is typically higher latency, higher decode complexity, and not truly fixed bandwidth.
The reality is that pinball is closer to something like a high action sport than your traditional influencer or display capture. Pinballs move fast, so motion blur and rolling artifacts degrade image quality more than a minor mjpeg compression artifact. Auto-exposure causes temporal brightness shifts and distortion, and rolling shutter introduces geometric distortion during fast motion. In multiball, those artifacts are far more visible than moderate JPEG compression which was corrected for.
Why did we choose MJPEG over USB?
We optimized for 1080p@60 fps. Long cable runs. Multi-camera. Cost. Low latency. In that order.
USB cables are small and routable inside a frame. Moving to HDMI would require a much larger (and more expensive) frame, not to mention the additional cost of HDMI cables and cost of more expensive capture hardware.
While the MJPEG has some pixel loss, we do correction for it automatically (degauss, color correction, sharpness), which makes pixel loss minimal. The final encoding trumps anything else. Maybe in the future if we offer a 4k setup, or an even higher tier, we will look at some other techniques. Doubling system cost for marginal visual gains was not worth it.
When would you choose something else?
If you want the crem-de-la-crem, get HDMI cameras and a 3 HDMI direct capture with the only conversion codec being the final encoding to your streaming protocol (h264).
If you are bandwidth constrained or have much larger runs, you could look at something like an h264 camera.
If you have a high bandwidth usb controller, you could look at yuyv over usb too.