One of my newest projects that I’m most excited about is using Apache Kafka distributed messaging queue to share compute jobs across many servers. I have a large video library of drone footage I am always adding to. I use various tools to browse and categorize this footage, all of which require it to be consistent in MP4 H.264 format. Unfortunately, not all of the cameras I’m using record in H.264. I have roughly 1TB of HD video in WMV, AVI, and MOV formats. I wrote a script to transcode this footage automatically using FFmpeg and archive the old videos just in case the transcoding failed. That was the easy part… Even with a powerful i7 computer set on this job, it would take days or weeks to complete. The good thing is I have multiple i7 servers laying around. This is where Apache Kafka comes in.
I designed a system where I would have one computer scanning the video library recursively for new videos. If a new video is not in MP4 H.264 format, it would produce a “job” and push it to a Kafka topic. This job contains the video that needs to be transcoded and the details of the video such as resolution and bitrate. These details are used by FFmpeg when transcoding. Then I have “worker” virtual machines running my consumer program, consuming jobs from the Kafka topic and executing FFmpeg. Kafka enables me to easily coordinate a large volume of work across a dynamic number of computers. If I film an event some weekend and drop 100 GB of video on my fileserver all at once, I can simply clone my “worker” VM to another server and have it work to transcode video as well. With this system I don’t have to work to tell one server to transcode one folder of video and another server to transcode a different one. It also guarantees I don’t have machines sitting idle while others are still working. I can easily set computers to shut down when the message queue is empty. I can definitely see Kafka is so popular in the industry as it makes message queues so easy to work with.