Script to Extract Text From Every Second of Youtube Videos Frames

November 26, 2023

Link to script to OCR text content displayed on the screen of youtube videos

Motivation for Project

It would be neat to be able to search the text content displayed on the screen of any youtube videos. In order to do that, you would have to run optical character recognition (OCR) on the frames of the video.

I watch a lot of hacking walkthroughs from ippsec where most of the videos feature terminal output. Searching this content might be useful in order to find examples of using a particlar hacking tool or concept. Searching is actually already pretty easy with ippsec’s videos in particular, because there is so much annotations added to the videos, and they are all searchable on ippsec’s website.

But I want to create a CLI tool that can convert a video to text for later processing. It should print a line with the current timestamp followed by the text content of the video at that frame. It is also possible to watch the video within the terminal if you can time the conversion process so that the text appears where it would at roughly the same time.

Interesting Combination

As well as an interesting computer science experiment, this project also got me a little experience doing some solutions engineering using several interesting programs. This project is made possible by an interesting combination of programs.

youtube-dl - a project to download youtube videos
ffmpeg - record videos, convert videos, and extract frames from videos
tesseract - OCR images

Installing

It was awkward to install ffmpeg on my system. I had to do a full upgrade and fix apt dependencies to install it. This was simply because I had not updated in a while.

The youtube-dl command says it can be installed from pip, but that version might not work. During my testing, the pip version did not work, but installing and running the fresh version from github did work.

Installing youtube-dl from github source

git clone https://github.com/ytdl-org/youtube-dl
cd youtube-dl
sudo python3 setup.py install

Running

Testing on ippsec video pilgrimage

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 00:1:10
= seOreence ee

# Nmap 7.93 scan initiated Mon Nov 20 14:28:58 2023 as: nmap -sC -sV -oA nmap/pilgrimage 10.10.11.219
Nmap scan report for 10.10.11.219

Host is up (0.091s latency).

Not shown: 998 closed tcp ports (reset)

PORT STATE SERVICE VERSION

22/tcp open ssh OpenSSH 8.4p1 Debian 5+debllu1 (protocol 2.0)
| ssh-hostkey:

| 3072 20be60d295f628c1b7e9e81706f1683 (RSA)

| 256 Oeb6a6a8c99b4173746e70180d5feOaf (ECDSA)

|_ 256 d14e293c708669b4d72cc80b486e9804 (ED25519)

80/tcp open http nginx 1.18.0

|_http-title: Did not follow redirect to http://pilgrimage.htb/
|_http-server-header: nginx/1.18.0

Service Info: 0S: Linux; CPE: cpe:/o: Linux: Linux_kernel

Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
# Nmap done at Mon Nov 20 14:29:11 2023 -- 1 IP address (1 host up) scanned in 12.20 seconds

nmap/pilgrimage.nmap (END)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 00:1:11

As you can see, it creates quite accurate and searchable output. It takes most of a second to run each frame on my cyberdeck, so it is kind of like watching the video in real time. I can add an optional sleep in to make it feel more like watching a live video.

Code

Let’s take a look at the code. It is a very simple wrapper around these three programs. This bash script feels like I just duct taped 3 programs together, but it works.

First I get the video URL from the first argument and check if youtube-dl is installed. youtube-dl may return a mkv file or an mp4 file. I am not sure what causes this. The tools exiftool and ffmpeg work for mkv and mp4 files, so I can account for the different file names but checking which one exists. For convenience, the videos are renamed video.mkv or video.mp4.

#!/bin/bash

url="$1"

youtube-dl "$url"
mv *.mkv video.mkv
srcfile="video.mkv"

if [ ! -f $srcfile ]; then
  mv *.mp4 video.mp4
  srcfile="video.mp4"
fi

Next, I use exiftool to figure out the duration. It produces output in the format Duration: HH:MM:SS, so i can parse that with AWK. An interesting trick I am doing here is printing out a Bash script using AWK and then sourcing it into Bash using file redirection in order to get the hours, minutes, and seconds values.

source <(exiftool $srcfile | grep ^Duration | awk -F : '
/^Duration/ {
  print "seconds=" $NF
  print "minutes=" $(NF-1)
  print "hours=" ($(NF-2) + 1) - 1
} ')

Finally, I loop over seconds and minutes (and should loop over hours too). For each second, I print a header and run ffmpeg to convert it to a 1024x768 image. For convenience, I am calling the image image.png each time. Then, I run tesseract on the file and print the output. It is important to clean up the images and output so the programs do not fail to overwrite them on the next frame.

for m in $(seq 0 $minutes); do
  if [ $m -lt $minutes ]; then
    sec=60
  else
    sec=$minutes
  fi
  for s in $(seq 0 $sec); do
    echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 00:$m:$s"
    ffmpeg -ss $m:$s -i $srcfile -frames 1 -s 1024x768 -f image2 image.png >/dev/null 2>/dev/null
    tesseract image.png output >/dev/null 2>/dev/null
    cat output.txt
    rm image.png
    rm output.txt
    #sleep 1
  done
done
rm video.mkv