Recently I wanted to set up a periodic job to mirror some of my personal youtube playlists. There’s plenty of reasons one might want to do this. For me it’s simple: one copy is none copy, and two copies is one copy, so I want a second copy of youtube videos I care about stored locally. This protects against videos getting removed, copyright stuck, youtube shutting down, or anything else that might make the youtube video otherwise unavailable.
The final straw for me was the recent YouTube Kids change. YouTube has implemented a system whereby it flags videos as “For Kids”, and a video flagged as “For Kids” can not be added to playlists, or interacted with beyond viewing, and may eventually start disappearing from search results. This seems kinda fine on the surface, but the automated systems flagging videos are very bad at actually detecting kid-friendly content. For example, Pony Music Videos (PMVs), which are fan made cuts of My Little Pony video footage over top unrelated songs, have been getting flagged quite a bit, even when the song has obviously explicit lyrics. (EDIT 2020-03-02: this video has since been un-marked as “For Kids”, so you’ll have to take my word that it was “For Kids” when this post was written.)
Anyways, I just wanted a simple script adding some extra features around youtube-dl, so I made one! You can find the full script at https://github.com/faithanalog/x/blob/master/youtube-archiver/archive-playlists. I’m going to go over a few specifics in case you’re interested in making your own rather than just using my script or someone else’s. But, if you don’t care about that, and you want to use my script, here’s what to do:
First, create a list of playlists in a file called, for example,
Nightcore Songs https://www.youtube.com/playlist?list=PLckeMyCaCCIN_JU1V4oADW50DlGOREoLj Vaporwave https://www.youtube.com/playlist?list=PLgP_WFDJWjxTRPJtV4DV99lGB92rq5we_
Optionally, create a list of SOCKS proxies
you want to use in another file, for example,
Then you can run my script!
# If you don't have any proxies to use archive-playlists playlists.txt # If you do have proxies to use archive-playlists playlists.txt proxies.txt
This will create folders with the names provided by
download each playlist to its own folder. You can run it periodically however
you feel most comfortable (crontab, systemd timers, etc.) and it should keep
all the local copies up to date.
The requirements I had for my script were pretty simple:
- Download videos, thumbnails, subtitles, and video metadata
- Write playlist information to its own file
- When re-running the script, download only the newly required data
- Ideally, avoid getting myself IP-banned
Most of this can be taken care of with various youtube-dl flags! I have a
dl_playlist() which implements all of this. Arguments passed
dl_playlist() are transparenltly passed on to any youtube-dl commands,
which allows me to pass it a playlist URL, and optionally a proxy.
First, we download the playlist metadata:
playlist_json="playlist-$(date --rfc-3339=date).json" youtube-dl \ --flat-playlist \ -J \ "$@" < /dev/null > "$playlist_json" temp_playlist_json="$(mktemp -p ./)" cp "$playlist_json" "$temp_playlist_json" mv "$temp_playlist_json" playlist.json
This writes the current metadata to a dated file, copies that to a temporary
file, and then renames the temporary file to
playlist.json. This allows us to
have a history of the playlist over time, which may make it easier to figure
out what videos are missing when they get taken down.
always contain the latest playlist information, to make it easier for me to
write tools for this later.
The process of copying to a temporary file and then renaming the temporary file
may seem overkill, but
mv is an atomic operation, so it eases my mind about
any race conditions I might run into if I’m running other tools over this data
to process it.
Next, we can download the playlist videos.
youtube-dl \ --download-archive ytdl-archive.txt \ --write-info-json \ --write-description \ --write-thumbnail \ --all-subs \ -i \ -f bestvideo+bestaudio \ -r 500K \ --sleep-interval $min_sleep \ --max-sleep-interval $max_sleep \ "$@" < /dev/null
Here’s where all the fun flags come into play!
--download-archive ytdl-archive.txttracks which videos have been fully downloaded in a file called ytdl-archive.txt. This allows youtube-dl to skip loading the page for the video entirely on subsequent runs.
--write-thumbnailare all fairly self explanatory. I don’t even really need to write the description, since the info json contains it, but it might be convenient to have in its own file.
--all-subsinstructs youtube-dl to download all available subtitles. I don’t really know which subtitles I need, so might as well just have them all available.
-imakes youtube-dl ignore download errors. Without this, if a video is missing, youtube-dl will stop running after trying to retrieve it, and skip the rest of the playlist.
-f bestvideo+bestaudiomakes youtube-dl download the best quality video and audio files separately, and merge then into a single
-r 500Krate-limits downloads to 500 Kilobytes per second. This is part my attempts to avoid the ire of automated IP-bans. I don’t have any source on what a safe range for download rates is, but this seems to work for my usecase, so I’m keeping it.
--max-sleep-intervaltogether specify the minimum and maximum sleep times. youtube-dl will pick a random time between these values between downloads. I was a bit confused about the semantics of what counts as a download- youtube-dl sleeps after downloading a thumbnail, video file, or audio file, but doesn’t sleep after downloading the info json. Anyways, I use a range of 30-90 seconds, and this seems to keep things slow enough that it shouldn’t be a problem.
The rest of the script is just housekeeping around this. It loads playlists from a file, creates separate directories for each one, and randomly shuffles through a list of SOCKS proxies for each playlist. My VPN provider provides SOCKS proxies for each of their exit nodes, so this is a really convenient way for me to distribute my downloads across a broader range of IP addresses.
And that’s it! The script is pretty well commented I feel, but if you have any questions about it, feel free to ask!
Thanks to everyone on the fediverse who helped me iterate on this script and iron out the details. ❤️