Media Scraper

a full-stack application for admin to scrape and manage media content for website

I have always wanted to actually work on a scraping project ever since I got a chance to learn about it during my university day. It has always fascinated me how data flow in the digital zone; once something is public it will be everywhere. Especially regarding Media. There seems to be endless streaming site popping up almost every single day, all of these provide free and accessible movie streaming experience to all users, I have always wonder how that works. During half of 2025, I was approached to work on something that helps unlock that secret for me. I dubbed it "Media Scraper".

What's Media Scraper?

Media scraper is a system written in NestJS (Backend) and Vue (Admin) that helps admin scrape data, download, and manage info for anything Media.The idea is simple, my company wanted to host a media streaming site, unlike Free streaming platform that rely on video source coming from a 3rd party provider, we wanted an experience similar to Netflix or Disney Plus, where the data is in-house and we do all the encoding, decoding and streaming ourselves. We have all the technology and the tools for this type of application, what we needed is Video Source and that is where I am came in.

I along side 2 other dev, started to put ourselves to work and we quickly hit all the wall that came with this sort of task.

Surfing and Scraping the Web

The first thing to do is learning what and how to scrape the web. Normally I would have picked Go as my backend service, however, when it comes to web scraping, the eco-system of JS/TS just cannot be beaten. So okay, but why NestJS?. Express would have been great, but since I got to choose the stack with the project, I decided to go with a safe option. Nest has everything config right out of the box. Error Handling, Validation Pipeline, Rule and Pattern, at first I thought that since we will be working as a team something like this is better than Express which is extremely unopinionated.

At fist I tried cheerio, which is pretty easy to understand and pretty lightweight but I ultimately decided to go with JSDOM as it is more familiar and easier to use for me as it's similar to DOM manipulation I would often do on previous project. During the whole project I encountered zero problem with this decision which I am really proud of.

After choosing all the tools, the next is source. At first I thought this is the easiest step, however it quickly became a whole mess that decelerated our development significantly.

The Nightmare

Initially the source that we used work wonderfully, we were able to download more than 1000+ media. We then manage its data and organize it into a neat folder for the streaming team to use. But, after quality inspection by the admin team, they discover that the majority of media has a distinct watermark in it and since it appears in random unpredictable order. Since there are a lot of media to go through it was deemed an impossible task to check and verify it all, so it was all deleted.

The second source also work great...after a few days, after that we were rate-limit and blocked. The same story goes for the third, fourth and fifth.

Sixth time the charm, to ensure no more failure this time I decided to use Proxy, which seems to solve this problem perfectly. I used Webshare which has. a built-in function to rotate proxy automatically, so anytime we got rate-limit, I just need to retry the request again to make it work.

Now that we got the metadata, now is time to manage it which is another tangled knot.

Manage Detail Should Not Be This Hard

It seems be obvious, we got the source, why not use it metadata as well? Well let just say that the data is...all over the place. Some data is correct, some are misleading and some are missing entirely. So, we pivot to other source for metadata.

We were eyeing Douban, which got all the data we wanted, however it is heavily rate-limited, even proxy cannot bypass it. Ultimately we landed on TMDB, which got just about everything we needed, it even support multi-lang. Ohh...and did I mentioned that it's free?. The only downside I encountered is that the name of the film on TMDB sometimes is not the same on the source that I was using, so if such case occurred, admin has to adjust the name to match TMDB. Aside from that everything is solid.

After acquired the metadata, the next step is to package it. Instruction from the streaming team is to organized it into below pattern

Prod/Completed-Movie/{movie_name}/...

Inside we need to store video, poster, thumbnail cover and an NFO file contains all the movie detail in this XML structure

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<movie>
  <title>Avatar: The Way of Water</title>
  <year>2022</year>
  <runtime>192</runtime>
  <genre>Action</genre>
  <director>James Cameron</director>
  <rating>7.6</rating>
  <thumb>poster.jpg</thumb>
  <fanart>background.jpg</fanart>
  <actor>Sam Worthington</actor>
</movie>

So the whole process should look like

Scrape pagination page -> Save media name per page -> Go through every media detail page -> Scrape download video -> Get Metadata from TMDB -> Organize into Folder

Journey to S3

The next instruction is that everything must be stored in S3. So the next challenge is to make download seamless, admin should be able to click one button and have everything done for them in one smooth operation.

We came up with a solution to this, which is to download on another storage provider PikPak first then transfer it to S3.

Why PikPak? One thing that PikPak does well is Magnet Download. By copying magnet link to PikPak, it was able to download that media at blazing speed, and luckily our source also provided magnet link download.

One issue is that PikPak does not publicly provide API, so we have to get creative. We first go into it officially website then we snipped it API called in the network tab, through this we were able to gather all the necessary API needed for our operation. You might be wondering how do we solve PikPak authentication? Since it's just email and password sign-in, We just check if we have token stored and if it's valid, If not, go sign-in and store token. It's that simple

After we got that working, next step is to think about transferring files to S3. Rclone work splendidly for this, we just set up 2 remote source, then anytime PikPak download finished, we just callback to kickstart rclone and transfer file from PikPak to Rclone.

Is NestJS Really the Move?

This part I am not proud of, I should have never use NestJS for this project and here are a few reasons:

It's build slow: NestJS depend on loads of dependency, and that shows during build time, at first it took almost 5 minutes every times we shipped update, even smaller one. We were able to improve this by using Bun instead of Node and better cache our build image in Docker build. But anytime it kicked off build stage it still took a lot of times.

It's overkill: I initially thought I would have more than 5 members on the project, however, it ended up only being 3, so stuff like convention and pattern were never a real concern

One thing I truly appreciated about NestJS was its structured conventions. It makes me fall in love with it module-based architecture and helped me understand and implement the repository pattern effectively.

Conclusion

Media Scraper started as curiosity and quickly turned into a reality check. Scraping isn't hard—making it reliable is. Between watermarks, rate limits, bad metadata, and blocked IPs, every "working" solution broke eventually. Proxies, TMDB, PikPak, and S3 weren't perfect choices, just the least painful ones.

The real lesson: scraping is not about clever code, it's about trade-offs, resilience, and operational discipline. If you can't handle source instability, data inconsistency, and constant failure, you don't have a scraper—you have a fragile script. This project taught me how real streaming platforms are built: not magically, just stubbornly.

Gallery