Track new manga releases in Italy
Hey there! I’m going to build something I always needed and that I could’t ever find on the internet: an email alert application that notifies me whenever a new physical release of the mangas I’m following is going to come out soon. This app is going to scrape a set of predefined websites (only the official publishers of the italian releases of those mangas) and send an email to every address in the database with the release date of the new volume.
Language: Python
This is a forced choice: Python is hands down the best programming language for web scraping.
Scraping: requests
Since I’m going to be making simple GET requests to the publishers’ websites and they don’t seem to block me if I use Python’s library requests, there is really no need for me to use more complex and advanced tools like curl-cffi or Scrapy. I’m also going to be using BeatifulSoup to parse the HTML pages I will get.
TDD: pytest and unittest.mock
I’m going to be following a Test Driven Development approach, using pytest and unittest.mock to properly test every functionality of the application. I will aim to have at least 85% code coverage.
Database: SQL
No need to use a NoSQL database, I’ll use a relational database with something like PostgreSQL, MySQL or SQLite as DBMS. The following are the tables I’m probably going to need to setup the application:
Subscribers table - subscribers
- id (Primary Key)
- email_address (string)
- subscriptions (string) – NORMALIZED IN ANOTHER TABLE
Manga Releases table - manga_releases
- id (Primary Key)
- manga_title (string)
- volume_number (string)
- release_date (date)
- publisher (string)
- alert_sent (boolean, to track whether an alert has already been sent for this release)
Subscribers’ subscriptions table – subscribers_subscriptions
- subscriber_id (integer)
- manga_title (string)
I decided to normalize the subscriptions field from the table subscribers to make everything cleaner, and also to make queries more efficient: at each new manga release I’m going to query the database searching for subscribers to that manga title so having a whole new column with the manga titles associated to a subscriber will make thing a lot cleaner and easier.
System design: serverless event-driven vs server-based polling system
You could argue that a classic server-based polling system would be highly inefficient compared to a serverless event-driven approach, that uses AWS RDS events to trigger email alerts only when changes on the Database happen; and you are 100% right. In the classic approach the EC2 instance would be waiting for most of the time and would poll the database every [x] (probably 6 or something, I’m still not sure) hours, but still I’m going to use this approach for two main reasons: I want to learn how to deploy and orchestrate Docker containers using AWS ECS and also I already know how to use AWS Lambdas in production environments so the serverless approach wouldn’t be really interesting from a “learning perspective”.
cron, Docker and ECS
I’m going to build both the scraper and the email alerter as cron jobs (ECS Scheduled Task) that run on Docker container through AWS ECS. I’ll probably write another blog post on that, since it’s going to be quite a long task.