Things I've learned while working on bot
Created on 2019-01-04 in categories
Things I've learned while working on my bot development project.
Things I’ve learned while working on my bot.
It’s been a long year for me. Plenty of changes on the work side together with the tremendous changes in private life gave me way too much time to start treating the development of the LittleGuardianBot more seriously. The overall process came at a cost - sometimes it was sleepless nights, dozens of takeaways, sometimes single mistake cost me hundreds of pounds. Leaving things I’ve learned below, for future myself.
It was mostly PoC, simple bot to guard one of my groups from annoying spammers. Nothing special from the beginning as it started from 30 lines of logic and after few months developed into few hundreds of non-optimised Ruby code. Everything was kept in the memory, so with every restart bot got amnesia and the fun with spam detection started all over again. It was “designed” to work only on my group and listen to hardcoded admins, and that obviously caused some issues when people wanted to use it on their own groups.
Lesson 1: Never, ever hardcode anything.
It doesn’t matter if your project is used only by you or close friends, that you don’t want to make it public, ever. Always assume that things will change randomly and the amazing solution you have implemented will not cover all the use cases and possibilities.
Questions I should’ve asked myself back then
- What if I’d add new admin?
- What if someone won’t be trustworthy and I’ll remove them
- What if someone adds the bot to the new group with a completely different set of admins?
Version 2.0 was rewritten from scratch, still in Ruby, with the support of Google Cloud Datastore for persistent information storage in case of the restarts, and multiple groups support. Plenty of IF’s there and data to store to make bot a little bit smarter than it was before. With persistent storage for data, I was finally able to restart bot at will, few groups and hundreds of active users did not make any difference to the running cost - that was the version which experienced the most functionalities growth. Constant development caused restarts, and everything was working quite smoothly, at least until the number of active groups and users grew, and the bot became really slow with replies and actions. As more and more information was stored in the GCP Datastore, every restart or change of group settings caused the bot to re-read all the entities. The number of available functions grew as well, so I have decided to create a website with simple ( at that time ) documentation and FAQ for people to visit before asking any questions.
Another portion of questions
- Is the language fit for task and allows me to develop further?
- What happens to the bill when traffic skyrockets?
- Do users know how to use my product?
Lesson 2: Pick your tools wisely
As much as I love Ruby - it wasn’t the perfect choice for the task. Ruby as a language have its demons, can become very difficult to debug and with increasing codebase - finding the real source of problems was more than difficult. On top of that - every hotfix of the previous commit caused restart and re-read of the whole database which became a costly process, especially during the heavy development days.
After answering questions and countless hours spent on firefighting I’ve decided to rewrite everything in Golang. It’s well known for its speed, together with actual compiler telling you about any potential errors, not the customers in this case. I’ve also added basic statistics using GCP Stackdriver to have better insight into traffic, got rid of the logging noise which was tolerable with few groups, but together with increased popularity became unreadable and difficult to understand. This version was full of beginner mistakes and wrong assumptions which annoyed customers. At that time nothing has changed on the data storage level, more billable and free functions arrived. This was the first version using Docker and being deployed on Kubernetes cluster in GCP as well. Thanks to Google Cloud Build making changes was as easy as pushing new code into the Github repository.
Even more questions
- How do I keep an eye on actual traffic?
- Are things I’m logging and keeping an eye on actionable?
- Do I utilise the potential of technologies I use?
Lesson 3: Logs and monitoring
Rapidly decreasing traffic for my bot could mean three things - either bot doesn’t do what it’s meant to, and people remove it from their groups, telegram has some issues with their bot API or bot quietly died. Having more logs than necessary with large amounts of traffic doesn’t work either - you start ignoring all the logs and can’t find any useful information about any potential issues.
It was the time focused mostly on code optimisation and migration to the SQL database. It took a few quite long days to change all the necessary queries to ping new source of truth. From that moment I’ve been free to restart everything at will, so I’ve focused my efforts on code optimisation. Amazing Profiler tool from Google helped me with this task tremendously - I was able to identify all the chokepoints for my bot, and after few weeks of re-thinking few strategies improve for example images and files scanning procedures. Until then - every image or file uploaded was scanned separately - either by VirusTotal or Google Vision. I came up with a simple solution to both speed up bot response and decrease API calls ( and the cost as well ). Every image before scanning is downloaded by the bot, which then calculates its MD5 hash and checks with the SQL database if the file is already known to return results instantly. In case of an entirely new file - bot waits for scans results and save them in the local database for future reference. It’s especially useful with people using gifs and sharing files on multiple groups. I’ve also added smart community filter which stores information about every user and his doings on all the groups’ bot is present on, so when the same person joins another group, we already know if it’s a known spammer.
Set of questions at that stage
- Is my code optimised enough to work under high load?
- What are the areas I can improve its performance?
- Can I make it even easier to code, test and deploy?
Lesson 4: Constant improvement
Yes, everything worked at that stage - I could easily keep adding new functionalities, but as I’ve learned on the previous iterations - code should be easy to maintain. Even when I work on the project on my own I’m always verbose with my commit messages to make it easier to come back to the code, know why and when I’ve made the change. I also started working on the bot using Github issue tracker/kanban board together with Todoist application to keep notes on new functions and improvements to work on later. Rewriting codebase from scratch seems to be a good idea and exercise as well - avoiding the same problems, coming up with better solutions helps both project and self-development.
Fifth and the current iteration
The bot was finally ready to go public. I’ve decided to rewrite it once again, using all the knowledge I’ve gained in previous steps, adding tests to be run just before committing changes into the repository, even more tests running on the Code Build side. I’ve heavily focused on gathering more statistics about the bot doing as well. Adding timers for every function and action was a great move - especially in telegram world when spammers can mass-join the group and flood it with unwanted content within a few seconds - I wanted to be faster. Tons of code optimisation at that stage resulted in an average receive-analyse-action response time of 30 milliseconds which beats all the present spammers and make the spam itself close to invisible for regular users. Great success but at the same time growing popularity ( almost 200 groups with thousands of users at that stage ) made me think about gathering some extra funds to cover the cost of hosting ( Kubernetes cluster, SQL database server ). I’ve had no choice but to add the premium version ( using telegram payments, so the whole payment process happens in your telegram application ) and establishing limits on free images scans. I’ve also decreased docker container size from 210 to 9 MB by using multi-stage builds which allowed me to build and deploy faster, spend fewer funds on the storage. On top of everything I’ve finally added alerting and slack integration to have essential information always handy and even if container restarts itself after the bot segfault - I’m able to fix the underlying issue almost immediately so it will never happen again. Bot also gained the official announcements channel where I can inform users of the new functions, organise pools with suggestions for new functionalities and stay in touch.
Even smallest project after going public will cost you and unless you don’t want to spend your funds for people using your stuff - try to come up with an idea on how to make it pay for itself. Stopping development of new functions to reiterate over the work already done enormously helps with avoiding potential tech debt issues. Using latest goodies from technologies you use may significantly improve parts of your workflow, keep an eye on them and ask yourself a question: “what’s in it for me”. If you can’t find an answer now - don’t worry, it’ll come back to you in future. Be as verbose to your users as possible to build the trust. Users and customers are the best source of information and ideas as it’s them who use your product.
Happy ending ( for now )
- From 50 to 2700 lines of highly optimised code.
- From 1 to 800 Telegram Groups in a few months.
- Over 5 million of messages processed.
- Over 350 thousands of unique users.
- 7 languages support ( thanks to the community ).
- Premium & custom versions and Patreon page to cover the costs.
Hundreds of hours of development and fun - utterly priceless.
Official bot website: telegram-bot.in Live statistics are viewable by everyone, of course!
* Table of contents *
* Check other posts *
- Copy AWS EC2 snapshots between regions.
- pci compliance
- Dealing with PCI compliance - sudo commands log
- Elasticsearch, AWS in different availability zones