Dev log: reaper.lite – Part 1

“As long as you painstakingly follow all the steps,
it is 10 billion percent possible.
That is science.”

Senku, Dr. Stone

In the last weekend, I finally finished the massive refactoring of my 2yr-old python webscraping project. It is now online in a repository, and the program has been divided up into proper modules. I now have better focus on improving each part. The downloader is now multithreaded, has improved sorting, and more sites are supported.

Day 1

After that, I wanted to build something lighter and more accessible. There’s nothing more dubious than distributing an .exe online with some hard dependencies that need to be explained and installed. My first thought was to have an actual webpage. Inspired by https://qsniyg.github.io/maxurl/, a tool that takes image url’s and finds better resolutions. A *small* script on a page would do right? Or so I thought.

So off I went to GitHub to setup a new repository and project page. Got that working, added simple HTML for a textbox. The first problem I encountered was about access to Twitter’s API. That would be the easiest route. But it turns out I just can’t have my API keys hardcoded in a public and opensource script. It would need to be old school text scraping to get the data.

I really didn’t have a deep understanding of Javascript’s syntax and patterns yet so for about half the day I was over at https://www.w3schools.com/js/ looking up the basics. I learned about the onchange/input methods and some workarounds to clipboard access. Then I hit another roadblock. I cannot just access another website or it’s source text from another domain. It’s a security risk that cannot be allowed. I looked into what maxURL was actually doing, and it was *simply* a url parser with thousands of rules hardcoded to replace text. It did not access any of the sites or any external resource. The standalone website was a dead end.

Day 2

I wanted to access data on a webpage the user is most likely on, run some code over the current page. It turns out there are already well-defined tools to do that specifically – userscripts. They are Javascript snippets that can run on specific webpages after they are loaded on the browser. I needed to learn some fancy jQuery – https://www.w3schools.com/jquery/.

An intuitive way to serve the tool was somewhere familiar – I want a new button above Copy link to Tweet. The first step: Find the dropdown and append the button. There were a few more intricacies to it like matching the element’s css to make it fit in, adding the new onclick event, and actually closing the dropdown after use. Those were solved with good old StackOverflow posts and YouTube tutorials.

Now I have a working event and a reference to the parent Tweet. It’s time for scraping the data for the Discord post. The easy ones first are the username, and permalink. For taking the date string, find the tweet’s text body and pass it through regular expression searches. The image content links are already embedded in the tweet so a little more digging to get them into an array. Parse the image links to point to the original resolutions. Put them all together into a formatted Discord post and pass that into the clipboard with some hacks. Basically that’s done. But… Twitter has a dynamic stream of content for the timeline page. New tweets get loaded in after the initial page load and are not affected by the userscript at this point. That needs to be solved.

Part 2 soon with more polish and the New Twitter problem. For anyone interested, the project is currently on GitHub – https://github.com/roguesleipnir/reaper.lite

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s