How to: Using Markov Chains to Generate Fashion Blogger Captions
What is a Markov Chain?
You may have noticed odd captions on some of our Instagram posts lately, with notes indicating they were generated by a Markov Chain. A Markov Chain is a probabilistic model which can generate text (and other types of data) based on probabilities of their arrangement given a specified dataset. For a more technical and precise explanation, check out this post from Towards Data Science.
The funny thing about these text generators is that sometimes the captions were completely believable and other times, they were varying degrees of weird. Here's a small sampling:
Taking Advantage of Tutorials
Floydhub has built some incredible tutorials that make it easier than ever to implement machine learning models for a range of use cases. In June 2018, they released a blog post, Generating Commencement Speeches with Markov Chains. I had seen projects like this before, but usually implementation required a lot of development set up. (That said, you don't need to use Floydhub to use this code, but they do make it easy.)
Tutorials are a great jumping off point for projects that you might have a difficult time tackling from scratch. In fact, most software engineers will tell you that the most valuable skill you can have writing code is knowing what to search for. Depending on the project, it might make sense retrofit code rather than starting from zero.
Collecting Data / Scraping Instagram Captions
First, I'd like to address the approach I've taken in this project. I tackled this from a hacker mentality more so than a software engineer's mentality; the code in this project is not elegant.
Using Google Chrome as my browser, I navigated to the page I wanted to pull captions from and opened up the Developer Tools. Then, I opened up the Network Tab and began to scroll, loading new data on the page (Instagram used continuous scroll.)
I clicked on the requests that started with: ?query_hash= then stored that as a global variable.
All this did was make it straightforward to access that blob of data from the console. Back in the console, that blob is now called something like temp1.
Then, using those stored global variables, I pulled out each caption and saved it to its own .txt file using the snippet below. You'll have to do this for every ~10 captions, so I recommend keeping it separate from the console.save function above. Just swap out temp1 for the new global variable names you store from other requests in the network tab. This will probably look like temp2... temp3...
(Again, this snippet exports only the data in the blob I stored, temp1, which had ~10 captions. To store more, you can repeat the step storing requests as global variables.)
After I scraped a few hundred captions, I stored them in a folder and uploaded them to Floydhub as their own dataset. You can find the dataset I created here.
Creating a Workspace
You can make a copy of the Commencement Speech project from Floydhub by clicking this link. Once you've done that you can make changes to the dataset and launch a Workspace where you can run the source code without much setup at all.
Mounting Your Dataset
If you don't have experience doing it, mounting a dataset can be very confusing. I will admit that it took me some time to figure out this simple, but critical change.
Once you launch your workspace, you're going to see the commencement speech dataset on the right. You will also see a space where you can add datasets. In the image below, I've added the fashion caption dataset that I created. (You can leave the commencement speech dataset there even if you're not using it. It won't get in your way. In fact, I found it useful when troubleshooting.)
The next step is to "Run" the JupyterLab environment to use the Markov Chain code in this project. (JupyterLab was introduced in February 2018 as a web-based interface for Project Jupyter. If this is all foreign to you, that's ok! Check out their announcement post to learn more about it.)
Navigate to speech_maker.ipynb Then update the SPEECH_PATH to reflect the location of your dataset. This should be the path that is shown on the right hand side when you add a dataset (like above).
SPEECH_PATH = '/floyd/input/fashioncaptions/'
Once you've done that, navigate over to floyd.yml and update the location of your dataset there as well.
What went wrong?
That's basically it! If you're lost, I recommend taking a look at the original blog post from Floydhub's blog.
Working on a similar project? We would really love to hear about it! Feel free to share below in the comments or reach out and let me know what you're up to.