How to: Using Markov Chains to Generate Fashion Blogger Captions

  Image borrowed from  Marco Waldner .

Image borrowed from Marco Waldner.

What is a Markov Chain?

You may have noticed odd captions on some of our Instagram posts lately, with notes indicating they were generated by a Markov Chain. A Markov Chain is a probabilistic model which can generate text (and other types of data) based on probabilities of their arrangement given a specified dataset. For a more technical and precise explanation, check out this post from Towards Data Science.

The funny thing about these text generators is that sometimes the captions were completely believable and other times, they were varying degrees of weird. Here's a small sampling:

fashion blogger caption generated by a markov chain, reads "It's a constant struggle to be sure #ootn"
fashion blogger caption generated by a markov chain, reads "Your time on this outfit?"
fashion blogger caption generated by a markov chain, reads "Morning sip with the family #ootd"
fashion blogger caption generated by a markov chain, reads "I hope you find a moment for yourself."

Taking Advantage of Tutorials

Floydhub has built some incredible tutorials that make it easier than ever to implement machine learning models for a range of use cases. In June 2018, they released a blog post, Generating Commencement Speeches with Markov ChainsI had seen projects like this before, but usually implementation required a lot of development set up. (That said, you don't need to use Floydhub to use this code, but they do make it easy.)

Tutorials are a great jumping off point for projects that you might have a difficult time tackling from scratch. In fact, most software engineers will tell you that the most valuable skill you can have writing code is knowing what to search for.  Depending on the project, it might make sense retrofit code rather than starting from zero.

Collecting Data / Scraping Instagram Captions

First, I'd like to address the approach I've taken in this project. I tackled this from a hacker mentality more so than a software engineer's mentality; the code in this project is not elegant.

To collect data for this project, I decided to match the format of the dataset in the tutorial so that I didn't have to figure out how to adapt the code later on. This isn't a choice that everyone would make, but it's what I did. The way it was formatted was that each text 'item' was represented by an individual text file (.txt). I scraped captions from Instagram by running some really basic Javascript in the console. The console is a developer tool that allows you to write scripts on the page and quickly test if they're working or in this case, use them to pull information off the page.

Using Google Chrome as my browser, I navigated to the page I wanted to pull captions from and opened up the Developer Tools. Then, I opened up the Network Tab and began to scroll, loading new data on the page (Instagram used continuous scroll.)

screenshot of developer tools from instagram by leanne luce

I clicked on the requests that started with:  ?query_hash= then stored that as a global variable.

store as a global variable data from instagram

All this did was make it straightforward to access that blob of data from the console. Back in the console, that blob is now called something like temp1.

storing network requests as global variables in console to create fashion blogger captions

I'm going to show you a little bit of code.  I used an existing script that I found for saving text to a basic text file. You can run this in the console and it will store this function in your browser and call it each time you run console.save. If you refresh the page, you will have to run this again. I am glazing over a lot of Javascript 101 information here, but you can learn more here if you're interested.

(function(console){
console.save = function(data, filename){
    if(!data) {
        console.error('Console.save: No data')
        return;
    }
    if(!filename) filename = 'console.json'
    if(typeof data === "object"){
        data = JSON.stringify(data, undefined, 4)
    }
    var blob = new Blob([data], {type: 'text/json'}),
        e    = document.createEvent('MouseEvents'),
        a    = document.createElement('a')
    a.download = filename
    a.href = window.URL.createObjectURL(blob)
    a.dataset.downloadurl =  ['text/json', a.download,
a.href].join(':')
    e.initMouseEvent('click', true, false, window, 0, 0, 0, 0, 0, false, false, false, false, 0, null)
    a.dispatchEvent(e)
}
})(console)

Then, using those stored global variables, I pulled out each caption and saved it to its own .txt file using the snippet below.  You'll have to do this for every ~10 captions, so I recommend keeping it separate from the console.save function above. Just swap out temp1 for the new global variable names you store from other requests in the network tab. This will probably look like temp2... temp3...

for (var i = 0; i < 12; i++){
console.save(temp1.data.user.edge_owner_to_timeline_media.edges[i].node.edge_media_to_caption.edges[0].node.text)
}

(Again, this snippet exports only the data in the blob I stored, temp1, which had ~10 captions. To store more, you can repeat the step storing requests as global variables.)

After I scraped a few hundred captions, I stored them in a folder and uploaded them to Floydhub as their own dataset. You can find the dataset I created here.

Creating a Workspace

You can make a copy of the Commencement Speech project from Floydhub by clicking this link. Once you've done that you can make changes to the dataset and launch a Workspace where you can run the source code without much setup at all.

Mounting Your Dataset

If you don't have experience doing it, mounting a dataset can be very confusing. I will admit that it took me some time to figure out this simple, but critical change.

Once you launch your workspace, you're going to see the commencement speech dataset on the right. You will also see a space where you can add datasets. In the image below, I've added the fashion caption dataset that I created. (You can leave the commencement speech dataset there even if you're not using it. It won't get in your way. In fact, I found it useful when troubleshooting.)

commencement speech dataset shown on the right, screenshot from floydhub

The next step is to "Run" the JupyterLab environment to use the Markov Chain code in this project.  (JupyterLab was introduced in February 2018 as a web-based interface for Project Jupyter.  If this is all foreign to you, that's ok!  Check out their announcement post to learn more about it.)

Navigate to speech_maker.ipynb Then update the SPEECH_PATH to reflect the location of your dataset. This should be the path that is shown on the right hand side when you add a dataset (like above).

SPEECH_PATH = '/floyd/input/fashioncaptions/'
updating dataset information on floydhub

Once you've done that, navigate over to floyd.yml and update the location of your dataset there as well.

updating dataset information on floydhub

What went wrong?

I will say that in this project, I ran into a problem with some of the special characters used in the captions in my original data. I needed to add this bit of code (below screenshot) to the  speech_maker.ipynb file in order to remove those special characters. I should mention that this is written in Python, not Javascript (which has been the primary language discussed in this post.)

remove special characters
         bad_chars = [
             '(',
             ')',
             '[',
             ']',
             '"',
             "'",
         ]
         for char in bad_chars:
             contents = contents.replace(char, '')
on our way to steal yo toast, caption generated by a markov chain
getting the house photo ready to crush the week, caption generated by a markov chain
that time i properly got dressed today becaues i was at the top of the best medicine god has ever blessed me with, caption generated by a markov chain
testing out the link in my life, caption generated by a markov chain

That's basically it! If you're lost, I recommend taking a look at the original blog post from Floydhub's blog.

Working on a similar project? We would really love to hear about it! Feel free to share below in the comments or reach out and let me know what you're up to.