Thursday, October 15, 2015

Training neural nets using Amazon EC2

In my opinion, one of the most attractive aspects of artificial neural networks is that they provide incredible power and flexibility for very little implementation overhead. For most applications, very good software libraries exist that make implementing and applying these techniques simple.

Lately I've been playing around with Keras, an elegant high level Python library that provides a broad set of tools for building and using neural nets. Keras sits on top of Theano, a powerful math library for n-dimensional arrays with extensive optimization features and support for parallel CUDA architectures, pretty much engineered for deep learning applications. Taking advantage of the highly parallel potential of CUDA-enabled graphics hardware can speed up neural net training tremendously. One way to do this without having to drop a few grand on your very own GPU workstation is to use Amazon EC2, which provides access to modest-but-capable virtual GPU compute nodes at a much lower cost than rolling your own.

You'll need an Amazon web services account, so if you don't have one yet, go ahead and get that set up. Once you're signed in to your AWS account, head to the EC2 spot instances panel.

I'll point out that at the time I'm writing this, Amazon offers two GPU instance types: g2.2xlarge and g2.8xlarge. Chances are good you want the smaller and considerably cheaper g2.2xlarge—the bigger 8x instance has the same GPU and won't be any faster for CUDA-enabled tasks, although it does have more memory which might be helpful depending on the size of your inputs. You can also save a considerable amount of money by using spot instances (which you bid on) instead of dedicated ones (for which you pay a flat rate), although spot instances may be terminated without warning if you place too low a bid. In fact, the average cost per hour for the g2.2xlarge instances is typically very affordable (around $0.10 or less), but I have been setting my maximum bid around $3 to make sure my instances don't get interrupted during unpredictable price spikes. You can read more about how spot instances work and check the current prices here.

Setting up Keras on EC2


Running Keras on EC2 first requires setting up Theano and the NVIDIA CUDA drivers, which isn't difficult, but is a bit tedious. Markus Beissinger has some great instructions on how to do this, which I took the liberty of adapting into a really basic set of shell scripts that will automate the process of installing Theano/Keras/CUDA on a remote machine. Note that if you just want to start with Theano/CUDA and nothing else, and you don't need the latest versions, you can just initialize your EC2 instance with Markus's AMI ami-b141a2f5. If you want to have a bit more control, e.g. to specify newer versions and/or install additional components, my shell scripts are a good place to start. You can tweak them until you've got your installation process nailed down, and then create your own AMI so you can easily spawn new instances with your custom setup pre-loaded. For instance, I added a few custom steps to my own copies of the scripts, which set up additional security credentials, install a few more python libraries I use, and pull down a couple of my own personal git repos onto the EC2 instance.

To get started, you can use git to grab the setup scripts:

git clone git@github.com:chinchliff/setuptheano.git

The scripts assume the remote instance is using Ubuntu Linux. So when you're requesting the initial AMI for your new EC2 instance, make sure you specify the base Ubuntu installation (currently Ubuntu 14.04). Once your instance is up and running, open the setup.sh script file in the setuptheano directory you just cloned, and enter the public IP of your EC2 instance and the local path to the EC2 .pem file you chose to use for authentication into the CUDA_IP and CUDA_PEM variable declarations respectively (at the top of the setup.sh script):

### set configuration variables here

CUDA_IP=127.0.0.1 # replace this with the ip of your ec2 instance
CUDA_USER=ubuntu # don't modify
CUDA_PEM=~/cuda.pem # local path to your ec2 ssh identity file


Then close and save the setup script, go to the command line, and run the script by entering the command:

./setup.sh

When it asks you to add the new IP to the list of known hosts, answer yes. The script will run, installing Theano, CUDA support, and Keras onto your remote machine.

Tuesday, September 22, 2015

Open Tree of Life paper is out!

The flagship paper from the Open Tree of Life project is finally out! The paper culminates three years of work by the participants of the Open Tree of Life project, including me. I share first authorship with my former advisor Stephen Smith, who I worked with alongside with Mark Holder, Joseph Brown, Jonathan Rees, and others to build tools that we used to combine hundreds of phylogenetic trees and source taxonomies into a comprehensive tree of life containing over 2.3 million tips.

The tree is not only browsable online and downloadable in its entirety (for those of you with a lot of RAM), but is also accessible via web services which are already being used to facilitate outside research and provide example data for the development of research pipelines such as Arbor. The machinery we built to construct and serve the tree of life was designed with frequent updates and community participation in mind—new versions of the tree will be released as community feedback and contributions allow us to refine details and improve the accuracy of the relationships among organisms.

Although this project is not the first to approach the challenging task of publishing the tree of life, it is the first to combine available phylogenetic data in a way that makes a single comprehensive phylogenetic hypothesis spanning all life accessible to the public and the scientific community, and we are pretty stoked about what we've accomplished. If you're interested, you can read more about the tree in this press release from the University of Michigan and this article from the Christian Science Monitor, or for more details check out our recent AMA on reddit or browse our open source software repositories here.

Wednesday, September 2, 2015

Inceptionism

I recently read a blog post by Google about "inceptionism" where they describe a way to gain insight into the behavior of neural networks for image recognition by asking the networks to show them what they "see" in images of random noise and arbitrary scenes. The results of this process are called "deep dream" images for short, and they are fascinating but also more than a litte freaky. Popular science called them nightmares, and I don't disagree.

Actually the images really remind me of the visions I used to have as a young child when I'd rub my eyes really hard just before going to sleep. Strange fractalized swirling noise with recognizable objects blending together in infinite spinning repetition. They freaked me out then too, but I couldn't help myself. I'm not sure whether it's exciting or terrifying to think that Google's deep image learning tools have apparently managed to mimic the human brain so accurately that they can even mimic the patterns we see where none really exist.

Search google for more inceptionist images.

Thursday, August 20, 2015

Turn the wind to gravel roads

Forgetting for a moment about those blog posts I promised regarding those methods for merging evolutionary trees—we have a paper coming out soon, at which point perhaps I'll make good on them—today I'll focus on a more tangible idea. I have successfully moved to Idaho.

I'm working here at the University of Idaho in Moscow, with Luke Harmon, a personal friend and also pretty bad-ass professor of evolutionary biology whose specialty is evolutionary statistics. I'm here for the opportunities to get more experience working with genomic data and learn new analytical and statistical techniques for big data. Luke is the principal investigator on the Arbor project, which is funding my current position, and I'll be contributing to that project hopefully with some analyses of sponge microbiome genomic data. I'm also currently burning rubber through a machine learning course from Coursera, which is fantastic. More to come.

Idaho is really a hidden jewel (a big one) of the Pacific Northwest, especially the mountainous northern region. There are few people, massive conservation lands, amazing outdoor opportunities, killer skiing, low taxes, and lots of interesting towns and places, like Coeur D'Alene, Riggins, Sun Valley, Boise, McCall, and Moscow. If you're ever in the area, give yourself a day or two to check it out.

Thursday, February 19, 2015

Ice coding

It's winter in Michigan. If you've never experienced a Midwest winter, you're missing out on one of the most remarkable tests of endurance a native Californian can undertake. Flowers are blooming up and down the west coast, but Ann Arbor is a world of ice, wind, and road salt. At the moment I write this, Mt. Everest base camp is a actually few degrees warmer (at -9F) than us. We're around -12F. Just going outside is dangerous.

Since frostbite and ice fishing just aren't my things, I tend to spend a lot of the winter coding. This winter, my team has been working on solving some big problems in scientific data analysis, and we've made big progress in the last few weeks. Our task is to combine many smaller evolutionary trees (e.g. the tree of dog life, the tree of fish life, the tree of mushroom life, etc.) into one massive tree of all life. It's a nontrivial problem—many of the trees conflict with one another in subtle ways, and choosing how best to combine them becomes technically challenging for even small numbers of trees and relatively few species. We're attempting to combine hundreds of trees and produce a tree of life with over 2 million tips.

I've been working on this project for a couple of years now, but we've recently had some methodological breakthroughs. Some of them are great examples of the application of classic algorithms, as well as much less classic techniques we've had to invent, particularly with regard to merging taxonomic information with evolutionary trees. My next few posts will focus on these challenges and some of the solutions I've come up with to address them.

For now, I'm going to let the dog out and brave the short trip to the bus stop. Even though it's bitterly cold, it's also beautifully sunny today, so besides my hair freezing a few seconds after I walk out the door, it should be a nice walk.