Cornell Lab of Ornithology Improves Machine Learning Workflow with Azure HDInsight
For the last 14 years, the Cornell Lab of Ornithology has been collecting millions of bird observations through a citizen science project called eBird. This data can be used to model and understand the distribution, abundance and movements of birds across large geographic areas and over long periods of time, which yields priorities for broad-scale bird conservation initiatives. Previously, researchers at the lab used mid-sized traditional academic high performance computers, with modeling run times of 3 weeks for a single species. By moving their open-source workflow to Microsoft’s scalable Azure HDInsight service, the researchers were able to reduce their analysis run times to 3 hours, generating results for more species and providing quicker results for conservation staff to use in planning.
Hosted at the Cornell Lab of Ornithology, eBird is a citizen science project that engages birders in submitting their observations of birds to a central database. Birders seek to identify and record all birds that they find at a location and report how much effort they made to find those birds. eBird provides easy-to-use web and mobile applications that makes recording and interacting with data convenient. Over the past 14 years, eBird has accumulated over 350 million records of birds across the world, representing 25 million hours of bird observation effort, with data volumes continuing to grow geometrically. This valuable resource offers Cornell researchers analysis opportunities at spatial and temporal scales they would not have been able to study otherwise.
Birds are known to be strong indicators of environmental health, using a wide variety of habitats, responding to seasonal environmental cues, and undergoing dramatic migrations that link distant landscapes across the globe. Conserving birds begins by understanding their distribution, abundance and movements across large geographic areas at a high spatial resolution, over long periods of time. By combining the bird observation data from eBird with remotely sensed land cover data from NASA, researchers at the Lab of Ornithology build models that can be used to understand the patterns of bird abundance and their associations with habitats across large geographic extents (such as the Western Hemisphere) and at a high spatial resolution (3 kilometres). With high resolution descriptions of bird abundance and habitat associations in hand, researchers can then work with bird conservation staff, to identify broad-scale conservation priorities and monitor trends in bird abundance.
Creating high resolution models built with large amounts of eBird and NASA data requires a significant amount of computation time. The original solution for Cornell researchers was to employ mid-sized traditional academic high performance computers (HPCs) to run the machine learning models. However, a model for a single species required three weeks to run, making it inefficient to generate results for the almost 700 species of birds that regularly inhabit North America and delaying the release of results to conservation staff. This was not a scalable solution and presented a challenge in generating enough results to be useful to the research and conservation communities. Researchers at the Cornell Lab of Ornithology also needed a way to scale their analyses with open-source technologies, since existing code bases used R and ran in the Linux environment. These challenges motivated the team to move to the cloud and look to Microsoft Azure as a platform for decreasing the clock time of their machine learning modelling workflow.
Solution after Adopting the Clouds
With help thanks to a grant from Microsoft Research, Cornell researchers were able to develop, test, and deploy a scalable, open-source solution with Microsoft’s Azure HDInsight product. In total, the solution uses Microsoft Azure, Microsoft Azure Storage, Microsoft Azure HDInsight Service, Microsoft R Server, Linux Ubuntu, and Apache Hadoop MapReduce and Spark. Using these services, the researchers are able to scale clusters sufficiently large enough such that they can reduce the run time of a single species model to 3 hours. Cornell researchers have so far been able to run dozens more species than they would have otherwise, continue to improve their modeling workflow with the Azure environment, and provide results more quickly to conservation staff.
Cornell researchers are now able to run models for more species and share results with conservation staff more quickly, making them the lead in producing these kinds of results within the bird research community, thanks to the scalability of Microsoft Azure.
This Azure solution is scalable into the future. As eBird data grows, so will the computational needs of running the model. To accommodate this need, HDInsight clusters can continue to scale in size such that while eBird data volumes and computational times will increase, wall clock run times for the models can remain the same. When working on HPCs in the past, Cornell researchers were often forced to wait in queues for long periods of time, before computational resources became available. With Microsoft Azure, the researchers can create clusters on-demand, 24-7, as needed. This availability improves the efficiency of overall analysis workflows and allows the broader Lab research group to ask and answer questions about bird distributions more easily.
Working in academic environment, Cornell researchers need to be able to share their code and port their workflow between systems among colleagues. Their solution is built on the R language and parallelized with either Apache Hadoop MapReduce or Apache Spark, all open-source products. The academic community thrives on open-source, since sharing code speeds innovation and makes the sharing of knowledge faster and easier. By being able to maintain this open-source environment on Microsoft Azure, the researchers can continue sharing their code and maintain portability of their workflow, leading to greater innovation within the researcher community and flexibility for the researchers to test and develop their workflow on other platforms.
Quicker to produce results and analyze more species
Most of all, moving to Microsoft Azure enables Cornell researchers to run their models dramatically faster than they previously had, speeding up their research workflow. Now more of the near 700 species of birds found regularly in North American can be analyzed and better understood, with information about their abundances and habitat associations flowing faster to the bird conservation staff, who can use that information to prioritize and implement solutions to help protect and grow populations of birds. This workflow improvement would not have been possible without Microsoft Azure.
The Cornell Lab of Ornithology has been collecting millions of bird observations through the citizen science project eBird. Using this data, combined with remote sensing data from NASA, researchers at the lab have built machine learning models that can be used to understand the distribution, abundance, and movements of birds across large geographic scales and over long periods of time, yielding priorities for bird conservation targets. Moving from mid-sized academic computing environments to the cloud and Microsoft Azure HDInsight allowed the research team to adequately scale their open-source workflow and generate more results in less time, better enabling conservation action. Read more about Cornell’s work at:
Abundance models improve spatial and temporal prioritization of conservation resources. 2015. Ecological Applications. http://onlinelibrary.wiley.com/doi/10.1890/14-1826.1/abstract
Data-intensive science applied to broad-scale citizen science. 2014. Trends in Ecology & Evolution. http://www.sciencedirect.com/science/article/pii/S0169534711003296
The eBird enterprise: An integrated approach to development and application of citizen science. 2014. Biological Conservation. http://www.sciencedirect.com/science/article/pii/S0006320713003820
Spatiotemporal exploratory models for broad-scale survey data. 2010. Ecological Applications. http://onlinelibrary.wiley.com/doi/10.1890/09-1340.1/abstract