DevOps for Data Science – Load Testing and Auto-Scale
In this series on DevOps for Data Science, I’ve explained the concept of a DevOps “Maturity Model” – a list of things you can do, in order, that will set you on the path for implementing DevOps in Data Science. You can find each Maturity Model article in the series here:
- Infrastructure as Code (IaC)
- Continuous Integration (CI) and Automated Testing
- Continuous Delivery (CD)
- Release Management (RM)
- Application Performance Monitoring
- Load Testing and Auto-Scale (This article)
The final DevOps Maturity Model is Load Testing and Auto-Scale. Note that you want to follow this progression – there’s no way to do proper load-testing if you aren’t automatically integrating the Infrastructure as Code, CI, CD, RM and APM phases. The reason is that the automatic balancing you’ll do depends on the automation that precedes it – there’s no reason to scale something that you’re about to change.
I covered automated testing a previous article, but that type of testing focuses primarily on functionality and integration. For load testing, you’re running the system with as many inputs as you can, until it fails. For the Data Science team, you should inform the larger testing team about any load-testing you’ve done on your trained model (or the re-training task if that is incorporated into your part of the solution) using any load testing tools you can run in R or Python or whatever language/runtime you are using.
The larger testing team will incorporate those numbers, run a “hammer” test on the entire solution, to see when the application becomes overloaded.
An interesting development I’m seeing lately is that the Data Science team is asking for the metrics from the load (which also contains performance information of course) to do data analysis and even prediction. That’s a great value-add.
The Auto-Scale maturity level is where you really need to interact with the entire team, from the very earliest planning phase - of course, that is the very definition of DevOps. You need to find out how large the system will be as early as possible, because it can affect the design of your system. Certain technologies allow scale (Spark, Hadoop, Docker, others) and other technologies don’t parallelize or scale well. Writing your code in an efficient but unscalable technology will come back to hurt the application in the end, if the solution will grow. If you create a huge architecture and the solution should scale down to an “Internet of Things” environment, you’ll likewise face issues. Of course, some languages can be used on scalable technologies and smaller ones, so it’s up to you to know the limits and features of the various ways of working through these scenarios.
With that, we’re done with my series on DevOps for Data Science. Follow the Maturity Model, develop the DevOps mindset, and take it one step at a time. It’s a journey worth taking.
- Need a quick introduction to DevOps? Check out this series: https://channel9.msdn.com/Series/DevOps-Fundamentals
- Here’s a complete, full course on DevOps on the Microsoft Virtual Academy - https://mva.microsoft.com/en-us/training-courses/devops-with-visual-studio-team-services-and-team-foundation-server-16779#!