The Dark Secret in Data Science
Trivia question: What do the following projects in common?
1. Jim is tasked with predicting failures of industrial compressors, using extensive fleet telematics. He knows it’s a case for recurrent neural networks (RNNs), but training the models is highly compute-intensive, and requires fast networks and storage. Provisioning that compute capacity can take months. By the time he’s got the needed approvals, his internal business champion has moved on.
2. Susan is designing video analytics solutions for safety control in manufacturing. Her convolutional neural network model (CNN) is built, trained and tested, and excels at high-speed image classification. Real-time inference needs to drive action, without the latency of the cloud. Lacking the edge compute capacity, the project never moves from pilot showcase to the real world.
3. Amit’s inspection robotics model has been developed, tested, and deployed to production. It performs so well that a fleet of robots is deployed at gas pipelines to identify equipment abnormalities. However, a few months later, robots fail to identify one potentially catastrophic defect. By then Amit has been assigned to new projects, and the fleet of robots gets parked.
The answer to the trivia question: All three projects were motivated by clear business objectives and demonstrated the involvement of highly skilled data scientists (both critical ingredients); but sadly, also all three projects failed due to lack of alignment across organizational silos and tech delivery processes.
Data science’s dark secret is that only a small percentage of Big Data projects ever sees the light of day in day-to-day business operations. That is despite the hype, and often significant efforts by IT and business teams. While in some cases that may be acceptable (e.g. in case of one-off research studies), in most cases that means wasted investments (in IT and people), and (even more critical) missed business opportunities.
Data is the new Code
Given the opportunities presented by “Artificial Intelligence” (AI), the inability to effectively productize data science is becoming a main impediment for business success. That’s a problem for Machine Learning (ML), and for Deep Learning (DL) in particular (given its huge appetite for training data).
Consider this: With the meteoric rise of Deep Learning, increasingly the focus can shift from hand-programmed software logic and rules to training of models. Software engineering is being replaced (in some instances, at least) by data engineering. De facto, data is becoming the new code, creating new opportunities across industries such as finance, healthcare, manufacturing, transportation, and more.
So, should we replace software engineers with data engineers and data scientists, to enter the brave new world? Not so fast. For one thing, not surprisingly, the software engineering discipline is very alive and well, far from a doom. For another, building complex systems was never just about coding. The software industry has learned hard lessons which the data science world is just starting to cope with.
Traditional software “waterfall” delivery approaches got challenged in the mid 90’s by two important realizations: first, it’s very hard to plan for and scale complex projects with many dependencies, and second, real life situations almost certainly won’t match planned-for scenarios or user expectations.
If data is the new code, what can data science learn from computer science? Let’s go down memory lane and see how software engineering and -deployment has evolved over the years.
From Waterfall to DevOps
Lean manufacturing revolutionized physical industrial processes in the 80’s. Methods popularized by the Toyota Production System (TPS) proved you can materially drive throughput while improving total product quality. Central to this approach was (and continuous to be) the continuous elimination of waste (‘muda’, or 無駄), driven by process transparency, automation, and empowered employees (e.g. allowed to take risks and to ‘fail fast’ by pulling a Kanban rope on the production line).
Fast-forward twenty years, and Agile software development kicked into full gear (with the Agile Manifesto published in 2001). Adopting lessons from industrial manufacturing processes, the software industry discovered the flaws of the traditional waterfall models, and adopted their own ‘fail fast’, anchored in (again) transparency, automation, frequent releases and decision autonomy.
But Agile development and Continuous Integration (the frequent merging of code into shared repositories) did not address the ‘final mile’ of software delivery into production. To make software delivery a truly fast and reliable process, it needed to bring together stakeholders across organizations, challenging conventional wisdom and breaking down traditional power structures.
Fast-forward again ten years, and DevOps has become a well-accepted (although not always well understood or well implemented) approach for software delivery. Central to DevOps is joint ownership between software engineering (Dev) and corporate IT (Ops) for service delivery in support of digital initiatives. Engineering is responsible for delivering code that can be deployed to production, and it maintains this responsibility well beyond deployment. IT in turn ensures that infrastructure and services will deliver based on jointly committed service level agreements (SLAs).
From DevOps to DataOps
Lessons learned from computer science should translate well to data science — intuitively that makes sense. The similarities are plenty, and besides: A good part of data science involves the writing of code, e.g. for data engineering, for model development, and for model implementation.
The philosophy and principles of the Agile Manifesto are ready applicable to data science: the customer outcome focus; the embracing of changing requirements; the frequent delivery; the cross-functional collaboration. Equally, also the philosophy and principles of the Jez Humble’s lesser-known DevOps Manifesto (building on the Agile Manifesto) are readily applicable.
Which brings us to DataOps (“Data Operations”). While the term is not necessarily new, it recently gained more prominence across conversations with practitioners across industries (see e.g. LinkedIn Jobs), technology & solution providers (e.g. Hitachi Vantara) and industry observers and analysts (e.g. Gartner).
So, if DataOps philosophy and principles are so similar to DevOps, why do we need a new term and discipline? Well, despite the similarities, there are a few fundamental differences, especially when it comes to implementation. The core disciplines and stakeholders are different, the processes are different, and the tools and IT needs are different too. DataOps isn’t just DevOps applied to data science. Let’s have a closer look at key principles of DevOps (derived from DevOps Manifesto), and how they can get implemented for DataOps.
Iterative and incremental
In DevOps, software is delivered through repeated cycles (iterative) and in small portions at a time (incremental). In each increment, a slice of functionality is delivered through cross-discipline work, from the requirements to the deployment.
In DataOps, models also are delivered through iterative and incremental cycles. However, in addition also the model training happens through data engineering through iterative and incremental cycles. So, what’s a single loop in DevOps becomes a double loop in DataOps — an increase in complexity.
Continuous and automated
In DevOps, software delivery is a continuous and ideally highly automated process. Code commits trigger build, unit test, deployment and test, and ideally automated deployment into production. The phases are typically described as continuous integration, continuous delivery, and continuous deployment. A fairly mature industry of tooling has evolved to address the needs of software delivery.
In DataOps, the phases are similar on a macro level: the development and training of data science models; the testing of model accuracy and efficiency; the deployment of models to production; and the re-training of models, as new test data become available. However, there is no ‘magic quadrant’ for DataOps tools (and traditional DevOps tools don’t readily apply, notably lacking data management capabilities).
Self-service and collaborative
In DevOps, self-service is about empowering developers and operators, in order to decrease problematic or incorrect hand-offs, accelerate time to market, and improve organizational capacity. Self-service is achieved through organizational empowerment, but also through the ability to auto-provision: for example, test environments based on the specific needs of code to be built and integrated.
In DataOps, self-service is about empowering data scientists to use their tools and frameworks of choice for model development (such as Python/Jupyter, PyCharm, RStudio), but also about enabling the team to build and deploy data pipelines for model training and data engineering, as well as push-button provisioning of infrastructure for training and production inference.
Five (5) Critical Technology Success Factors
DataOps is not a technology. As much as some vendors may claim you can “buy DataOps” — you cannot. It’s a philosophy and practice, which requires organizational alignment and cultural readiness. That said, in order to be successful with DataOps initiatives, some level of technology maturity is mission critical.
Let’s see what could have helped make Jim, Susan and Amit (see this blog’s intro trivia question) successful.
1. Agile data infrastructure
DataOps hinges on an agile data infrastructure, that means on the organizational ability to allocate and scale compute infrastructure elastically, as needed. Jim, in our intro case study, was slowed down by his inability to rapidly provision compute infrastructure.
While in some cases cloud-based services may be the answer, the reality is typically not as straight forward. Data may be distributed across on-premises and multi cloud platforms; compliance and regularity requirements may interfere with where and when to deploy compute for DataOps. A cohesive data management strategy and technology management infrastructure is recommended.
2. Automated data pipelines
Data infrastructure and data pipelines need to support self- or auto-provisioning, based on data scientists’ needs. To stay with Jim, as data scientist he doesn’t really want to spin up IT infrastructure: All he wants to do is train his RNN models, and to do so fast.
The data pipelines the data scientist (or data engineer) provisions should automatically, behind the scenes, scale the infrastructure so it’s up to the task — and de-provision, where no longer needed. E.g. data pipelines need to understand the recipes for scaling up with Hadoop and Spark.
3. Model deployment workflows
Once developed using the preferred (and often: open source based) tools, data models need to be wrapped and deployed, and injected into end-to-end analytical workflows. Susan lacked the ability to deploy the model where it was really needed — at the “edge” in the IoT infrastructure.
Often, those analytical workflows want to be embedded into business applications, in support of processes such as call enters, sales or manufacturing; and, model execution may need to happen in the cloud, or on the ‘IoT edge’. For example, a container-friendly IoT edge architecture should be considered to enable rapid deployment.
4. Model testing
Even with 100% accurate data, data science isn’t an exact science. Rather there are typically several approaches to solve a given problem. Amit learned late in the project that the model wasn’t quite up to the task.
Multiple models want to be compared and benchmarked, not just in terms of accuracy under different scenarios, but also in terms of cost of compute infrastructure, and cost for maintenance. There are technologies available to compare and identify the most accurate model for a specific task.
You cannot manage what you cannot measure. That old saying certainly is applicable to data science. Jim, Susan and Amit would have benefited from real-time metrics measuring project success — but also from alerts highlighting potential or actual bottlenecks.
Model accuracy, consumption of compute infrastructure, the speed to train models and infer results — a DataOps systems needs to account for relevant metrics. Scientific measurement frameworks, tools and dashboard can help.
Data science, machine- and deep learning are here to stay, and will continue to have profound impact on organizations’ top and bottom lines, across industries. The faster organizations can deploy working models into production (and continue to improve those over time), the better they’ll be prepared for business success. While DataOps is still a fairly early discipline, already best practices and proven technologies exist today to drive more frequent releases at higher quality, for ultimately better business outcomes.