DevSecOps and the Problem of Machine-Scale Data

“Shifting Left” From DevOps to DevSecOps

When development teams using waterfall approaches couldn’t keep up with customer requirements, they adopted DevOps and Agile SDLCs. While these flexible approaches attempt to meet customer demands, security processes get left behind. You either skip security, or you aren’t really agile. Either way, you’re losing the benefits of adapting rapidly to customer needs.

Now that new regulations and consumer awareness have made privacy and security a priority, the industry’s recognized that they need to be built into the SDLC. “Shift left” means integrating processes and testing that have traditionally happened at the end into the development process itself, and you often hear that term used to describe a transition from DevOps to DevSecOps.

DevSecOps is generally described as a manifesto that security is built in from the beginning, into all steps of an SDLC. DevSecOps recognizes that security is the responsibility of all team members, but none of the top 40 computer science programs require security classes.1 Developers are not prepared for the shift.

This is a fundamental problem, but even more troubling is that, as described today, DevSecOps is merely a software and application development mindset. What about data, the main driver for any security strategy? If in order to remain agile you must build security into your SDLC from the start, then you also have to make data – and data handling – a first-class citizen in that conversation.

Coming to Terms with “Machine-Scale” Data

According to Domo.com2 2.5 quintillion bytes of data are created daily, and that number is growing. Exponentially. It’s no longer people creating this data; the applications we build are a fast-growing contributor. How much data does your Nest generate? Your Fitbit?

So much data is being generated that people – and manual processes – can’t scale. Those bytes are aggregated into data lakes in hopes that machine learning can make sense of it. We develop algorithms to reason over complex inputs within fractions of seconds because even as data has proliferated, our patience has shrunk. How tolerant are you when an application lags?

Coming to terms with “machine-scale” data means acknowledging that we need automation. We need the right technological tools to enable data handling at scale. No matter how many skilled humans we have, we can’t process this much data as swiftly as it’s being generated. We have to remove humans – and human error – as a chokepoint to the flow of information.

Machine-Scale Opportunities…And Threats

Mining the vast trove of data we generate leads to better business intelligence, informed decision-making, and enhanced customer offerings. Taking advantage of machine-scale data results in operational advancements and market improvements. 

Data lakes used to be the domain of large enterprises who could afford the huge infrastructure to store and process massive amounts of data, but cloud tools and IaaS have made this technology widely available. The opportunities to glean appreciable value from troves of data are affordable for smaller companies. Organizations of all sizes are now moving, storing, processing, and handling data in the cloud.

You are, no doubt, familiar with the risks this presents. Data no longer remains inside your network perimeter. Even if you do have great perimeter security – and you probably don’t – the data no longer resides there.

Besides the complexities of managing data in the cloud – or in multiple clouds, or in conjunction with an on-premises environment – data aggregation itself has a creepy side. Brokers trade location and personal information through data markets, and the insights gleaned from such correlations can range from manipulative targeting, as in the case of Cambridge Analytica,3 to national security threats, as in the case of the OPM breach.4

The rise of regulations like GDPR and CCPA mirror these increasing threats to privacy and security. The regulations pose significant risks to companies who don’t have the visibility to understand how data is being used, who can’t lock down how data should be used. But these regulations also represent an opportunity to accelerate the “shift left” of security tools and processes to cover the entire DevOps lifecycle.

How Do You Remain Agile in a Machine-Scale World?

Returning to our first quandary of remaining agile but secure: How do you do this with data? To maintain agility while minimizing risk and compliance overhead, you need a set of criteria. You need to:

  • Know the data. How do I maintain compliance if I don’t know what’s being accessed? What is contained within those bytes? This requires both the tools necessary to automate discovery and a data identity store capable of scale.
  • Know the consumers…and this includes users, services and devices. Who (or what) is accessing those bytes? What are the things that are known about the consumer at the time of any data access? How is it being consumed? This requires a consumer identity store that can integrate easily with identity providers, and directory and claims systems.
  • Know the context. When, where, and how can a piece of data be accessed? For what purpose? This requires a flexible decision engine to determine if an access represents an appropriate use of information according to policy.
  • Audit the chain. Can you, in an automated fashion, provide when, where, by whom, and for what reason a piece of data was accessed? Can you demonstrate regulatory compliance? This requires more data in our machine-scale world, easily consumable, pre-packaged if possible. Encrypt when practical; control and audit always.

So, Practically-Speaking, What Does All This Mean for DevSecOps?

I’m a practical person, and this is where I get passionate. DevSecOps is ostensibly about making security part of your whole process, right from the beginning. But it doesn’t talk about data handling, and it needs to. Static scanning is important, but you can’t keep telling your developers to just review application security testing results and hope that’s going to teach them how to write more secure code. Because it’s not.

Remember the reason that you need security – data – and remember the machine-scale problems it poses. You need to build scalable data handling technology into your frameworks as an automatic part of any data production, storage, and access solution.

Let me introduce an analogy using a low-level runtime library: If I include libgcc in every application, then discover it has a bug, I have to go in and rebuild every single one of my applications. I have to go back through the whole SDLC…for every application. Data handling policy presents the same challenge: If I code my application to address GDPR requirements, then CCPA comes out…each application must be rebuilt, retested, and redeployed. Perhaps this isn’t that big of a deal if you only have two or three apps, but most enterprises I talk to have hundreds or thousands. Just the effort of mapping which app is accessing what data creates a migraine. Now think about redoing the hard-coded policy in each one. Ouch.

The practical answer is: Abstract data handling from application code. Allow your policy to adapt just-in-time to the context of the request, using the data and consumer identity stores I called out earlier. You don’t have to write code to do this; externalizing policy introduces the scale you need to survive and thrive in this machine-scale world. 

You can cobble this solution together, using a patchwork of technologies already on the marketplace, but it will be time-intensive and cost-prohibitive. A more practical approach is to leverage an SDK and set of APIs to a solution that already exists today. DevSecOps is focused on automation and agility through the right tools and processes. It should include removing the need for programmers to understand, code, test, and maintain data protection in silos. Don’t make your software engineers reinvent policy and data handling rules every single time, for every single application. Free them to focus on the features and functionality that are core to your business.

Machina solves for the four criteria listed above that I consider necessary for addressing data protection while remaining agile, minimizing risk and compliance overhead, and maximizing runtime visibility. Its external decision engine reasons over the attributes of the data, the identities of the data consumers, and the context of the request itself. Every data access attempt – and the resulting policy decision – is logged for a fully-auditable chain. Machina helps make security an integral part of both Dev and Ops disciplines.

  1. DeMartine, Amy and Trevor Lyness with Stephanie Balaouras, John R. Rymer, Kate Pesa, and Peggy Dostie. “Show, Don’t Tell, Your Developers How To Write Secure Code.” Forrester Research, Inc., April 19, 2019.
  2. Marr, Bernard. “How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read.” Forbes, May 21, 2018.
  3. Hern, Alex. “Cambridge Analytica: how did it turn clicks into votes?The Guardian, May 6, 2018.
  4. Nakashima, Ellen. “Hacks of OPM databases compromised 22.1 million people, federal authorities say.” The Washington Post, July 9, 2015.