Describing Systems – Architecture Documentation

Note: This article is a copy of one I have written for the compliance.engineering site. Check it out!

The organizational dilemma of understanding systems

Every day decisions are made, from the low-risk decision on how you would like your coffee or tea, to high-risk decisions on organizational strategy or accepting the implications stemming from proposed contractual terms with a customer or partner. These decisions all require related, supporting information and success in making the correct decisions requires a prerequisite level of accuracy and exposure to the proper information in a holistic snapshot. Without that proper exposure to relevant details, as well as an accurate understanding of what those details mean, the consequence can negatively impact your business in a significant way.

Organizational management of IT systems and underlying processes is no different, its severity just sits on the mid-to-upper end of risk management for the large organization. With contractual commitments, as well as regulatory requirements (e.g., SOX, PCI, FedRAMP, etc.) it is critical that the organizational leadership is well and accurately informed of the state of their digital estate. With the involvement of external auditors, the risk of accidental discovery of previously-undiscovered deficiencies goes up. Because of this, it’s critical that the organization has a clear, accurate, and concise understanding of all their systems. Without this understanding the organization runs the undue risk of damages via accidental data leakage, data loss via intrusion, accidental outage, or exposure and audit failure due to external findings.

The current landscape

So, how do organizations manage these systems? The answer isn’t ideal, as the approaches are quite fragmented and vary (often in materially significant ways) between organizations. A smaller organizational entity might have a solid understanding of its digital estate and configurations, while a much larger enterprise might have significant technical debt and loss of understanding assets due to years of team turnover. The more mature organizations will often have a process for reviewing operational systems and processes. These organizations will also often have a production readiness assessment workflow, along with ongoing review to ensure compliance with appropriate business standards, especially in the realm of security and privacy. However, even in these, currently-ideal scenarios, the methods of discovery, documentation, presentation of details, and sharing of maturity scores and findings is highly varied – both in effectiveness as well as in interoperability between teams and corporate entities. Often though, these assessment processes exist, but don’t operate in a meaningfully effective way. In this worst case scenario it can provide a sense of false confidence in the associated resilience, security, and associated risk of an organization’s systems.

Without a solid grasp of a system’s architecture and design, implementation details, operational practices, and risk mitigation strategies, an organization is left with a significant gap in truly understanding their security posture and appropriately managing the associated risk. A breakdown of foundational knowledge of your system can start small, such as a lapse in proper patching of nodes based on your business practices and associated commitments. That issue compounds when, due to a lack of accurate knowledge, your internet-exposed footprint neglects to document that those nodes are exposed on the internet. This gap grows organically as now during your annual penetration testing those IPs/FQDNs are not tested as they aren’t realized to be in-scope of testing. This predictable scenario can eventually lead to a successful, malicious compromise of your systems, all stemming back to a fundamental lack of understanding the system’s architecture, configuration, and associated security details.

Ignoring the possibility of a security breach that leads to possible loss of data (and associated trust), the risks also exist on the corporate side. With the declarations made during various compliance endeavors, as well as those enshrined in corporate contracts, there is risk that what is expected varies from the actual state of an organization’s systems. As many audit reports will state, their assessment is based solely on the information provided by the organization being assessed, and if that information is found to be lacking, then risk of litigation arises against the organization. Even before that though, the risk of being questioned by an auditor on your controls can lead them to potentially uncover aspects of the system that were previously overlooked by the organization. While less damaging, this scenario is still an embarrassment and can lead to a qualified opinion or audit failure, with the associated corporate fallout once the findings are released.

Discovery, Documentation, and Dissemination

So, what can we do to improve this position, not just at one organization, but at a national or global level? The answer to that question is likely multi-dimensional, but I believe is one that hinges on several key actions. First, we must be serious about defining the scope of our systems and realize that it’s likely larger than we’d like to admit based on underlying dependencies. Second, we must have a consistent approach to defining what makes up a system. Third, we need to have a consistent method to gather information about the system and store it in a machine-readable format that can be shared and consumed by other parties. Lastly, we must move to enabling development, operational changes, and breakfix with automated methods to ensure that changes made aren’t going to compromise the desired state of our systems. Let’s look into each of these in more detail.

Scope is important, even critical in determining if we have cast a wide-enough net around what we are trying to define, describe, and ultimately protect. The challenge here is that there are so many ways to describe a system, from code used, to physical or virtual machines, cloud accounts, etc.; that the challenge becomes describing what the difference is between a portfolio, a product, a component, and even a feature – not to mention supporting infrastructure and services. To compound this issue, organizations have an internal drive to reduce the inspected/managed scope to be as small as possible for audit purposes, as that reduction in scope increases the chance of success during audit. While this is its own subject that could consume hours of discussion, I believe that the fundamental, defining factor of a system is connectivity, and as such all scoping should use connectivity as the definitive building block in description and scoping. To make this happen in my previous role we started by defining all ports and protocols across all the components of the system being assessed. Using this information, architecture diagrams were automatically generated, thus ensuring that our diagrams matched with the reality of what was really at play. Almost without fail this step eliminated over-simplification of the system, as well as allowing us to easily identify components that would otherwise be overlooked, or not mentioned by the owning team.

Systems are often more than one or more related devices, so as such the second principle we need is to ensure that our understanding of the system covers all the applicable systems, including supporting ones. To use an example, you may have a production web service that stores customer data. However, as part of that system you also need to consider everything from the workstations with access for managing that web application, to the CI/CD pipeline managing the deployment, to the supporting security tools such as your SIEM and NIDS devices, even the backup site where you store copies of that sensitive data. Without a proper understanding of the entire ecosystem an organization quickly falls prey to issues such as the logging agents being installed, but the SIEM not actually receiving or parsing the logs. Or, perhaps an internet gateway owned by another team is providing access to your system, but it’s not being managed by your internal security processes and slips through the cracks. Whatever the negative scenario, every system needs to be measured holistically, and not necessarily just the device(s) that most of us would jump to classify as “the system”.

While many organizations have successfully implemented either a holistic or partial strategy for these first two fundamentals, very few are even starting to plan on how to implement their process in a way that is machine-readable and suitable for re-use across teams and enterprises. This is a new, emerging trend as cybersecurity is beginning to mature to a layered understanding, where each system is understood to be reliant on underlying systems. Much like a Software Bill of Materials (SBOM) is used to show all the varied layers of what code makes up a system, or the Vulnerability Exploitability Exchange (VEX) shows exploitability of those components in a standardized way, so too do we need the ability to have a consistent, additive (per layer) way to measure the security and compliance aspects of the holistic system with all its layers being accounted for. If your enterprise’s own internally-developed code is compliant, but uses an underlying library or database that isn’t, then your current state is only at the level of those underlying and deficient dependencies. Today, there isn’t a commonly used way to have this additive analysis of components that span proprietary and open source code in a meaningful way, especially when it comes to compliance. When this point is reached, then we can truly start to realize a future of compliance as code.

Future Plans

This future, where a developer’s commit is rejected with an automated trigger and explanation that the functionality in that commit breaks certain security or compliance-related requirements that first need to be resolved, is ideal; as far left in the process as possible. This same future where compliance can programmatically be described in a standardized syntax (e.g., OSCAL) with consistent policy language, where organizations begin realizing the currently-sought-after value being missed in countless security and privacy questionnaires, reviews, and interviews. To get to this future, we must become consistent in our scoping, description, and documentation of services in a universal manner. This is the first of ten articles that are intended to extrapolate the needs, route, and tools to make the journey to universally-usable service details available across organizations, as well as through the various layers of proprietary and open source solutions and bring us towards achieving this vision.

The organizational dilemma of understanding systems

The current landscape

Discovery, Documentation, and Dissemination

Future Plans

Leave a Comment Cancel