During my career I’ve come from a technical background to a more strategic position and picked up several memorable lessons along the way. One of those is especially relevant to me in my current role where I have to decide on pursuing a short-term fix (while pursuing the long-term solution) or have the disciple to go the long-term route without a workaround. I’m personally biased towards the short-term fix, so it’s been a good lesson that has saved me trouble as my role becomes more strategic. Many people with a technical background are familiar with one of these two scenarios when supporting internal operational tasks.
- You have a deadline to deliver a solution, but the amount of time to have “the right” team deliver it is far beyond your project’s deadline.
- You have an open source tool that handles most of your needs. However, great business value can be added with a few (today) tweaks to the tool’s code.
These two scenarios represent what I have experienced as the two biggest drivers of this dilemma of customized code that can potentially provide critical business value and separate our organization from the competition. However, on the other side of this two-edged sword, these customized code solutions have a high level of associated risk of falling into mismanagement, presenting security risks, or locking you into your own forked version and losing the benefits of updates provided by the larger, original code stream. As an engineer at heart, I want to explore and identify several data points to help me (and hopefully you) better answer the question: “Should I invest in custom code, or should I justify a longer-term solution to my management and get deadlines pushed?”. Let’s dive into each of these two categories to investigate and see if we can identify clear decision indicators and maybe some best practices.
Solution Gap Fill / Workaround
This area of gap fill is usually more prevalent with companies or departments that do not have stringent compliance requirements and enjoy more engineering freedom. This has great potential when properly managed and can provide the associated business with great value over a more rigid competition. However, it can become very costly if neglected or not properly managed. It’s a risk/reward balance that we want to help keep on the side of reward by mitigating the risk. How do we do that? Let’s start by identifying the top risks associated with this option.
- Lack of Visibility – This risk is probably the most critical since you fundamentally cannot manage what you do not have visibility into. It is important that you have well documented visibility into your entire software portfolio and that includes the “temporary” solutions that usually last a magnitude of time longer than originally planned.
- Lack of Documentation – For some reason I’ve never really understood, engineers typically don’t like to write documentation. It’s a situation that every company struggles with to varying degrees, but one that must be owned by each and every team. For core organizational products this is usually well managed. However, for smaller, usually internal tools the situation is usually much more dismal. When you start to look in the “temporary solutions” department it usually continues to go downhill.
- Lack of Uniform Source Control – Hopefully this doesn’t affect many teams but it’s a definite problem if your organization doesn’t have a unified source control system. Code that isn’t readily available and tracked is paramount to our largest problem of Lack of Visibility.
- Lack of Strategic Planning – This really boils down into several distinct, but related areas.
- Lack of Compliance, Security and Legal Concerns – Engineers love to bring solutions to a problem, especially when it involves their own creation. However, how many short or even long-term solutions start with instructions like “Disable SELinux”? Our current state of lack of cybersecurity specialists is in some ways analogous to a lack of band-aids for a person who keeps shooting themselves in the foot. Don’t get me wrong here, we need cybersecurity specialists, but I’d wager that the majority of our industry problems in this area are due to poorly planned projects that didn’t consider security or compliance in their design, architecture, and/or operations. There’s a whole topic that could be written about here with many things that have to be considered, but for now, you get the gist of it.
- Lack of 2 Owners – Smaller projects have great potential, but what happens when the one engineer who wrote the whole thing leaves the organization?
- Lack of Sustainability Planning – This one is squarely on the shoulders of the engineers who write the code and self-enforced by necessity with the Own/Operate model in the SRE style of management. You should assume that you won’t get to touch your code again for a long time due to other, higher priorities and plan accordingly with your initial release.
- Lack of Support – This is really a consequence of the lack of sustainability planning as well as the single owner problem.
- Lack of Monitoring – Especially with small projects there is a tendency for the engineering staff to not actually build out what is necessary for the management and lifecycle of our solutions. I would classify telemetry in this same category as monitoring, otherwise you are flying blind around usage and don’t know when the product is being used anymore or how critical it has become while you weren’t looking.
- Lack of Long-Term Plan – This is hard, because it feels like we are solving the problem twice. After all, I’ve already solved the problem, right? But alas, now we need to work with the proper team to define requirements, demonstrate the value provided by the temporary solution and most importantly, ensure that traction happens. Let’s be honest, if you have a working temporary solution then the priority for the long-term solution is going to be low priority. Again, plan accordingly and expect your solution to last longer than you ever imagined.
This second area is unique to the world of open source software and usually take the form of forking a project and beginning to add your own needed features. It’s quite attractive at first glance to take what someone else has built and tweak it for our purposes. However, there are some problems with this approach that we need to discuss to figure out if they can be mitigated.
- Lack of Upstream Merges – It’s easy to bolt on features for your own use, however, we need to make sure that our changes to the original code can and are merged back into the original project. This process is sometimes tedious and requires working with the upstream owners and finding a way to work with their release schedules and criteria. Sometimes it becomes so complex that it necessitates a long-term relationship and not just an occasional random merge request that shows up in their approval queue. Count this cost in advance or it will surprise you in a year or several years down the road when you find yourself managing a business critical snowflake with no upgrade path.
- Lack of Support – Who interacts with the upstream community? Again, if you want to add features to an open source project please do! Just make sure that you have a long-term plan to get your improvements in the upstream project. If you can’t, I would vote that you never go down this road.
- Legal and Licensing – Can you contribute to this project? Make sure that you have the appropriate business and legal sign-offs before you start this work.
Now that we’ve explored this a bit, I hope that we have a good understanding of some of the associated pitfalls and how to avoid them. Let’s take that a bit further and see if we can identify some best practices if you find yourself in the business of providing short term solutions.
- Establish Policy – Establish and document your policy for all teams to see. Hold an informative meeting around your expectations in this area. Have Slack? Perfect, create a new group and an email distro for questions.
- Provide Tooling – Most people take the path of least resistance. Because of that you’ll want to provide a toolset and instructions for their usage for the common components of these services. I would classify these at a minimum as logging, monitoring and telemetry for internal resources.
- Enable Visibility – Visibility is key. Having established and documented policies as well as a centralized logging, monitoring and telemetry solutions is a great start. However, to improve on this you can add documentation around API Standards and start to build a more cohesive ecosystem.
- Target the Correct Audience – In the recent past I ended up creating a simple API layer that allowed us to easily manage dozens of PagerDuty Services, Schedules and Escalation Policies in a very non-technical end-user friendly (YAML) way and managed it via source control with a validation pipeline. From a purely technical point of view this API was a prime candidate for a containerized application like I’ve done in the past. However, in this case, the other team members who will be supporting this API are not familiar with Docker, let alone Kubernetes. To help lower the entry bar for support we built this into a native Linux service instead. May not be the “right” way to go from a technical point of view, but it’s the “right” way to go from a support point of view. Don’t immediately go for the coolest tech around, know your team, and plan accordingly.
What are the conclusions that I take awake from this exercise?
- Mindset – Proper mindset in my opinion is the cornerstone that is required to successfully be involved in the world of customized code over the long term. If one item cannot be lost in the road to success it would be this one, and it should be prevalent at an appropriate level of understanding across the whole team. Without it, regardless of the provided tooling and policies it will be an uphill battle and one that will ultimately suffer or possibly fail.
- Enabled Tooling – Policy is great on paper, but without tooling two things are going to happen. The first is that the policies will be ignored, obviously bad; but the second is that the policies will be implemented in an inconsistent manner which is even worse because you expect consistency. If policy implementation is inconsistent then auditing for compliance becomes an enormous administrative overhead and sometimes even impossible at scale.
- Clear and Lightweight Policy – I work with a lot of attorneys and while they serve a valuable purpose and add a lot of value, internally it can quickly become too much and end up becoming words on a page that mean nothing to the reader. Strive to make your policies as readable as possible and containing a couple components.
- Policy Text – Clearly articulated policy including the reasons for the policy, the requirements to be met and a contact who is authoritative on the text.
- Audit Mechanism – While it can be incredibly difficult, I am a firm believer that no policy should be written that doesn’t have an associated audit mechanism. Ideally, an automated audit mechanism, but that’s not required in my mind. Do you have a policy to have an email alias created? Great, just set up a Jenkins job that checks it’s valid every day…. Sometimes an elegant solution isn’t required, and yet it will potentially save you a lot of trouble down the road. The best example I have of this is from years ago when my manager had us walk the datacenter rows and audit every server for an amber light on a weekly basis. His logic? We might have missed a critical alert and that 5 minute secondary audit might save us a data loss incident. He was right, and the principle still applies today.