This article is about "the death [of productivity] by thousand cuts" - about the many obstacles that make enterprise development unnecessarily slow, costly, and painful. And it is about the "invisible cost" of ignoring them. I look at the top obstacles that we encounter and at what we could do about them. I argue that we must prioritize great developer experience and invest into our tools and into simplicity - and that this will yeld benefits both to developers and to the business. Those in power need to realize that developer happiness isn't about perks, huge monitors, and relaxation pods (though those would be cool!). It is, to a large extent, about our ability to do our job without hindrances and thus it is about delivering more value, faster.
Writing software can sometimes be extremely unproductive and frustrating. Last week I spent a few hours coding - and wasted ten times so much due to various technical problems. VPN did not want to connect. Npm was failing with weird socket exceptions (due to left-over http proxy config). IntelliJ could not compile - and thus properly navigate - the codebase and when I told it to fall back onto Gradle, it failed again, this time due to missing proxy configuration (because it was oblivious to the environment variables I have set to solve the issue in the terminal). Login stopped working and test data was messed up. With a deadline looming, this all was really frustrating. But it does not need to be that way.
I focus here on my company and its particular challenges and possible solutions but I am sure that many others face similar obstacles and the general problem and the conclusions apply to them equally well.
Bad developer experience and complexity slow development down considerably. They increase time to market and costs - even though the costs are incurred on all development and thus "invisible," like the air we breathe. Therefore you must invest into great developer experience - create a developer portal to make it easy to find APIs, encourage delightful API documentation so they can be used effectively, require helpful error messages, keep tools up to date and seek feedback speed ups, build test data management that makes obtaining relevant test data a breeze, grow a culture of continuous improvement (of tools and processes). Simplify aggressively to fight the rise if incidental complexity. What can you remove or streamline? Make sure shared services are always up - and have backup solutions when they are not.
Too abstract? Read on to get insight into the day of a frustrated developer and learn more about the problems and possible solutions.
It should be noted that companies, especially large ones, have many issues that hinder productivity. The technical issues I focus on here are typically secondary to those caused by lacking communication, bad prioritization, and generally suboptimal organization. Still, it is worth addressing what we can address.
I have loosely grouped the technical problems we encounter into two large categories: Lack of focus on developer experience and Complexity (of our infrastructure, primarily).
Nick Tune has written a nice article about The Importance of a Great Developer Experience. He concludes:
As a result of the great developer experience me and my colleagues enjoyed, the benefits to the business were tremendous. The speed at which new features got implemented, and importantly delivered to production for customers to benefit from, was as fast as humanly possible.
Developer experience (DX) is crucial not only for the happiness of the developers but also for the business. However it does not just happen, you need to prioritize it and give it resources. Here we will explore some of the obstacles to a great developer experience.
Finding the data I need and the APIs I can to use to achieve my goal is
a painstaking detective work. It typically starts by asking more
experienced colleagues and posting in a few channels such as
#neo-maintainers, hoping that for once somebody with an answer will
pop up. It is much easier if you have been here a while and know the
people who know people. I have no idea how a newcomer would manage it.
Having posed my question, it is up to a chance whether somebody is able
to point me in the best direction. For example who knows that the shared
login team has a service for sending OTP codes? I had no idea until
It is similar for data. How do I find the price to display for a product to a particular customer? (Beware: the answer might make your brain explode!) What does a particular product code mean, and how to categorize and thus correctly process the product in our particular system? (Somebody has some email or contacts in Billing we can use to try to get an answer, I believe… .)
So finding APIs and the right data is a time-consuming, error-prone process involving a number of people. We can do better!
Side note: I want to thank deeply to all my awesome colleagues from Norway and Ostrava, who have always been so supportive and helped me to find and use the APIs I needed.
I imagine we could have a shared "developer portal" where all APIs would be registered, with their documentation, contact details etc., a portal with great categorization of the APIs and even better smart search so that finding APIs would be trivial.
The 7 Ingredients That Make Up a Superb Developer Center can provide some inspiration:
So, for this article we compared 10 successful public API programs to see what attributes their developer centers have in common, and distilled this research into seven main ingredients providers should prioritize when creating a developer center.
A closely related problem to finding APIs and data is understanding
them. Even if there is somebody who can point you to the Swagger
documentation of the service (which seems to be different for each one
and I never remember their locations), it is not all that helpful. Take
PUT /orders/<order-id> - the order has about million different
attributes (in nested data structures) and it can be used to make many
different kinds of orders. But how do the attributes relate to each
other? For each kind of order, which of them must / can / mustn’t be
there, and when?
I typically start by studying how we did it on a previous project. Then I do my best to put an order together and discuss any errors with the API maintainers (who sometimes have to fall back on the maintainers of the underlying APIs). After a few days and many attempts I arrive at something we all hope might work in production.
How to solve this problem? I have asked some smart people and the answer is that Swagger / OpenAPI (and/or JSON Schema) has much richer type system and possibilities than we currently use. Instead of a single "order" type, we should have a dedicated sub-type for each of the different types of orders and perhaps even for subtypes within those (such as an order without any hardware / with a SIM only / with some hardware). And when we hit the limits of this, we could combine it with other solutions. We should also include in the docs realistic examples of the different kinds of things that can be sent to the API and we should leverage documentation strings on attributes to describe rules that cannot be expressed in code. Or/and we could connect it to a "living documentation" based on unit tests.
For inspiration, check out these 5 Examples of Excellent API Documentation (and Why We Think So)
- featuring Stripe, Twilio, Dropbox & more. (Which also refers to 7 Items No API Documentation Can Live Without including 5. Human Readable Method Descriptions and 6. Requests and Examples.)
I have written before about how to create troubleshooting-friendly responses and how important it is. This applies even more so to errors from APIs and libraries - something we developers are very bad at. (My favorite scarecrow is Oracle DB. To say the least, I am not overly fond of it.) I have wasted uncountable hours due to unhelpful, if not misleading, error messages. Imagine you get the HTTP response "403 Forbidden by administrative rules". You have no idea who returned it (a load balancer? any of the number of proxies? the API itself? some framework part of it or is it an actual business error?), what does it mean (what rules?), and - most importantly - what to do about it. (This particular one is a built-in error returned by Nginx used at a shared proxy. For years I have asked for this to be improved but it turned out to be non-trivial. Fortunately, an improvement is finally on its way.)
Unhelpful errors are one of the "invisible costs" that plague us - the developers sped up their work by not using the time necessary to make awesome error messages (certainly under the constant pressure to deliver more features) but we all pay for it many times over when developing our solutions.
Here is a small taste:
<An API> 401 forbidden by administrative rules [no information about the origin of the error provided]
This error should have been: The API /xxx/yyy is missing from the list of APIs opened to your AWS account. Edit your configuration in <link> to add it.
<Spring MVC> Could not find acceptable representation
should have been: Could not find acceptable representation - your Accept header says you only want X or Y but the endpoint for /some/url (see SomeController.someMethod) can only produce Z. See <link> for more information and solutions.
[GraphQL error]: Message: 14 UNAVAILABLE: unavailable
should have been: The request header
clientid contained an unknown
value 'X'. See <link> for valid values.
<npm> A "socket" was not created for HTTP request before 300001m
should have been: Connection to <host> timed out after 30000 ms using HTTP proxy <host:port>, as configured in
<link> for troubleshooting tips.
<docker-compose> FileNotFoundError: [Errno 2] No such file or directory
should have been: The Docker daemon is not running and you need to start it first.
We know that providing helpful error message is a tough job but not an impossible one - just look at the famously great Elm compiler errors or the Rust compiler ones. What applies to writing user-friendly error messages applies equally to API errors. The teams and business just need to recognize and acknowledge the cost the default bad errors incur and prioritize great developer experience.
One of major pain points and true time sinks. I can sometimes fix a problem within minutes and then spend hours trying (and regularly failing) to get test data to verify the fix. (The good thing is that it then does not matter that the application takes forever to start 😂.) For example I changed the way we fetch information about the logged-in user and need to check that it works properly for multi-organization administrators. Where do I get one? Creating one from scratch requires cooperation from multiple busy teams and likely some luck to find a good starting point. Another example: I needed a B2C user who would be required to pay breakage fee if she changed her subscription now. Despite the long catalogue of various pre-generated test cases, I found none that satisfied that (having tried about 5 most likely ones). Can you imagine how much time we have all collectively wasted over the years here? And when we finally give up, we increase the risk and occasionally inconvenience end-users by releasing untested changes to production.
This is another of the invisible costs that, if they were made visible, would make the eyes of the management pop out and would finally persuade them to pour enough resources into fixing this problem. Once again it is nothing insurmountable - it is only a question of understanding and priorities.
(I want to thank Anders and the kind testers in Ostrava for having helped me so many times in this most laborious and occasionally futile task. I admire their strength and patience in dealing with this <censored>.)
An elephant in the room (or perhaps rather a dragon, in this case) is data quality and the lack thereof. I can only scratch the surface because I cannot remember the really juicy examples, but I am sure there are many. It is a problem acknowledged by many but none willing to take up the enormous battle and improve the lives of us all. Likely partly because everything hangs together, in this particular case with the monstrosity of the underlying legacy system for managing everything, which I will call F in this article.
People use the limited fields in F to record necessary additional information that they were never meant to hold, they do not follow procedures when changing organizational structures, they make mistakes when entering data, … . Just to be clear, I am not blaming the employees - it is always the system, here both the IT system F and the incentive and management system they are subject to, that is to blame.
Two examples of bad data I recently encountered. One is when we get a monthly usage data for a subscription, which sometimes contains charges for previously unknown "services", which breaks our billing process, because we do not know into which category to place it and thus how to deal with it. The code of the "service" tells us nothing (essentially everything is a "service") and there is not much more information anywhere. It is very similar with the "additional services" you can activate on a subscription - they all look very similar in the data even though there are clearly very different types of these. We are forced to hard-code mappings from their IDs to some actually meaningful categories. That takes a lot of time to figure out - and is a nightmare to maintain. Not mentioning that different applications have to repeat this process.
The solution is again realizing how much this actually costs us, nominating a chief data quality officer, and giving her the power and resources necessary to do something about it.
The shorter the feedback loop between making a change and seeing its effect, the more productive you are. Waiting kills the flow. I have written about slow feedback and productivity before. The key point is that every second (and minute!) counts and that our languages, frameworks, and tools are generally not optimized to provide us with a quick feedback on our code changes. We are limited by the choice of your technological stack but nevertheless there is always a space for improvement (or worsening).
There is an overwhelming amount of good advice, practices, tools, and processes that you should use to improve. It is very hard to know what to do. My research has shown that there are a number of key developer feedback loops. I recommend focusing on optimizing these loops, making them fast and simple. Measure the length of the feedback loop, the constraints, and the resulting outcome.
— Tim Cochran Maximizing Developer Effectiveness
We do not invest enough into finding the best tools and keeping up to date. It is understandable that we stick with what works and focus rather on delivering features. We use npm because it is a standard even though pnpm is much faster. We use old Gradle 5 because it is too much work to upgrade, foregoing any speed and dev support improvements the latest version(s) might bring. (At least I hope it has some improvements because v5 is <censored>.) Fortunately there are exceptions, such as the recent switch from Webpack to swc.
We need to be aware of the cost of using suboptimal / old tools and invest more in this.
The second category of obstacles is unnecessary complexity. Complexity is the true killer of productivity. It increases with the increasing number of parts and of connections between them. As complexity grows linearly, the cost of working with(in) the system grows exponentially.
We will look at some elements that add complexity and thus delays into developer’s everyday life.
I cannot even imagine how many days (weeks?) I have wasted struggling
with the corporate HTTP proxy - trying to understand weird network
errors (such as the npm’s "A 'socket' was not created") when I forgot to
switch it off when off VPN or on when on VPN, or even figuring out that
network was the problem, trying to persuade millions of different
programs to use it, each having of course its unique way(s) of
configuring it, getting right the list of domains to exclude from
proxying, … . CLI programs have their env variables (some lower-case,
some upper-case, some both), git has the
http[s].proxy config, npm has
.npmrc, Java has its system properties (good luck setting those for
Gradle so that it works both from CLI and IntelliJ, both on and off
VPN), Docker has
~/.docker/config.json, Node.js has no built-in
support and you need to add and initialize something like
global-agent with its own
unique env. variable, ssh needs to be
tunneled, and some things
simply will not work. It is a tangled mess of complexity.
(To be fair: some colleagues claim that it got better over the last years. But I suspect they have simply learned the correct configuration for the tools they use and perhaps learned to work mostly just on or off VPN thus avoiding the risk of switching and screwing things up.)
I understand that forcing all outgoing traffic through proxy is important for security. But no other place have I experienced such problems. And I am sure there must be a better way than being forced to configure a number of applications in a number of different ways. We should look for better, simpler ways. Perhaps we should simply only configure Docker to use the proxy and run everything inside Docker images? Or perhaps there is some kind of virtual network interface that both connects to the company VPN and makes sure that all requests pass through the proxy. Perhaps a part of the solution is to realize that developers are not standard workers and establishing separate security profiles for them - possibly acknowledging that for them the cost of the proxy outweighs the benefits. I do not have a solution - but I am sure we should search for one.
Having a shared end-user login across all our application, one based on the OIDC standard, is awesome. It means I mostly do not need to bother with user management, login, and logout, other than adding a few standardized components/libraries to integrate with this. But when login stops working, due to bugs, data integration issues, or backwards-incompatible changes, a lot of development grinds to a halt.
Another disadvantage is that the creation of test users needs to involve even more systems than before, including an external system for BankIDs. It is a real pain, and another obstacle slowing down and occasionally blocking our testing efforts.
I am not sure what to do about this. Perhaps the login system in the test environment should be treated as a critical component, with an extreme focus on stability and smooth transitions between versions. Perhaps we should have a backup solution we can use when there are data issues, such as a number of hardcoded test users that can always log in or something.
I love the ability to spin up, control, and scale any infrastructure my application needs in essentially no time in the cloud, and AWS offers many great services. For better control and traceability we implement Infrastructure as Code with the help of Terraform. However AWS is an incarnation of complexity - to get anything usable, you need to create a number of resources and ensure correct "connections" between them. You need roles, policies, role policy attachments, load balancers with listeners with rules, … . Often you only guess at what permissions a role must have to work properly. Terraform does not help with this. There are higher-level modules but typically they bring in other modules you might not want, are too opinionated, and when your needs diverge from what they offer, your only option is to clone them or start from scratch. And then there are the issues with incompatible versions of Terraform and AWS between your code and the module(s). I have written about my issues with Terraform before, so I won’t repeat myself here.
What are the consequences? One is that you need an expert to design and maintain your infrastructure, and few team members are willing to put in the necessary time to become one at the expense of their normal work. Another one is that, in the face of lacking deep expertise, you end up with a suboptimal and perhaps overpriced infrastructure. (I am certainly guilty of this.) Finally, you waste inordinate amounts of time getting things to work (and then again when you need to change them). That might be acceptable for infrastructure that you set up once and only change infrequently. (Though I will never get back the days I put into CodeBuild and CodePipeline. GitHub Actions, where were you then?!) But the low-levelness and the associated cost plagues also more dynamic services, such as our frontend-for-backend platform based on Lambdas, Step Functions, and AppSync. I won’t bother you with the details (perhaps because it is too painful to think about them? :-)) but it requires you to understand and get right too many pieces. (Don’t get me wrong - the team behind it is doing an awesome job and they are constantly improving things. But you can only get so far, if all you have are many tiny pieces that you must manually put together.)
Is there something with a better balance between flexibility and
complexity? Perhaps another cloud provider with likely fewer features
but greater focus on developer-friendliness? Perhaps Cloud Development
Kit can ease some of the pain, providing
high-level abstractions (I love its convenience functions such as
bucket.grantReadWrite(user|role|…)), while making it possible to drop
down to gain more control without having to throw everything out. (I.e.
rather the Fulcro
than the Terraform one.)
To demonstrate that it is possible, with little effort, to improve productivity, I want to tell you about our customer support app.
The app is created with the full-stack web framework Fulcro, which means that the frontend is in ClojureScript and the backend in Clojure and both use a few, well integrated, great tools optimized for an awesome developer experience. I can change either and get an immediate feedback without waiting for any restart, and I can interact with either from the REPL for any exploratory activities. I use a machine to machine token for interacting with APIs and thus do not need the end-user’s token and therefore do not require login locally. Most problems with end-users and the shared login solution thus do not impact me. I also cache all remote calls to files so that I can - to a large extent - work offline, or if something in the backend is down.
To not paint a way too one-sided picture, I have to say that there are also many things that work very well. We have come very far with DevOps. Most teams have full control over their infrastructure and release changes at will. The teams I have seen are quite self-organized and people are very helpful to each other. Some teams have a very good automated test coverage. We already have some tools that enable us to test what would otherwise be impossible.
There is a lot of waste in our current development "environment". Its causes are well known but the costs are invisible to the decision makers because they are spread over all of our development and not easily attributed to a particular source. All hope is not lost - there is a lot we can do to improve both the developer experience (and thus happiness and retention) and productivity, resulting in a faster time to market and overall savings. We just need to acknowledge the current situation and prioritize its improvement. The lean movement brought us the idea of flow and of focusing on detecting and removing the top obstacle. I think that a lot of what it teaches can help us here. The improvement kata can be a useful tool for going about it.
Those in power need to realize that developer happiness isn’t about perks, huge monitors, and relaxation pods (though those would be cool!). It is, to a large extent, about our ability to do our job without hindrances and thus it is about delivering more value, faster. Something the business would certainly love. They need to look into the eyes of the "dark matter" that sucks up so much of our time and put money into getting rid of it so that we can all end up happier. And we, the developers, need to get better at surfacing these hidden costs and voicing our issues, instead of accepting the way things are as an unavoidable consequence of working in a large enterprise. We must help the management help us - and thus all of us.