Bending The Serverless Spoon

Do not try and bend the spoon. That’s impossible. Instead, only realize the truth… THERE IS NO SPOON. Then you will see that it not the spoon that bends, it is yourself.” — The Matrix

[tl;dr To change the world around them, organizations should change themselves by adopting serverless + agile as a target. IT organizations should embrace serverless to optimize and automate IT workflows and processes before introducing it for critical business applications.]

“Serverless” is the latest shiny new thing to come on the architectural scene. An excellent (opinionated) analysis on what ‘serverless’ means has been written by Jeremy Daly, a serverless evangelist – the basic conclusion being that ‘serverless’ is ultimately a methodology/culture/mindset.

If we accept that as a reasonable definition, how does this influence how we think about solution design and engineering, given that generations of computer engineers have grown up with servers front-and-center of design thinking?

In other words, how do we bend our way of thinking of a problem space to serverless-first, and use that understanding to help make better architectural decisions – especially with respect to virtual machines, containers, and orchestration, and distributed systems in general?

Worked Example

To provide some insight into the practicalities of building and running a serverless application, I used a worked example, “Building a Serverless App Using Athena and AWS Lambda” by Epsagon, a serverless monitoring specialist. This uses the open-source Serverless framework to simplify the creation of serverless infrastructure on a given cloud provider. This example uses AWS.

Note to those attempting to follow this exercise: not all the required code was provided in the version I used, so the tutorial does require some (javascript) coding skills to fill the gaps. The code that worked for me (with copious logging..) can be found here.

This worked example focuses on two reference-data oriented architectural patterns:

  • The transactional creation via a RESTful API of a uniquely identifiable ‘product’ with an ad-hoc set of attributes, including but not limited to ‘ProductId’, ‘Name’ and ‘Color’.
  • The ability to query all ‘products’ which share specific attributes – in this case, a shared name.

In addition, the ability to create/initialize shared state (in the form of a virtual database table) is also handled.

Problem-domain Non-Functional Characteristics

Conceptually, the architecture has the following elements:

  • Public, anonymous RESTful APIs for product creation and query
    • APIs could be defined in OpenAPI 3.0, but by default are created
  • Durable storage of product data information
    • Variable storage cost structure based on access frequency can be added through configuration
    • Long-term archiving/backup obligations can be met without using any other services.
  • Very low data management overhead
  • Highly resilient and available infrastructure
    • Additional multi-regional resilience can be added via Application Load Balancer and deploying Lambda functions to multiple regions
    • S3 and Athena are globally resilient
  • Scalable architecture
    • No fixed constraint on number of records that can be stored
    • No fixed constraint on number of concurrent users using the APIs (configurable)
    • No fixed constraint on the number of concurrent users querying Athena (configurable)
  • No servers to maintain (no networks, servers, operating systems, software, etc)
  • Costs based on utilization
    • If nobody is updating or querying the database, then no infrastructure is being used and no charges (beyond storage) are incurred
  • Secure through AWS IAM permissioning and S3 encryption.
    • Many more security authentication, authorization and encryption options available via API Gateway, AWS Lambda, AWS S3, and AWS Athena.
  • Comprehensive log monitoring via CloudWatch, with ability to add alerts, etc.

For a couple of days coding, that’s a lot of non-functional goodness..and overall the development experience was pretty good (albeit not CI/CD optimized..I used Microsoft’s Code IDE on a MacBook, and the locally installed serverless framework to deploy.) Of course, I needed to be online and connect to AWS, but this seemed like a minor issue (for this small app). I did not attempt to deploy any serverless mock services locally.

So, even for a slightly contrived use case like the above, there are clear benefits to using serverless.

Why bend the spoon?

There are a number of factors that typically need to be taken into consideration when designing solutions which tend to drive architectures away from ‘serverless’ towards ‘serverful’. Typically, these revolve around resource management ( i.e., network, compute, storage ) and state management (i.e., transactional state changes).

The fundamental issue that application architects need to deal with in any solution architecture is the ‘impedance mismatch’ between general purpose storage services, and applications. Applications and application developers fundamentally want to treat all their data objects as if they are always available, in-memory (i.e., fast to access) and globally consistent, forcing engineers to optimize infrastructure to meet that need. This generally precludes using general-purpose or managed services, and results in infrastructure being tightly coupled with specific application architectures.

The simple fact is that a traditional well-written, modular 3-tier (GUI, business logic, data store) monolithic architecture will always outperform a distributed system – for the set of users and use-cases it is designed for. But these architectures are (arguably) increasingly rare in enterprises for a number of reasons, including:

  • Business processes are increasing in complexity (aka features), consisting of multiple independently evolving enterprise functions that must also be highly digitally cohesive with each other.
  • More and more business functions are being provided by third-parties that need close (digital) integration with enterprise processes and systems, but are otherwise managed independently.
  • There are many, disparate consumers of (digital) process data outputs – in some cases enabling entirely new business lines or customer services.
  • (Digital) GUI users extend well outside the corporate network, to mobile devices as well as home networks, third-party provider networks, etc.

All of the above conspire to drive even the most well-architected monolithic application to the a ‘ball-of-mud‘ architecture.

Underpinning all of this is the real motivation behind modern (cloud-native) infrastructure: in a digital age, infrastructure needs to be capable of being ‘internet scale’ – supporting all 4.3+ billion humans and growing.

Such scale demands serverless thinking. However, businesses that do not aspire to internet-scale usage still have key concerns:

  • Ability to cope with sudden demand spikes in b2c services (e.g., due to marketing campaigns, etc), and increased or highly variable utilisation of b2b services (e.g., due to b2b customers going digital themselves)
  • Provide secure and robust services to their customers when they need it, that is resilient to risks
  • Ability to continuously innovate on products and services to retain customers and remain competitive
  • Comply with all regulatory obligations without impeding ability to change, including data privacy and protection
  • Ability to reorganize how internal capabilities are provisioned and provided with minimal impact to any of the above.

Without serverless thinking, meeting all of these sometimes conflicting needs, becomes very complex, and will consume ever more enterprise IT engineering capacity.

Note: for firms to really understand where serverless should fit in their overall investment strategy, Wardley Maps are a very useful strategic planning tool.

Bending the Spoon

Bending the spoon means rethinking how we architect systems. It fundamentally means closing the gap between models and implementation, and recognizing that where an architecture is deficient, the instinctive reaction to fix or change what you control needs to be overcome: i.e., drive the change to the team (or service provider) where the issue properly belongs. This requires out-of-the-box thinking – and perhaps is a decision that should not be taken by individual teams on their own unless they really understand their service boundaries.

This approach may require teams to scale back new features, or modify roadmaps, to accommodate what can currently be appropriately delivered by the team, and accepting what cannot.

Most firms fail at this – because typically senior management focus on the top-line output and not on the coherence of the value-chain enabling it. But this is what ‘being digital’ is all about.

Everyone wants to be serverless

The reality is, the goal of all infrastructure teams is to avoid developers having to worry about their infrastructure. So while technologies like Docker initially aimed to democratize deployment, infrastructure engineering teams are working to ensure developers never need to know how to build or manage a docker image, configure a virtual machine, manage a network or storage device, etc, etc. This even extends to hiding the specifics of IaaS services exposed by cloud providers.

Organizations that are evaluating Kubernetes, OpenFaaS, or Knative , or which use services such as AWS Fargate, AWS ECS, Azure Container Service, etc, ultimately want to ensure to minimize the knowledge developers need to have of the infrastructure they are working on.

Unfortunately for infrastructure teams, most developers still develop applications using the ‘serverful’ model – i.e., they want to know what containers are running where, how they are configured, how they interact, how they are discovered, etc. Developers also want to run containers on their own laptop whenever they can, and deploy applications to authorized environments whenever they need to.

Developers also build applications which require complex configuration which is often hand-constructed between and across environments, as performance or behavioural issues are identified and ‘patched’ (i.e., worked around instead of directing the problem to the ‘right’ team/codebase).

At the same time, developers do not want anything to do with servers…containers are as close as they want to get to infrastructure, but containers are just an abstraction of servers – they are most definitely not ‘serverless’.

To be Serverless, Be Agile

Serverless solutions are still in the early stages of maturity. For many problems that require a low-cost, resilient and always-available solution, but are not particularly performance sensitive (i.e., are naturally asynchronous and eventually consistent), then serverless solutions are ideal.

In particular, IT (and the proverbial shoes for the cobblers children) processes would benefit significantly from extensive use of serverless, as the management overhead of serverless solutions will be significantly less than other solutions. Integrating bespoke serverless solutions with workflows managed by tools like ServiceNow could be a significant game changer for existing IT organizations.

However, mainstream use of serverless technologies and solutions for business critical enterprise applications is still some way away – but if IT departments develop skills in it, it won’t be long before it finds its way into critical business solutions.

For broader use of serverless, firms need to be truly agile. Work for teams needs to come equally from other dependent teams as from top-down sources. Teams themselves need to be smaller (and ‘senior’ staff need to rethink their roles), and also be prepared to split or plateau. And feature roadmaps need to be as driven by capabilities as imagined needs.

Conclusion

Organizations already know they need to be ‘agile’. To truly change the world (bend the spoon), serverless and agile together will enable firms to change themselves, and so shape the world around them.

Unfortunately, for many organizations it is still easier to try to bend the spoon..for those who understand they need to change, adopting the ‘serverless’ mindset is key to success, even if – at least initially – true serverless solutions remain a challenge to realize in organizations dealing with legacy (serverful) architectures.

Bending The Serverless Spoon

What I realized from studying AWS Services & APIs

[tl;dr The weakest link for firms wishing to achieve business agility is principally based around the financial and physical constraints imposed by managing datacenters and infrastructure. The business goals of agile, devops and enterprise architecture are fundamentally unachievable unless these constraints can be fully abstracted through software services.]

Background

Anybody who has grown up with technology with the PC generation (1985-2005) will have developed software with a fairly deep understanding of how the software worked from an OS/CPU, network, and storage perspective. Much of that generation would have had some formal education in the basics of computer science.

Initially, the PC generation did not have to worry about servers and infrastructure: software ran on PCs. As PCs became more networked, dedicated PCs to run ‘server’ software needed to be connected to the desktop PCs. And folks tasked with building software to run on the servers would also have to buy higher-spec PCs for server-side, install (network) operating systems, connect them to desktop PCs via LAN cables, install disk drives and databases, etc. This would all form part of the ‘waterfall’ project plan to deliver working software, and would all be rather predictable in timeframes.

As organizations added more and more business-critical, network-based software to their portfolios, organization structures were created for datacenter management, networking, infrastructure/server management, storage and database provisioning and operation, middleware management, etc, etc. A bit like the mainframe structures that preceded the PC generation, in fact.

Introducing Agile

And so we come to Agile. While Agile was principally motivated by the flexibility in GUI design offered by HTML (vs traditional GUI design) – basically allowing development teams to iterate rapidly over, and improve on, different implementations of UI – ‘Agile’ quickly became more ‘enterprise’ oriented, as planning and coordinating demand across multiple teams, both infrastructure and application development, was rapidly becoming a massive bottleneck.

It was, and is, widely recognized that these challenges are largely cultural – i.e., that if only teams understood how to collaborate and communicate, everything would be much better for everyone – all the way from the top down. And so a thriving industry exists in coaching firms how to ‘improve’ their culture – aka the ‘agile industrial machine’.

Unfortunately, it turns out there is no silver bullet: the real goal – organizational or business agility – has been elusive. Big organizations still expend vast amounts of time and resources doing small incremental change, most activity is involved in maintaining/supporting existing operations, and truly transformational activities which bring an organization’s full capabilities together for the benefit of the customer still do not succeed.

The Reality of Agile

The basic tenet behind Agile is the idea of cross-functional teams. However, it is obvious that most teams in organizations are unable to align themselves perfectly according to the demand they are receiving (i.e., the equivalent of providing a customer account manager), and even if they did, the number of participants in a typical agile ‘scrum’ or ‘scrum of scrums’ meeting would quickly exceed the consensus maximum of about 9 participants needed for a scrum to be successful.

So most agile teams resort to the only agile they know – i.e., developers, QA and maybe product owner and/or scrum-master participating in daily scrums. Every other dependency is managed as part of an overall program of work (with communication handled by a project/program manager), or through on-demand ‘tickets’ whereby teams can request a service from other teams.

The basic impact of this is that pre-planned work (resources) gets prioritized ahead of on-demand ‘tickets’ (excluding tickets relating to urgent operational issues), and so agile teams are forced to compromise the quality of their work (if they can proceed at all).

DevOps – Managing Infrastructure Dependencies

DevOps is a response to the widening communications/collaboration chasm between application development teams and infrastructure/operations teams in organizations. It recognizes that operational and infrastructural concerns are inherent characteristics of software, and software should not be designed without these concerns being first-class requirements along with product features/business requirements.

On the other hand, infrastructure/operations providers, being primarily concerned with stability, seek to offer a small number of efficient standardized services that they know they can support. Historically, infrastructure providers could only innovate and adapt as fast as hardware infrastructure could be procured, installed, supported and amortized – which is to say, innovation cycles measured in years.

In the meantime, application development teams are constantly pushing the boundaries of infrastructure – principally because most business needs can be realized in software, with sufficiently talented engineers, and those tasked with building software often assume that infrastructure can adapt as quickly.

Microservices – Managing AppDev Team to AppDev Team Dependencies

While DevOps is a response to friction in application development and infrastructure/operations engagement, microservices can be usefully seen as a response to how application development team can manage dependencies on each other.

In an ideal organization, an application development team can leverage/reuse capabilities provided by another team through their APIs, with minimum pre-planning and up-front communication. Teams would expose formal APIs with relevant documentation, and most engagement could be confined to service change requests from other teams and/or major business initiatives. Teams would not be required to test/deploy in lock-step with each other.

Such collaboration between teams would need to be formally recognized by business/product owners as part of the architecture of the platform – i.e., a degree of ‘mechanical sympathy’ is needed by those envisioning new business initiatives to know how best to leverage, and extend, software building blocks in the organization. This is best done by Product Management, who must steward the end-to-end business and data architecture of the organization or value-stream in partnership with business development and engineering.

Putting it all together

To date, most organizations have been fighting a losing battle. The desire to do agile and devops is strong, but the fundamental weakness in the chain is the ability for internal infrastructure providers and operators to move as fast as software development teams need them to – issues as much related to financial management as it is to managing physical buildings, hardware, etc.

What cloud providers are doing is creating software-level abstractions of infrastructure services, allowing the potential of agile, devops and microservices to begin to be realized in practice.

Understanding these services and abstractions is like re-learning the basic principles of Computer Science and Engineering – but through a ‘service’ lens. The same issues need to be addressed, the same technical challenges exist. Except now some aspects of those challenges no longer need to be solved by organizations (e.g., how to efficiently abstract infrastructure services at scale), and businesses can focus on the designing the infrastructure services that are matched with the needs of application developers (rather than a compromise).

Conclusion

The AWS Service Catalog and APIs is an extraordinary achievement (as is similar work by other cloud providers, although they have yet to achieve the catalog breadth that AWS has). Architects need to know and understand these service abstractions and focus on matching application needs with business needs, and can worry less about the traditional constraints infrastructure organizations have had to work with.

In many respects, the variations between these abstractions across providers will vary only in syntax and features. Ultimately (probably at least 10 years from now) all commodity services will converge, or be available through efficient ‘cross-plane’ solutions which abstract providers. So that is why I am choosing to ‘go deep’ on the AWS APIs. This is, in my opinion, the most concrete starting point to helping firms achieve ‘agile’ nirvana.

What I realized from studying AWS Services & APIs

Culture, Collaboration & Capabilities vs People, Process & Technology

[TL;DR The term ‘people, process and technology’ has been widely understood to represent the main dimensions impacting how organisations can differentiate themselves in a fast-changing technology-enabled world. This article argues that this expression may be misinterpreted with the best of intentions, leading to undesirable/unintended outcomes. The alternative, ‘culture, collaboration and capability’ is proposed.]

People, process & technology

When teams, functions or organisations are under-performing, the underlying issues can usually be narrowed down to one or more of the dimensions of people, process and technology.

Unfortunately, these terms can lead to an incorrect focus. Specifically,

  • ‘People’ can be understood to mean individuals who are under-performing or somehow disruptive to overall performance
  • ‘Process’ can be understood to mean formal business processes, leading to a focus on business process design
  • ‘Technology’ can be understood to mean engineering or legacy technology challenges which are resolvable only by replacing or updating existing technology

In many cases, this may in fact be the approach needed: fire the disruptive individual, redesign business processes using Six Sigma experts, or find another vendor selling technology that will solve all your engineering challenges.

In general, however, these approaches are neither practical nor desirable. Removing people can be fraught with challenges, and should only be used as a last resort. Firms using this as a way to solve problems will rapidly build up a culture of distrust and self-preservation.

Redesigning business processes using Six Sigma or other techniques may work well in very mature, well understood, highly automatable situations. However, in most dynamic business situations, no sooner has the process been optimised than it requires changing again. In addition, highly optimised processes may cause the so-called ‘local optimisation’ problem, where the sum of the optimised parts yields something far from an optimised whole.

Technology is generally not easy to replace: some technologies are significantly embedded in an organisation, with a large investment in people and skills to support the technology. But technologies change faster than people can adapt, and business environments change even quicker than technology changes. So replacing technologies come at a massive cost (and risk) of replacing functionally rich existing systems with relatively immature new technology, and replacing existing people with people who may be familiar with the technology, but less so with your organisation. And what to do with the folks who have invested so much of their careers in the ‘old’ technology? (Back to the ‘people’ problem.)

A new meme

In order to effect change within a team, department or organisation, the focus on ‘people, process and technology’ needs to be adapted to ‘culture, collaboration and capabilities’. The following sections lays out what the subtle difference is and how it could change how one approaches solving certain types of performance challenges.

Culture

When we talk about ‘people’, we are really not talking about individuals, but about cultures. The culture of a team, department or organisation has a more significant impact on how people collectively perform than anything else.

Changing culture is hard: for a ‘bad’ culture caught early enough, it may be possible to simply replace the most senior person who is (consciously or not) leading the creation of the undesirable cultural characteristics. But once a culture is established, even replacing senior leadership does not guarantee it will change.

Changing culture requires the willing participation of most of the people involved. For this to happen, people need to realise that there is a problem, and they need to be open to new cultural leadership. Then, it is mainly a case of finding the right leadership to establish the new norms and carry them forward – something which can be difficult for some senior managers to do, particularly when they have an arms-length approach to management.

Typical ‘bad’ cultures in a (technology) organisation include poor practices such as lack of testing discipline, poor collaboration with other groups focused on different concerns (such as stability, infrastructure, etc), a lack of transparency into how work is done, or even a lack of collaboration within members of the same team (i.e., a ‘hero’ based approach to development).

Changing these can be notoriously difficult, especially if the firm is highly dependent on this team and what it does.

Collaboration

Processes are, ultimately, a way to formalise how different people collaborate at different times. Processes formalise collaborations, but often collaborations happen before the process is formalised – especially in high-performance teams who are aware of their environment and are open to collaboration.

Many challenges in teams, departments or organisations can be boiled down to collaboration challenges. People not understanding how (or even if) they should be collaborating, how often, how closely, etc.

In most organisations, ‘cooperation’ is a necessity: there are many different functions, most of which depend on each other. So there is a minimum level of cooperation in order to get certain things done. But this cooperation does not necessarily extend to collaboration, which is cooperation based on trust and a deeper understanding of the reasons why a collaboration is important.

Collaboration ultimately serves to strengthen relationships and improve the overall performance of the team, department or organisation.

Collaborations can be captured formally using business process design notation (such as BPMN) but often these treat roles as machines, not people, and can lead to forgetting the underlying goal: people need to collaborate in order to meet the goals of the organisation. Process design often aims to define people’s roles so narrowly that the individuals may as well be a machine – and as technology advances, this is exactly what is happening in many cases.

People will naturally resist this; defining processes in terms of collaborations will change the perspective and result in a more sustainable and productive outcome.

Capabilities

Much has been written here about ‘capabilities’, particularly when it comes to architecture. In this article, I am narrowing my definition to anything that allows an individual (or group of individuals) to perform better than they otherwise would.

From a technology perspective, particular technologies provide developers with capabilities they would not have otherwise. These capabilities allow developers to offer value to other people who need software developed to help them do their job, and who in turn offer capabilities to other people who need those people to perform that job.

When a given capability is ‘broken’ (for example, where people do not understand a particular technology very well, and so it limits their capabilities rather than expands them), then it ripples up to everybody who depends directly or indirectly on that capability: systems become unstable, change takes a long time to implement, users of systems become frustrated and unable to do their jobs, the clients of those users become frustrated at  people being paid to do a job not being able to do it.

In the worst case, this can bring a firm to its knees, unable to survive in an increasingly dynamic, fast-changing world where the weakest firms do not survive long.

Technology should *always* provide a capability: the capability to deliver value in the right hands. When it is no longer able to achieve that role in the eyes of the people who depend on it (or when the ‘right hands’ simply cannot be found), then it is time to move on quickly.

Conclusion

Many of todays innovations in technology revolves around culture, collaboration and capabilities. An agile, disciplined culture, where collaboration between systems reflects collaborations between departments and vice-versa, and where technologies provide people with the capabilities they need to do their jobs, is what every firm strives for (or should be striving for).

For new startups, much of this is a given – this is, after all, how they differentiate themselves against more established players. For larger organisations that have been around for a while, the challenge has been, and continues to be, how to drive continuous improvement and change along these three axes, while remaining sensitive to the capacity of the firm to absorb disruption to the people and technologies that those firms have relied on to get them to the (presumably) successful state they are in today.

Get it wrong and firms could rapidly lose market share and become over-taken by their upstart competitors. Get it right, and those upstart competitors will be bought out by the newly agile established players.

Culture, Collaboration & Capabilities vs People, Process & Technology

The hidden costs of PaaS & microservice engineering innovation

[tl;dr The leap from monolithic application development into the world of PaaS and microservices highlights the need for consistent collaboration, disciplined development and a strong vision in order to ensure sustainable business value.]

The pace of innovation in the PaaS and microservice space is increasing rapidly. This, coupled with increasing pressure on ‘traditional’ organisations to deliver more value more quickly from IT investments, is causing a flurry of interest in PaaS enabling technologies such as Cloud Foundry (favoured by the likes of IBM and Pivotal), OpenShift (favoured by RedHat), Azure (Microsoft), Heroku (SalesForce), AWS, Google Application Engine, etc.

A key characteristic of all these PaaS solutions is that they are ‘devops’ enabled – i.e., it is possible to automate both code and infrastructure deployment, enabling the way to have highly automated operational processes for applications built on these platforms.

For large organisations, or organisations that prefer to control their infrastructure (because of, for example, regulatory constraints), PaaS solutions that can be run in a private datacenter rather than the public cloud are preferable, as this a future option to deploy to external clouds if needed/appropriate.

These PaaS environments are feature-rich and aim to provide a lot of the building blocks needed to build enterprise applications. But other framework initiatives, such as Spring Boot, DropWizard and Vert.X aim to make it easier to build PaaS-based applications.

Combined, all of these promise to provide a dramatic increase in developer productivity: the marginal cost of developing, deploying and operating a complete application will drop significantly.

Due to the low capital investment required to build new applications, it becomes ever more feasible to move from a heavy-weight, planning intensive approach to IT investment to a more agile approach where a complete application can be built, iterated and validated (or not) in the time it takes to create a traditional requirements document.

However, this also has massive implications, as – left unchecked – the drift towards entropy will increase over time, and organisations could be severely challenged to effectively manage and generate value from the sheer number of applications and services that can be created on such platforms. So an eye on managing complexity should be in place from the very beginning.

Many of the above platforms aim to make it as easy as possible for developers to get going quickly: this is a laudable goal, and if more of the complexity can be pushed into the PaaS, then that can only be good. The consequence of this approach is that developers have less control over the evolution of key aspects of the PaaS, and this could cause unexpected issues as PaaS upgrades conflict with application lifecycles, etc. In essence, it could be quite difficult to isolate applications from some PaaS changes. How these frameworks help developers cope with such changes is something to closely monitor, as these platforms are not yet mature enough to have gone through a major upgrade with a significant number of deployed applications.

The relative benefit/complexity trade-off between established microservice frameworks such as OSGi and easier to use solutions such as described above needs to be tested in practice. Specifically, OSGi’s more robust dependency model may prove more useful in enterprise environments than environments which have a ‘move fast and break things’ approach to application development, especially if OSGi-based PaaS solutions such as JBoss Fuse on OpenShift and Paremus ServiceFabric gain more popular use.

So: all well and good from the technology side. But even if the pros and cons of the different engineering approaches are evaluated and a perfect PaaS solution emerges, that doesn’t mean Microservice Nirvana can be achieved.

A recent article on the challenges of building successful micro-service applications, coupled with a presentation by Lisa van Gelder at a recent Agile meetup in New York City, has emphasised that even given the right enabling technologies, deploying microservices is a major challenge – but if done right, the rewards are well worth it.

Specifically, there are a number of factors that impact the success of a large scale or enterprise microservice based strategy, including but not limited to:

  • Shared ownership of services
  • Setting cross-team goals
  • Performing scrum of scrums
  • Identifying swim lanes – isolating content failure & eventually consistent data
  • Provision of Circuit breakers & Timeouts (anti-fragile)
  • Service discoverability & clear ownership
  • Testing against stubs; customer driven contracts
  • Running fake transactions in production
  • SLOs and priorities
  • Shared understanding of what happens when something goes wrong
  • Focus on Mean time to repair (recover) rather than mean-time-to-failure
  • Use of common interfaces: deployment, health check, logging, monitoring
  • Tracing a users journey through the application
  • Collecting logs
  • Providing monitoring dashboards
  • Standardising common metric names

Some of these can be technically provided by the chosen PaaS, but a lot is based around the best practices consistently applied within and across development teams. In fact, it is quite hard to capture these key success factors in traditional architectural views – something that needs to be considered when architecting large-scale microservice solutions.

In summary, the leap from monolithic application development into the world of PaaS and microservices highlights the need for consistent collaboration, disciplined development and a strong vision in order to ensure sustainable business value.
The hidden costs of PaaS & microservice engineering innovation

Scaled Agile needs Slack

[tl;dr In order to effectively scale agile, organisations need to ensure that a portion of team capacity is explicitly set aside for enterprise priorities. A re-imagined enterprise architecture capability is a key factor in enabling scaled agile success.]

What’s the problem?

From an architectural perspective, Agile methodologies are heavily dependent on business- (or function-) aligned product owners, which tend to be very focused on *their* priorities – and not the enterprise’s priorities (i.e., other functions or businesses that may benefit from the work the team is doing).

This results in very inward-focused development, and where dependencies on other parts of the organisation are identified, these (in the absence of formal architecture governance) tend to be minimised where possible, if necessary through duplicative development. And other teams requiring access to the team’s resources (e.g., databases, applications, etc) are served on a best-effort basis – often causing those teams to seek other solutions instead, or work without formal support from the team they depend on.

This, of course, leads to architectural complexity, leading to reduced agility all round.

The Solution?

If we accept the premise that, from an architectural perspective, teams are the main consideration (it is where domain and technical knowledge resides), then the question is how to get the right demand to the right teams, in as scalable, agile manner as possible?

In agile teams, the product backlog determines their work schedule. The backlog usually has a long list of items awaiting prioritisation, and part of the Agile processes is to be constantly prioritising this backlog to ensure high-value tasks are done first.

Well known management research such as The Mythical Man Month has influenced Agile’s goal to keep team sizes small (e.g., 5-9 people for scrum). So when new work comes, adding people is generally not a scalable option.

So, how to reconcile the enterprise’s needs with the Agile team’s needs?

One approach would be to ensure that every team pays an ‘enterprise’ tax – i.e., in prioritising backlog items, at least, say, 20% of work-in-progress items must be for the benefit of other teams. (Needless to say, such work should be done in such a way as to preserve product architectural integrity.)

20% may seem like a lot – especially when there is so much work to be done for immediate priorities – but it cuts both ways. If *every* team allows 20% of their backlog to be for other teams, then every team has the possibility of using capacity from other teams – in effect, increasing their capacity by much more than they could do on their own. And by doing so they are helping achieve enterprise goals, reducing overall complexity and maximising reuse – resulting in a reduction in project schedule over-runs, higher quality resulting architecture, and overall reduced cost of change.

Slack does not mean Under-utilisation

The concept of ‘Slack’ is well described in the book ‘Slack: Getting Past Burn-out, Busywork, and the Myth of Total Efficiency‘. In effect, in an Agile sense, we are talking about organisational slack, and not team slack. Teams, in fact, will continue to be 100% utilised, as long as their backlog consists of more high-value items then they can deliver. The backlog owner – e.g., scrum master – can obviously embed local team slack into how a particular team’s backlog is managed.

Implications for Project & Financial Management

Project managers are used to getting funding to deliver a project, and then to be able to bring all those resources to bear to deliver that project. The problem is, that is neither agile, nor does it scale – in an enterprise environment, it leads to increasingly complex architectures, resulting in projects getting increasingly more expensive, increasingly late, or delivering increasingly poor quality.

It is difficult for a project manager to accept that 20% of their budget may actually be supporting other (as yet unknown) projects. So perhaps the solution here is to have Enterprise Architecture account for the effective allocation of that spending in an agile way? (i.e., based on how teams are prioritising and delivering those enterprise items on their backlog). An idea worth considering..

Note that the situation is a little different for planned cross-business initiatives, where product owners must actively prioritise the needs of those initiatives alongside their local needs. Such planned work does not count in the 20% enterprise allowance, but rather counts as part of how the team’s cost to the enterprise is formally funded. It may result in a temporary increase in resources on the team, but in this case discipline around ‘staff liquidity’ is required to ensure the team can still function optimally after the temporary resource boost has gone.

The challenge regarding project-oriented financial planning is that, once a project’s goals have been achieved, what’s left is the team and underlying architecture – both of which need to be managed over time. So some dissociation between transitory project goals and longer term team and architecture goals is necessary to manage complexity.

For smaller, non-strategic projects – i.e., no incoming enterprise dependencies – the technology can be maintained on a lights-on basis.

Enterprise architecture can be seen as a means to asses the relevance of a team’s work to the enterprise – i.e., managing both incoming and outgoing team dependencies.  The higher the enterprise relevance of the team, the more critical the team must be managed well over time – i.e., team structure changes must be carefully managed, and not left entirely to the discretion of individual managers.

Conclusion

By ensuring that every project that purports to be Agile has a mandatory allowance for enterprise resource requirements, teams can have confidence that there is a route for them to get their dependencies addressed through other agile teams, in a manner that is independent of annual budget planning processes or short-term individual business priorities.

The effectiveness of this approach can be governed and evaluated by Enterprise Architecture, which would then allow enterprise complexity goals to be addressed without concentrating such spending within the central EA function.

In summary, to effectively scale agile, an effective (and possibly rethought) enterprise architecture capability is needed.

Scaled Agile needs Slack

Making good architectural moves

[tl;dr In Every change is an opportunity to make the ‘right’ architectural move to improve complexity management and to maintain an acceptable overall cost of change.]

Accompanying every new project, business requirement or product feature is an implicit or explicit ‘architectural move’ – i.e., a change to your overall architecture that moves it from a starting state to another (possibly interim) state.

The art of good architecture is making the ‘right’ architectural moves over time. The art of enterprise architecture is being able to effectively identify and communicate what a ‘right’ move actually looks like from an enterprise perspective, rather than leaving such decisions solely to the particular implementation team – who, it must be assumed, are best qualified to identify the right move from the perspective of the relevant domain.

The challenge is the limited inputs that enterprise architects have, namely:

  • Accumulated skill/knowledge/experience from past projects, including any architectural artefacts
  • A view of the current enterprise priorities based on the portfolio of projects underway
  • A corporate strategy and (ideally) individual business strategies, including a view of the environment the enterprise operates in (e.g., regulatory, commercial, technological, etc)

From these inputs, architects need to guide the overall architecture of the enterprise to ensure every project or deliverable results in a ‘good’ move – or at least not a ‘bad’ move.

In this situation, it is difficult if not impossible to measure the actual success of an architecture capability. This is because, in many respects, the beneficiaries of a ‘good’ enterprise architecture (at least initially) are the next deliverables (projects/requirements/features), and only rarely the current deliverables.

Since the next projects to be done is generally an unknown (i.e., the business situation may change between the time the current projects complete and the time the next projects start), it is rational for people to focus exclusively on delivering the current projects. Which makes it even more important the current projects are in some way delivering the ‘right’ architectural moves.

In many organisations, the typical engagement with enterprise architecture is often late in the architectural development process – i.e., at a ‘toll-gate’ or formal architectural review. And the focus is often on ‘compliance’ with enterprise standards, principles and guidelines. Given that such guidelines can get quite detailed, it can get quite difficult for anyone building a project start architecture (PSA) to come up with an architecture that will fully comply: the first priority is to develop an architecture that will work, and is feasible to execute within the project constraints of time, budget and resources.

Only then does it make sense to formally apply architectural constraints from enterprise architecture – at least some of which may negatively impact the time, cost, resource or feasibility needs of the project – albeit to the presumed benefit of the overall landscape. Hence the need for board-level sponsorship for such reviews, as otherwise the project’s needs will almost always trump enterprise needs.

The approach espoused by an interesting new book, Chess and the Art of Enterprise Architecture, is that enterprise architects need to focus more on design and less on principles, guidelines, roadmaps, etc. Such an approach involves enterprise architects more closely in the creation and evolution of project (start) architectures, which represents the architectural basis for all the work the project does (although it does not necessarily lay out the detailed solution architecture).

This approach is also acceptable for planning processes which are more agile than waterfall. In particular, it acknowledges that not every architectural ‘move’ is necessarily independently ‘usable’ by end users of an agile process. In fact, some user stories may require several architectural moves to fully implement. The question is whether the user story is itself validated enough to merit doing the architectural moves necessary to enable it, as otherwise those architectural moves may be premature.

The alternative, then, is to ‘prototype’ the user story,  so users can evaluate it – but at the cost of non-conformance with the project architecture. This is also known as ‘technical debt’, and where teams are mature and disciplined enough to pay down technical debt when needed, it is a good approach. But users (and sometimes product owners) struggle to tell the difference between an (apparently working) prototype and a production solution that is fully architecturally compliant, and it often happens that project teams move on to the next visible deliverable without removing the technical debt.

In applications where the end-user is a person or set of persons, this may be acceptable in the short term, but where the end-user could be another application (interacting via, for example, an API invoked by either a GUI or an automated process), then such technical debt will likely cause serious problems if not addressed. At the various least, it will make future changes harder (unmanaged dependencies, lack of automated testing), and may present significant scalability challenges.

So, what exactly constitutes a ‘good’ architectural move? In general, this is what the project start architecture should aim to capture. A good basic principle could be that architectural commitments should be postponed for as long as possible, by taking steps to minimise the impact of changed architectural decisions (this is a ‘real-option‘ approach to architectural change management). Over the long term, this reduces the cost of change.

In addition, project start architectures may need to make explicit where architectural commitments must be made (e.g., for a specific database, PaaS or integration solution, etc) – i.e., areas where change will be expensive.

Other things the project start architecture may wish to capture or at least address (as part of enterprise design) could include:

  • Cataloging data semantics and usage
    • to support data governance and big data initiatives
  • Management of business context & scope (business area, product, entity, processes, etc)
    • to minimize unnecessary redundancy and process duplication
  • Controlled exposure of data and behaviour to other domains
    • to better manage dependencies and relationships with other domains
  • Compliance with enterprise policies around security and data management
    • to address operational risk
  • Automated build, test & deploy processes
    • to ensure continued agility and responsiveness to change, and maximise operational stability
  • Minimal lock-in to specific solution architectures (minimise solution architecture impact on other domains)
    • to minimize vendor lock-in and maximize solution options

The Chess book mentioned above includes a good description of a PSA, in the context of a PRINCE2 project framework. Note that the approach also works for Agile, but the PSA should set the boundaries within which the agile team can operate: if those boundaries must be crossed to deliver a user story, then enterprise design architects should be brought back into the discussion to establish the best way forward.

In summary, every change is an opportunity to make the ‘right’ architectural move to improve complexity management and to maintain an acceptable overall cost of change.

Making good architectural moves

Achieving modularity: functional vs volatility decomposition

Enterprise architecture is all about managing complexity. Many EA initiatives tend to focus on managing IT complexity, but there is only so much that can be done there before it becomes obvious that IT complexity is, for the most part, a direct consequence of enterprise complexity. To recap, complexity needs to be managed in order to maintain agility – the ability for an organisation to respond (relatively) quickly and efficiently to changes in markets, regulations or innovation, and to continue to do this over time.

Enterprise complexity can be considered to be the activities performed and resources consumed by the organisation in order to deliver ‘value’, a metric usually measured through the ability to maintain (financial) income in excess of expenses over time.

Breaking down these activities and resources into appropriate partitions that allow holistic thinking and planning to occur is one of the key challenges of enterprise architecture, and there are various techniques to do this.

Top-Down Decomposition

The natural approach to decomposition is to first understand what an organisation does – i.e., what are the (business) functions that it performs. Simply put, a function is a collection of data and decision points that are closely related (e.g., ‘Payments ‘is a function). Functions typically add little value in and of themselves – rather they form part of an end-to-end process that delivers value for a person or legal entity in some context. For example, a payment on its own means nothing: it is usually performed in the context of a specific exchange of value or service.

So a first course of action is to create a taxonomy (or, more accurately, an ontology) to describe the functions performed consistently across an enterprise. Then, various processes, products or services can be described as a composition of those functions.

If we accept (and this is far from accepted everywhere) that EA is focused on information systems complexity, then EA is not responsible for the complexity relating to the existence of processes, products or services. The creation or destruction of these are usually a direct consequence of business decisions. However, EA should be responsible for cataloging these, and ensuring these are incorporated into other enterprise processes (such as, for example, disaster recovery or business continuity processes). And EA should relate these to the functional taxonomy and the information systems architecture.

This can get very complex very quickly due to the sheer number of processes, products and services – including their various variations – most organisations have. So it is important to partition or decompose the complexity into manageable chunks to facilitate meaningful conversations.

Enterprise Equivalence Relations

One way to do this at enterprise level is to group functions into partitions (aka domains) according to synergy or autonomy (as described by Roger Sessions), for all products/services supporting a particular business. This approach is based on the mathematical concept of equivalenceBecause different functions in different contexts may have differing equivalence relationships, functions may appear in multiple partitions. One role of EA is to assess and validate if those functions are actually autonomous or if there is the potential to group apparently duplicate functions into a new partition.

Once partitions are identified, it is possible to apply ‘traditional’ EA thinking to a particular partition, because that partition is of a manageable size. By ‘traditional’ EA, I mean applying Zachman, TOGAF, PEAF, or any of the myriad methodologies/frameworks that are out there. More specifically, at that level, it is possible to establish a meaningful information systems strategy or goal for a particular partition that is directly supporting business agility objectives.

The Fallacy of Functional Decomposition

Once you get down to the level of partition, the utility of functional decomposition when it comes to architecting solutions becomes less useful. The natural tendency for architects would be to build reusable components or services that realise the various functions that comprise the partition. In fact, this may be the wrong thing to do. As Jüval Lowy demonstrates in his excellent webinar, this may result in more complexity, not less (and hence less agility).

When it comes to software architecture, the real reason to modularise your architecture is to manage volatility or uncertainty – and to ensure that volatility in one part of the architecture does not unnecessarily negatively impact another part of the architecture over time. Doing this allows agility to be maintained, so volatile parts of the application can, in fact, change frequently, at low impact to other parts of the application.

When looking at a software architecture through this lens, a quite different set of components/modules/services may become evident than those which may otherwise be obvious when using functional decomposition – the example in the webinar demonstrates this very well. A key argument used by Jüval in his presentation is that (to paraphrase him somewhat) functions are, in general, highly dependent on the context in which they are used, so to split them out into separate services may require making often impossible assumptions about all possible contexts the functions could be invoked in.

In this sense, identified components, modules or services can be considered to be providing options in terms of what is done, or how it is done, within the context of a larger system with parts of variable volatility. (See my earlier post on real options in the context of agility to understand more about options in this context.)

Partitions as Enterprise Architecture

When each partition is considered with respect to its relationship with other partitions, there is a lot of uncertainty around how different partitions will evolve. To allow for maximum flexibility, every partition should assume each other partition is a volatile part of their architecture, and design accordingly for this. This allows each partition to evolve (reasonably) independently with minimum fixed co-ordination points, without compromising the enterprise architecture by having different partitions replicate the behaviours of partitions they depend on.

This then allows:

  • Investment to be expressed in terms of impact to one or more partitions
  • Partitions to establish their own implementation strategies
  • Agile principles to be agreed on a per partition basis
  • Architectural standards to be agreed on a per partition basis
  • Partitions to define internally reusable components relevant to that partition only
  • Partitions to expose partition behaviour to other partitions in an enterprise-consistent way

In generative organisation cultures, partitions do not need to be organisationally aligned. However, in other organisation cultures (pathological or bureaucratic), alignment of enterprise infrastructure functions such as IT or operations (at least) with partitions (domains) may help accelerate the architectural and cultural changes needed – especially if coupled with broader transformations around investment planning, agile adoption and enterprise architecture.

Achieving modularity: functional vs volatility decomposition