Infra and App Evolution

TL;DR
How infrastructure and application architecture can be blamed for contributing to the overall complexity, bureaucracy heavy, error-prone processes of the large scale IT operation and how to improve the situation using step-by-step approach.

Background
Having worked in ‘few’ organizations over last couple of decades in North America and occasionally around the world I finally started to recognize some common patterns that slow down productivity within an organization, sometimes, to a crawl. The cycle seems to repeat itself overtime where, at first, productivity and velocity is high and, over time, it is all lost to processes and bureaucracy.
Spending daytime and often enough burning midnight oil designing, coding, deploying, and troubleshooting software in different environment it has started bothering me while working at Shomi.com in principle: ‘What am I doing wrong with my architecture practices that cost that enormous loss of productivity?’
From the common sense idea of blaming everyone else, through self pity, eventually I started to realized the magnitude of factors that contribute to the trouble that cannot be possibly resolved with a silver bullet or finger-pointed to a single ‘root-cause’.

Hidden technical debt
Overview

The good
On the surface there is nothing wrong with the diagram of how infrastructure and applications have been designed and operated for an eternity in IT terms.
Networks have been built using private ip address space, subnets were designed to separate traffic flows, firewalls were introduced to control traffic between locations, subnets and internet access.
Vpns have been added to interconnect and to protect traffic between remote locations and trusted partners.
Direct lines have been added to support higher network loads with decreased latency and increased bandwidth.
The bad
In the earlier days encryption at the application level could not have been easily supported due to computation overhead. Encryption often would be offloaded to a hardware appliance to improve performance.
Networks have been perceived as trusted and there was little to no concerns about clear text traffic on the internal networks.
In the earlier days encryption at the application level could not have been easily supported due to computation overhead. physically deployed close one to another would create interdependency at any level with little to no concern on security:
  • db link for database-to-database connectivity, read or read/write - whatever is needed
  • ETL (extract-transform-load) processes would move increasing amount of data
  • file exchange using ftp protocol
  • tcp/ip sockets calls
  • web services

The ugly
While it is true proper software architecture promotes loose coupling, abstraction, encapsulation, and separation of concerns. A term, I first heard from Deb Butts, probably accurately describes the reality - decay of a software over time:
  • Messy dependency chart between components
  • hard-coding of parameters
  • poor encapsulation
  • lack of motivation to upgrade operating system, 3rd party modules and frameworks over time all
  • the essence of close proximity to other software and systems in the legacy deployment model indirectly encourages shortcuts during the design or the implementation phases of the dependant systems to 'just connect to that database' or 'just send this file over ftp'
The end result is an incredible complexity of environment setup and troubleshooting that, in order, encourages evolution of change management processes that are quickly slowing the processes in addition to technical complexity.
Horrendous inflexibility of adopting another operating system, back-end system, programming language, and etc. translates into 'not invented here' syndrome where 'new shiny things need not apply'.
Since systems are running over trusted networks software continues to be built insecurely even though encryption at application level is cheap and is readily available in nowadays.
Provisioning additional servers, managing dns entries, configuring load balancers, firewall rules, subnets, network zones translates into a snowball effect of decreasing productivity over time and bureaucracy becomes a necessary evil to deal with all of that.
The next thing you know: results oriented resources start moving out and processes oriented candidates start moving in. Productivity is greatly declined and velocity of the earlier days slows down.

What's your point?

It is probably not going to be well received, but there is an element of blame to attribute to solution architects like myself:
  • legacy systems rewrites/upgrades don't tend to go well in terms of timelines and budget, the more complex is the system - the more painful upgrading or decommissioning it will be
  • designing complex system as one or few deployment units is a recipe for incrementally expensive enhancement and maintenance cycles - decay happens over time
  • deploying large number of systems over fast and private network tends to translate into tighter than ideal coupling between different systems and services - SOA less than impressive success stories
  • deploying large number of systems into the same network tends to promote heavy change/risk management processes - it did look like a great idea at the time to increase performance though
  • there is a false comfort in defining and working toward an end-state, the end-state will most likely change half way through at every time interval allocated to the project - best architecture is the one that is easiest to change
  • owner of the solution (technical or business) will develop an emotional attachment to the choices/decisions made likely proportionally to the investment made into the solution. Complex system means heavy investment paving a way for clinging to them longer at increasing total cost of ownership
  • progressive software developers don't get excited about digging through someone else's code using version of a framework that is 10 years old - how we did not see that coming: then new and shiny MVC would soon become dated and dull with release of AngularJs

What's next?
Next is not rewriting the old systems with new tools, it is rather creating a risk reduced infrastructure and application architecture where old systems operate as-is and new systems are built using newest technology. IT is not paralyzed by fear of breaking existing infrastructure/applications and ready to take on new initiatives with startup like velocity. Sounds like a hallucination, does not it?

Step 1 - New Apps Deployed Almost Isolated


What's the difference?
  • new application deployments don't reside on the same network as the legacy systems
  • new applications are closer to microservices than to monolithic architecture
  • there is still a heavy dependency on private network connectivity as legacy systems rely on network for security and there is no appetite to revamp legacy systems
  • connectivity to third parties remains the same - cannot count on the same vision and timeframes
  • new applications have greater freedom to deploy faster
  • new applications have greater autonomy to select newer technology and programming languages
  • new applications are still relying on the VM deployment model due to organization familiarity - IaaS
  • reduced risk and reduced change management overhead - isolation as imperfect as it is decreasing chances of widespread outage or failure

Step 2 - Apps Moved to PaaS and DaaS


What's the difference?
  • new applications are decreasing dependency on IaaS and increase dependency on PaaS, Serverless and DBaaS
  • building and deploying microservices is becoming cheaper and less risky
  • reduced dependency on private network connectivity - communication between new application is using public IP with application level protection instead of network level protection
  • connectivity to third parties remains the same - cannot count on the same vision and timeframes
  • new applications benefit from increased delivery velocity by reducing dependency on IaaS
  • new applications benefit from reduced cost running on PaaS/Serverless instead of IaaS
  • SaaS and DBaaS further reduce TCO - no daily backups to take
  • further reduction of risk and cost of deployment using cloud specific or 3rd party continuous integration and deployment tools

Step 3 - Apps and Networks are least interdependent


What's the difference?
  • new applications are mostly running on PaaS, Serverless and DBaaS with the lowest TCO
  • no dependency on private network connectivity, applications on-premises and on the cloud are integrated using public IP's
  • legacy on-prem system have been encapsulated using service layer over public IP
  • connectivity to third parties remains the same - cannot count on the same vision and timeframes
  • new application can select any cloud, any technology, any SaaS to deliver fastest business results using most appropriate toolset
  • possibly perfect stage for switching new apps into the DevOps model with greater accountability and small teams SDLC processes
  • application team can slide the scale between microservices to a gigantic deployment without a cross-organizational convergence

Conclusions
  • large deployments are risky and hence process heavy - isolated networks and microservices architecture should have a leg up on monolithic applications and gigantic networks
  • end-state architecture is a continuous and risk-reduced model for change, not perfect static result
  • what specific technology is being used is less important: C++, C#, JavaScript, and Java would be suffering the same fate over time - increased fragility and complexity
  • what ip addressing schema is being used ipv4 or ipv6 is less important: we run out of public and private ipv4, shall we switch to ipv6 overnight application level security vs. network security remains open
  • network architects and solution architects kind of brought the risk and change management processes onto themselves while suffering from reduced productivity and business agility - address the factors rather than struggling with the symptoms
  • maybe there is a hope helping business to stay in a startup mode longer by isolating complexity and interdependencies at the application and network layers