Taking DCIM Further

The declared aim of data center infrastructure management (DCIM) is to bridge the gap between organizational silos and support agile methodologies across multiple departments. But there is more to DCIM than a merging of facility and server management. With the rise of virtualization and SDx, it is necessary to consider a number of other stakeholders to ensure truly professional and efficient data center operation.

A data center is home to a wide range of disciplines and stakeholders: facility management teams, server and storage admins, database specialists, and network engineers. Often, these groups are siloed off in separate organizational units, making properly synchronized and orchestrated collaboration a rarity in many contemporary data centers.

Responsibilities and operational concepts have usually evolved over time, rather than being part of an overarching plan. Attempts to improve things through internal restructuring are often rendered obsolete by changing conditions, new technologies, or acquisitions and expansion. The result is a major discrepancy between actual task profiles, organizational structures, and the current demand for agile data center operation.

Silos and the problem of virtualization

The first question we have to ask is what we really mean when we talk about a “data center.” Depending on one’s perspective, the definitions can be very different indeed. For some, it is simply the room in which an organization’s servers are located; for others, it is the entire building, i.e., the “white space” and associated IT rooms, the “gray space,” and various other facilities.

Just as lack of a shared language makes it difficult to communicate, so the differing perspectives of individual stakeholders can have a negative impact on the efficiency of their processes and methods. In most cases, each team – or silo – has its own methods and tools for managing tasks and operating the systems for which they are responsible. However, while the server team might find it helpful in their daily work to concentrate on “their” specific assets, this insular view is not really productive for the organization as a whole.

Planning, management, and monitoring in a single system is the key to modern data center management.

Virtualization of systems and new programmable/SDN networks (software-defined networking) have brought enormous gains in the speed at which services are delivered. Yet there is still room for improvement in the planning and introduction of these methods and technologies, given the frequent absence of a robust planning platform and appropriate tools. The reason for this deficiency is that the whole discussion around “virtual” and “software-defined” technologies tends to overlook the fact that these are not entirely virtual and abstract assets, but rather ones based on physical hardware resources.

Every private cloud system runs on a physical server and every data package travels through numerous cables and patch panels. But it is impossible to detect passive components using autodiscovery tools, and they cannot be controlled through software. This physical layer also has its own lifecycles, its own particular rhythm, and – occasionally and unfortunately – its own specific faults. Without adequate documentation and planning of the real-world relationships, it is extremely difficult and time-consuming to identify the physical cause in the event of a fault. It is also impossible to conduct a full contingency analysis for a planned change – or a reliability analysis that addresses each individual layer in the stack.

Adopting a coordinated strategy for digital transformation

The situation described above will be exacerbated by the anticipated increase in virtualization, cloud, private cloud, and hybrid cloud services. The ongoing digital transformation will place even greater pressure on data centers to accelerate delivery of services. Where once clients could wait as long as six months for a bare-metal system, most cloud clients today consider a wait time of six minutes for a virtual system to be too long.

And these expectations will continue to rise, of course. The digital transformation is not only creating new application scenarios and business models, it is also giving rise to new methods and models designed for greater agility, flexibility, and speed. The processes used in data centers 20 years ago are not always practicable today. Since technology has moved on, it is time to examine whether operational processes and management methods are still fit for purpose.

To achieve meaningful improvements in the use of IT, we have to look beyond optimization of individual subsystems and improve the system as a whole.

A classic example of subsystem optimization is deployment of UPS systems with better PUE (power usage effectiveness) – or rather pPUE (partial power usage effectiveness). The problem with improving a subsystem in isolation is that it does not necessarily lead to a significant improvement of the overall system. In many cases, the investment costs exceed the benefits. Improving the PUE on an asset can often be difficult to justify if it is accompanied by a bigger electricity bill. PUE optimization runs the risk of becoming an end in itself, overshadowing all other efforts to bring down costs or dooming them to failure through a lack of appropriate means.

A holistic approach instead of partial optimization

To understand partial optimization in the IT environment, we only have to consider migration from one server technology – or one server supplier – to the next.

Just as pizza box servers were replaced by blade server solutions, now many data centers are switching to converged systems. Regular tendering for new server hardware often results in a change of supplier. Typically, this change is viewed in isolation, with comparisons limited to the previous generation of server or previous supplier. The assessment covers the procurement and operating costs, a brief examination of the new hardware in terms of processing power and perhaps space requirements or power efficiency, plus maybe the impact on licensing costs.

However, many operators neglect to carry out a complete and detailed examination of the changes in current density (i.e., power requirements per height unit in the rack or area unit in the room), heat emissions (heat-flow volume and outlet temperature), or cooling requirements (required air-flow rate, optimum temperature range, air velocity, and pressure differences). Yet precisely herein lies a major risk during both the roll-out and operational phases.

Another problem with even the most successful partial optimization is that neighboring teams or users are often completely unaware of the change and may inadvertently work against it.Even the most efficient server will still consume too much power if left running in idle mode.Similarly, a basic test system doesn’t need to run in a Tier 3 environment if a simpler alternative is available.

The aim should be to optimize the overall performance of the data center.

To do that, it is necessary to consider every part of the organization, all IT and data center components, and every layer in the stack. Accordingly, the senior management team must buy into the objective and add it to their agenda. Rather than focusing on a prestigious new facility building, for instance, it often makes better long-term sense to restructure the internals or optimize operational processes.

There is also a need to rethink redundancy in light of the latest NFV (network functions virtualization) technologies and methods, given that system reliability will increasingly be assured on the application level, i.e., with fewer demands on the physical infrastructure of the data center. Hyperscale service providers, who rely on self-healing and auto-scaling on the application layer, have long understood the potential savings that are possible if, for example, A and B power feeds do not have to be provided through a UPS, due to a lower level of reliability being required on this layer. Without planning and monitoring, a management team has no objective and no insight into current states.

New demands on DCIM solutions

For companies that, unlike hyperscale providers, are unable to develop and deploy their own management solutions, there is a wide range of powerful DCIM products to choose from. These tools support planning and operation, automation of processes, fault prevention, and better utilization of resources. In doing so, they merge the three key areas of planning, management, and monitoring.

It is easy to see why these three components are a natural fit – and why no sophisticated data center management system can do without them. When monitoring is separate from management and planning, it is purely reactive – the operator waits until a threshold value is breached or an alarm is sounded, and only then starts to think about what to do. Obviously, it is better to prevent this type of situation from occurring, if possible not merely through lower threshold values but rather through proper planning, better operating conditions, and a proactive stance. Conversely, planning without monitoring means there is no way of comparing planned and actual states, of verifying implemented changes, or of detecting potential deviations before they become a problem.

A proven DCIM solution is the only way to ensure better use of resources and faster processes while maintaining or improving system reliability.

Using an appropriate data model across all layers – from the smallest fuse upwards through the server, VM, application, and business service – ensures that data center stakeholders have access to planning and analysis/optimization options. The addition of workflows helps to improve efficiency by simplifying recurring tasks and preventing errors, while also meeting compliance and security requirements. The result is a significant reduction in workload, which frees up staff to focus on non-automatable tasks and planning.

User involvement is crucial to success

Bridging the gap is only possible through integration and collaboration of all stakeholders. Only when all participants use the same database and can access the same up-to-date information is it possible to plan reliably and assess situations correctly. All planning steps should be shared with colleagues before a decision is taken. Playing Chinese whispers by e-mail is not a good way of synchronizing processes, and a spreadsheet program is not an orderly, revision-proof means of documenting critical infrastructure.

Even if it is not always easy to roll out a DCIM solution to a large number of stakeholders, it is a hurdle that is well worth overcoming.

Objections on cost grounds are no longer valid today: numerous studies indicate a very reasonable ROI of sometimes even less than one year. In short, the question is no longer whether data center operators can afford a DCIM solution, but whether they can still afford to do without one – since the costs of inefficient operation are much greater in the long term.

More interessting blog posts written by Oiver Lindner, Head of Business Line DCIM at FNT:

Predictive Maintenance in the Data Center

Reliable Planning and Operation with DCIM tools

Data Center Business Value Dashboards