Advanced Environmental Monitoring, Control and Optimization
Today, the vast majority of data centers employ environmental monitoring controls that are outdated and uncoordinated. While ASHRAE guidelines specify temperature and humidity levels at the IT equipment inlet, nearly all CRAC/CRAH units regulate independently and control both temperature and humidity at the air return point. This disparity results in conflicts between different cooling units where the temperature and humidity of the air that enters the IT equipment is out of the ASHRAE recommended range. These cooling units end up opposing each other; one cooling, while the other bears no load or even heats the load or one humidifying while another de-humidifies.
In addition to a discussion with 42U's CTO Scot Heath, PE, about effective environmental monitoring that will improve efficiency and maximize availability of your data center, this webinar presents a Data Center Case Study for a Fortune 50 Company with a 3-Month ROI for Monitoring and Optimization.
This Webinar covers the following environmental monitoring topics:
- The Current State of Monitoring
- Monitoring Considerations
- Pitfalls of Individual Return Air Control
- Case Study: 3-Month ROI for Fortune 50 Company
- How Control Systems Save Energy & Impact PUE
- Q&A Session
Read the Full Transcription
Tanisha White: Ladies and Gentlemen, thanks for standing by and welcome to today’s session in the 2011 42U web seminar series. My name is Tanisha White and I’ll be your moderator for today’s webinar called Advanced Environmental Monitoring, Control & Optimization.
During this presentation all participants will be in a listen only mode. However, we encourage your questions or comments at any time through the Chat feature located at the lower left of your screen. These questions will be addressed as time allows.
This webinar is being recorded today, August 18, 2011. A replay of this webinar recording will be available on our Web site at 42u.com approximately 48 hours after our presentation.
Our speaker for today’s presentation is 42U’s Chief Technology Officer, Scot Heath. And at this time I’d like to turn the presentation over to Scot.
Scot Heath: Thank you Tanisha. Welcome again everyone. Let’s go ahead and go to the next slide Tanisha.
Those of you who have been here before may have seen this slide before, it’s one of my favorites, I’ll continue to show it. This is actually a very dense environment, the inside of a containerized data center.
And the quote here, you know, regarding density I just want to reiterate, you know, over and over and over, I go into data centers that, you know, were initially laid out with little to no thought for the increase in density that is rapidly becoming the norm.
And I used to say, "Ten years ago," it’s even more years ago than that now, we didn’t think about hot aisle cold aisle arrangements and so on and so forth. And I still go into a significant number of data centers that have that. Even with hot aisle/cold aisle arrangement, you know, knowing what’s going on in the data center is imperative.
And we’re going to talk today a little bit about, you know, how you know about what’s going on; what your purpose in knowing about what’s going on is that is relative to temperatures; and I’ll throw in a little bit of other monitoring parameters; and finally, you know, how some of these monitoring systems have actually evolved into control systems. Let’s go ahead and go on.
So you know, top of the slide there, data center cooling is not comfort cooling, yet we still treat it that way by-and-large. You know, monitoring your presence is usually a few points, you know, I’ve seen all kinds of schemes — NetBots, I look at a few servers, whatever the case may be — and to really get a good idea of what’s going on, you know, having just that few points doesn’t quite cut it.
Most of the time, you know, we rely on just making the temperature very cool to make things all right. And I’ll talk about, you know, what’s going on there with the control system, a little bit later.
But fact of the matter is, kind of keep this out in mind, when you add equipment you should turn the thermostat up — up, that is warmer — if you have a classical, you know, CRAC control in your data center.
Now one of the other problems is just no clear ownership of the environment in the data center.
You know, is it the IT guy’s responsibility to make sure that the environment is the right temperature? Well he’s certainly not the guy in most cases who specified the cooling equipment and maintains that cooling equipment.
Is it the building maintenance people who go into the data center and check temperatures? You know, most of the time that’s not true either.
When I was still at HP in fact, we built a new data center that’s kind of a research data center, and I and one of the last guys were in there adjusting some temperatures. And the guard showed up and wanted to know what we were doing. "Well we’re in here, you know, adjusting temperatures."
"Well you shouldn’t be in here adjusting temperatures." So in that case, the ownership was so screwed up that we were being chased out of something that was our responsibility. So things can get quite muddy there.
And finally you know, measuring is evolving. And I don’t know if you’ve had a chance to take a look at the Green Grid’s data center maturity model or guide, but you know, everything in the data center is evolving, and monitoring is no different. You know, this little chart right here at the bottom is 90% of the time we’ve got, you know, some points out there that are distributed some place.
And in the case I just sited, you know, the reason the guard was in there was because his alarm was going off. Why was his alarm going off? Well, the alarm point that got put in when the builders built that data center — and this is brand new, probably only 10% occupied — was put in a contained hot aisle.
So, what a silly place to put an alarm point. But nonetheless that’s where it was. And we only had I think three in that data center of 33,000 square feet, so you know, way under monitored from the point of view of, you know, the people who were coming in to tell us not to do things.
And as we moved forward to optimization, you know, more and more points are necessary. And kind of the number of points varies with the density that’s there. If you’ve got some pretty high densities in, you want to have a lot of points, because you know, the localized temperatures can swing significantly based on how much air is being flowed through that particular area.
Bad things can happen, you know, people can take out machines, racks, blanking panels, whatever the case may be and end up with a condition that’s unsuitable for the equipment right around that localized area. And you’ll miss it if you’re not monitoring in a very close proximity.
And finally you know, if we do get to the control point, very pervasive monitoring is necessary because we want to ensure that we are minimizing margin, I mean that’s our whole purpose in even trying to put in more advanced control is to have the ability to have reserve cooling capacity on tap when we need it because things happen — you know, power goes down I lose cooling, belts break, you know, my fan quits in a CRAC unit.
So I need to have more cooling than absolutely necessary. But running that cooling all the time, especially at a high capacity level is quite expensive. And so much better to have smarter controls. Let’s go ahead and go to the next slide Tanisha.
And I have a plug-in here for "Poor monitoring is worse than none," and I kind of lumped the, you know, poor monitoring and poor control together here.
You know, we have an inaccurate view of what’s going on in the data center. Sometimes we have no view of what’s going on in the data center. But if we have a little bit of monitoring, depends on where that monitoring is at.
Now the picture I have there is actually an example of the data center I was in where there’s a CRAC very nearby, in fact you can barely see the edge of it in the right hand side of that picture. And that CRAC was happily running, thinking it was doing a great job.
But the issue here is, that you know, the cold supply tiles for that row of equipment are right up against the front of the CRAC essentially. And so a significant portion of that air returns directly to the CRAC without going through the server equipment at all.
It’s happy to do that; the CRAC’s creating a low pressure zone on the top and it’s drawing air back, you know, presumably warm air from the room, but if I dump cold air out right in front of it, where’s it going to go? It’s going to go right back to the source that gave it the energy to get there and that’s the CRAC.
And so you know, it thought it was being happy, the BMS thought that CRAC was happy because it’s, you know, not running a very high load, and the temperature set point and the temperature monitored are the same, when in fact it was cooling, you know, hardly anything.
Look at the top of the RAC there, I think the Delta T in this case was around 20 degrees top to bottom in this rack. So we could have seen that of course if we had, you know, much better monitoring in place. Okay let’s go ahead and go on to the next slide.
So you know, when you think about putting monitoring in place there are several considerations. The first is cost and cost can be quite high depending on, you know, what scheme you pick. I’ve heard, you know, a standard quote of around $1000 per monitored point for a typical classical BMS — you know, a Johnson Controls or a Siemens or whoever the case may be.
I’ve got to run wire out there, I’ve got to get, you know, somebody to come out and change the program to add that point in if I don’t have staff on site to do that, and now I’ve got to put that, you know, functionality in an alarm someplace — you know, whatever the case may be it’s not a simple proposition to do.
You know, where I’m going to put this stuff is also very paramount. And that is, you know, as close as possible to the inlets of the machines because that’s really what I care about.
If I go look at the ASHRAE specs, they don’t spec, you know, temperature back at the CRAC, they don’t spec temperature in the rear of the machines; they spec the temperature of the air going into those machines.
And if I’m cost conscious and I want to minimize the amount of monitoring I put out there but maximize the benefit, I want to spend some time out there understanding where my worse-case locations are. You know, that is CRACs that have – or I mean equipment that’s very dense in particular areas, you know, I may have hot spots in areas — I certainly want to monitor those.
And that’s by no means a, you know, one time deal. It’s a continuous kind of walk through this and make sure that I’m monitoring the right places. Because you all know that your data centers are dynamic; things evolve. You know, I changed out my outlook servers, now all of a sudden I have blades in place where I had, you know, old 1U Servers before, or you know, I installed a new database server and it changes the dynamics of the room significantly when those kinds of things happen. And you know, open air return rooms can suffer greatly, from you know, dense loads placed without planning or with little thought given to, you know, the amount of cooling that’s available in that particular location. And this is common. You know, I had my outlook servers there before, I put my outlook servers there now. I had my whatever there before, I put them there now. We’re getting in – you know, we were just at a university here in Colorado and they put in a super compute node that’s very dense. You know, it’s probably a dozen racks that are 20 to – 27 to 35 kilowatts each. And they had room on one side of the data center so that’s where they put it. Unfortunately, you know, that side of the data center is the absolute furthest from the return ducts to the CRAC. So it’s all about the supply duct air return, every molecule of hot air that comes out of the back of those servers have to propagate around the racks themselves and other racks to get back to the return. And it’s not, you know, a pretty situation. They end up having to run their temperature extremely low. I think they’re delivering 50 or maybe even slightly below 50 degree air just to get ASHRAE temperature air into that super compute node.
So you know, had a little bit of planning been done, and maybe some monitoring ahead of time, they would have known that that location was, you know, quite unsuitable for that particular application or putting in there.
And finally, what are you going to use the monitoring for? You know, is it going to be just an alarm feature; optimization, are you going to look at this and go make changes to your floor on a regular basis. You know, actually use the data to plan, as we’ve been talking about, equipment placements, tuning of your CRAC zone and so forth, or do you want to go whole hog and implement some kind of a control scheme that’s based on this?
I will say one thing about the optimization here and that is the picture I’ve got up here. You know, optimization becomes much more straight forward, a much easier, you know, activity to do if you’ve got a valuable set of data to look at. And numbers are fine, and you know, you can go correlate, you know, where that RAC is with where that vent tile is and look and see what the temperatures are, but a visual indication like this graph I have up here, this picture I have up here is quite valuable. And this is, you know, available from several, this happens to be a SynapSense display that shows inlet temperatures. And I can’t remember what level it is, they monitor top, middle and bottom. But the bottom line is, you know, it’s very quick and easy to see that if red is too hot, green is okay and blue is too cold, you know we have no place that’s red, so we’re overcooling the data center. You know, we like to be kind of on the edge of that yellow in at least some places here to give – you know, to minimize the margin that we have out there. We have some areas that are overcooled, and what’s the cause of that? Well you know, we don’t show the CRACs in this particular picture, but knowing where the CRACs are, sure enough there’s are areas that are very close to CRACs. So cool air is being delivered there at a higher volume, we could probably shut the tiles down in that area somewhat.
So very easy to look at this and know that, you know, we can make some changes here to both save some money because we’re not going to be overcooling and enable better cooling in the rest of the data center.
The right hand end of my floor right there looks like, you know, if I raise my set point much that’s where it’s going to turn yellow. Be nice to get a little bit more cool air down on that end. And I can tell how I’m doing very quickly by just taking a look at these graphics. Okay let’s go ahead and go on to the next slide.
So let’s talk a little bit about the different uses here. So alarming, again you know, cost is a major driving factor here. If all I want to do is alarm, I want to absolutely minimize my investment here, but I want to make it so that I monitor enough points. And there are very many inexpensive, but – I mean very inexpensive but reliable solutions out there.
So you know, if you’re interest is just in alarms, you have a wide variety, including just letting the machines alarm themselves. Now you have to set them up and listen to them, but nearly every, you know, piece of IT gear out there now has the capability of broadcasting some kind of an alarm if it has, you know, too warm an air.
You know, I’m very familiar with the HP machines, they absolutely have this capability. And you know, they’ll report and allow you to see all kinds of data. Some of it’s free, comes just standard, some of it you pay for, software using their ILO monitoring system. But you know, as time goes on, this will become more and more common, and you know, more and more cost effective. In fact, you know, I was chair of a committee for a while the Green Grid is developing a set of standards for server and IT equipment manufacturers, not just servers, that is meant to standardize the environmental parameters that are monitored and how they’re reported. So things like instantaneous power, you know, accumulative energy use, temperature at the front of the machines, temperature at the exhaust of the machines. I can’t remember the entire list, but basically you know, most of those things, most have a manufacturer’s monitor today, including you know, temperatures all over the internals.
The difficulty in using that information for you know, folks like us is, it’s not standard. So developing a package that goes out, and you know, queries machines is not an easy proposition. As we define better standards and people adhere to those, then it will become much easier to query.
And the guide that’s being developed there is the Data Center Design Guide and it not only covers, you know, the IT equipment that goes in there, it covers the, you know, power and cooling equipment as well.
So UPSs, another great example, the ability to go out and query the UPS, for you know, "What’s your output load, what’s your capability, what’s your input, and therefore you know, what’s your efficiency, you know, what’s your health" — all those kinds of parameters are available on most big, you know, building level UPSs but without standardization they’re very difficult to implement.
So another plug for the Green Grid, you know, please go take a look at, you know, what’s being done there in that Data Center Design Guide. And think in the future, you know, "How am I going to implement this as time moves along?"
If you’re going to do optimization of course you need enough points to give you an accurate profile of what’s going on in the data center. And that’s largely based on density. You know, we typically recommend at least one out of every three racks being monitored, one out of four if it’s a very sparsely populated data center.
But as the density goes up, you know, we quickly can get to one of every other RAC. And if you’re in a very dense environment, I mean that super compute node that I just mentioned at the university would really benefit from having monitors at every single RAC.
And not only you know, on an aggregate scale, you know, how many racks out of the data center I want to monitor, but placement on the RAC is very important.
One of the big issues that – in that particular case is that the racks that were implemented do not have good seals from the rails themselves to the sides of the RAC. And so the high pressure area generated just behind the servers that’s hot air drives hot air back around the sides of the machines and into the front.
Now if I monitor right in the middle of those racks I see a significantly different temperature than if I monitor near the edge where that hot air’s wrapping around. Yet that hot air gets sucked right back into the fans on the outside edges of those servers and pass through.
So the fans ramp way up in speed. And again you know, the total mix of air — that air that I’m supplying — I need to supply some pretty cold air to keep those servers in an operating range that they can tolerate.
Good visualization. You know again, a plug for that picture that I had up there before; much easier to look at that and get a view of what’s going on than to try and wade through individual data and match that stuff up. It’s night and day. You know, and you know, imagine trying to drive a car by looking down the road and look at your numbers, how far you are from the left and how far you are from the right, you’d never be able to do it. And it’s almost that bad trying to tune a data center.
So you know, a few steps here about optimization. You know, the very first thing that needs to happen in any data center is to create as consistent of an environment as possible. Prudent monitoring will tell you how inconsistent you are, but it won’t help you, you know, get it consistent. You have to go out there and do the hard work, you know, put blanking tiles in every slot. Kind of like – it’s kind of like communes you know, I go into data centers all the time and I say, "Do you use blank panels?" "Oh yes, we use blank panels." And I go out there and they’ll have 50% of the slots will be wide open. "Well how come these aren’t blanks?" "Oh you know, nobody gets around to it." And I always kind of relate this to communes. I heard someone say one time, "Why did the commune fail, sounds like a great idea?" And the answer was, "Because nobody was willing to take out the trash." You know, it’s a menial job, putting in blanking panels is kind of a menial job, but gee, you know, putting those blanking panels in, making sure that all those gaps and rows are sealed up, all your racks are tied together, you know, sides that got taken off are put back on, cabling is neat in the back — all that stuff just helps immensely to even out the temperature in the data center. And that’s really – you know the very first goal is to make the temperature of air going into the data center – going into the IT equipment, I should say, be all the same. Once that’s the case then getting the right airflow there and adjusting the temperature to be correct becomes easy. And I really mean easy. You know, you look at the pictures you say, "Oh I can see, I got you know, too big a vent tile opening here, too small one here." In no time at all you can get that, you know, wrapped right up.
And in the end, you know, what are you trying to do with that? Well you’re trying to use the least energy possible, so minimizing the total energy. And that’s not necessarily the hottest temperature, that’s the temperature that gives you the minimum energy because remember, that IT equipment’s very smart and it’s going to you know, speed up its fans as it sees fit to take account of temperature increases it sees going in because the only way to get heat out is, you know, a Delta T and the amount of air that goes through there. and if I’ve got too hot an air coming in I need to flow more of it to take that same amount of heat out of there.
So a plug for monitoring here that optimization, you don’t know what minimum energy is unless you’re monitoring minimum energy. So monitoring the thermal environment’s very important. Monitoring the power environment, especially if optimization is your goal, is equally important.
Now PUE is related to total energy but not necessarily, you know, one-to-one. So think about what’s going on in a case where I crank the thermostat up.
First thing that happens is I make my compressors run in a more effective mode. So there COP goes up, total power goes down. Did PUE get better? You bet, PUE got better. I turned my temperature up some more, probably the same thing happened.
But I turn my temperature up some more and now all the sudden fans all over the data center start to speed up. And those fans have a significant power curve. The power that fan consumes is a cubic with the amount of air that it moves.
So think about that for a minute, if I double the amount of air going through a server, I take eight times as much power to push that air through there. I can quickly swamp out the COP gain I got in my compressor by that fan energy that I’m seeing in the data center.
Did my PUE get better? My PUE got a lot better because now all of a sudden my floor load went way, way up. Even in the newest standard, you know, the place you measure the best measurement is energy consumed at the server itself.
While those fans are in the server they get to count as part of that IT load. So despite the fact that my PUE got much, much better, my total energy went up. So PUE is a very important measure for comparing, you know, general data center health; maybe not the best measure for tuning the floor itself. Let’s go ahead and go on.
Okay let’s get to controls. So controls are very, very tied to monitoring, right? The simplest control that you can think of is very simple monitoring, and that is just the monitor point that you’ve got in the CRAC itself.
Standard supplies that’s what every CRAC comes with. It’s got that little thermistor right up there in the corner and it does it’s best to tell you what kind of job it’s doing. And it may be reporting good information, it may not. You know that picture I showed you before, that was bad information.
The – there are several downfalls to this right? One of the biggest is supply temperature varies with load. Remember back on Slide 1 when I said, "Remember this, when you add equipment you should turn the thermostat up?" This is why.
You know ideally in the data center we’d like to supply a constant temperature air. We want the same temperature of air coming out of the floor of every vent in the data center.
Now I might have to go make adjustments to vents and things like that, but if I’m doing a good job of containing the environment to my data center — that is keeping the hot aisle and cold aisle air separated — I want that supply air to be the same temperature all the time.
Every time I add load, the only way CRAC can take more heat out of that air is to lower its supply temperature, because I’m regulating my return air temperature to the same point. That’s where that thermistor, it’s right up there in the corner of the machine, that’s in the return air stream. So as load goes up, outlet temperature goes down.
Now the right response to that is go turn the thermostat up so that I got the same supply air temperature going out. What temperature do I want? Well I’d like to have about 72, 75 something like that at the inlets of my machines.
What does that relate back to, well you know, that might be all the way down to 68 degrees something like that coming out of the CRACs themselves, but what I’m regulating is the return temperature. And that can be as high as 90 or even excess of 90 degrees.
So certainly that causes an allergic reaction, you know, in a lot of people to say, "You should have your thermostat set at 85 degrees, 90 degrees, whatever."
But you know, it’s what – it’s just the place that you’re monitoring in the control loop that gives you that reaction. You’re not seeing what these machines are seeing, you’re seeing some number that relates to that but in kind of a far distant land.
And I always use a car example here, and I’ll get to that on the next slide. So of the other pitfalls of having that monitored up there is that’s where humidity if monitored as well. So if you’re – if you are controlling the humidity in your data center and you go change that set point, suddenly now if I don’t change my relative humidity I’m controlling relative humidity at a different point.
That means that the relative humidity of my supply air just changed. So I need to go change my humidity set point as well.
Seems like an awful lot of work, why don’t we just control that outlet temperature? Well that’s pretty easy to do with a water cooled CRAC and in fact most of the new manufacturers allow you to control on that side. Sometimes it’s a kit, sometimes you just move the sensors that’s there.
But with DX CRACs, you know, refrigeration CRACs it’s very difficult to control close to the CRAC because you get huge temperature swings when the compressors come on or change stages and when they drop out. And so that causes oscillation in the control system and that’s really the reason that the control ended up where it did.
So poor information here leads to poor operation for all the reasons that I just mentioned. You know, we don’t have pervasive monitoring stock in most places because it’s expensive. You know the thing that’s delivered with that control system in it, and geez it’s pretty cheap just to use it that way. You can do yourself a big a favor by using it correctly. Go out and adjust those temperatures. Let’s go on to the next slide.
All right let’s get to my car example here. So I know a lot of you are thinking,, "Well gee whiz my data center is running just fine," and that old adage, "You know, if it ain’t broke don’t fix it," applies. And I kind of, "Well look it is broke."
And it’s broke because, you know, we’re consuming more energy than we need to. And I don’t want to sound, you know, like efficiency is my cry, but competitiveness is my cry. And part of that is being efficient, especially you know, for larger providers, (CoLoS) for instance. I mean the operating cost of the data center is paramount to making a buck.
You know, your main business may not be being a (CoLoS), you may be a bank. Well if you’re a bank, you know, and this quite common in the financial industry, you know, they don’t look at the cost of the data center because, you know, so many dollars of transactions are going through as if the data center goes down it’s just awful.
But I’ll throw out the car example here as a case in point. Look at the car I have in the upper left, I wouldn’t say it’s one of favorites, but it’s certainly one I remember. In fact in the last movie – Cars movie that came out, Cars Part Two, the bad cars were old American Motors cars like this. It just cracked me up.
And so this thing is a pacer. We used to call it the spacer. And the funniest thing in my opinion, about this car is it’s got a longer door on one side than the other. The two sides don’t even match. So contrast that to, you know, one of my current favorites there in the in the lower right. That’s a brand new BMW M5.
Now think for a minute about the engines in these cars, you know, there’s no comparison right? I mean the Pacer is low powered and crappy fuel economy and probably breaks all the time versus the BMW which is probably one of the most sophisticated engines in the world.
Well what’s sophisticated about it? Still has pistons right? Still has a crankshaft right? What different? The difference is control. The control on that engine is not even in the same universe as the control on the Pacer engine.
You know the Pacer engine is (unintelligible) in it’s – the only feedback really is your foot. Somebody at some point in time put some jets in it that were supposedly right for the ops that you ran at, and that’s just the way it lives. If you drive into the mountains it runs worse, if you drive it down to the coast it runs worse.
But the BMW on the other hand continuously monitors about a gazillion places and adjusts on the fly. And I contend that as we move forward that’s what the difference will be in data centers, control. To operate in the most effective and therefore the most competitive mode, advanced control is necessary, and it’s coming. Let’s go ahead and go on.
And this is a great example. So it’s not only coming, it’s here right? This is a big data center (unintelligible) square feet, 100, 000 square feet, they have over, you know, 3500 sensor points in this data center. And this is a slide I took from a joint presentation done by SynapSense and PGE. And it illustrates a couple things. One is the power of advanced control and the other is how much rebates can swing, you know, the viability of a project like this.
Saving was tremendous here, you know, three quarters of a million dollars’ worth of savings per year, half a million in rebate. So the net customer cost of this was almost nothing compared to the operating cost of the data center.
So anytime you can (CNLRI) in less than three it truly is a no-brainer. And this customer implemented this, is quite happy with this. No it’s not, you know, a simple thing to do. I certainly don’t want to make to oversimplify the case here. It took a lot of work to go in and make this environment, you know, have the integrity it needs. From a thermal view that is separation of the hot and cold blanking panels everyplace adjusting sensor points.
And so simple as you know, these can be things coming to three sensor streams, so I put my sensor string down the front of the rack. Well one sensor may be by a machine, just where I want it to be measuring quite happily. But the other one may be, you know, by a blanked off spot. There is not machine there. And so therefore the airflow is very low.
In fact the airflow may be reversed. Even though I have a blanking panel in there if it’s not doing the greatest job of sealing the air, some of that warm air sneaks back through there. And while it doesn’t really affect the air that the servers see, because the vast amount is coming from the floor tile delivery, it does affect the temperature of that sensor.
So you know, it takes some time to get these sensors in the right places, but once you do it certainly has a stupendous payback here. This is implemented with standard CRACs, you know, the communication is (unintelligible). In fact let’s go ahead and go on to the next slide Tanisha.
Probably not the most exciting to a lot of folks here, this is a classical control equation, g over 1 plus gh. And the only thing I really want illustrate here is the C on the right, the thing that I’m trying to control, it’s what I want to measure. And I’m measuring that here with this H Function. I’m actually looking at right where I care about the temperature. And that’s the beauty of having a pervasive monitoring system be the control system.
So, you know, rather than regulate, you know, back at the CRAC and in fact it’s the wrong size of the CRAC we regulate clear out at the machines themselves’ temperature.
And this system also employs under floor pressure monitoring another thing that you can certainly monitor and in this case, you know, use the control. And they use under floor pressure to control the fan speed. So there’s a big, you know, bang for the buck there as well. Just like the server machines take, you know, a cubic relationship to airflow versus cubic relationship to power versus airflow, so do the CRAC fans, or in this case air handler fans.
And so minimizing the amount of air delivered to just match what is necessary at the machine is vital. So it’s kind of a complete process here right? You go out and put this monitoring in and you may not to choose to control upfront, you may just optimize the floor. And so that’s a very important step even in control is optimizing the floor.
I’d like to have even pressure everywhere, I’d vary the tile sizes to match the amount of load in their proximity so I’m delivering the air that I need out of those floors. And by actually controlling the pressure under the floor to a constant value, I can go in and make changes almost autonomously to the environment above the floor.
So let’s say all of a sudden I do get a big high density, you know, piece of equipment in and I need to open up those vents more, install new vents in the floor. The CRACs automatically respond by increasing the speed of their fans to keep that under floor pressure the same. So the air I was delivering out of the rest of the vents on the floor is largely ineffective.
The way that the system works here is it actually adds a loop if you will to this control system. If we think of this control system being at the CRAC itself, you know, I’m controlling what I’m controlling my temperature of air coming back, let’s say.
In this case I think they used supply air temperature instead of return air temperature. So that’s the C on that little controller. And every CRAC’s got one, right? It has a little control board in there and it’s got this function implemented.
Well what the SynapSense system does, and some other ones as well, is put another loop around this. So if you will, they take another loop from (seeds) or another function and pass it back to the reference. They actually go in and tell the CRAC to change its set point.
So where I had the CRAC set point before, maybe at 80 degrees, you know, I now have a sensor out there on the floor and it’s the sensor that’s farthest away from its set point that actually decides, you know, what the new set point on the CRAC should be.
Something happened, you know, they fired up a job and it’s running and the temperature went up in there significantly, and so I tell the CRAC, "Hey, you have to do something. You have to change your temperature." And I tell it to get a lower set point. So I may have changed it from 85 down to, you know, 83 degrees. And I do this on each individual CRAC for the set of servers that is in its vicinity.
So I kind of have a guideline, of you know, how much air is delivered. I draw a semicircle around that CRAC and I say, "Any sensor that lives in this semicircle has the opportunity to adjust the temperature on that CRAC."
And I set that up in the control system itself. I mapped sensors, to you know, the CRACs that they’re going to be able to control. Same thing is true with under floor pressure sensors; I map the under floor pressure sensors to the CRACs that influence the floor in that area.
And I am quite surprised actually that they have as good a success as they do at controlling, you know under floor pressure individually. So if you go look at the (VFE) speeds on all these CRACs they’re all different. They’re all supplying, you know, air at a different rate based on what it takes to keep the floor pressure constant in the area.
Now the set point in the floor pressure is the same, it’s just that we run the fan speeds at different set points to get the floor pressure that we desire. And they do it several different ways.
Typical way is, you know, just go through the (unintelligible) BMS, we talk to the BMS via BACnet or something like that. And then it talks out to the individual CRACs themselves. They have dual controls, so they have 100% backup, they recognize failure conditions, they’ve just done a great job here.
I actually worked a lot with an older system HP had called dynamic smart cooling that was, you know, the same basic idea — a control loop was, you know, sensors all over the place out on the floor — but not nearly as cleanly implemented. This system is quite clean, easy to install sensors on additional racks and map them into the system, doesn’t take a lot of maintenance.
But still you know, whenever the data center changes, it bears an examination of the location of sensors in that vicinity, what my floor tile placement is, you know I have to still do my job to keep my house in order making sure that I’ve got, you know, the integrity that I need in my environment to allow the control to do its job.
So kind of a recap; you know there’s lots of things we can do with monitoring, the basic alarm functions, you want to go as cheap but reliable as possible.
Optimization functions; you’d like to have as many points as you need to give you an accurate view of what’s going on out there; and visualization is very, very important.
You know, data is great, but having that, you know, map – and if we could just go – can we go back a slide here Tanisha? The map here I think is even more dynamic, you know, the hot aisles and cold aisles are just clear as a bell on this map.
And you know, having the temperature I can see some yellows in there in the hot aisles and that’s probably – I mean in the cold aisles and that’s probably just exactly the minimum energy point for this particular layout. Very, very valuable to be able to look at this, to go make changes out there on the floor.
And finally control, you know, making that jump to control is not an easy decision. You know, it’s – there’s a lot of FUD, fear, uncertainty and doubt, about you know, giving control up from whatever the classical means has been, you know, "If it ain’t broke don’t change it," to this new control methodology.
And I think that one of the things, that you know is kind of aiding that, is look at building management systems, things that we trust every day to go do controls in our whole building, not just our data center.
You know, it used to be they were standalone, very you know, specialized pieces of equipment. What are they today? Well they’re programs — they’re programs that run on X86 based workstations. You know, just like the SynapSense program does.
So there’s just not a lot of difference, when it comes down to the, you know, reliability and availability of the control system itself from this older, I’ll call it antiquated technology to new technology, the stuff that’s really going to let us wring the absolute minimum margin that we can out of our data center and be as competitive as possible.
Well gee whiz we have some time for questions here. Do we have any questions?
Tanisha White: Yes, we have a few questions here. And I also want to remind participants that you can use the Chat feature to send in any questions in real-time.
But right now our first question comes from (Scott) about measuring pressure, "What sort of monitoring" – excuse me, "What sort of monitoring is out there for pressure, vibration and airflow in the data center?"
Scot Heath: Well let me address the vibration one first, that’s an interesting case. You know, I’ve read a couple papers about vibration in the data center. And we did some work of course, you know, when I was at HP, around vibration. And vibration is probably the most mythical of all of these parameters that you can go measure.
There’s a couple of companies out there that are selling isolation systems for racks, claiming that data rates can be severely impacted by excessive vibration, from you know, sources like chiller plants and things like that as they start up and down.
And I guess I would counter that continuous measure of vibration is pretty low on my list of things to measure. You know, it’s – if you do have an issue – and I’m not saying there is an issue or isn’t an issue, you know, we never experienced an issue.
Testing shows that it would be quite a bit of vibration in the machines that I’ve looked at that would cause that. But I don’t want to discount the fact that it might happen in somebody’s machine.
It’s going to be a very well – or eventually a characterized and well understood effect that you can eliminate kind of in one fell swoop. So needing lots of vibration measurement there, I’d have to say is not typical.
Now if you still want to go implement vibration, most of the big monitoring systems, you know, the BMS systems, certainly have the capability of reading any kind of 4 to 20 milliamp signal that you want to give it. And you can get vibration monitors that run on that standard.
Less expensive monitoring systems – and I’ll use SynapSense as an example again, implement a – they actually have a device and I’m not coming up with the name of it. I’m looking at one of the engineers here to see if he remembers the little multi-input device that has the 4 to 20 milliamp inputs.
Anyway, it’s a brick that you put in and then you can feed that with almost anything; door open and close, vibration’s a great example, different kinds of power pulses. So you may need be able to pick up power off an existing meter (unintelligible) pulses, so on and so forth. So that’s certainly available.
Temperature monitoring, gosh you know, the gamut. I mentioned NetBots before, they’re a very inexpensive source all the way up, to you know, a pervasive monitoring from a BMS system which would be a very expensive source.
In fact we were at a large genetic research installation out in California recently where they quoted us about – they’re paying about $3000 a monitoring point to have their BMS come in and implement monitoring. So it’s – it can be quite expensive there.
Certainly take a look at SynapSense again, they have a pretty wide variety of, you know, both off-the-shelf power, you can add power meters, (CT)s up to, I don’t know what the largest is, you know whatever (CT)s they make probably versus tapping in to existing. You know, they talk mod bus so you can go look at existing monitoring, if you’ve got you know, mod bus capable monitors out there already.
So you know, lots and lots of choices for hardware to do this.
Tanisha White: All right, thank you. Another question from (Mark) is about air cross-flow, "How does air interact within the data center in terms of cross-flow and can this be monitored?"
Scot Heath: So monitoring flow is very, very difficult. Monitoring pressure – you know, in fact the pressure sensors that SynapSense uses are flow monitors. And because you have a calibrated orifice inside the device it knows what the pressure is based on the flow and it works on differential temperature.
So really knowing that there’s cross-flow there is a secondary effect; that is you look at the temperatures and you deduce that there’s some airflow issue there.
And when you say cross-flow, I don’t know if you mean flow across the front of a machine or if you mean recirculation, you know, where you’re getting hot air coming back to the inlets of other machines. But that’s very obvious.
You know, if you look at temperature profiles so, the picture I put up before of the racks, the fronts of the racks where they’re cool at the bottom, warm at the top, that is clearly a case of recirculation, warm air is coming over the tops of the rack, getting back to the CRAC and it’s causing a significant temperature differential top to bottom in the rack.
Now remember, one of the first things I said, especially about optimization is you need to clean up your house. And that is, we need to keep good separation between hot and cold.
Containment, even partial containment, you know curtains over the tops of the rack, end caps on the rows — is just probably one of the best uses of your dollar you can have for eliminating those kinds of problems.
Seeing those problems, monitoring for those, the variable again, that you really care about is, "What temperature am I seeing going into the fronts of the rack." So detecting cross-flow or recirc, if that’s what you mean, is done by looking at temperatures, not necessarily airflow.
Tanisha White: Okay, thank you. Here’s another question from (Ed) about environmental monitoring, "Can monitoring a data centers in being environmentally conscious? If – how so?"
Scot Heath: Can you tell how environmentally conscious the data center is by monitoring? Maybe that’s your question? Well, that’s a great question and I’m going to say, "Maybe, I don’t know." I haven’t given that a lot of thought.
You know, there are a lot of thoughts around environmentally conscious. Lots of research groups are devoting a significant amount of time to, you know, net zero data centers, that is, "Are we, you know, really contributing zero waste?"
In fact there was a recent publication on one of the LinkedIn sites, I don’t remember which group, that’s a biofuel data center in the British Isles someplace. I want to say England, but I’m not quite sure about that, where you know, crops are raised and decomposed and methane is burned, blah, blah, blah — quite an interesting operation.
But to tell, you know via monitoring, that that’s happening you know, you want to look at, you know, what your total carbon output is and that’s hard to tell. You know, am I getting my power from a hydro source where I have no carbon emissions or am I getting it from a carbon coal fired plant, which has a lot of carbon emissions? And you know, does that change minute to minute?
So you know, I can’t think of anything straight forward right now, and I’m certainly not going to claim expertise here.
Tanisha White: Okay, thank you for that. This question is from (Leif), "If a pressure spike occurs, can monitoring pick this up and make sense of an equipment failure?"
Scot Heath: Make sense of an equipment failure; so gee you know, we talked about that quite a bit, looking at profiles – and when I say we talked about that quite a bit, this is back you know, working with the researchers at HP on the system that we had.
They actually had several examples of failures that they had recorded using their monitoring system. And you know, looking at the data profile there, they could tell certain things had happened.
And one example of that is the grounds crew one day blew, you know, leaf into the cooling towers and clogged up the air intake. And they could tell by the shape of the temperature rise in the data center that something was wrong with the chiller plant.
They probably couldn’t go as far as, you know, "This guy blew leaves in the cooling tower," but they were able to predict that something was happening and get out there and take a look, and then it was obvious what was going on, and saved the center.
And so after that occurred they did other experiments in, and you mentioned pressure so take for instance pulling floor tiles. You know, somebody pulls a floor tile, that’s fine to the area where the floor tiles pulled, but three floor tiles away, that may cause a localized pressure drop enough that it’s a concern.
And you know, they did all temperature monitoring and were able to deduce out of the shape of that temperature profile on the floor, where something was wrong on the floor.
And so yes, it’s possible. I don’t think anyone has got an automated data recognition system or analysis system to go look at that yet and tell you as an operator, "Hey, you know, you had a belt break," or something else.
Tanisha White: Okay. This question is asking, regarding control from monitoring, "What is the availability of large scale control systems? And how hard is it to get feedback for that control as a backup?"
Scot Heath: Feedback for that control as a backup is typically done with the control that’s already in place. So you know, if you think about the way the SynapSense system works for instance, it’s really just giving commands to that CRAC.
So the alarming features of the CRAC are still intact, the localized regulation is still intact. The danger there would be if you, you know, had a set point that was put in place, load that change and the whole, you know, control system went down.
Is there a condition that, you know, with that increased load you could have localized overheating? Sure, absolutely, but the guard against that is multiple servers on the SynapSense system. So we have two sets of servers getting information from all these sense points.
They flag sense points that are bad. The servers talk to each other and they know if a server has suffered a, you know, seize of operation of the control system, the other server takes over.
And then finally the building monitoring system, or however the communication with the CRACs are itself, as I say, you know, that alarming feature is still in place there. So if temperature starts to rise, you know, uncontrollably – and remember, the CRAC is still in control, it’s still doing its job, but if you’re operating very close, you know, there’s a chance that you could go outside of limits.
Now how far outside of limit you’re going to go depends on how close you want to run to having a zero margin. The machines – IT machines, will typically run all the way up to 90-95 degrees; I would never recommend you run at 95 degree inlet temperatures. You can’t stand any kind of a incident if you’re already there.
Just think about a momentary power loss. Even if you’re you know, generator backed, all the CRACs go down, all the compressors shut off, you’re not cooling, you know, one molecule there you’re just recirculating stuff.
And you know, you need to be low enough in the – you have enough thermal storage and the components are on the floor and in the air that’s being circulated that you’re able to withstand those kind of events.
So you know, nothing is foolproof, especially not damn foolproof. The worst is, you know, some human goes out and does something bad. You can defeat, you know, the most carefully thought out backup system with enough ingenuity, but I’ve – my personal opinion is this system is quite robust.
And I’ve looked very carefully at, you know, both this one and the HP system, and I’m very impressed.
Tanisha White: All right, this question comes from (Jim) on standards and best practices, "Are there any standards or regulations in place regarding data center monitoring? What are some best practices in place currently?"
Scot Heath: So no standards. Best practices, publicized best practices, that is a great question. I’m actually not aware of any standards body, you know an ASHRAE or an IEEE or anybody who’s got data center monitoring guidelines for best practices in place. Good question, don’t know.
Tanisha White: Okay. So here’s another question about carbon footprints, but here it is. Let me get there, it’s from (Renee), "Are there power consumption and carbon footprint mapping down to the rack or appliance levels?"
Scot Heath: Sure, there’s absolutely power consumption monitoring down to the rack level. Now mapping that to carbon depends on the source of the power. So you know, lots of people make plug level monitored power strips, you know, PDEs that you put in the back of your rack.
SynapSense has got kind of a nice alternative which is a replacement cord that has wireless monitoring built in so it doesn’t take a network drop, doesn’t’ take a convenience outlet, nothing. You just replace the power cord you’ve got with this one and it communicates wirelessly to a base station.
And then of course mapping that back to carbon usage, you can monitor instantaneous demand, total energy consumption, all the parameters that go around that.
And then finally you know, in the machines themselves, as we move forward, as I say you know, there’s not an HP machine out there that doesn’t have this information available some place in there buried in some secret code. Getting to it becomes much easier as we become more standardized.
Tanisha White: Okay, this will be our last question. Thank you Scot. And this question comes from (Robert), "Scot mentioned in one of the slides about racks to close to the CRAC being starved from receiving cool air due to negative pressure under the floor close to the crack. Is this – is there a best practice distance to avoid lack of cool air from the floor vent?"
Scot Heath: Yes so your – you may have misinterpreted what I thought I said, or I may have said the wrong thing. What you’re talking about is the lower pressure affect from, you know, the high velocity air, you know, very close to a crack.
And you’re absolutely right, that’s a case that can certainly happen, you know, as air – sometimes it’s called jetting, but as air moves horizontally across the openings of the floor tiles, it causes a pressure drop there and can actually cause reverse flow, that is air from the data center to be sucked in those vents under the floor.
A colleague of mine, actually one time, was demonstrating this effect with his business card and he laid it down, and it got sucked under the floor and he went, "Oh rats, that’s going to end up in a CRAC someplace clogging up a filter and it has my name on it."
So yes, best practice there is to stay, you know, at least 4 feet away from the crack, 6 feet is preferable. But the picture that I showed, and if you can go back to that while I’m talking here Tanisha, was actually an even worse case than that.
And I don’t know that you could have – no let’s go on back to the one with the temperature stratification on the front of the rack, 4, there it is. Four, one more back.
So the case that was happening here is that, you know, air is actually coming out of those vents. We measured the flow and it was around, you know, 150-200 (CFM). But it doesn’t go into the racks, it goes right back to the CRAC. The CRAC is, you know, creating this low pressure zone on top of itself. That’s where it’s drawing air back in, so it’s just above the filter.
And that cold air, I mean it has a chance to go through some of the machines because they also have, you know, energy devices, fans that are drawing air in. But the vast majority just goes right back to the CRAC. What a convenient thing to do right? You put vents here and you give it a direct path right back to the crack — bad news.
The right thing to have done here, and they could have done this because this is a brand new row, I mean this was put in less than a month before I got there, and it was – nothing around it. They could have turned this row around.
If you turn this row around, put the vents on the other side, they would have been 6 feet away from the CRAC – more than that, probably 8 feet away from the CRAC, and the hot air exhaust would have been what’s going back to the CRAC, not the cold air on the supply side.
This was just a, you know, a case of the poor guys who put the stuff in, you know, didn’t realize and – or didn’t think about, you know, the repercussions of having those vent tiles be close the CRAC. Bad, bad thing to do.
The good news is, pretty easy solution because there is adequate air coming out of these vents, they are far enough away, you know, curtains are pretty easy to put in and around this thing to contain that cold aisle and keep the cold air going where it should be going when it comes out of those vents.
Well looks like we’re out of time here. Thanks very much for all your questions and attentiveness. Looks like most people stuck around. And have a good day.
Tanisha White: Yes, thank you Scot for a great presentation on today’s topic, Advanced Environmental Monitoring, Control and Optimization, and answering a lot of our questions.
I want to remind our participants that the replay of this webinar will be available within 48 hours on our Web site at www.42u.com. If you feel like your questions were not addressed today, I would like to invite you to call us at 1-800-638-2638 or send your questions via our Project Evaluation Form on our Web site.