Database 101-6: What does Page Life Expectancy (PLE) do to your database?

By Ryan Huber

A continuation of the blog post Database 101-5: How does the processing power (CPU) and memory affect the database performance?

Page Life Expectancy

So how can you tell if you’re giving SQL a big enough “desk”? There are different metrics around memory usage that can help you figure that out, but the most common one is called Page Life Expectancy. 

This metric measures how long a data page will stay in the buffer pool before it’s no longer being accessed and is flushed out. 

In our real-world example, it would be how long a single piece of paper from the cabinet can stay on your desk before you don’t have room for it anymore and have to put it back in the cabinet. 

So what’s a good PLE number? 

A long time ago, Microsoft published a standard metric of 300 seconds (5 minutes) for PLE. If your SQL instance held pages in the buffer pool for longer than that, you server is in good shape. If it was lower than that, you need to do some investigation. And modern SQL monitoring tools, if any of you are using them, still stick with that number for the most part. 

But that number is completely outdated.

When the metric was conceived, an average server had about 8 GB of RAM, which meant if PLE is at 300, you’re reading about 6.5 GB of data into the buffer pool every 5 minutes. 

Today, a medium sized SQL instance could be using 128 GB of RAM, which would equate to cycling through over 100 GB of data every 5 minutes. 

That is tremendous memory pressure and in turn a huge load on you disks as they try to keep up with the IO. And this also doesn’t take into account the newer hardware and VM configurations, which use NUMA. 

What is NUMA? 

NUMA stands for non-uniform memory access. I know that probably means absolutely nothing to most people, so let me translate. Imagine you have 8 workers all trying to do their work on the same desk. 

It’s very likely they’re going to be getting into each other’s way a lot. So to fix this, you take a saw and cut the big desk up into 8 smaller desks so that each worker can do what they need to do without interfering with others.

This is what NUMA does. SQL carves up the RAM so that each processor get its own sector of buffer pool to work with so the CPUs can complete their work without contending with other CPUs for memory resources. 

Now, imagine a situation where 7 of the workers are just doing their work leisurely, grabbing paper as they need and having ample space to work. 

But one of the workers is completely overloaded, has 20 folders of paper on his desk and is running back and forth to the cabinet constantly because he just doesn’t have enough room. 

This exact type of situation could be occurring on your SQL instance. 

For example, say you have a particular report using a scalar function which keeps that transaction from being parallelized across all of your CPUs, so a single CPU is just churning through this huge query while the others have ample memory.

Because the server-level PLE is still high, as it’s an average of PLE across all of the NUMA nodes, you think everything is running great. But you have a front-end user who’s transaction is stuck on this one CPU sitting there wondering what the heck is going on. 

In this situation, there’s a good chance you might not even notice the contention.

And if you’re waiting until your PLE hits 300, then you’re really not noticing.   

So then what is a good threshold for PLE? 

There really isn’t a good across-the-board answer anymore, which is why Microsoft hasn’t published any new guidance in the last 20 years. But some SQL Server MVPs (ie really smart people who get paid a lot of money to know SQL) have studied this and came up with a metric that I’ve found to be much more appropriate. 

You take the size of your buffer pool in GB, divide it by 4 (a likely RAM size for a NUMA node these days) and then multiply that by 300 secs for your new threshold. 

(GB/4) * 300 = PLE Threshold

As an example, if your server has a 64 GB buffer pool, that’s a PLE threshold of 3800 seconds.  For you non-math majors in the audience, that’s a difference of a factor of 12. 

If you’re not investigating SQL performance issues until PLE hits 300, then you’re likely missing a lot of issues. 

And if you’re server’s PLE metric is hovering in that range on a consistent basis under load and there’s nothing egregious running (like huge reports) then that server is a good candidate for an upgrade. Which leads to our next topic… 

 

 

 

Ryan Huber is the Senior Database Architect at SoftWriters, a Pittsburgh-based company that offers pharmacy management software solutions to pharmacies serving the long-term care community. Ryan is our in-house SQL guru, providing technical expertise in every facet of database operations, including architecture and code review, resolving performance issues for customers, and formulating a strategy for existing and future product features and enhancements.