Database 101-2: Why should I care about statistics and index fragmentation?

by Ryan Huber

A continuation of the blog post Database 101-1: Tables and Indexes

Statistics

Now that you have your returns ordered, and that sheet of paper to help you do searches “By State”, there’s another method you might use to save time when searching. 

Let’s say you take another piece of paper and write down how many folders you have in the 2016 drawer, and inside the “A-H” folder there are 52 returns, inside the “I-K” folder there are 180 and so on.

Non-clustered Indexes.png

With this piece of paper, you can quickly determine how many returns you’re going to have to leaf through in order to find the return you’re looking for.

If you know ahead of time that there are only 2 returns in the folder you’re targeting, you know that search won’t take very long.

But if you know there are 500 in the folder you’re targeting, you know that search might take you a while and you might decide to grab a coworker to help you out.

In SQL, this type of data tracking is called Statistics. SQL keeps these data distribution statistics on every table and index in your database. And the optimizer makes use of this data when formulating an execution plan to decide exactly what type of operation to use and the resources required to complete that query. 

If SQL knows it has to read a large set of rows to return the right data, a full table or index scan is often best. If SQL knows the data set to return is very small and targeted, an index seek is best. Without table and index statistics, SQL can’t make good decisions when making execution plans. 

Index Fragmentation

Back to our example…now that you have these things in place, everything runs very smoothly because everything is neatly ordered. But your business isn’t static. As time goes by, you get new customers who bring their old returns with them that you now need to keep on file. And (hopefully not too much), you lose customers who take their returns with them that you remove from the cabinet.

Because of this data movement in and out, you might have some folders that didn’t have room for more returns, so you put them in a new folder at the back of the drawer. Other folders may only have 1 or 2 returns in them now because others were removed. 

The “By State” sheet of paper on the outside of the cabinet now has crosses through the old entries and new entries added at the end of the list. And your statistics sheet is now so out of date that you have no idea how many returns you have in folders or overall. 

As things get messier, you now can’t find anything efficiently.

In SQL terms, this situation is called index fragmentation. In our real world example, to fix the situation you would take all of the paper out of the drawers and cabinets and reorder everything again. 

SQL Server is no different. When you rebuild a clustered index, SQL literally takes every data page and does a complete re-sort of the data. If you neglect this maintenance, your database performance can get very bad, very fast. 

Ryan Huber is the Senior Database Architect at SoftWriters, a Pittsburgh-based company that offers pharmacy management software solutions to pharmacies serving the long-term care community. Ryan is our in-house SQL guru, providing technical expertise in every facet of database operations, including architecture and code review, resolving performance issues for customers, and formulating a strategy for existing and future product features and enhancements.