Is Entity Framework Hurting Your Database Performance?

An infrastructure guy at Mews, avid cyclist.

Introduction

ORMs are pretty great these days. We depend heavily on Entity Framework and the man-months saved in aggregate are arguably worth the downsides. But, there are downsides and sometimes they are just too big to ignore.

Using Enumerable.Contains? Of course we are!

Consider this innocent piece of code, getting bonuses awarded to three of our employees:

var employeeIds = new[]{ 1, 2, 3 };
var bonuses = dbContext.Bonuses
                       .Where(b => employeeIds.Contains(b.EmployeeId))
                       .ToList();

What SQL query do you think this produces? With the help of the SQL Profiler, we get our answer (adjusted for brevity):

select *
from Bonus
where EmployeeId in (1, 2, 3)

That does not seem surprising, but can you see the problem?

Introducing the plan cache

Stepping back a little, we need to say at least a few words about how the SQL Server is processing queries.

After making sure the query is valid and optimizing it, SQL Server has to decide whether there is already a cached execution plan for the query or a new one has to be built.

Compiling an execution plan is a CPU-heavy process and you want SQL Server to utilize the plan cache as much as possible (assuming the cached plans are good enough).

That’s why parameterization is important – by not hardcoding parameter values in the query text, the same execution plan can be used for all queries of the same shape where the only difference is a parameter value.

That gets us back to our query – no parameters in sight, just hard-coded number literals. After SQL Server compiles an execution plan for this query and executes it, we hit it with another one:

select *
from [Bonus]
where [EmployeeId] in (2, 3, 4)  -- Note the different values.

Unable to reuse the existing execution plan, SQL Server spends more CPU time compiling a new plan every time such a query arrives.

Erin Stellato ran an experiment to measure the impact non-parameterized queries have on CPU and this is what she observed:

The difference between a workload consisting of ad-hoc vs. parameterized queries. Created by Erin Stellato at https://sqlperformance.com/2019/05/sql-plan/perf-impact-adhoc-workload.

Obviously, SQL Server is spending a lot of CPU compiling the execution plans over and over again.

Let’s take a look at how the plan cache looks after we let such queries come at SQL Server for a while. For the purpose of this demonstration, we will run exactly 1000 of them, each one with different values used as employee IDs.

Now let’s examine the plan cache, in particular queries for which we have multiple plans cached. To do that, we can use this query. The result is:

Query hashPlans cachedTotal plan sizeExecutionsSample text
0xABA22…1000655441921000select * from Bonus where EmployeeId in (9003, 5387, 3441, 8494, 4170)

As you can see, after 1000 executions, we have 1000 plans cached. How is SQL Server even able to group the queries together, when the ID values differ across them? That is because it is how the query hash is defined – it disregards any literal values and considers only the query shape. That makes it a very useful tool to investigate, among other things, ad-hoc queries polluting the plan cache.

Another important piece of information is the total plan size – 65 MB in our case. This might not sound like much, but consider that:

  • There will typically be more queries like this running in the system – in our case, ad-hoc queries of this shape came at a rate of about 100 per second.
  • The plans generated by these queries are as simple as they get – an index seek followed by key lookups. More complicated plans will take up more space.

Ultimately, this is 65 MB which should be considered wasted – instead of caching useful information like frequently accessed index data pages, we store many copies of the same execution plan with no reuse possible.

Did you know… cycling is a thing in Mews tech team?

Join Standa for a bike ride or tackle slowdown problems together. Are you up for it?

To summarize the main problems with ad-hoc (non-parameterized) queries:

  • Wasting CPU cycles on repeated compilation of the execution plans
  • Wasting storage on caching the plans

How do I become a better Entity Framework person?

There are several workarounds with various degree of implementation complexity. Here are some of them:

Bucketizing

With this approach, the IN predicate is rewritten (modifying the IQueryable) in such a way that parameters are used instead of literals and to reduce the amount of plans required for different numbers of the values, they are “bucketized”. Typically, powers of two are used as the list sizes, which means there will be just a couple of distinct plans overall. Sentinel values (like NULL or -1 depending on the column type and its nullability) are used to pad the input to the nearest larger list size without changing the query semantics. For example, when three IDs are provided, it would result in the following query:

select *
from Bonus
where EmployeeId in (@p1, @p2, @p3, /* Sentinel value */ @p4)

where NULL would be passed for @p4.

If you’re thinking about using this method, note that the maximum number of parameters is 2100.

Table-valued parameters

TVPs utilize user-defined types and are a popular choice when the client needs to provide multiple rows of data to the server in a single round-trip. In our case, the usage would look like this:

create type EmployeeTableType as table (
    EmployeeId int
);
declare @tvp EmployeeTableType
insert into @tvp (EmployeeId) values (155), (6412), (6525), (6666)
select *
from Bonus b
where exists (
   select 1
   from @tvp
   where EmployeeId = b.EmployeeId
)

The downside might be the complexity of implementation, in particular getting TVP to play nicely with Entity Framework (especially when you are on EF 6). It might be perfectly feasible, I just have not spent enough time exploring this path.

Parameter-splitting function

This is the approach we went with at Mews. Why?

  • Unlike bucketizing, you do not have to implement fallbacks for large input list sizes.
  • Unlike the TVP approach, we could see a clear way to make this work with Entity Framework 6 after a reasonable effort.

The idea is simple – instead of sending a parameter for each value in the list (what the Bucketizing method does), we send one, possibly large argument which contains all of the values concatenated. The SQL query then refers to a splitting function, which unbundles the parameter into individual values. An example query will make it easy to understand:

-- We are using sp_executesql to run a parameterized query.
-- This is the same procedure that Entity Framework uses to run all your queries.
sp_executesql 
 'select *
  from Bonus
  where EmployeeId in (
    select Value
    from string_split(@idList, ',')
  )',
  N'@idList nvarchar(max)',
  @idList = N'7010,3086,7773,7160,7534' 

After running this 1000 times with different values, let’s again inspect the plan cache:

Query hashPlans cachedTotal plan sizeExecutionsSample text
0xABA22…11064961000(@idList nvarchar(max)) select * from Bonus where EmployeeId in (select Value from string_split(@idList, ‘,’))…

Regardless of the number of IDs, the query stays the same. A single plan is generated and reused across all executions. 🎉

But how do we generate such queries with Entity Framework?

A gentle nudge

In essence, we need the Entity Framework to use a custom SQL function instead of the non-parameterized IN predicate, as the code sample demonstrates:

var employeeIds = new[]{ 1, 2, 3 };
var idList = string.Join(',', employeeIds);
// Instead of calling Contains(), we tell Entity Framework to use a parameter splitting
// function to evaluate the collection membership.
var bonuses = dbContext.Bonuses
   .Where(b => dbContext.SplitToIds(idList).Any(id => id.Value == b.EmployeeId))
   .ToList();

Over the years, Entity Framework accumulated enough extension points to make our solution feasible. Start by installing the EntityFramework.Functions NuGet package, which adds support for calling server functions in the code-first model.

Next, we will define the SplitToIds method which is going to be used inside the LINQ queries:

// Inside your DbContext class
//
[Function(FunctionType.TableValuedFunction, name: "SplitToIds", namespaceName: nameof(MyDbContext), Schema = "dbo")]
public IQueryable<IdSplitResult> SplitToIds(
    [Parameter(DbType = "nvarchar", Name = "string")] string @string,
    [Parameter(DbType = "nvarchar", Name = "separator")] string separator)
{
    var sourceParameter = new ObjectParameter("Source", @string);
    var separatorParameter = new ObjectParameter("Separator", separator);
    return ObjectContext.CreateQuery<IdSplitResult>($"[{GetType().Name}].string_split(@{sourceParameter.Name}, @{separatorParameter.Name})", sourceParameter, separatorParameter);
}

And a class which represents the database type of the function result:

[ComplexType]
public class IdSplitResult
{
    public int Value { get; set; }
}

Finally, to make EF register the custom function and the result type, add this to your OnModelCreating overload:

protected override void OnModelCreating(System.Data.Entity.DbModelBuilder mb)
{
    base.OnModelCreating(mb);
    mb.AddFunctions<MyDbContext>();
    ...
}

That’s it! Using the SplitToIds function in the way shown above, Entity Framework generates the following query:

sp_executesql
  N'select *        
    from Bonus b
    where exists (
        select 1
        from string_split(@idList, N'','')
        where Value = b.Id
    )',
  N'@idList nvarchar(max)',
  @idList = N'1,2,3'

Our experience

After rolling this change out, we observed a significant drop in the number of execution plans SQL Server compiles per second:

The number of SQL query compilations per second.

The end result is

  • Less memory wasted for plans used just once. In our case, it was about 700 MB in the prime real estate of SQL Server’s RAM.
  • More CPU available for doing useful things instead of recompiling the same queries repeatedly.

Is it all rainbows and unicorns from now on?

Though I would argue in most cases it is worthwhile to address the issue of non-parameterized queries bloating your plan cache, you should still consider all of the consequences. In particular:

  • You will most likely be using a table-valued function (like string_split) for splitting the parameters. SQL Server does not maintain statistics for those and has a hard-coded number of rows it expects to be returned from them. In our case, it is 50 rows, but I think this is different across SQL Server versions. Why does it matter? As with all potentially bad estimates, this might lead to a bad execution plan being generated. In particular, the small number of rows estimated means that index seeks will probably be preferred over index scans even if the actual number of values warrants the latter.
  • Now that the queries are parameterized, they are subject to parameter sniffing. Once such a query is seen for the first time, a plan is compiled based on the actual parameters provided and all subsequent executions will use that plan, regardless of their parameters (unless the plan is evicted from the cache for any reason).
  • Based on the complexity and frequency of the queries this change affects, you might suddenly see those pop up in database performance monitoring tools, in particular those using Query Store data. That is because now that the queries are parameterized, all of the executions will be aggregated under a single query ID. I would say this is a good thing, but better to be prepared for it.

Are we done here?

We made it. Our plan cache is lean again and SQL Server does not waste CPU on repeated compilation of the same execution plans.

But will you and your teammates be content with having to remember to not use the Contains method in LINQ queries and replacing it with SplitToIds instead? I wouldn’t. Is there a better way to do it? Find out in part 2 of this blog post, coming soon! 💪


Learn more

  • There is an Optimize for ad-hoc workload option in SQL Server by which you instruct it to only save a compiled plan after it sees a specific query for the second time. It still compiles every query string as it arrives, you just save some memory.
  • You can also use Forced parameterization, which makes SQL Server replace every literal with a parameter automatically. That means better plan reuse and reduced memory and CPU usage. However, it might not be enough when dealing with the IN predicates – you will get one plan per each input list size, so you might still have to at least employ some kind of bucketizing.

An infrastructure guy at Mews, avid cyclist.
Share:

More About

Cookie Management

Here you can manage your preferences regarding cookies: