SSIS Lookups modes or leaky abstractions
I've got a question about SSIS Lookup,
how-to change Partial or No_Cache mode in Lookup task ?
I could give a quick answer, but wanted to explain what I think we did wrong in SQL 2005, and how are we planning to fix this. Before continuing, I recommend everyone to read this article:
Joel Spolsky, The Law of Leaky Abstractions
What that article has to do with SSIS Lookup and the question in particular?
If you look at the Lookup UI, you wound not find any reference to Partial or NoCache mode (it does show up in Properties panel, but most users miss it). The idea was that you don't have to know about all these details. The UI abstracts away these modes by providing advanced options, where you can enable Memory Restrictions. If have enough memory, you would keep defaults. But if you have less memory, you would enable Memory Restrictions, and specify how much memory Lookup should use. It is all clean and simple, and ideally you don't need to know about all these internal modes and how Lookup operates. I wish :).
And of course, like most abstractions, this one is very leaky. It is even mentioned in some Books Online articles, even though BOL does not mention the Lookup modes when describing Lookup Editor.
Internally, SSIS Lookup has three caching modes: Full, Partial and None. They dramatically change how the Lookup operates. Full cache is used if you don't check Enable memory restrictions, None if you check Enable memory restrictions but don't select Enable caching, and Partial cache is used when you check both Enable memory restrictions and Enable caching.
In Fully Cached mode, Lookup fetches all the data into memory first, then processes its input, matching (using internal binary comparison) input data with reference data. Obviously, SQL is not used anymore to process all this data. So you don't need SQL indexes, but you better have enough RAM to hold all the reference data.
In Partial or No Cache mode, Lookup does not fetch any data in advance. It starts with empty internal cache. For each input row, it looks into the cache, and then if it does not find data, it executes SQL SELECT statement and saves the result in the cache. You set the size of the cache for Partial cache mode in the Lookup properties. For No Cache mode, only the very last row is cached (which makes No Cache a bit of confusing name, but it's been decided it does not make much sense to throw away this last row, the memory for it has already been allocated anyway). Obviously, you need an index on the reference table for this to perform well. Also, SQL uses its own collation rules that are usually different from internal binary comparison.
Now you probably see why this abstraction was very leaky. The results could be different (SQL collation vs. binary collation), you need to create indexes for partial/no-cache mode, etc. And hopefully, now you know enough about what is hidden by this abstraction to use Lookup efficiently and correctly.
P.S. In SQL 2008, we decided to stop hiding this abstraction, and provide direct access to the Lookup cache modes. Here is the first page of new Lookup editor:
I'll probably blog about other new items you see here - Cache connection manager and handling of no-match entries later.