question

chrisrdba avatar image
0 Votes"
chrisrdba asked chrisrdba commented

Best way to export data with LOB columns.

Greetings. I need to export several tables with LOB columns into new tables for long winded reasons. My current example has two XML columns on it, is 10 million records, and took about 12 hours to load through a job that calls SSIS.

A few fun facts:

  1. Nobody else is on the box, resources are fine, no blocking, etc.

  2. The source and destination are on the same box, the only real difference is they're on different disks, but this is a VM anyways, so not sure that really matters.

  3. From googling I'm under the impression this is totally common. I've seen a couple suggestions, but not really clear on a common solution.


Questions:

  1. Are there any settings within SSIS itself I can tweak to speed this up?

  2. Is there an better option than SSIS -- Bulk Insert, Insert...Select, etc.?

  3. Is there any reason to think something goofy like export all columns but LOB columns first, then do an update for LOB columns only would be faster?

  4. Because resources are fine, is there any reason to not export several tables at the same time, as there won't be resource contention?

  5. Editing the source to be an actual query instead of just relying on the GUI, I can convert one of the columns as such. If I use 1024, I get a warning about the column being too small for XML, but if I stick to 2048 I don't. Is this safe? Will it perform better?

    convert(varchar(2048), [RequestHashInfo]) as RequestHashInfo,

  6. My wait types are constantly ASYNC_NETWORK_IO -- why would this be under these circumstances?

Thanks!








sql-server-general
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

I've proven question # 5 won't work, disregard.

0 Votes 0 ·
ErlandSommarskog avatar image
0 Votes"
ErlandSommarskog answered

My wait types are constantly ASYNC_NETWORK_IO -- why would this be under these circumstances?

ASYNC_NETWORK_IO means that SQL Server is waiting for the client to pick up data. It could be due to slow network, but more likely the client is slow with processing the data.

I'm not sure why you brought in SSIS here. It sounds like you are copying the data to tables in the same database. In that case, it is probably not a good idea to extract the data to a client.

I would do this with INSERT SELECT, having only the clustered in place, adding non-clustered indexes once data has been copied.

I would also considering inserting the rows in batches to keep down the strain on the transaction log. But it is important that the batches are defined through the clustered index, since you will repeat a scan again and again. How many rows there should be per batch, depends on the size of those LOBs. Look at the total size of the table with sp_spaceused (the reserved column). Then compute the batch size, so that a batch is 500 MB in size on average. (There is no science behind that number; but I picked something which is not too big, nor too small.)

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

chrisrdba avatar image
0 Votes"
chrisrdba answered chrisrdba commented

I started wondering more about this and did the following:

  1. Modified the SSIS Package to be an Execute SQL Task, containing the same INSERT...SELECT statement I was already using.

  2. Deployed the Package to the same server.

  3. Ran the job.

It ran in 1 hour. Like the INSERT...SELECT from a job directly on the server, the plan had parallelism, and the wait types were mainly pageLatchIO_% and MEMORY_ALLOCATION_EXT.

It would appear this isn't an issue with SSIS in general, but instead just a Data Flow Task issue.

I have no idea how/ why doing the same export from a different Task produced such a radically different outcome, but it did.



· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Well, I don't know SSIS, but Execute SQL Task sounds like something that runs an SQL Statement, so that is just a more fancy way than running it from SSMS directly. Whereas the Data Flow Task, apparent reads all the data into SSIS and sends it back. Which certainly can make sense if you are moving data from one data sources to another, not the least if the data sources are from different architectures and you also want to apply some transformations along the way.

0 Votes 0 ·

Yeah that's about the idea. The original motivation for using it was because it's an easy way to load several tables at the same time if desired, use precedence constraints for loading in a specific order, etc.

Now that the wait types have changed -- network is out and each table can just gobble up all the RAM while using parallelism I'll just create a job with INSERT....SELECT'S and not bother with SSIS.

I've never seen this much of a profound difference between the two load types, and suspect it's in part due to the XML columns in some way.

0 Votes 0 ·
chrisrdba avatar image
0 Votes"
chrisrdba answered ErlandSommarskog commented

My 12 hour SSIS Package went down to a 1 hour INSERT...SELECT.

I had no idea it would make that much of a difference.

I'm off to enjoy the remainder of the weekend -- you do the same, and thanks!

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

I'm glad to have help you to be enjoy at least part of the weekend! ...and I may not be in a hurry to lead SSIS.

0 Votes 0 ·
chrisrdba avatar image
0 Votes"
chrisrdba answered chrisrdba commented

Thanks Erland -- I was wondering if this would be faster and coded it right after I posted the question. It's not done yet but looking very hopeful.

Question just to satisfy my curiosity -- how/ where is ASYNC_NETWORK_IO coming into play? Literally the only piece besides the source/ destination DB is where the SSIS Package exists and is running the job from, but I can't imagine SSIS causes this overhead itself, does it?

· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Since I don't know SSIS, I have no reason not to give it the blame. :-)

But of course, is you run SSIS on a separate machine which is connected to the SQL Server machine with a 9600 baud line, that could be the reason. :-)

0 Votes 0 ·

If you are running the package on your client (in SSDT for example) - then the data is extracted from SQL Server to your client and then loaded from your client back to SQL Server. If SSIS is actually running on a server - but is not the same server as the instance, then the same operations apply. And finally - if the package is run on the same server - you are still extracting out of SQL to the OS and back to SQL which will use the network.

That can be avoided by using a different source and destination specific to SQL Server - but for something on the same instance there isn't any reason to use SSIS. As Erland stated - insert/select and batching would be much better.

0 Votes 0 ·
chrisrdba avatar image chrisrdba JeffreyWilliams-3310 ·

Jeffery I just spotted this -- see below for more info.

0 Votes 0 ·