Tuesday, February 14, 2012

Distributed SSIS server setup

I work in my organization's warehouse team and we are rearchitecting our hardware. One of the ideas on the table is to devote a server to running ETL. There would be several beefy database servers, and a separate server that would run the ETL (SSIS). I'm wondering if that actually hinders the process.

The DB servers will still do all of the query processing, so that isn't off loaded to the ETL server. Then the recordsets I think would need to go over the network to the ETL server for the SSIS packages to work on. And then another trip over the network back to their destination. Would it be better to place the ETL process on either the source or destination DB server? What could be gained by using hardware in this way as described here?

Thanks.

Hi, its a good question but at the same time there is no perfect answer. It all depends. In a perfect world everything on 1 server that is understressed would be screamingg fast. Unlimited hard drive space (and high RPMs), bucket loads of free RAM, minimal network interaction.

So it depends a lot on how stressed you feel the box would be, AND its expected growth.

Are you running ETL jobs during most of the day?

Are they small but frequent or infrequent but intensive?

Capture performance with perfmon or some other tool of even basic stuff like cpu utilization and free memory for even a day and see where are the peaks and duration of the peaks.

Eggs in your basket and downtime are other things to consider. So a huge down side of the single server is if it smoked out you lose everything.

If an isolated ETL box dies, then all your other operaitons keep going. So now a quest ion for your question is, what is your recovery/backup plan/needs/desires?

Do you have the hardware to have a duplicate of each type of server that you can work into production quickly if need be? Perhaps a good approach is you have the DBs and ETL process on multiple boxes but one is the primary DB server the other the ETL...but if something happened to the ETL server, you could in a pinch get the ETL process up and running fast on the DB box because all the needed bits are there but you just need to get it 'live'. Similarly assume your ETL box has mirrored (in some wya) data of whats on the DB server and then if the DB server has problems you could get that process live quickly from the ETL box.

Again, there is no 1 perfect answer. Hope this was useful food for thought.

|||

A distributed SSIS farm will quite well for micro-batch ETL (sub-day processing). Using Craig's terms, the job resource utilization profiles are "small (data wise), frequent AND intensive", meaning not much RAM per job (200MB), lots of CPU, and lots of jobs (thousands to millions per day). This is a generalizaton, since the there are certainly some packages which run on a less frequent schedule which have different resource utilization profiles.

The distributed SSIS server architecture can scale out nicely too. There is custom work up front to get it going, because there needs to be some type of queue-ing (database table, msmq, service broker queue, file system) mechanism which farms out the work, which operates outside of and interacts with the SSIS infrastructure.

As far as mirroring dimensional data out to the SSIS servers (in an effort to reduce network traffic) something to consider is the use of SQL Compact. SQL Compact is not 64 bit compatible (yet), but it looks like a good option if and when it does have a 64 bit version. If not, something that works right now, is a local SQL installation together with SSIS which each ETL server uses to "assist" in areas like caching of stock (relatively non-volatile) data.

No comments:

Post a Comment