Tuesday, March 27, 2012

Do I really need a cursor?

I've built an application to import transactions into the database. Bad transactions go in a separate table and dupe transactions get updated. Currently, it takes about 2 hours to import ~40K records using the code below. Obviously I'd like this to run as fast as possible and since cursors are a real drag I was wondering if there was a more efficient way to accomplish this.

DECLARE
@.contact_id int,
@.product_code char(9),
@.status_date datetime,
@.business_code char(4),
@.expire_date datetime,
@.prod_status char(4),
@.transaction_id int,
@.emailAddress varchar(50),
@.journal_id int

BEGIN TRAN
DECLARE transaction_import_cursor CURSOR
FOR SELECT transaction_id, product_code, emailAddress, status_date, business_code, expire_date, prod_status from transactions_batch_tmp
OPEN transaction_import_cursor
FETCH NEXT FROM transaction_import_cursor INTO @.transaction_id, @.product_code, @.emailAddress, @.status_date, @.business_code, @.expire_date, @.prod_status
WHILE (@.@.FETCH_STATUS = 0)
BEGIN
SELECT top 1 contacts.contact_id AS contact_id, transactions_batch_tmp.status_date AS status_date, transactions_batch_tmp.product_code AS product_code,
transactions_batch_tmp.business_code AS business_code, transactions_batch_tmp.expire_date AS expire_date,
transactions_batch_tmp.prod_status AS product_status
FROM transactions_batch_tmp INNER JOIN
journal INNER JOIN
contacts ON journal.contact_id = contacts.contact_id ON transactions_batch_tmp.emailAddress = contacts.emailAddress AND
transactions_batch_tmp.product_code = journal.product_code INNER JOIN
products ON transactions_batch_tmp.product_code = products.product_code
WHERE rtrim(ltrim(contacts.emailAddress)) = @.emailAddress AND journal.product_code = @.product_code
ORDER BY transactions_batch_tmp.status_date desc
IF @.@.ROWCOUNT = 0
BEGIN
print 'NEW transaction! ' + @.product_code + @.emailAddress
insert into journal (contact_id, product_code, status_date, business_code, expire_date, entryTypeID, product_status, date_entered)
SELECT distinct rtrim(ltrim(contacts.contact_id)) as cid, rtrim(ltrim(products.product_code)), transactions_batch_tmp.status_date,
rtrim(ltrim(transactions_batch_tmp.business_code)) , transactions_batch_tmp.expire_date, 21, rtrim(ltrim(transactions_batch_tmp.prod_status)), getDate()
FROM contacts INNER JOIN (transactions_batch_tmp INNER JOIN products ON transactions_batch_tmp.product_code=products.produ ct_code) ON contacts.emailAddress=transactions_batch_tmp.email Address
WHERE transactions_batch_tmp.transaction_id=@.transaction _id
END
ELSE
BEGIN
--print 'UPDATE transaction! ' + @.product_code + @.emailAddress
UPDATE journal
SET status_date =
(SELECT max(tmp.status_date)
FROM transactions_batch_tmp tmp, contacts c, products p, journal j
WHERE tmp.emailaddress = @.emailAddress
AND tmp.emailaddress = rtrim(c.emailaddress)
AND c.contact_id = j.contact_id
AND j.product_code = @.product_code
AND j.product_code = tmp.product_code)
FROM transactions_batch_tmp tmp, contacts c, products p, journal j
WHERE tmp.emailaddress = @.emailAddress
AND tmp.emailaddress = rtrim(c.emailaddress)
AND c.contact_id = j.contact_id
AND j.product_code = @.product_code
AND j.product_code = tmp.product_code
END
FETCH NEXT FROM transaction_import_cursor INTO @.transaction_id, @.product_code, @.emailAddress, @.status_date, @.business_code, @.expire_date, @.prod_status
END
CLOSE transaction_import_cursor
DEALLOCATE transaction_import_cursor
COMMIT TRAN

/** purge data from temp error table before writing bad records for this batch **/
truncate table tran_import_error;

/** write bad records (missing product code or email address) to temp_error table **/
insert into tran_import_error (transaction_id, product_code, emailAddress, date_entered)
SELECT DISTINCT transactions_batch_tmp.transaction_id, transactions_batch_tmp.product_code, transactions_batch_tmp.emailAddress, getDate()
FROM transactions_batch_tmp
where transactions_batch_tmp.emailaddress not in (select emailaddress from contacts)
OR
transactions_batch_tmp.product_code not in (select product_code from products)

TIAI don't see anything in your code that requires a cursor. It would run much faster as set-based INSERT and UPDATE statements.|||Well, how would I handle the update part without a cursor? I need to make sure that *only* unique contact_id-product_code values exist in the journal table.

Thanks.|||Add some bit flag and notes columns to your import table. Then you can run data checks against the records prior to importing them. Flag any duplicates or bad records and add a note as to why they were flagged. Then import only the non-flagged records. Delete the non-flagged records when you are done, and you are left with a list of bad records that you can review or discard.|||blindman - Thanks for your help.

I'm almost there (I hope), but was wondering if there was a more efficient way to delete the dupe records than having to write two separate queries. I need to keep the most recent product_code-status_date transaction for *each* person. This runs after I insert ALL the records in the journal table.

--delete dupe trans with status_date as the flag
DELETE journal FROM journal, contacts
JOIN
(select product_code, contact_id, max(status_date) as max_status_date
from journal
group by product_code, contact_id) AS G
ON G.[contact_id] = contacts.[contact_id]
WHERE journal.[status_date] < G.[max_status_date]
AND G.[product_code] = journal.[product_code];

--delete dupe trans with journal_id as the flag
DELETE journal FROM journal, contacts
JOIN
(select product_code, contact_id, max(journal_id) as maxID
from journal
group by product_code, contact_id) AS G
ON G.[contact_id] = contacts.[contact_id]
WHERE journal.[journal_id] < G.[MaxID]
AND G.[product_code] = journal.[product_code];

Thanks again.|||The contact table has nothing to do with your delete, except to limit the deleted records to those that have a contact_id. I assume that contact_id is part of journal's natural key and that all records have a valid contact_id, so drop if from both your queries. (If you do need it for filtering, join it in the subquery.)

--delete dupe trans with status_date as the flag
DELETE
FROM journal
INNER JOIN
(select product_code, contact_id, max(status_date) as max_status_date
from journal
group by product_code, contact_id) AS G
ON journal.[product_code] = G.[product_code]
and journal.[contact_id] = G.[contact_id]
and journal.[status_date] < G.[max_status_date]

--delete dupe trans with journal_id as the flag
DELETE journal
FROM journal
INNER JOIN
(select product_code, contact_id, max(journal_id) as maxID
from journal
group by product_code, contact_id) AS G
ON journal.[product_code] = G.[product_code]
and journal.[contact_id] = G.[contact_id]
and journal.[journal_id] < G.[MaxID]

It also appears that the first query should handle all product_code/contact_id duplicates except those with that share exactly the same status_date. If status_date stores only whole-date values, then I guess I see the point of the second delete statement, but otherwise I wouldn't expect you to get a high rowcount from it.

Now to your question; can this be done as a single SQL statement? Yes, but it would essentially require two nested subqueries, so I don't think you would get a big performance boost from it, and you would certainly have to sacrifice code clarity. I recommend that you leave it as two separate deletes.

No comments:

Post a Comment