Ask the Expert

Removing duplicate records from a physical file

I have a file (900000 records 1G) with multiple records with the same key (not very useful but result of multiple copy). I would like to remove these duplicate records by a simple way (SQL, ...)

I already tried to create a file with unique key, then copy the data, but Its very long (probably because index are computed for each record).

I've seen some SQL advice, but the last step (delete command on a view on the file) doesn't work on my V4R5.

DELETE FROM tmp/cq0942xxxx
    WHERE DUPRRN in (SELECT MRRN FROM tmp/cq0942xxx) 

It really depends on the columns involved in your definition of "duplicate data". This query assumes that the column with the unique identifier is fine, but two other columns have duplicate data.

WHERE EXISTS (SELECT idcol FROM mytab innertab
   WHERE innertab.Col1 = mytab.FirstName
     AND innertab.Col2 = mytab.LastName
     AND innertab.idcol < mytab.idcol ) 

Just about every method is going to be fairly slow when a large number of rows have to be scanned & processed.

=================================='s targeted search engine: Get relevant information on DB2/400.

The Best Web Links: tips, tutorials and more.

Check out this online event, Getting the Most out of SQL & DB2 UDB for the iSeries.

This was first published in February 2003

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: