How to find duplicates - With code samples

This WebDNA talk-list message is from

2000

It keeps the original formatting. numero = 30640
interpreted = N
texte = {ORIGINAL MESSAGES SHOWN BELOW}NONONONONONOPlease do not use nested searches to find duplicates!(Disclaimer: I'm not sure I have all of the formal programming termslisted below correct. The technique is perfect however. If you'd like tocheck the terminology, please feel free to read any book on programmingdatabase searches)When you use a nested search to find duplicate records (in the outer searchyou find unique records, then in the inner search you search for that uniquerecord to count the number found) you actually end up searching the databaseN*N times. Lets say you have a database of 5000 records. you have to performthe inner search once through the WHOLE database for each record of thedatabase, so you are basically doing 5000 inner searches (of 5000 recordseach) which leads to about 25 MILLION comparisons/computations/whatever.A MUCH BETTER way is to do a search one time, with a VERY structured sort,and then use text variables to do a comparison of two records (you'll seethis a little below) to see if you have a duplicate.Here's the theory:in a database of let's say customer contacts you want to find areas withmore then one customer per zip code. However, the whole database is set upas one customer, one record, so finding unique customers is easy, butfinding two (or more) customers in the same zip code isn't.If you were doing this by hand/eye, you would create a report that listedall of the contacts - sorted by zip code - and then skim down the report ONETIME to find the duplicates, right? Of course you would. This works becauseyou are simply comparing each record to the one immediately below it. Sowhat we need to do is write some code that basically does the same thing.But for a computer, we don't (can't?) know what the next record is going tobe, but we can remember what the previous record was, and compare it tothe current record to see if it's a duplicate.HERE'S SOME CODE:[!] Lets find the duplicate (or more) Zip Codes [/!][!] We have to set the value of the first previous record so that it will NOT match our very first record[/!][text]prevrecord=NadaMatch[/text][!] The key is to find ALL of the records in the database, and sort them by the field you are looking for duplicates in [/!][search db=contacts.db&geIDdata=0&ZIPCODEsort=1&max=99999][founditems][!] Here's .... Johnnnnnnnnny [/!][showif [url][ZIPCODE][/url]=[prevrecord]]A duplicate was detected for zip code: [ZIPCODE]
[/showif][!] The last thing we have to do is remember this record, so we can use in the the next occurrence of this founditems loop. [!][text]prevrecord=[url][ZIPCODE][/url][/text][/founditems][/search]-----------------I use a more complicated version of this code to occasionally look forduplicate email addresses in a file of about 180K records. It works in a fewseconds. Also, I'll leave it as an exercise to the reader to convert thiscode to be able to perform summary functions (Example: write a summary filecontaining of the number of orders per day in an order history file).If anyone in the New York City (ish) area is looking for training, let meknow. -Thanks Brian B. Burton BOFH - Department of Redundancy Department--------------------------------------------------------------- MMT Solutions - Specializing in Online Shopping Solutions 973-808-8644 http://www.safecommerce.com ================================= time is precious. waste it wisely =================================> I was just reading your question, and I was thinking that you> could maybe do something with the numfound function.>> In other words if the number of items found were greater> than one, then you could make a list of them by showing them.>> So what if you made a summery of your zip codes, and then> did a search on everyone.>> When you searched for a particular zip code look at the number> found in this second search and if it> was greater than one that would> give you a list of all the repeated zip codes.>> I am thinking that you just want to find them so you can> delete them or something right? Cause this would use up a lot> of processor time if you did it constantly.>> jaks>>>>>Hey all,>> This has become a little more perplexing than I would like. Situation:>>I'm wondering if there is a simple way to find ALL records in a database>>that have duplicate's in a certain field. Example; I want a search results>>that will return all ZipCode that have duplicate's and to not show the rest.>>I'm not searching for just one ZipCode at a time either, I want the full>>list of duplicates. As in:>>Name ZipCode>>Tom Jones 90012>>Haray Caray 90012>>Joe Smith 90014>>Jane Doe 90014>>etc...>>Basically, it would be the exact opposite of what Summ=T does in a search.>>>>Any advice would be appreciated!>>Gary-------------------------------------------------------------This message is sent to you because you are subscribed to the mailing list .To unsubscribe, E-mail to: To switch to the DIGEST mode, E-mail to Associated Messages, from the most recent to the oldest:

How to find duplicates - With code samples (Brian B. Burton 2000)

{ORIGINAL MESSAGES SHOWN BELOW}NONONONONONOPlease do not use nested searches to find duplicates!(Disclaimer: I'm not sure I have all of the formal programming termslisted below correct. The technique is perfect however. If you'd like tocheck the terminology, please feel free to read any book on programmingdatabase searches)When you use a nested search to Find duplicate records (in the outer searchyou find unique records, then in the inner search you search for that uniquerecord to count the number found) you actually end up searching the databaseN*N times. Lets say you have a database of 5000 records. you have to performthe inner search once through the WHOLE database for each record of thedatabase, so you are basically doing 5000 inner searches (of 5000 recordseach) which leads to about 25 MILLION comparisons/computations/whatever.A MUCH BETTER way is to do a search one time, with a VERY structured sort,and then use text variables to do a comparison of two records (you'll seethis a little below) to see if you have a duplicate.Here's the theory:in a database of let's say customer contacts you want to find areas withmore then one customer per zip code. However, the whole database is set upas one customer, one record, so finding unique customers is easy, butfinding two (or more) customers in the same zip code isn't.If you were doing this by hand/eye, you would create a report that listedall of the contacts - sorted by zip code - and then skim down the report ONETIME to find the duplicates, right? Of course you would. This works becauseyou are simply comparing each record to the one immediately below it. Sowhat we need to do is write some code that basically does the same thing.But for a computer, we don't (can't?) know what the next record is going tobe, but we can remember what the previous record was, and compare it tothe current record to see if it's a duplicate.HERE'S SOME CODE:[!] Lets find the duplicate (or more) Zip Codes [/!][!] We have to set the value of the first previous record so that it will NOT match our very first record[/!][text]prevrecord=NadaMatch[/text][!] The key is to find ALL of the records in the database, and sort them by the field you are looking for duplicates in [/!][search db=contacts.db&geIDdata=0&ZIPCODEsort=1&max=99999][founditems][!] Here's .... Johnnnnnnnnny [/!][showif [url][ZIPCODE][/url]=[prevrecord]]A duplicate was detected for zip code: [ZIPCODE]
[/showif][!] The last thing we have to do is remember this record, so we can use in the the next occurrence of this founditems loop. [!][text]prevrecord=[url][ZIPCODE][/url][/text][/founditems][/search]-----------------I use a more complicated version of this code to occasionally look forduplicate email addresses in a file of about 180K records. It works in a fewseconds. Also, I'll leave it as an exercise to the reader to convert thiscode to be able to perform summary functions (Example: write a summary filecontaining of the number of orders per day in an order history file).If anyone in the New York City (ish) area is looking for training, let meknow. -Thanks Brian B. Burton BOFH - Department of Redundancy Department--------------------------------------------------------------- MMT Solutions - Specializing in Online Shopping Solutions 973-808-8644 http://www.safecommerce.com ================================= time is precious. waste it wisely =================================> I was just reading your question, and I was thinking that you> could maybe do something with the numfound function.>> In other words if the number of items found were greater> than one, then you could make a list of them by showing them.>> So what if you made a summery of your zip codes, and then> did a search on everyone.>> When you searched for a particular zip code look at the number> found in this second search and if it> was greater than one that would> give you a list of all the repeated zip codes.>> I am thinking that you just want to find them so you can> delete them or something right? Cause this would use up a lot> of processor time if you did it constantly.>> jaks>>>>>Hey all,>> This has become a little more perplexing than I would like. Situation:>>I'm wondering if there is a simple way to find ALL records in a database>>that have duplicate's in a certain field. Example; I want a search results>>that will return all ZipCode that have duplicate's and to not show the rest.>>I'm not searching for just one ZipCode at a time either, I want the full>>list of duplicates. As in:>>Name ZipCode>>Tom Jones 90012>>Haray Caray 90012>>Joe Smith 90014>>Jane Doe 90014>>etc...>>Basically, it would be the exact opposite of what Summ=T does in a search.>>>>Any advice would be appreciated!>>Gary-------------------------------------------------------------This message is sent to you because you are subscribed to the mailing list .To unsubscribe, E-mail to: To switch to the DIGEST mode, E-mail to Brian B. Burton

DOWNLOAD WEBDNA NOW!

How to find duplicates - With code samples

2000

Top Articles:

Related Readings: