How to find duplicates - With code samples

This WebDNA talk-list message is from

2000


It keeps the original formatting.
numero = 30640
interpreted = N
texte = {ORIGINAL MESSAGES SHOWN BELOW}NONONONONONO Please do not use nested searches to find duplicates!(Disclaimer: I'm not sure I have all of the formal programming terms listed below correct. The technique is perfect however. If you'd like to check the terminology, please feel free to read any book on programming database searches)When you use a nested search to find duplicate records (in the outer search you find unique records, then in the inner search you search for that unique record to count the number found) you actually end up searching the database N*N times. Lets say you have a database of 5000 records. you have to perform the inner search once through the WHOLE database for each record of the database, so you are basically doing 5000 inner searches (of 5000 records each) which leads to about 25 MILLION comparisons/computations/whatever.A MUCH BETTER way is to do a search one time, with a VERY structured sort, and then use text variables to do a comparison of two records (you'll see this a little below) to see if you have a duplicate.Here's the theory: in a database of let's say customer contacts you want to find areas with more then one customer per zip code. However, the whole database is set up as one customer, one record, so finding unique customers is easy, but finding two (or more) customers in the same zip code isn't. If you were doing this by hand/eye, you would create a report that listed all of the contacts - sorted by zip code - and then skim down the report ONE TIME to find the duplicates, right? Of course you would. This works because you are simply comparing each record to the one immediately below it. So what we need to do is write some code that basically does the same thing. But for a computer, we don't (can't?) know what the next record is going to be, but we can remember what the previous record was, and compare it to the current record to see if it's a duplicate.HERE'S SOME CODE: [!] Lets find the duplicate (or more) Zip Codes [/!][!] We have to set the value of the first previous record so that it will NOT match our very first record[/!] [text]prevrecord=NadaMatch[/text][!] The key is to find ALL of the records in the database, and sort them by the field you are looking for duplicates in [/!] [search db=contacts.db&geIDdata=0&ZIPCODEsort=1&max=99999] [founditems][!] Here's .... Johnnnnnnnnny [/!] [showif [url][ZIPCODE][/url]=[prevrecord]] A duplicate was detected for zip code: [ZIPCODE]
[/showif][!] The last thing we have to do is remember this record, so we can use in the the next occurrence of this founditems loop. [!] [text]prevrecord=[url][ZIPCODE][/url][/text][/founditems] [/search]-----------------I use a more complicated version of this code to occasionally look for duplicate email addresses in a file of about 180K records. It works in a few seconds. Also, I'll leave it as an exercise to the reader to convert this code to be able to perform summary functions (Example: write a summary file containing of the number of orders per day in an order history file).If anyone in the New York City (ish) area is looking for training, let me know. -Thanks Brian B. Burton BOFH - Department of Redundancy Department --------------------------------------------------------------- MMT Solutions - Specializing in Online Shopping Solutions 973-808-8644 http://www.safecommerce.com ================================= time is precious. waste it wisely =================================> I was just reading your question, and I was thinking that you > could maybe do something with the numfound function. > > In other words if the number of items found were greater > than one, then you could make a list of them by showing them. > > So what if you made a summery of your zip codes, and then > did a search on everyone. > > When you searched for a particular zip code look at the number > found in this second search and if it > was greater than one that would > give you a list of all the repeated zip codes. > > I am thinking that you just want to find them so you can > delete them or something right? Cause this would use up a lot > of processor time if you did it constantly. > > jaks > > > >>Hey all, >> This has become a little more perplexing than I would like. Situation: >>I'm wondering if there is a simple way to find ALL records in a database >>that have duplicate's in a certain field. Example; I want a search results >>that will return all ZipCode that have duplicate's and to not show the rest. >>I'm not searching for just one ZipCode at a time either, I want the full >>list of duplicates. As in: >>Name ZipCode >>Tom Jones 90012 >>Haray Caray 90012 >>Joe Smith 90014 >>Jane Doe 90014 >>etc... >>Basically, it would be the exact opposite of what Summ=T does in a search. >> >>Any advice would be appreciated! >>Gary ------------------------------------------------------------- This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: To switch to the DIGEST mode, E-mail to Associated Messages, from the most recent to the oldest:

    
  1. How to find duplicates - With code samples (Brian B. Burton 2000)
{ORIGINAL MESSAGES SHOWN BELOW}NONONONONONO Please do not use nested searches to find duplicates!(Disclaimer: I'm not sure I have all of the formal programming terms listed below correct. The technique is perfect however. If you'd like to check the terminology, please feel free to read any book on programming database searches)When you use a nested search to Find duplicate records (in the outer search you find unique records, then in the inner search you search for that unique record to count the number found) you actually end up searching the database N*N times. Lets say you have a database of 5000 records. you have to perform the inner search once through the WHOLE database for each record of the database, so you are basically doing 5000 inner searches (of 5000 records each) which leads to about 25 MILLION comparisons/computations/whatever.A MUCH BETTER way is to do a search one time, with a VERY structured sort, and then use text variables to do a comparison of two records (you'll see this a little below) to see if you have a duplicate.Here's the theory: in a database of let's say customer contacts you want to find areas with more then one customer per zip code. However, the whole database is set up as one customer, one record, so finding unique customers is easy, but finding two (or more) customers in the same zip code isn't. If you were doing this by hand/eye, you would create a report that listed all of the contacts - sorted by zip code - and then skim down the report ONE TIME to find the duplicates, right? Of course you would. This works because you are simply comparing each record to the one immediately below it. So what we need to do is write some code that basically does the same thing. But for a computer, we don't (can't?) know what the next record is going to be, but we can remember what the previous record was, and compare it to the current record to see if it's a duplicate.HERE'S SOME CODE: [!] Lets find the duplicate (or more) Zip Codes [/!][!] We have to set the value of the first previous record so that it will NOT match our very first record[/!] [text]prevrecord=NadaMatch[/text][!] The key is to find ALL of the records in the database, and sort them by the field you are looking for duplicates in [/!] [search db=contacts.db&geIDdata=0&ZIPCODEsort=1&max=99999] [founditems][!] Here's .... Johnnnnnnnnny [/!] [showif [url][ZIPCODE][/url]=[prevrecord]] A duplicate was detected for zip code: [ZIPCODE]
[/showif][!] The last thing we have to do is remember this record, so we can use in the the next occurrence of this founditems loop. [!] [text]prevrecord=[url][ZIPCODE][/url][/text][/founditems] [/search]-----------------I use a more complicated version of this code to occasionally look for duplicate email addresses in a file of about 180K records. It works in a few seconds. Also, I'll leave it as an exercise to the reader to convert this code to be able to perform summary functions (Example: write a summary file containing of the number of orders per day in an order history file).If anyone in the New York City (ish) area is looking for training, let me know. -Thanks Brian B. Burton BOFH - Department of Redundancy Department --------------------------------------------------------------- MMT Solutions - Specializing in Online Shopping Solutions 973-808-8644 http://www.safecommerce.com ================================= time is precious. waste it wisely =================================> I was just reading your question, and I was thinking that you > could maybe do something with the numfound function. > > In other words if the number of items found were greater > than one, then you could make a list of them by showing them. > > So what if you made a summery of your zip codes, and then > did a search on everyone. > > When you searched for a particular zip code look at the number > found in this second search and if it > was greater than one that would > give you a list of all the repeated zip codes. > > I am thinking that you just want to find them so you can > delete them or something right? Cause this would use up a lot > of processor time if you did it constantly. > > jaks > > > >>Hey all, >> This has become a little more perplexing than I would like. Situation: >>I'm wondering if there is a simple way to find ALL records in a database >>that have duplicate's in a certain field. Example; I want a search results >>that will return all ZipCode that have duplicate's and to not show the rest. >>I'm not searching for just one ZipCode at a time either, I want the full >>list of duplicates. As in: >>Name ZipCode >>Tom Jones 90012 >>Haray Caray 90012 >>Joe Smith 90014 >>Jane Doe 90014 >>etc... >>Basically, it would be the exact opposite of what Summ=T does in a search. >> >>Any advice would be appreciated! >>Gary ------------------------------------------------------------- This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: To switch to the DIGEST mode, E-mail to Brian B. Burton

DOWNLOAD WEBDNA NOW!

Top Articles:

Talk List

The WebDNA community talk-list is the best place to get some help: several hundred extremely proficient programmers with an excellent knowledge of WebDNA and an excellent spirit will deliver all the tips and tricks you can imagine...

Related Readings:

calculating ShipCost depending on weight (1997) Table sorting by selected category then others (2004) page redirect in webDNA (1997) Did you hear about this? (1997) generating unique IDs (2000) EmailFolder Correct Permissions ? (2003) Webcat interfering with Webstar? (1998) RequiredFields template (1997) How does WebCatalog search the database? (1997) WebCat Error Log problem (2.0.1, acgi, Mac) (1997) Queertrons? (1997) Problems with [Search] param - Mac Plugin b15 (1997) form population (1998) [search] within [listfiles]? (2003) suffix mapping for NT? (1997) how do I delete 1 of 2 identical records? (2003) WC2.0 Memory Requirements (1997) What if SSL server down? MAC (1997) cookies ok ??? (2006) Am I going senile? (Price recalc based on quantity) (1997)