Also, munching fritos and looking at this, we could assume that any asset that is new to one avatar that was created by a different avatar is a high-probability candidate for being a duplicate and should be checked out.<br>
<br>That would capture a good chunk ( over 50%?) of duplicates without having to touch the renaming-or-making-a-copy processes.<br><br>Again, this could be event-driven, or db-trigger-driven on INSERT, etc. (Or does MySQL not have transactions and not have on-insert triggers? I'm used to Oracle. )<br>
<br>Wade<br><br><div class="gmail_quote">On Thu, Mar 8, 2012 at 8:06 PM, Wade Schuette <span dir="ltr"><<a href="mailto:wade.schuette@gmail.com">wade.schuette@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Justin,<br>
<br>
I have to respectfully agree with Cory.<br>
<br>
Wouldn't something like the following address your valid concerns about complexity and reducing total load as well as perceived system response time to both filing and retrieving assets?<br>
<br>
First, if you use event-driven processes, there's no reason to rescan the entire database, and by separating the processes into distinct streams, they are decoupled which is actually a good thing and simplifies both sides. There's no reason I can see they need to be coupled, and separating them allows them to be optimized and tested separately, which is a good thing.<br>
<br>
In fact, the entire deduplication process could run overnight at a low-load time, which is even better, or have multiple "worker" processes assisgned to it, if it's taking too long. Seems very flexible.<br>
<br>
I'm assuming that a hash-code isn't unique, but just specifies the bucket into which this item can be categorized.<br>
<br>
When a new asset arrives, if the hash-code already exists, put the unique-ID in a pipe and finish filing it and move on. If the hash-code doesn't already exist, just file it and move on.<br>
<br>
At the other end of the pipe, this wakes up a process that can, as time allows, check in the background to see if not only the hash-code is the same, but the entire item is the same, and if so, change the handle to point to the existing copy. ( For all I know, this can be done in one step if CRC codes are sufficiently unique, but computing such a code is cpu intensive unless you can do it in hardware.)<br>
<br>
Of course, now the question arises of what happens when the original person DELETES the shared item. If you have solid database integrity, you only need to know how many pointers to it exist, and if someone deletes "their copy", you decrease the count by one, and when the count gets to one, the next delete can actually delete the entry.<span class="HOEnZb"><font color="#888888"><br>
<br>
<br>
<br>
Wade</font></span><div class="im"><br>
<br>
<br>
<br>
<br>
On 3/8/12 7:41 PM, Justin Clark-Casey wrote:<br>
</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">
On 08/03/12 22:00, Rory Slegtenhorst wrote:<br>
</div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
@Justin<br>
Can't we do the data de-duplication on a database level? Eg find the duplicates and just get rid of them on a regular<br>
interval (cron)?<br>
</blockquote>
<br>
This would be enormously intricate. Not only would you have to keep rescanning the entire asset db but it adds another moving part to an already complex system.<br>
<br>
</div></blockquote>
<br>
</blockquote></div><br><br clear="all"><br>-- <br>R. Wade Schuette, CDP, MBA, MPH<br>698 Monterey Ave<br>Morro Bay CA 93442<br>cell: 1 (734) 635-0508<br>fax: 1 (734) 864-0318<br><a href="mailto:wade.schuette@gmail.com" target="_blank">wade.schuette@gmail.com</a><br>