Now that I had secured our initial 25.6 millions files, just days ago, I indexed everything to be put into my own search engine system (SOTDS) and I observed some quite revealing information about amount of files that are unique!
During December/January into the second automated batch ripping, I discovered a huge amount of crashes for WinUAE. This was related to a good deal of files that where compressed with "XPKF", "PACK" and "RNC" was still lurking around in the archive.
These were files that could never be decompressed due to either being encrypted with a password/key but also damaged compressed files. So, I hunted them all down and sliced and deleted the files in question from 25.6 million down to 22.9million.
Further, I had performed the database scanning where my SOTDS system automatically removes any duplicates a much more revealing result came to life!
Out of the 22.96 millions files, the amount of actual non-duplicated files came all the way down to 4.44million!
Yep, this does make sense, as remember during my extraction and decrunching of original Amiga Files from all kinds of archives, cd's, disk images, looose files etc. naturally I had only 1 filter applied: That was a rule of files below 100 bytes or so would be ignored initially, then ignore all damaged uncompressable files. There was no way any of these files could result in a rippable music file that could be future recorded!
So, in having this in mind, naturally any disk with "loadwb" would be part of this scanning and left into the collection. Of, course, how many variations could there be of "loadwb" and its unique MD5 checksums?
Well, logic dicates that a high number of files WOULD be the same by looking at the MD5 checksums, and why do we need to scan a duplicated file twice?
Of course not!
So, the new number of unique files I need to scan went down to 4.4million shaving away about 18millions of duplicated ones - PHEW!
The new number of files I need to scan are: 4445098 (4.4million). Multiplying that with 4 ripper techniques, I end up with a total of 17.6million files instead of the initial 102million!
That should mean, I shouldn't expect us to wait 2.2 years to scan through everything. It could be down to about 10-17months instead, if I are able to scan around 1+ million files a month. Remember, everything is done through "real" Amiga's by using WinUAE emulation (allthough in JIT, turbo mode), it does still take time!
On the other hand, I should probably kiss the Guiness World Records good-bye as processing 4.4million files instead of 102million files through any emulator isn't that kind of a world record, more like peanuts :-)
Anyway, who cares about that. Its music we care about in the end here!
Next interresting thing is: How many of the ripped files I end up with matches by MD5 checksum what was already stored in the original SOAMC= archive.
Hopefully I should be able to at least produce a couple of thousand or even several hundred thousand music files not yet recorded, or these many years for this crazy project wouldn't be needed afterall, lets just wait and see before we salute ourselves and the ENDGAME project!
I have started to delete duplicated datafiles. In other terms it means our original 536GB dataset will be quite reduced to the point of that its easily transferrable as a single package to other computers I will set up.
Right now, we have a linux serving the files and 1 single Windows 7 64bit accessing the files through network. This isn't a perfect way of processing this kind of data (in my opinion, using Linux at all for this is just troublesome), so my plan ahead is to copy the new set of data to local computers and process it locally. It would give me the chance of deploying the dataset onto other computers (any Windows really) and process 100% locally. This will in turn speed up ripping and I should probably manage to pull of entire ripping of 17million files before Christmas 2019, let's hope.
More details will be given when it progresses!