Hide Document Parts:
Images
   Digital Media
Forms
QR Code
Output Options:
Print Document
Download PDF
Article
Comments

A Unique Identifier Hash for Hive LMS

So I got distracted but in a good way. I realized as I was coding up the login system for the LMS Project (my Bachelor’s Degree Senior Project) that I’m going to need some kind of ID (unique identifier hashing) for Hive and almost four hours later I have a working but extremely broken YouTube-esque unique identifier hashing algorithm.

Research

First let me lay out the background research. Naturally I Google searched like crazy and came up with a ton of resources but the first thing that actually caught my attention long enough to be of use was Instagram’s Sharding & IDs at Instagram article. To be honest it hurt my head trying to understand it but it lead me to eventually find Hashids which looked promising. After picking apart what Hashids does though its not exactly for me. My goal is more YouTube minded where I create a unique identifier that is not based off a auto incremented database column. Hashids does a decent job at hiding the original integer ID but I like the security and ease of scale of YouTube’s video URL hash. With YouTube’s algorithm you can not just switch around a few letters or numbers and easily pull up hidden videos and when need you can allow the ID to be bigger than the current 11 character limit. Currently the algorithm has 6411 possibilities, in non-math speak 73,786,976,294,838,206,464 possibilities, but can really be represented as 64since they do not seem to check for curses: bad words that accidentally get generated. This is about the time that I stumbled upon this awesome video.

My Attempt

At this point I got to coding and realized another potential problem; feature creep. I would like to have a time stamp hidden or built into this unique identifier. After failing miserably at my initial implementations I decided to get the unique identifier algorithm working first instead of trying to kill two birds with one stone as the saying goes. And here is my first working and tested solution using base62 not base64 like YouTube.

function genId($length){
    $array = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P'
             ,'Q','R','S','T','U','V','W','X','Y','Z','a','b','c','d','e','f'
             ,'g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v'
             ,'w','x','y','z','0','1','2','3','4','5','6','7','8','9'];
    $curses = ['c','f','h','i','s','t','u','C','F','H','I','S','T','U'];
    $result = '';
    $current = '';
    $previous = FALSE;
    for($x=0; $x<$length; $x++){
        shuffle($array);
        $current = $array[rand(0, 61)];
        if($previous){
            $previous = in_array($current, $curses);
            if($previous){
                $x--;
            } else {
                $result .= $current;
            }
        } else {
            $previous = in_array($current, $curses);
            $result .= $current;
        }
    }

    return $result;
}

I noticed that having  shuffle($array); outside of the for loop caused way more collisions than was acceptable so I popped it inside and there you have it: The Worst Unique Identifier Hashing Algorithm Ever. Using a test script I wrote in JavaScript I tested collisions out of 10,000 attempts and the results were not promising (see below). I think the way PHP handles generating random numbers and random array shuffles is the biggest hindrance at the moment. Since the code runs so fast the randomness in PHP does not change enough.

Test Results

UID’s made:10,000
Collisions:8,426
% of collisions:84%

Now after taking a break from my coding spree and giving it some thought I realized that I was implementing the algorithm wrong: I was introducing to much randomness. Evan Caldwell took a look at my first attempt and also pointed this out to me Taking out the shuffle($array); completely and testing collisions out of 10,000 attempts now gives:

Test Results

UID’s made:10,000
Collisions:6,879
% of collisions:69%

What I’m realizing now though is I may need to head away from randomness and move towards adding some kind of control to limit the collisions. Kevin van Zonneveld has a nice little function that I might be able to borrow some ideas from but it violates the rule that I’m setting for this algorithm: be randomly generated with enough uniqueness that you can not easily guess the next or previous ID. For reference the article can be found here: Create Youtube-Like IDs With PHP/Python/Javascript/Java/SQL.

A Possible Solution

Still not factoring in my desire to include a time stamp yet I think the following business logic will solve my problem and be fairly simple to implement:

  1. Create a variable that tracks the amount of ID’s I have generated so far. This essentially is a variable or database record of the last number I used.
  2. Use a base64 alphabet shifted by an organizational salt; this helps increase uniqueness globally but does not grantee it.
  3. Generate blocks of usable ID’s incrementing from where we last left off guaranteeing 100% uniqueness of the ID.
  4. Pad each ID with data from another fast randomizing method to create ID’s that are the length desired; this step is done at the same time as the previous.
  5. Shuffle this block of usable ID’s before pulling one off the top for use.
Comments are currently disabled. Please check back later.
Close
Categories