Find the non repeating char in O(n) time and O(1) space – v2
My colleague and friend Luca (@lucabox) described a better solution to the problem of “Finding the first non repearing char in a string in O(n) time and O(1) space“. It uses smartly the space, making the solution nicer and slicker.
Or we are just 2 geeks that need to give more attention to their girlfriends
Luca’s solution description
The logic of this solution is based on the usage of an array of unsigned chars.
Every char (assumed to be lowecase) has an associated small byte (1 char = 8 bits), where the bits 0×1 and 0×2 (the 2 least significant) represents, respectively, “present once in the input string” and “present multiple times in the input string“. After the input is “scanned” once, and every letter is marked with the correspondent “presence bit” (once, multiple or none), it get’s scanned a second time to find the first char of the input which has the bit “present once” set to “1″.
Before I show you the code
Again, this comes from Luca Colantonio (http://uk.linkedin.com/in/lucacolantonio), smart ass that is too lazy to maintain a blog and post it himself (or even implementing himself – he just explained to me at work and I had to code it). Thanks Luca
Now, the code.
#include <stdio .h>
#include <string .h>
#define PRESENCE_ONCE_BIT 0x1
#define PRESENCE_MULTIPLE_BIT 0x2
#define NORMALIZE_CHAR(C) (C-'a')
int main(int argc, char** argv)
{
char* input;
unsigned int input_length;
unsigned int i;
unsigned char char_presence_bitmask[26];
memset(char_presence_bitmask, 0x0, 26);
// Check the Input
if ( argc == 2 ) input = argv[1];
else return (1);
// Search for every character "present only once"
for ( i = 0;; ++i )
{
// Avoid use of "strlen", just to look cooler
if ( input[i] != '\0' )
{
// If the current char is not already marked as "present multiple times"
if ( !( char_presence_bitmask[NORMALIZE_CHAR(input[i])] & PRESENCE_MULTIPLE_BIT) )
{
// It it is already marked as "present once"
if ( char_presence_bitmask[NORMALIZE_CHAR(input[i])] & PRESENCE_ONCE_BIT )
{
// Then mark it as "present multiple times" and unmark it's "present once" bits
char_presence_bitmask[NORMALIZE_CHAR(input[i])] |= PRESENCE_MULTIPLE_BIT;
char_presence_bitmask[NORMALIZE_CHAR(input[i])] &= ~PRESENCE_ONCE_BIT;
}
else
{
// Otherwise, mark it as "present once"
char_presence_bitmask[NORMALIZE_CHAR(input[i])] |= PRESENCE_ONCE_BIT;
}
}
}
else
{
// Store the input length to make next iteration easier
input_length = i;
break;
}
}
// Look for the first char that is marked as "present only once" => that's the solution
for ( i = 0; i < input_length; ++i )
{
if ( char_presence_bitmask[NORMALIZE_CHAR(input[i])] & PRESENCE_ONCE_BIT )
{
printf("=== FINAL RESULT: %c ===\n", input[i]);
return (0);
}
}
printf("=== FINAL RESULT: no char appears only once ===\n");
return (0);
}

about 2 months ago
Why not replace all the if stuff in the middle with:
char_presence_bitmask[NORMALIZE_CHAR(input[i])] <<= 1;
Change the memset to:
memset(char_presence_bitmask, 1, 26);
And check in the last for loop if it equals 2? Time/space complexity is the same, but will it not run faster than with the jumps?
about 2 months ago
@V.R.:
Your first suggestion is cool: yes, it will make the if even slimmer.
About your second suggestion, I think you should check this out: http://www.cplusplus.com/reference/clibrary/cstring/memset/. The second parameter represent the initialization value: doing like you say would mean “marking all chars as present once”.
And, for your last suggestion, a “bitwise AND” is faster then using comparison. That’s why it should be used whenever possible.
Thanks.
about 2 months ago
Yes, I know about memset. It has to be 1 (i.e. non zero) because shifting 0 is 0. It’s not really “marking all chars as present once”, but checking if it has been shifted once.
Your second point is interesting, can you give any references? Probably specific to platform/compiler.
about 2 months ago
About your first point, what I mean is if we initialize the array to “1″, every char will result already marked as “present once”. I assume you maintain the convention I set. Unless, then, you are changing it somehow, using:
About the second point, I guess you are right: it is something dependent on the platform: if there is a specific set of registry already able to do 1-clock-heartbit AND and none for “EQUALITY COMPARISON”, then my point stands. Otherwise it’s irrelevant.
I’m unable to find a reference, but I was told looooong time ago that this was the case for ARM architectures. But, again, I could be wrong.
Any Architecture expert out there?
BUT, there is still a reason for NOT using the EQUALITY COMPARISON: given the fact that the algorithm is all designed around the concept of “bit manipulation”, it is more clear, in my opinion, to maintain constant the usage of bitwise operators. Avoids, for who reads the algorithm, to move from the mental scope of “bitwise operations” to “interger’s math”.
Thank you very much for this comments anyway, really really useful to compare with someone else that knows these matters