Lexer Builder - Problem

Build a lexer that tokenizes a simple programming language. The lexer should identify and classify different types of tokens from source code.

Token Types:

KEYWORD: if, else, while, for, return, int, float, bool
IDENTIFIER: Variable names (letters, digits, underscore, must start with letter)
NUMBER: Integer or floating-point numbers
OPERATOR: +, -, *, /, =, ==, !=, <, >, <=, >=
DELIMITER: (, ), {, }, ;, ,
WHITESPACE: Spaces, tabs, newlines (usually ignored)

Return an array of tokens, where each token is a string in format "TYPE:value".

Note: Skip whitespace tokens in the output. Process tokens from left to right.

Input & Output

Example 1 — Basic If Statement

$ Input: sourceCode = "if(x == 5)"

› Output: ["KEYWORD:if","DELIMITER:(","IDENTIFIER:x","OPERATOR:==","NUMBER:5","DELIMITER:)"]

💡 Note: Tokenizes: 'if' as keyword, '(' and ')' as delimiters, 'x' as identifier, '==' as operator, '5' as number

Example 2 — Variable Declaration

$ Input: sourceCode = "int count = 10;"

› Output: ["KEYWORD:int","IDENTIFIER:count","OPERATOR:=","NUMBER:10","DELIMITER:;"]

💡 Note: Tokenizes variable declaration: 'int' keyword, 'count' identifier, '=' operator, '10' number, ';' delimiter

Example 3 — Expression with Float

$ Input: sourceCode = "result = 3.14 * radius"

› Output: ["IDENTIFIER:result","OPERATOR:=","NUMBER:3.14","OPERATOR:*","IDENTIFIER:radius"]

💡 Note: Handles floating point number '3.14' and mathematical expression with identifiers and operators

Constraints

1 ≤ sourceCode.length ≤ 10⁴
sourceCode contains only printable ASCII characters
Keywords are case-sensitive
Identifiers start with letter or underscore

Visualization

Tap to expand

Asked in

G Google 35 M Microsoft 28 a Amazon 22 A Apple 18

Build a lexer using a state machine approach that processes source code character by character. The key insight is to transition between states (KEYWORD, IDENTIFIER, NUMBER, OPERATOR, DELIMITER) based on the current character, eliminating backtracking. Best approach is the State Machine Lexer with Time: O(n), Space: O(n).

Common Approaches

✓ Character-by-Character with Backtracking

⏱️ Time: O(n × m) Space: O(n)

For each character, try to match all possible token patterns starting at that position. This involves a lot of backtracking and repeated work when tokens share common prefixes.

State Machine Lexer

⏱️ Time: O(n) Space: O(n)

Implement a state machine that transitions between different states based on input characters. Each state knows what type of token it's building, eliminating backtracking and redundant checks.

Character-by-Character with Backtracking — Algorithm Steps

Step 1: For each position, try matching each token type
Step 2: Use backtracking when partial matches fail
Step 3: Choose the longest successful match

Visualization

Tap to expand

Step-by-Step Walkthrough

Position Scan

Start at each character position

Try Patterns

Test all token types (operators, keywords, numbers)

Backtrack

If partial match fails, try next pattern

Code -

solution.c — C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

int isKeyword(char* word) {
    char* keywords[] = {"if", "else", "while", "for", "return", "int", "float", "bool"};
    int numKeywords = 8;
    for (int i = 0; i < numKeywords; i++) {
        if (strcmp(word, keywords[i]) == 0) return 1;
    }
    return 0;
}

int isOperator(char* op) {
    char* operators[] = {"+", "-", "*", "/", "=", "==", "!=", "<", ">", "<=", ">="};
    int numOps = 11;
    for (int i = 0; i < numOps; i++) {
        if (strcmp(op, operators[i]) == 0) return 1;
    }
    return 0;
}

int isDelimiter(char c) {
    return c == '(' || c == ')' || c == '{' || c == '}' || c == ';' || c == ',';
}

char** solution(char* sourceCode, int* returnSize) {
    char** tokens = malloc(1000 * sizeof(char*));
    int tokenCount = 0;
    int len = strlen(sourceCode);
    int i = 0;
    
    while (i < len) {
        if (isspace(sourceCode[i])) {
            i++;
            continue;
        }
        
        // Try operators (longest match)
        int found = 0;
        for (int opLen = 2; opLen >= 1; opLen--) {
            if (i + opLen <= len) {
                char candidate[3];
                strncpy(candidate, sourceCode + i, opLen);
                candidate[opLen] = '\0';
                if (isOperator(candidate)) {
                    tokens[tokenCount] = malloc(50);
                    sprintf(tokens[tokenCount], "OPERATOR:%s", candidate);
                    tokenCount++;
                    i += opLen;
                    found = 1;
                    break;
                }
            }
        }
        
        if (found) continue;
        
        // Try delimiters
        if (isDelimiter(sourceCode[i])) {
            tokens[tokenCount] = malloc(50);
            sprintf(tokens[tokenCount], "DELIMITER:%c", sourceCode[i]);
            tokenCount++;
            i++;
            continue;
        }
        
        // Try numbers
        if (isdigit(sourceCode[i])) {
            int j = i;
            while (j < len && (isdigit(sourceCode[j]) || sourceCode[j] == '.')) {
                j++;
            }
            char number[50];
            strncpy(number, sourceCode + i, j - i);
            number[j - i] = '\0';
            tokens[tokenCount] = malloc(50);
            sprintf(tokens[tokenCount], "NUMBER:%s", number);
            tokenCount++;
            i = j;
            continue;
        }
        
        // Try identifiers/keywords
        if (isalpha(sourceCode[i]) || sourceCode[i] == '_') {
            int j = i;
            while (j < len && (isalnum(sourceCode[j]) || sourceCode[j] == '_')) {
                j++;
            }
            char word[50];
            strncpy(word, sourceCode + i, j - i);
            word[j - i] = '\0';
            tokens[tokenCount] = malloc(50);
            if (isKeyword(word)) {
                sprintf(tokens[tokenCount], "KEYWORD:%s", word);
            } else {
                sprintf(tokens[tokenCount], "IDENTIFIER:%s", word);
            }
            tokenCount++;
            i = j;
            continue;
        }
        
        i++;
    }
    
    *returnSize = tokenCount;
    return tokens;
}

int main() {
    char sourceCode[10000];
    fgets(sourceCode, sizeof(sourceCode), stdin);
    sourceCode[strcspn(sourceCode, "\n")] = 0;
    
    int returnSize;
    char** result = solution(sourceCode, &returnSize);
    
    printf("[");
    for (int i = 0; i < returnSize; i++) {
        if (i > 0) printf(",");
        printf("\"%s\"", result[i]);
    }
    printf("]\n");
    
    for (int i = 0; i < returnSize; i++) {
        free(result[i]);
    }
    free(result);
    
    return 0;
}

Time & Space Complexity

Time Complexity

⏱️

O(n × m)

For each of n characters, we may try matching m different token patterns

✓ Linear Growth

Space Complexity

O(n)

Store the result tokens array and temporary matching state

⚡ Linearithmic Space

23.4K Views

Medium Frequency

~35 min Avg. Time

847 Likes

Ln 1, Col 1

Smart Actions

💡 Explanation

AI Ready

💡 Suggestion Tab to accept Esc to dismiss

// Output will appear here after running code

Code Editor Closed

Click the red button to reopen

Lexer Builder - Problem

Input & Output

Constraints

Visualization

Related Problems

Common Approaches

Character-by-Character with Backtracking — Algorithm Steps

Visualization

Code -

Time & Space Complexity

Select Compiler