Creating a Programming Language from Scratch

So you want to make a programming language from scratch?
This post dives into creating one and how you can design one for yourself. More importantly, we will be exploring the insides of how an interpreted language works under the hood. There can be differences in implementation for the language we’re going to build today, but the core concepts will remain the same.
But first let’s set the scene for what we’re gonna do today.

We will be building a toy interpreted language - something similar to BhaiLang. We are going to call our language SimpleLang. All the source code will be in C++.
Here is an example of what our language will look like 👇:

fun main{
  print Hii!!
}
main;

Hii!!

You can run the above code snippet by clicking the “Run” button below:

SimpleLang Interpreter

Try it yourself:

Output

Hii!!

From Source Code to Execution

Here are the steps by which we’ll turn this code into something a computer can understand and execute it.

Programming Language Implementation Flowchart

The first step is reading the source file into memory so we can process it.

int main() {
  std::ifstream inFile("./main.sl");
  std::string fileData(
    (std::istreambuf_iterator<char>(inFile)),
    std::istreambuf_iterator<char>()
  );
  inFile.close();
}

Tokenization

At this stage, we have the entire source code as a single string in memory and we need to find a way to break it down into meaningful components.
So next, we tokenize the source code by breaking it into meaningful symbols that our Interpreter can understand.
This process is called tokenization, where the raw source code is converted into symbols that our interpreter can understand.

struct KeywordsType {
  std::string name;
  std::string type;
};

std::vector<KeywordsType> Keywords = {
  KeywordsType{"fun", "functionDef"},
  KeywordsType{"print", "print"}
};

struct Token {
  std::string type;
  std::string value;
};

static void Tokenizer(
  std::vector<std::string>& tokens,
  std::vector<Token>& keywords,
  std::string& fileData
 ) {
  std::smatch m;
  std::regex r("\\{|\\}|[^;\\s{}]+");

  for (std::sregex_iterator
    it(fileData.begin(), fileData.end(), r), end;
    it != end; ++it) {
    tokens.emplace_back(it->str());
  }

  for (auto& j : tokens) {
        if (j == "{" || j == "}") {
      Token t;
      t.type = "bracket";
      t.value = j;
      keywords.push_back(t);
      continue;
    }

  bool isName = true;
  for (auto& i : Keywords) {
    if (i.name == j) {
      Token t;
      t.type = i.type;
      t.value = i.name;
      keywords.push_back(t);
      isName = false;
      break;
    }
  }

    if (isName)
    {
      Token t;
      t.type = "name";
      t.value = j;
      keywords.push_back(t);
    }
  }
}

Tokens are defined meaningful characters or strings and the process of converting raw strings to tokens is called Tokenization.

Parser

We now need to parse these tokens into an AST - Abstract Syntax Tree.
The parser consumes tokens sequentially and builds the AST by following simple, rule-based patterns.

AST

Programming Language Implementation Flowchart

An AST is a Data Structure used in compilers to represent the structure of code and the order of execution. Nodes in the tree represent concrete items in the code such as function definitions, function calls, print statements, etc.

Defining AST Nodes

struct RootNode
{
public:
  std::string type;
  std::vector<std::unique_ptr<RootNode>> body;

  virtual ~RootNode() = default;
};

struct FunctionDefNode : RootNode
{
public:
  std::string functionName;
};

struct FunctionCallNode : RootNode
{
public:
  std::string functionName;
};

struct PrintNode : RootNode
{
public:
  std::string arguments;
};

Parser Implementation

This is where we convert our tokens into an AST by defining parsing rules for each token in our language.

class Parser
{
  std::vector<Token> tokens;
  int pos = 0;
public:
  Parser(std::vector<Token>& t) : tokens(t) {}

  std::unique_ptr<RootNode> parseProgram() {
    auto program = std::make_unique<RootNode>();
    program->type = "Program";

    while (pos < tokens.size()) {
      program->body.push_back(parseStatement());
    }

    return program;
  }

  std::unique_ptr<RootNode> parseStatement() {
    std::unique_ptr<RootNode> r;

    if (tokens.at(pos).type == "bracket" && tokens.at(pos).value == "{")
    {
      return parseBlock();
    }
    if (tokens.at(pos).type == "functionDef")
    {
      return parseFunctionDef();
    }
    if (tokens.at(pos).type == "print")
    {
      return parsePrint();
    }
    if (tokens.at(pos).type == "name")
    {
      return parseFunctionCall();
    }

    throw std::runtime_error("Unknown statement");
  }

  std::unique_ptr<RootNode> parseFunctionDef() {
    auto r = std::make_unique<FunctionDefNode>();
    r->type = tokens.at(pos).type;
    pos++;
    if (tokens.size() > pos)
    {
      r->functionName = tokens.at(pos).value;
    }
    pos++;

    r->body.push_back(parseBlock());

    return r;
  }

  std::unique_ptr<RootNode> parseFunctionCall() {
    auto r = std::make_unique<FunctionCallNode>();
    r->type = "call";
    r->functionName = tokens.at(pos).value;
    pos++;

    return r;
  }

  std::unique_ptr<RootNode> parsePrint() {
    auto r = std::make_unique<PrintNode>();
    r->type = tokens.at(pos).type;
    pos++;
    if (tokens.size() > pos)
    {
      r->arguments = tokens.at(pos).value;
    }
    pos++;
    return r;
  }

  std::unique_ptr<RootNode> parseBlock() {
    auto r = std::make_unique<RootNode>();
    r->type = "Body";
    pos++;
    while (tokens.at(pos).value != "}")
    {
      r->body.push_back(parseStatement());
    }
    pos++;
    return r;
  }
};

Interpreter

The final step where we get the Output, this part interprets the AST and executes the code.

class Interpreter {
  std::unordered_map<std::string, FunctionDefNode*> functions;

public:
  void registerFunctions(RootNode* node) {
    if (auto fn = dynamic_cast<FunctionDefNode*>(node)) {
      functions[fn->functionName] = fn;
    }
    if (node != NULL && !node->body.empty()) {
      for (auto& child : node->body)
        registerFunctions(child.get());
    }
  }

  void run(RootNode* node) {
    if (node->type == "Program") {
      for (auto& stmt : node->body) {
        run(stmt.get());
      }
      return;
    }

    if (auto fn = dynamic_cast<FunctionDefNode*>(node)) {
      // don't auto-run
      return;
    }

    if (auto call = dynamic_cast<FunctionCallNode*>(node)) {
      auto it = functions.find(call->functionName);
      if (it == functions.end())
        throw std::runtime_error("Undefined function");

      run(it->second->body[0].get());
      return;
    }

    if (node != NULL && node->type == "Body") {
      for (auto& stmt : node->body)
        run(stmt.get());
    }

    if (auto p = dynamic_cast<PrintNode*>(node)) {
      std::cout << p->arguments << "\n";
    }
  }
};

An Easter Egg 🐣

Even though SimpleLang is minimal, it supports recursion. Since this language has no conditional statement this creates an infinite recursion

fun main{
  print Hii!!;
    main;
}
main;

Hii!!
Hii!!
Hii!!
...

Conclusion

In this post, we built a simple interpreted programming language from scratch. We have covered tokenization, parsing, AST construction, and interpretation. Even though SimpleLang is minimal, we saw how a language is implemented.

If you want to expand on this, feel free to do so - you may want to add variables and conditional statements to stop infinite recursion.

If you want to explore a real programming language, I’d recommend checking out Gleam. It is a relatively new language with a great community written on Rust, and built on top of Erlang’s OTP.