Creating a Programming Language from Scratch
So you want to make a programming language from scratch?
This post dives into creating one and how you can design one for yourself.
More importantly, we will be exploring the insides of how an interpreted language works under the hood.
There can be differences in implementation for the language we’re going to build today, but the core concepts will remain the same.
But first let’s set the scene for what we’re gonna do today.
We will be building a toy interpreted language - something similar to BhaiLang. We are going to call our language SimpleLang. All the source code will be in C++.
Here is an example of what our language will look like 👇:
fun main{ print Hii!!}main;Hii!!
You can run the above code snippet by clicking the “Run” button below:
SimpleLang Interpreter
Try it yourself:Output
Hii!!
From Source Code to Execution
Here are the steps by which we’ll turn this code into something a computer can understand and execute it.
The first step is reading the source file into memory so we can process it.
int main() { std::ifstream inFile("./main.sl"); std::string fileData( (std::istreambuf_iterator<char>(inFile)), std::istreambuf_iterator<char>() ); inFile.close();}Tokenization
At this stage, we have the entire source code as a single string in memory and we need to find a way to break it down into meaningful components.
So next, we tokenize the source code by breaking it into meaningful symbols that our Interpreter can understand.
This process is called tokenization, where the raw source code is converted into symbols that our interpreter can understand.
struct KeywordsType { std::string name; std::string type;};
std::vector<KeywordsType> Keywords = { KeywordsType{"fun", "functionDef"}, KeywordsType{"print", "print"}};
struct Token { std::string type; std::string value;};
static void Tokenizer( std::vector<std::string>& tokens, std::vector<Token>& keywords, std::string& fileData ) { std::smatch m; std::regex r("\\{|\\}|[^;\\s{}]+");
for (std::sregex_iterator it(fileData.begin(), fileData.end(), r), end; it != end; ++it) { tokens.emplace_back(it->str()); }
for (auto& j : tokens) { if (j == "{" || j == "}") { Token t; t.type = "bracket"; t.value = j; keywords.push_back(t); continue; }
bool isName = true; for (auto& i : Keywords) { if (i.name == j) { Token t; t.type = i.type; t.value = i.name; keywords.push_back(t); isName = false; break; } }
if (isName) { Token t; t.type = "name"; t.value = j; keywords.push_back(t); } }}Tokens are defined meaningful characters or strings and the process of converting raw strings to tokens is called Tokenization.
Parser
We now need to parse these tokens into an AST - Abstract Syntax Tree.
The parser consumes tokens sequentially and builds the AST by following simple, rule-based patterns.
AST
An AST is a Data Structure used in compilers to represent the structure of code and the order of execution. Nodes in the tree represent concrete items in the code such as function definitions, function calls, print statements, etc.
Defining AST Nodes
struct RootNode{public: std::string type; std::vector<std::unique_ptr<RootNode>> body;
virtual ~RootNode() = default;};
struct FunctionDefNode : RootNode{public: std::string functionName;};
struct FunctionCallNode : RootNode{public: std::string functionName;};
struct PrintNode : RootNode{public: std::string arguments;};Parser Implementation
This is where we convert our tokens into an AST by defining parsing rules for each token in our language.
class Parser{ std::vector<Token> tokens; int pos = 0;public: Parser(std::vector<Token>& t) : tokens(t) {}
std::unique_ptr<RootNode> parseProgram() { auto program = std::make_unique<RootNode>(); program->type = "Program";
while (pos < tokens.size()) { program->body.push_back(parseStatement()); }
return program; }
std::unique_ptr<RootNode> parseStatement() { std::unique_ptr<RootNode> r;
if (tokens.at(pos).type == "bracket" && tokens.at(pos).value == "{") { return parseBlock(); } if (tokens.at(pos).type == "functionDef") { return parseFunctionDef(); } if (tokens.at(pos).type == "print") { return parsePrint(); } if (tokens.at(pos).type == "name") { return parseFunctionCall(); }
throw std::runtime_error("Unknown statement"); }
std::unique_ptr<RootNode> parseFunctionDef() { auto r = std::make_unique<FunctionDefNode>(); r->type = tokens.at(pos).type; pos++; if (tokens.size() > pos) { r->functionName = tokens.at(pos).value; } pos++;
r->body.push_back(parseBlock());
return r; }
std::unique_ptr<RootNode> parseFunctionCall() { auto r = std::make_unique<FunctionCallNode>(); r->type = "call"; r->functionName = tokens.at(pos).value; pos++;
return r; }
std::unique_ptr<RootNode> parsePrint() { auto r = std::make_unique<PrintNode>(); r->type = tokens.at(pos).type; pos++; if (tokens.size() > pos) { r->arguments = tokens.at(pos).value; } pos++; return r; }
std::unique_ptr<RootNode> parseBlock() { auto r = std::make_unique<RootNode>(); r->type = "Body"; pos++; while (tokens.at(pos).value != "}") { r->body.push_back(parseStatement()); } pos++; return r; }};Interpreter
The final step where we get the Output, this part interprets the AST and executes the code.
class Interpreter { std::unordered_map<std::string, FunctionDefNode*> functions;
public: void registerFunctions(RootNode* node) { if (auto fn = dynamic_cast<FunctionDefNode*>(node)) { functions[fn->functionName] = fn; } if (node != NULL && !node->body.empty()) { for (auto& child : node->body) registerFunctions(child.get()); } }
void run(RootNode* node) { if (node->type == "Program") { for (auto& stmt : node->body) { run(stmt.get()); } return; }
if (auto fn = dynamic_cast<FunctionDefNode*>(node)) { // don't auto-run return; }
if (auto call = dynamic_cast<FunctionCallNode*>(node)) { auto it = functions.find(call->functionName); if (it == functions.end()) throw std::runtime_error("Undefined function");
run(it->second->body[0].get()); return; }
if (node != NULL && node->type == "Body") { for (auto& stmt : node->body) run(stmt.get()); }
if (auto p = dynamic_cast<PrintNode*>(node)) { std::cout << p->arguments << "\n"; } }};An Easter Egg 🐣
Even though SimpleLang is minimal, it supports recursion. Since this language has no conditional statement this creates an infinite recursion
fun main{ print Hii!!; main;}main;Hii!!Hii!!Hii!!...Conclusion
In this post, we built a simple interpreted programming language from scratch. We have covered tokenization, parsing, AST construction, and interpretation. Even though SimpleLang is minimal, we saw how a language is implemented.
If you want to expand on this, feel free to do so - you may want to add variables and conditional statements to stop infinite recursion.
If you want to explore a real programming language, I’d recommend checking out Gleam. It is a relatively new language with a great community written on Rust, and built on top of Erlang’s OTP.