Building a language?

Started by oconnor0

oconnor0

I'm a professional compiler developer and so find myself annoyed with existing languages.

What projects or languages exist that build a language up from "first principles"? Lisp and Scheme are the obvious two since they both allow constructing new abstractions on top of the existing ones. I'm interested in something more statically typed.

I have this idea about a language where each layer of abstraction is built on top of the previous one and is done in such a way that the language is built from the "ground up". I love functional languages but find them so disconnected from the actual hardware we run on.

I don't know exactly what I'm asking for… :D

Kartik Agaram

I've actually been thinking about this for almost a year now. Since 2013 I've had a bee in my bonnet about getting to a stack that is easy for newcomers to understand rather than for insiders to maintain. Initially I thought the answer lay in building the perfect programming language, so I built one based on Lisp. However, my language was incredibly slow, and I simultaneously started to realize that linguistic mechanisms didn't really attack the real problem.

There's a blind spot in the way we program: we diligently encode the rules we want to automate, but leave gaping holes in the reasons why we chose the rules we did. When the reason for some design decision is not recorded in code, there's literally not enough information in a codebase for a newcomer to understand it deeply without talking to old hands. Entropy becomes inevitable, as old ways become cargo-culted because people can't decide if some seemingly-vestigial feature is truly obsolete, or a subtle regression waiting to happen.

Tests help encode some of the reasons, and their use has greatly expanded (with good reason) in the past decade. But we're not yet at the point where we can blindly deploy software if all automated tests pass. I decided it's incredibly valuable to get there as quickly as possible, so that newcomers would be able to ask arbitrary what-if questions about a codebase and learn about it interactively by running it in various situations rather than just passively by trying to read its code. "Why is this line of code written like this?" Change it, run tests, see a failure, go read what it does. "Oh, that's why." Or if no tests fail, then assume with confidence that it doesn't matter, and perhaps you found an improvement.

In chasing after this goal, I ended up creating to a new language called Mu (named after the next step after lambda) which is:

a) designed to be easy to compile down to machine code. Statically typed, minimal impedance-mismatch with Assembly.

b) expressive enough to support co-evolving an OS along with the language, just like C co-evolved with Unix.

In the process I found a third benefit:

c) it's easier to teach to kids than either Lisp or C.

Lately, Mu has kinda reached a milestone where I've managed to gain confidence in several key mechanisms. The next step is to apply these lessons in a less "toy" context, something that we can use for building real stuff rather than just teaching kids. I've been at an impasse on how to do this because I get hung up on precisely the same property you mentioned: bootstrapping. I want a hierarchy of languages built using my layers, that starts from machine code and extends all the way up to a high-level language like Lisp. I want each layer to be gossamer-thin and easy to understand. I don't want the language to be "self-hosted"; building a compiler in its source language is great for geek cred but makes understanding it much harder. I want each layer to only use what has come before, allowing readers to go over layers sequentially.

Some days/weeks I think the way to achieve this goal is to start with OpenBSD and gradually reshape the C it's written in, without regard for backwards compatibility (though I'm not aiming for Mu in this thread). Other days/weeks I get obsessed with building a compiler to transform Mu into a simpler VM that can be efficiently interpreted. Mu is just so much safer than C. (Everything is bounds-checked and use-after-free errors are impossible, at a fraction of the implementation complexity of Rust but at the cost of runtime checks.)

So: I don't know if this is what you might be asking for :) But I love talking about it, so feel free to ask me more questions, let's see if we can figure that out.


To answer your precise question, here are some interesting bootstrapping projects I've found:

In general, concatenative languages like Forth seem like a very useful milestone for a bootstrapped language.

oconnor0

Yeah, I've looked at concatenative languages because of that, but I haven't been convinced they scale up nicely in abstractions.

oconnor0

Your thoughts about having a language that enables the encoding of why along with what and how is fascinating. Do you feel like you've captured that with Mu? Or gotten closer?

Abstractions and naming are really hard. I have spent hours stuck on a problem because I couldn't name some transformation or operation. This is one of the reasons I've been reticent to "believe" in concatenative languages. They can read beautifully - if all the words are small and properly named. As soon as it turns into bi dup dup swap my eyes glaze over.

And abstractions have a tendency to leak and to miss important details - requiring ugly hacks or poking at the innards to do what is needed.

I am also very intrigued by both total languages and dependently-typed languages. But one of my big peeves about languages is when they have dramatically different syntaxes for different things. C++ being my favorite example. Recursion is relatively easy to write, but attempting to do the same recursion at compile-time in the template system is awful. C++ template syntax is so foreign that I find myself incapable of thinking in it beyond the simplest of expression.

Kartik Agaram

Do you feel like you've captured that [encoding of why] with Mu? Or gotten closer?

Yes, it definitely seems like an improvement. The catch is that it works best if baked in from the lowest level up. So I'm stumped on how to apply it to a stack without needing to reinvent all of modern software.

I too used to obsess about abstraction, but I don't anymore. The key for me was to realize that I can always change an abstraction when I realize it's not good enough. Abstractions are only hard when we feel disempowered from changing them in the name of backwards compatibility, componentization and other such false gods. Once you can change an abstraction it no longer has to be perfect, just the best idea you have had so far. You can just switch to better abstractions as new issues come up.

oconnor0

Attempting to reinvent all of modern software is probably a losing proposition. Can you embed reasoning in from Mu from its lowest layer?

Changing abstractions only works when I have control over the abstraction and its users. When someone else provides the abstraction or when I am responsible for providing that abstraction for others, I am not able nor empowered to change it.

Kartik Agaram

I've advocated before [1, 2] that we all use distinct words for "I know how this works but it's useful not to think about its internals most of the time" and "I don't care how this works; somebody else provides it for me". I call the first "abstraction" and the second "service". Conflating the two muddies many a conversation about abstractions and good software practice.

(I didn't follow your Mu question. Perhaps there's a typo?)

oconnor0

Basically, can you write the lowest layer of Mu on top of an existing OS and attempt to encode the appropriate assumptions and reasoning into that layer rather than starting from bare metal?

Kartik Agaram

Yes, that is basically what the current version does. For example, I started out implementing raw-mode terminal manipulation primitives using ncurses, then switched to the more lightweight termbox libraries. I've been gradually stripping out everything between Mu and the kernel, and I kinda understand now how the termbox library calls translate to raw ioctl()s.

However, I'm not sure how to wrap around certain properties of kernels that programmers may care about and want to write tests for. For example, fsync() has many gotchas. A program that used fsync() would reasonably want to check against undesirable behaviors in its automated tests. However, to make fsync() testable would require giving access from userland to kernel-private data structures.

marco

What projects or languages exist that build a language up from "first principles"?

What language doesn't try to be based on "first principles"? It depends on where you start building from. My language Egel can be thought of as being derived from "first principles" since it is based entirely on untyped eager combinatorial rewriting. But Lisp is also derived from first principles as a list processing language, and so is C, a language based around some abstract Von Neumann machine.

Duncan Cragg

Nice to see some activity here!

My own language is aimed at non-coders - at least at the level of those comfortable with spreadsheet programming. Like spreadsheets, it works well via a 2D 'visual' interface.

Its 'style' could be described as graph rewriting or state transformation. There isn't really a known 'paradigm' it falls under. I call it 'Functional Observer': linked data objects set their state as a function of their own current state plus the state of peer objects observed 'through' those links.

Duncan Cragg

@Kartik:

Thanks!

I'm building it in … C ! :-)

Well, the core in C, the outside bits for Android, etc, in C++.

I'm currently playing with Vulkan for the render, which gives me an easy route into 3D when I'm ready for that.

Kartik Agaram

Given that my Mu is also in C[1], I think we have a lot in common here. What I'm curious to see is if there's a graphics stack that I can steal from you. I gave up on graphics programming 15 years ago when I left Allegro on DOS and switched to Linux. X has always been a pain in my backside.

[1] Well, C++, but in a very C style.

Duncan Cragg

Oh well it's a bit early days to share code .. the fullest implementation of the core ideas is in 'NetMash' in Java. The C version (Onex) is really a re-implementation, along with lessons learned and an outside-in, UI-first approach - to get early feedback (from my core demographic: my family - wife and two daughters!!)

Duncan Cragg

I like the idea of "C++ in a C style" - I do like some of the benefits of C++ but I ended up hating it when I was a C++ contractor. C is a simple, honest language.

I'll put my first slice of code up on GitHub as soon as it's vaguely coherent.. :-)

Duncan Cragg

Just realised I didn't get to the point w.r.t. the OP's question and Onex.

My approach is layered in the sense that Onex can easily become an operating system (it's being written in C for that reason).

It's built from four components - Object Network Format, Protocol, Rules and Types:

  • 'ONF': really the core mechanism that handles my 'objects' and their observation, notification and persistence
  • 'ONP': a P2P distribution protocol that allows objects to observe over the net
  • 'ONR': the Onex programming language
  • 'ONT': the i/o, the types, of objects (an object has a broad type but dynamically not statically)

ONF and ONP can replace or be built into the operating system, providing an API on which to build ONR and the ONT application and i/o types.

The programming model is so 'inside out' relative to conventional approaches, that it does actualy make sense to rebuild the whole stack, ground up. I guess IP networking can stay, though!

Realistically, it'll never replace current approaches, so my OS ambitions are limited to IoT devices (which probably won't even need IP :-) .