rain1

  • User since
  • Last active
  • Started 2 topics
  • Posted 3 times

Recent activity

Started Escaping

Hello! Here is my blog post about escaping.

Escaped Text

The fundamental idea of escaping is to provide an injective (and hence invertible) function from arbitrary strings to strings with some pattern not occurring. This could mean that a set of reserved characters do not occur in the output, or that a certain character does not occur alone.

The netstring approach of putting the length before the data is an alternative to escaping. NUL delimited text is another alternative to escaping in cases where text cannot contain \0 (like C strings).

The ability to easily escape data we want to print out and unescape data we are taking in as input is important if we want easy and correct "plumbing" between programs. It seems to be overlooked a lot. One of the things that's amazing to lisp beginners coming from other languages is how easy it is to print out and read back data.

The rest of this document is a review of some of the ways escaping comes up in computing.

url/uri encoding

Data inside web addresses must avoid the / separator and several other characters, as documunted in the URI RFC. To allow these chars to still be passed to a web application we use percent encoding with hex digits: %XX.

The choice of % as the escape char is good. Since it's not using the more common \ URLs can usually be put into a string literal without another layer of escaping.

Originaly in RFC 2396 the \ was not allowed inside a URL. It was marked as "unwise". In the latest spec, RFC 3986 it is allowed but should be escaped as %5C.

HTML escaping

HTML and XML allow you to mix arbitrary text and <tags>. So to include in our text the same metacharacters that are used to described tags they will need escaped.

The & char is used to escape special characters in HTML. They call it a character reference. Characters can be referenced by decimal, hex or by name.

According to the rules for parsing data inside HTML tags one only really needs to escape & and <. But it is advised to also escape >. The main restriction on html data is that it must not contain the string "</". Here.

sql escaping

A lot of websites have been exploited from user data ending up unescaped in part of an SQL query. If a web application forgets to escape something user controlled that ends up they can use things like delimiter collision to perform their own SQL queries and take over the site.

Command line arguments and shell scripting

Command line arguments starting with - are usually understood as flags. The magic -- flag enables one to pass further arguments which are interpreted literally instead of checked to see if they are flags.

On the ext4 filesystem, file names/paths can contain any byte except 0. In shell scripting it is often useful to use find -print0 to get NUL delimited list of filenames to loop over, although usually people use newline or as a delimited which is not 100% but works almost all the time.

base64 encoding

As base64 uses such a small alphabet of plaintext characters any data that has been base64 encoded can basically be put in any context without further escaping. In URLs, in HTML, strings inside programming languages, in email. etc.

string escaping in programming languages

The primary place escaped text happens is in string literals. Double quotes are used to contain and delimit the text.

The syntax of a string literal, in regex is at a very basic level something like: /"([^"]|\"|\\)*"/. wiki gets this. It's not "[^"]*".

  • Rust basic. newlines allowed. Newlines + whitespace ahead stripped if preceeded by \. x u and nrt. It also has raw string literals. A bit like lua's long ones.
  • Go
  • C
  • POSIX shell. Shell supports things like glob * and ~ which expand to multiple files or a path. To enter a literal asterisk you need to either escape it \*, \~ or put it in a string.
  • bash has weak and strong quoting, supports escapes much like C.
  • Python has a few different sorts of string literal.
  • Ruby - needs # escaped. Also has %(...)
  • Lua - \z is special. \xXX, where XX is a sequence of exactly two hexadecimal digits \u{XXX} is for unicode. brackets mandatory. The long literal strings [==[ ]==] are a unique approach.
  • Javascript - Probably the simplest string language: x for hex, u for unicoed. special escape chars '"\bfnrtv. Also supports \0.

Another overview here: codecodex - escape sequences.

regex escaping

. is usually a metacharacter representing any char, so if you want to match a literal dot you need \..

A very weird choice by sed is that ( and ) are matched literally, and to use parens as grouping you need \( \). This is a kind of reverse escaping which makes it difficult to remember. I think it is done for the purpose of brevity, assuming that matching brackets literally is done more often than grouping is done - I find the opposite to be true though.

Regex often falls victim to leaning toothpick syndrome.

UNIX terminals

UNIX terminals use in-band control . This means just catting a file can completely mess up your terminal if the file contains these control codes. This is a mixed bag of good and bad. Command line programs should escape all terminal control sequences to avoid in band signalling.

Gopher

Gopher terminates files with a period .. Spec here. Because of this it's a bit of extra work to have text files that start with a period in a gopher text page:

Note:  Lines beginning with periods must be prepended with an extra
     period to ensure that the transmission is not terminated early.
     The client should strip extra periods at the beginning of the line.

Started What UNIX shell could have been

Hello. I have written a blog post about an idea for UNIX shell. The canonical link is here https://rain-1.github.io/shell-2 but I have copied it so that people are not forced to click out to a different site. I look forward to your thoughts on the general topics.

What UNIX shell could have been

The shell (bash or whatever) is an excellent tool that saves people a huge amount of work. Being able to easily script complex jobs together is one of the best things. It does have some weaknesses though that I feel could be improved though.

The two biggest weaknesses in shell, in my opinion, are the quoting and escaping mess and secondly that all the objects are strings. I've talked about the quotation stuff before so I wont cover that here. My idea for improving it would be to make separate (dynamic) data types for strings, paths and command lines flags.

This could be implemented in a simple (but ugly) way by encoding each object into strings with a tag saying what type they are:

  • "sfoo" for a string
  • "fhelp" for a flag
  • "p/dev/random" for a path

Maybe there's a more aesthetic way to do this, open to suggestions.

This change can't be done just by writing a new shell though. Every UNIX tool that we have (ls, cat, grep, jq, …) would need to conform to this protocol. It would probably steamroll over 'dd' (which has an ideosyncratic argument style).

Advantages and Drawbacks

What would be good about this idea? Command line tools would throw an error if given the wrong input, instead of what they currently do: attempt to continue with a misinterpreted string. This could be considered an improvement if you're interested in your scripts correctness. Issues like the recent GPG security problem CVE-2018-12020 would have been avoided.

But the drawback might be that some things take a bit more programming to achieve. For example you would have to explicitly convert a string to a path. You would have to build up paths and apply regex and things to them in a different way than you do it on strings. Maybe the language could be designed to make these things easier.

The current mess

It's worth mentioning the current "solution". In the current system if something starts with a - it's a flag. This leaves the problem of filenames that start with -. They are very rare so it isn't a big problem: We only encounter it occasionally. Some might never encounter it.

Anyway the answer is that tools can provide the – flag to say that the next argument (or all preceeding) are not flags.

You can read about that here

We also have incredibly good documentation about working around the difficulties of coping with paths and spaces and stuff in shell:

Now these are excellent resources that I value, but I believe that it is incidental complexity. And a better designed shell would result in much less documentation and edge cases to worry about.

Some shells have added array data types rc, but these only work internally. Across process boundaries everything is squeezed through the string object.

Summary

I believe we make all of our lives easier by improving the tools we use. Shell has a couple big problems that can actually be solved. There have been experiments to solve the quotation issues execline and s. And I am proposing the idea of a very basic dynamic type system to solve the 'in-band signalling' issues with command line args.

Posted in Who's Around?

Hello. I am interested in bootstrapping programming languages from nothing.