— 06 Mar 2018

reading , paper

Notes on NL2Bash

I’m SO excited about this paper. In the past few weeks I’ve been working with bash scripts a lot and I’ve never taken a particular liking on them until recently *cue Daft Punk’s Instant Crush playing on the background*. In the office we also have a long-running inside joke on voice commands and I can’t help but thinking about running bash scripts using voice commands! So here’s a short write-up that I’m writing as I go through the paper.

The paper is titled NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System by Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. The paper delves into the domain of natural language (NL) control of the operating system. More specifically, it’s talking about how we can map natural language into Bash bash scripts. This way, we can perform tasks we usually perform using Bash scripts, such as file manipulation, by using natural language. Perhaps that means we can even use voice commands to automate many, many processes. :D

Natural language is way, way less precise than a formal language, so the idea that natural language is going to replace writing codes entirely is… well, we’re still far from that, I guess. However, we can utilize natural language to automate repetitive tasks like file manipulation or scripting that is application-specific.

At this point of the paper I have some ideas on how the natural language commands could be like (if I can join two csv files by using voice command I’ll be a happy camper), but I’ll have to see if this paper and I are on the same page. The good news is, the paper also provides a new dataset called NL2Bash that consists of various commonly used commands and expert-written descriptions, so hopefully there will be many more works using this dataset to come! Now the next question is: where does this dataset come from? I find this interesting because depending on the sources, building a dataset can be the hardest part (and the most time-consuming part :-)) especially when it is the first dataset of its kind. Turns out it is scraped from Q&A forums, tutorials, tech blogs, and course materials. The expert-written descriptions themselves are obtained from Bash programmers. At the end, the dataset consists of 9000 English-command pairs and 100 unique Bash utilities.

Anyway, I’ve just learned that there also exist similar datasets on semantic parsing which focuses on a particular programming language such as regex, SQL, and even IFTTT scripts! OMG right. But! Shell commands pose their own challenges: irregular syntax, wide domain coverage (> 100 bash utilities), and a large percentage of unseen words.

Shell command crash course

Three basic components make up a shell command1:

  • Utility (e.g. find, grep)
  • Option flags (e.g. -name, -i)
  • Arguments (e.g. "*.java", "TODO")

Something else I’ve just learned is that right now there are > 250 Bash utilities. Third party developers develop a new one every now and then. However, for the paper:

  • The domain for the commands are commands that come from 135 of the most useful utilities defined by the Linux user group
  • Only consider target commands that can be specified in a single line (one-liners), so functions are out of the scope
  • Omit I/O redirection, variable assignment, and compound statements (if, for, etc.) because those need to be interpreted in context
  • Omit non-bash program strings nested with language interpreters (awk, sed, python, java, etc.)

Corpus construction

More details are on the paper, but I’m particulary interested in the data cleaning process because I’ve been dealing with a ton of dirty data in the past few days so I’m always looking for other possible solutions/creative ways to solve issues that I might find relevant. ;)

  • Filtering: remove commands that violate the syntaxes specified in the Linux man pages, contain out-of-scope syntaxes, are usually used in multi-statement shell scripts, and contain non-bash language interpreters.
  • Cleaning: correct spelling errors using Peter Norvig’s probabilistic spell checker2, manually correct spelling errors that bypass the spell checker3, remove sudo, prompt characters, replace absolute pathnames to base names.

Corpus statistics

I’m leaving out some statistics but here are some that I find interesting:

  • Most mapping from NL sentences to Bash commands are one-to-one, but this is actually a many-to-many mapping problem because one Bash command can be described in many ways using NL, and for an NL description there can also exist many commands that are semantically equivalent.
  • The utility distribution is long-tailed–the utility find appeared 6268 times. The second utility is xargs which appeared 1047 times–quite a gap, huh? The 52 least common bash utilities (out of the 135 utilities) only appeared 984 times.

Data split

  • The data is split into train, dev, and test sets randomly (10:1:1). If a particular command in the dev and test sets also appear in the train set, they will be moved to the train set to avoid leaking.
  • The command-NL pairs are first clustered by the normalized NL descriptions. The normalization consists of: lowercasing, stemming (Snowball), and stop-word filtering. So at this point we’ll have a bunch of clusters where each cluster contains commands with similar normalized NL descriptions.


  • The authors opted to do manual evaluation by hiring three freelancers to independently evaluate the correctness of the top-3 translations for all test examples. The majority vote is used.
  • Test pairs that have the same normalized NL descriptions are evaluated as a single test instance.
  • Two types of accuracy are reported:
    • Top-k full command accuracy: percentage of test instances where a correct full command is ranked k or above in the model output
    • Top-k command template accuracy: percentage of test instances where a correct command template is ranked k or above in the model output
  • In this case, k = 3. We need to set k because one NL description may have multiple correct Bash command translations, thus the need to consider > 1 commands and/or command template. This is an issue that is also present in other NL-to-code translation problems.

We define a command template as a command with its arguments replaced by their semantic types. For example, the template of grep -l "TODO" *.java is grep -l [regex] [file].

  • There are other ways to evaluate NL-to-code translation (not used in this paper, but from previous related works) which you can find out more about in the paper! Seems very interesting but I haven’t had the time to delve into it yet.


  • Rich domain: As said before, Bash has a ton of different commands and utilities and all of these utilities focus on different applications. Text processing, process management, you name it. Most previous works focus on one particular domain only.
  • Out-of-vocabulary constants: File/path names, file properties, time expressions, etc.
  • Language flexibility: Many commands have a ton of possible flags, and multiple commands can be combined (nested commands such as those containing the pipeline | are in-scope, by the way). This is what I love about Bash but I can see this as a crazy challenge, on top of everything.
  • Idiomatic syntax: Syntax-tree-based parsing4 is not possible because the Bash interpreter parses command options using pattern matching so each command can have idiomatic syntax rules which makes things even more complicated.

Example: to specify an ssh remote, the format needs to be [USER@]HOST:SRC

Baseline System Performance

The authors performed evaluation using two neural machine translation models, Seq2Seq, CopyNet, and Tellina as baselines for future work. Both Seq2Seq and CopyNet are evaluated at three levels of token granularities, which are token, character, and sub-token.

  • Seq2Seq: Calculates the conditional probability of an output sequence (the code) given an input sequence (the NL command) using RNN encoder-decoder. The command sequence(s) with the highest conditional probabilities are chosen.
  • CopyNet: is an extension of Seq2Seq, it selects sub-sequences of the input sequence and emit them at proper places while generating the output sequence.
  • Tellina5: Stage-wise natural language programing model. Abstracts constants in NL to their semantic types -> performs template-level NL-to-code translation -> fills argument slots with extracted constants.


To tokenize both the NL and Bash commands, the authors used regex based natural language tokenizer and Bash parser augmented from Bashlex) respectively.


Split every constant in both the natural language and Bash commands. Each sub-token is padded with SUB_START and SUB_END.

For example, the file path “/home/dir03/*.txt” is converted to the sub-token sequence: SUB_START, “/”, “home”, “/”, “dir”, “03”, “/”, “*”, “.”, “txt”, SUB_END.


  • When it comes to command structure, token-level is superior compared to sub-token-level and character-level (remember that we have two means for evaluation: command structure and full command!)
  • However, when it comes to full command accuracy, sub-token level and character-level are superior. Structure accuracy kinda sucks tho
  • Copying slightly improves character-level models (probably because OOV are rare)
  • In the token-level, copying improves full command accuracy, but the command structure drops (possibly due to mismatch between the source constants and the com- mand arguments). Same thing with sub-token level.
  • Learned string-level transformations outperform manually written heuristics (in this case are in Tellina) when enough data is provided.

Error analysis

  • Sparsity in training data: NL describing functions that are rare in the train set. Potential solution: semi-supervised learning using unlabeled Bash commands/Linux man pages
  • Common errors of RNN translation models: misinterpreting chunks of NL descriptions leads to generating a different utility/flag
  • Constant enumeration: difficulties in extracting constants correctly
  • Complex task: Some NL descriptions are best separated into different sentence instead
  • Other: failure in translating specifications in (), long descriptions of regular expressions, and intelligible/nongrammatical NL descriptions. There are also errors propagated from preprocessing steps (tokenizing, etc.)

Overall possible solution: use separate RNNs for template translation and argument filling.


OK so this is definitely interesting and is something that is relatively new to me too so I had fun reading the paper! Although I still have to fill in sooo many gaps in my knowledge, especially in the NL-to-code problem area, I also learned a lot from the methodologies which I’m sure will be useful for me sooner or later (and not necessarily to solve an NL-to-code problem).

Now I’ll take a look at the dataset and see if there’s anything I can incorporate to my usual/frequent data manipulation steps!

  1. FYI, ExplainShell is a godsend when it comes to explaining shell commands. 

  2. I read this a while ago, it’s a super good read, and the future work is an interesting read as well! I always love reading Peter Norvig’s thought processes on solving problems. 

  3. You can never run away from this issue. 

  4. Further reading: http://www1.cs.columbia.edu/~sedwards/classes/2003/w4115f/ast.9up.pdf 

  5. Related work, also note to future self: http://victorialin.net/pubs/tellina_tr_2017.pdf