Galuh Sahid

Notes on NL2Bash

— 06 Mar 2018

I’m SO excited about this paper. In the past few weeks I’ve been working with bash scripts a lot and I’ve never taken a particular liking on them until recently *cue Daft Punk’s Instant Crush playing on the background*. In the office we also have a long-running inside joke on voice commands and I can’t help but thinking about running bash scripts using voice commands! So here’s a short write-up that I’m writing as I go through the paper.

The paper is titled NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System by Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. The paper delves into the domain of natural language (NL) control of the operating system. More specifically, it’s talking about how we can map natural language into Bash bash scripts. This way, we can perform tasks we usually perform using Bash scripts, such as file manipulation, by using natural language. Perhaps that means we can even use voice commands to automate many, many processes. :D

A picture of a table from the paper, which shows examples of natural language commands and the resulting Bash commands.
Examples of natural language commands and the resulting Bash commands.

Natural language is way, way less precise than a formal language, so the idea that natural language is going to replace writing codes entirely is… well, we’re still far from that, I guess. However, we can utilize natural language to automate repetitive tasks like file manipulation or scripting that is application-specific.

At this point of the paper I have some ideas on how the natural language commands could be like (if I can join two csv files by using voice command I’ll be a happy camper), but I’ll have to see if this paper and I are on the same page. The good news is, the paper also provides a new dataset called NL2Bash that consists of various commonly used commands and expert-written descriptions, so hopefully there will be many more works using this dataset to come! Now the next question is: where does this dataset come from? I find this interesting because depending on the sources, building a dataset can be the hardest part (and the most time-consuming part :-)) especially when it is the first dataset of its kind. Turns out it is scraped from Q&A forums, tutorials, tech blogs, and course materials. The expert-written descriptions themselves are obtained from Bash programmers. At the end, the dataset consists of 9000 English-command pairs and 100 unique Bash utilities.

Anyway, I’ve just learned that there also exist similar datasets on semantic parsing which focuses on a particular programming language such as regex, SQL, and even IFTTT scripts! OMG right. But! Shell commands pose their own challenges: irregular syntax, wide domain coverage (> 100 bash utilities), and a large percentage of unseen words.

Shell command crash course

Three basic components make up a shell command1:

Something else I’ve just learned is that right now there are > 250 Bash utilities. Third party developers develop a new one every now and then. However, for the paper:

Corpus construction

More details are on the paper, but I’m particulary interested in the data cleaning process because I’ve been dealing with a ton of dirty data in the past few days so I’m always looking for other possible solutions/creative ways to solve issues that I might find relevant. ;)

Corpus statistics

I’m leaving out some statistics but here are some that I find interesting:

Data split

Evaluation

We define a command template as a command with its arguments replaced by their semantic types. For example, the template of grep -l "TODO" *.java is grep -l [regex] [file].

Challenges

Example: to specify an ssh remote, the format needs to be [USER@]HOST:SRC

Baseline System Performance

The authors performed evaluation using two neural machine translation models, Seq2Seq, CopyNet, and Tellina as baselines for future work. Both Seq2Seq and CopyNet are evaluated at three levels of token granularities, which are token, character, and sub-token.

Tokens

To tokenize both the NL and Bash commands, the authors used regex based natural language tokenizer and Bash parser augmented from Bashlex) respectively.

Sub-tokens

Split every constant in both the natural language and Bash commands. Each sub-token is padded with SUB_START and SUB_END.

For example, the file path “/home/dir03/*.txt” is converted to the sub-token sequence: SUB_START, “/”, “home”, “/”, “dir”, “03”, “/”, “*”, “.”, “txt”, SUB_END.

Results

Error analysis

Overall possible solution: use separate RNNs for template translation and argument filling.

Takeaways

OK so this is definitely interesting and is something that is relatively new to me too so I had fun reading the paper! Although I still have to fill in sooo many gaps in my knowledge, especially in the NL-to-code problem area, I also learned a lot from the methodologies which I’m sure will be useful for me sooner or later (and not necessarily to solve an NL-to-code problem).

Now I’ll take a look at the dataset and see if there’s anything I can incorporate to my usual/frequent data manipulation steps!

  1. FYI, ExplainShell is a godsend when it comes to explaining shell commands. 

  2. I read this a while ago, it’s a super good read, and the future work is an interesting read as well! I always love reading Peter Norvig’s thought processes on solving problems. 

  3. You can never run away from this issue. 

  4. Further reading: http://www1.cs.columbia.edu/~sedwards/classes/2003/w4115f/ast.9up.pdf 

  5. Related work, also note to future self: http://victorialin.net/pubs/tellina_tr_2017.pdf