emacs/admin/notes/tree-sitter/html-manual/Language-Definitions.html
Yuan Fu cb183f6467
Add tree-sitter admin notes
starter-guide: Guide on writing major mode features.
build-module: Script for building official language definitions.
html-manual: HTML version of the manual for easy access.

* admin/notes/tree-sitter/build-module/README: New file.
* admin/notes/tree-sitter/build-module/batch.sh: New file.
* admin/notes/tree-sitter/build-module/build.sh: New file.
* admin/notes/tree-sitter/starter-guide: New file.
* admin/notes/tree-sitter/html-manual/Accessing-Node.html: New file.
* admin/notes/tree-sitter/html-manual/Language-Definitions.html: New file.
* admin/notes/tree-sitter/html-manual/Multiple-Languages.html: New file.
* admin/notes/tree-sitter/html-manual/Parser_002dbased-Font-Lock.html:
New file.
* admin/notes/tree-sitter/html-manual/Parser_002dbased-Indentation.html:
New file.
* admin/notes/tree-sitter/html-manual/Parsing-Program-Source.html: New
file.
* admin/notes/tree-sitter/html-manual/Pattern-Matching.html: New file.
* admin/notes/tree-sitter/html-manual/Retrieving-Node.html: New file.
* admin/notes/tree-sitter/html-manual/Tree_002dsitter-C-API.html: New
file.
* admin/notes/tree-sitter/html-manual/Using-Parser.html: New file.
* admin/notes/tree-sitter/html-manual/build-manual.sh: New file.
* admin/notes/tree-sitter/html-manual/manual.css: New file.
2022-10-05 14:11:33 -07:00

326 lines
14 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Created by GNU Texinfo 6.8, https://www.gnu.org/software/texinfo/ -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<!-- This is the GNU Emacs Lisp Reference Manual
corresponding to Emacs version 29.0.50.
Copyright © 1990-1996, 1998-2022 Free Software Foundation,
Inc.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3 or
any later version published by the Free Software Foundation; with the
Invariant Sections being "GNU General Public License," with the
Front-Cover Texts being "A GNU Manual," and with the Back-Cover
Texts as in (a) below. A copy of the license is included in the
section entitled "GNU Free Documentation License."
(a) The FSF's Back-Cover Text is: "You have the freedom to copy and
modify this GNU manual. Buying copies from the FSF supports it in
developing GNU and promoting software freedom." -->
<title>Language Definitions (GNU Emacs Lisp Reference Manual)</title>
<meta name="description" content="Language Definitions (GNU Emacs Lisp Reference Manual)">
<meta name="keywords" content="Language Definitions (GNU Emacs Lisp Reference Manual)">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<meta name="viewport" content="width=device-width,initial-scale=1">
<link href="index.html" rel="start" title="Top">
<link href="Index.html" rel="index" title="Index">
<link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="Parsing-Program-Source.html" rel="up" title="Parsing Program Source">
<link href="Using-Parser.html" rel="next" title="Using Parser">
<style type="text/css">
<!--
a.copiable-anchor {visibility: hidden; text-decoration: none; line-height: 0em}
a.summary-letter {text-decoration: none}
blockquote.indentedblock {margin-right: 0em}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
kbd {font-style: oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
span.nolinebreak {white-space: nowrap}
span.roman {font-family: initial; font-weight: normal}
span.sansserif {font-family: sans-serif; font-weight: normal}
span:hover a.copiable-anchor {visibility: visible}
ul.no-bullet {list-style: none}
-->
</style>
<link rel="stylesheet" type="text/css" href="./manual.css">
</head>
<body lang="en">
<div class="section" id="Language-Definitions">
<div class="header">
<p>
Next: <a href="Using-Parser.html" accesskey="n" rel="next">Using Tree-sitter Parser</a>, Up: <a href="Parsing-Program-Source.html" accesskey="u" rel="up">Parsing Program Source</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index.html" title="Index" rel="index">Index</a>]</p>
</div>
<hr>
<span id="Tree_002dsitter-Language-Definitions"></span><h3 class="section">37.1 Tree-sitter Language Definitions</h3>
<span id="Loading-a-language-definition"></span><h3 class="heading">Loading a language definition</h3>
<p>Tree-sitter relies on language definitions to parse text in that
language. In Emacs, A language definition is represented by a symbol.
For example, C language definition is represented as <code>c</code>, and
<code>c</code> can be passed to tree-sitter functions as the <var>language</var>
argument.
</p>
<span id="index-treesit_002dextra_002dload_002dpath"></span>
<span id="index-treesit_002dload_002dlanguage_002derror"></span>
<span id="index-treesit_002dload_002dsuffixes"></span>
<p>Tree-sitter language definitions are distributed as dynamic libraries.
In order to use a language definition in Emacs, you need to make sure
that the dynamic library is installed on the system. Emacs looks for
language definitions under load paths in
<code>treesit-extra-load-path</code>, <code>user-emacs-directory</code>/tree-sitter,
and system default locations for dynamic libraries, in that order.
Emacs tries each extensions in <code>treesit-load-suffixes</code>. If Emacs
cannot find the library or has problem loading it, Emacs signals
<code>treesit-load-language-error</code>. The signal data is a list of
specific error messages.
</p>
<dl class="def">
<dt id="index-treesit_002dlanguage_002davailable_002dp"><span class="category">Function: </span><span><strong>treesit-language-available-p</strong> <em>language</em><a href='#index-treesit_002dlanguage_002davailable_002dp' class='copiable-anchor'> &para;</a></span></dt>
<dd><p>This function checks whether the dynamic library for <var>language</var> is
present on the system, and return non-nil if it is.
</p></dd></dl>
<span id="index-treesit_002dload_002dname_002doverride_002dlist"></span>
<p>By convention, the dynamic library for <var>language</var> is
<code>libtree-sitter-<var>language</var>.<var>ext</var></code>, where <var>ext</var> is the
system-specific extension for dynamic libraries. Also by convention,
the function provided by that library is named
<code>tree_sitter_<var>language</var></code>. If a language definition doesn&rsquo;t
follow this convention, you should add an entry
</p>
<div class="example">
<pre class="example">(<var>language</var> <var>library-base-name</var> <var>function-name</var>)
</pre></div>
<p>to <code>treesit-load-name-override-list</code>, where
<var>library-base-name</var> is the base filename for the dynamic library
(conventionally <code>libtree-sitter-<var>language</var></code>), and
<var>function-name</var> is the function provided by the library
(conventionally <code>tree_sitter_<var>language</var></code>). For example,
</p>
<div class="example">
<pre class="example">(cool-lang &quot;libtree-sitter-coool&quot; &quot;tree_sitter_cooool&quot;)
</pre></div>
<p>for a language too cool to abide by conventions.
</p>
<dl class="def">
<dt id="index-treesit_002dlanguage_002dversion"><span class="category">Function: </span><span><strong>treesit-language-version</strong> <em>&amp;optional min-compatible</em><a href='#index-treesit_002dlanguage_002dversion' class='copiable-anchor'> &para;</a></span></dt>
<dd><p>Tree-sitter library has a <em>language version</em>, a language
definition&rsquo;s version needs to match this version to be compatible.
</p>
<p>This function returns tree-sitter librarys language version. If
<var>min-compatible</var> is non-nil, it returns the minimal compatible
version.
</p></dd></dl>
<span id="Concrete-syntax-tree"></span><h3 class="heading">Concrete syntax tree</h3>
<p>A syntax tree is what a parser generates. In a syntax tree, each node
represents a piece of text, and is connected to each other by a
parent-child relationship. For example, if the source text is
</p>
<div class="example">
<pre class="example">1 + 2
</pre></div>
<p>its syntax tree could be
</p>
<div class="example">
<pre class="example"> +--------------+
| root &quot;1 + 2&quot; |
+--------------+
|
+--------------------------------+
| expression &quot;1 + 2&quot; |
+--------------------------------+
| | |
+------------+ +--------------+ +------------+
| number &quot;1&quot; | | operator &quot;+&quot; | | number &quot;2&quot; |
+------------+ +--------------+ +------------+
</pre></div>
<p>We can also represent it in s-expression:
</p>
<div class="example">
<pre class="example">(root (expression (number) (operator) (number)))
</pre></div>
<span id="Node-types"></span><h4 class="subheading">Node types</h4>
<span id="index-tree_002dsitter-node-type"></span>
<span id="tree_002dsitter-node-type"></span><span id="index-tree_002dsitter-named-node"></span>
<span id="tree_002dsitter-named-node"></span><span id="index-tree_002dsitter-anonymous-node"></span>
<p>Names like <code>root</code>, <code>expression</code>, <code>number</code>,
<code>operator</code> are nodes&rsquo; <em>type</em>. However, not all nodes in a
syntax tree have a type. Nodes that don&rsquo;t are <em>anonymous nodes</em>,
and nodes with a type are <em>named nodes</em>. Anonymous nodes are
tokens with fixed spellings, including punctuation characters like
bracket &lsquo;<samp>]</samp>&rsquo;, and keywords like <code>return</code>.
</p>
<span id="Field-names"></span><h4 class="subheading">Field names</h4>
<span id="index-tree_002dsitter-node-field-name"></span>
<span id="tree_002dsitter-node-field-name"></span><p>To make the syntax tree easier to
analyze, many language definitions assign <em>field names</em> to child
nodes. For example, a <code>function_definition</code> node could have a
<code>declarator</code> and a <code>body</code>:
</p>
<div class="example">
<pre class="example">(function_definition
declarator: (declaration)
body: (compound_statement))
</pre></div>
<dl class="def">
<dt id="index-treesit_002dinspect_002dmode"><span class="category">Command: </span><span><strong>treesit-inspect-mode</strong><a href='#index-treesit_002dinspect_002dmode' class='copiable-anchor'> &para;</a></span></dt>
<dd><p>This minor mode displays the node that <em>starts</em> at point in
mode-line. The mode-line will display
</p>
<div class="example">
<pre class="example"><var>parent</var> <var>field-name</var>: (<var>child</var> (<var>grand-child</var> (...)))
</pre></div>
<p><var>child</var>, <var>grand-child</var>, and <var>grand-grand-child</var>, etc, are
nodes that have their beginning at point. And <var>parent</var> is the
parent of <var>child</var>.
</p>
<p>If there is no node that starts at point, i.e., point is in the middle
of a node, then the mode-line only displays the smallest node that
spans point, and its immediate parent.
</p>
<p>This minor mode doesn&rsquo;t create parsers on its own. It simply uses the
first parser in <code>(treesit-parser-list)</code> (see <a href="Using-Parser.html">Using Tree-sitter Parser</a>).
</p></dd></dl>
<span id="Reading-the-grammar-definition"></span><h3 class="heading">Reading the grammar definition</h3>
<p>Authors of language definitions define the <em>grammar</em> of a
language, and this grammar determines how does a parser construct a
concrete syntax tree out of the text. In order to use the syntax
tree effectively, we need to read the <em>grammar file</em>.
</p>
<p>The grammar file is usually <code>grammar.js</code> in a language
definitions project repository. The link to a language definitions
home page can be found in tree-sitters homepage
(<a href="https://tree-sitter.github.io/tree-sitter">https://tree-sitter.github.io/tree-sitter</a>).
</p>
<p>The grammar is written in JavaScript syntax. For example, the rule
matching a <code>function_definition</code> node looks like
</p>
<div class="example">
<pre class="example">function_definition: $ =&gt; seq(
$.declaration_specifiers,
field('declarator', $.declaration),
field('body', $.compound_statement)
)
</pre></div>
<p>The rule is represented by a function that takes a single argument
<var>$</var>, representing the whole grammar. The function itself is
constructed by other functions: the <code>seq</code> function puts together a
sequence of children; the <code>field</code> function annotates a child with
a field name. If we write the above definition in BNF syntax, it
would look like
</p>
<div class="example">
<pre class="example">function_definition :=
&lt;declaration_specifiers&gt; &lt;declaration&gt; &lt;compound_statement&gt;
</pre></div>
<p>and the node returned by the parser would look like
</p>
<div class="example">
<pre class="example">(function_definition
(declaration_specifier)
declarator: (declaration)
body: (compound_statement))
</pre></div>
<p>Below is a list of functions that one will see in a grammar
definition. Each function takes other rules as arguments and returns
a new rule.
</p>
<ul>
<li> <code>seq(rule1, rule2, ...)</code> matches each rule one after another.
</li><li> <code>choice(rule1, rule2, ...)</code> matches one of the rules in its
arguments.
</li><li> <code>repeat(rule)</code> matches <var>rule</var> for <em>zero or more</em> times.
This is like the &lsquo;<samp>*</samp>&rsquo; operator in regular expressions.
</li><li> <code>repeat1(rule)</code> matches <var>rule</var> for <em>one or more</em> times.
This is like the &lsquo;<samp>+</samp>&rsquo; operator in regular expressions.
</li><li> <code>optional(rule)</code> matches <var>rule</var> for <em>zero or one</em> time.
This is like the &lsquo;<samp>?</samp>&rsquo; operator in regular expressions.
</li><li> <code>field(name, rule)</code> assigns field name <var>name</var> to the child
node matched by <var>rule</var>.
</li><li> <code>alias(rule, alias)</code> makes nodes matched by <var>rule</var> appear as
<var>alias</var> in the syntax tree generated by the parser. For example,
<div class="example">
<pre class="example">alias(preprocessor_call_exp, call_expression)
</pre></div>
<p>makes any node matched by <code>preprocessor_call_exp</code> to appear as
<code>call_expression</code>.
</p></li></ul>
<p>Below are grammar functions less interesting for a reader of a
language definition.
</p>
<ul>
<li> <code>token(rule)</code> marks <var>rule</var> to produce a single leaf node.
That is, instead of generating a parent node with individual child
nodes under it, everything is combined into a single leaf node.
</li><li> Normally, grammar rules ignore preceding whitespaces,
<code>token.immediate(rule)</code> changes <var>rule</var> to match only when
there is no preceding whitespaces.
</li><li> <code>prec(n, rule)</code> gives <var>rule</var> a level <var>n</var> precedence.
</li><li> <code>prec.left([n,] rule)</code> marks <var>rule</var> as left-associative,
optionally with level <var>n</var>.
</li><li> <code>prec.right([n,] rule)</code> marks <var>rule</var> as right-associative,
optionally with level <var>n</var>.
</li><li> <code>prec.dynamic(n, rule)</code> is like <code>prec</code>, but the precedence
is applied at runtime instead.
</li></ul>
<p>The tree-sitter project talks about writing a grammar in more detail:
<a href="https://tree-sitter.github.io/tree-sitter/creating-parsers">https://tree-sitter.github.io/tree-sitter/creating-parsers</a>.
Read especially &ldquo;The Grammar DSL&rdquo; section.
</p>
</div>
<hr>
<div class="header">
<p>
Next: <a href="Using-Parser.html">Using Tree-sitter Parser</a>, Up: <a href="Parsing-Program-Source.html">Parsing Program Source</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index.html" title="Index" rel="index">Index</a>]</p>
</div>
</body>
</html>