mirror of
git://git.sv.gnu.org/emacs.git
synced 2026-02-17 10:27:41 +00:00
Improve character name escapes
* doc/lispref/nonascii.texi (Character Properties):
Avoid duplication of Unicode names. Reformat examples to fit in
narrow pages.
* doc/lispref/objects.texi (General Escape Syntax):
Simplify and better-organize explanation of \N{...} escapes.
* src/character.h (CHAR_SURROGATE_PAIR_P): Remove; unused.
(char_surrogate_p): New inline function.
* src/lread.c: Do not include string.h; no longer needed.
(invalid_character_name, check_scalar_value): Remove; the ideas
behind these functions are now bundled into character_name_to_code.
(character_name_to_code): Remove undocumented support for "CJK
IDEOGRAPH-XXXX" names, as "U+XXXX" suffices. Reject monstrosities
like "\N{U+-0}" and null bytes in \N escapes. Reject floating
point in \N escapes instead of returning garbage. Use
AUTO_STRING_WITH_LEN to lessen pressure on the garbage collector.
* test/src/lread-tests.el (lread-char-number, lread-char-name)
(lread-string-char-number, lread-string-char-name):
Test runtime behavior, not compile-time, as the test framework
is not set up to test compile-time.
(lread-char-surrogate-1, lread-char-surrogate-2)
(lread-char-surrogate-3, lread-char-surrogate-4)
(lread-string-char-number-2, lread-string-char-number-3):
New tests.
(lread-string-char-number-1): Rename from lread-string-char-number.
This commit is contained in:
parent
e7cb38edc9
commit
bd1c7ca67e
5 changed files with 87 additions and 125 deletions
|
|
@ -622,18 +622,21 @@ This function returns the value of @var{char}'s @var{propname} property.
|
|||
@result{} Nd
|
||||
@end group
|
||||
@group
|
||||
;; U+2084 SUBSCRIPT FOUR
|
||||
(get-char-code-property ?\u2084 'digit-value)
|
||||
;; U+2084
|
||||
(get-char-code-property ?\N@{SUBSCRIPT FOUR@}
|
||||
'digit-value)
|
||||
@result{} 4
|
||||
@end group
|
||||
@group
|
||||
;; U+2155 VULGAR FRACTION ONE FIFTH
|
||||
(get-char-code-property ?\u2155 'numeric-value)
|
||||
;; U+2155
|
||||
(get-char-code-property ?\N@{VULGAR FRACTION ONE FIFTH@}
|
||||
'numeric-value)
|
||||
@result{} 0.2
|
||||
@end group
|
||||
@group
|
||||
;; U+2163 ROMAN NUMERAL FOUR
|
||||
(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} 'numeric-value)
|
||||
;; U+2163
|
||||
(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@}
|
||||
'numeric-value)
|
||||
@result{} 4
|
||||
@end group
|
||||
@group
|
||||
|
|
|
|||
|
|
@ -353,25 +353,32 @@ following text.)
|
|||
control characters, Emacs provides several types of escape syntax that
|
||||
you can use to specify non-@acronym{ASCII} text characters.
|
||||
|
||||
@enumerate
|
||||
@item
|
||||
@cindex @samp{\} in character constant
|
||||
@cindex backslash in character constants
|
||||
@cindex unicode character escape
|
||||
Firstly, you can specify characters by their Unicode values.
|
||||
@code{?\u@var{nnnn}} represents a character with Unicode code point
|
||||
@samp{U+@var{nnnn}}, where @var{nnnn} is (by convention) a hexadecimal
|
||||
number with exactly four digits. The backslash indicates that the
|
||||
subsequent characters form an escape sequence, and the @samp{u}
|
||||
specifies a Unicode escape sequence.
|
||||
You can specify characters by their Unicode names, if any.
|
||||
@code{?\N@{@var{NAME}@}} represents the Unicode character named
|
||||
@var{NAME}. Thus, @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}} is
|
||||
equivalent to @code{?à} and denotes the Unicode character U+00E0. To
|
||||
simplify entering multi-line strings, you can replace spaces in the
|
||||
names by non-empty sequences of whitespace (e.g., newlines).
|
||||
|
||||
There is a slightly different syntax for specifying Unicode
|
||||
characters with code points higher than @code{U+@var{ffff}}:
|
||||
@code{?\U00@var{nnnnnn}} represents the character with code point
|
||||
@samp{U+@var{nnnnnn}}, where @var{nnnnnn} is a six-digit hexadecimal
|
||||
number. The Unicode Standard only defines code points up to
|
||||
@samp{U+@var{10ffff}}, so if you specify a code point higher than
|
||||
that, Emacs signals an error.
|
||||
@item
|
||||
You can specify characters by their Unicode values.
|
||||
@code{?\N@{U+@var{X}@}} represents a character with Unicode code point
|
||||
@var{X}, where @var{X} is a hexadecimal number. Also,
|
||||
@code{?\u@var{xxxx}} and @code{?\U@var{xxxxxxxx}} represent code
|
||||
points @var{xxxx} and @var{xxxxxxxx}, respectively, where each @var{x}
|
||||
is a single hexadecimal digit. For example, @code{?\N@{U+E0@}},
|
||||
@code{?\u00e0} and @code{?\U000000E0} are all equivalent to @code{?à}
|
||||
and to @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}}. The Unicode
|
||||
Standard defines code points only up to @samp{U+@var{10ffff}}, so if
|
||||
you specify a code point higher than that, Emacs signals an error.
|
||||
|
||||
Secondly, you can specify characters by their hexadecimal character
|
||||
@item
|
||||
You can specify characters by their hexadecimal character
|
||||
codes. A hexadecimal escape sequence consists of a backslash,
|
||||
@samp{x}, and the hexadecimal character code. Thus, @samp{?\x41} is
|
||||
the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
|
||||
|
|
@ -379,23 +386,16 @@ the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
|
|||
You can use any number of hex digits, so you can represent any
|
||||
character code in this way.
|
||||
|
||||
@item
|
||||
@cindex octal character code
|
||||
Thirdly, you can specify characters by their character code in
|
||||
You can specify characters by their character code in
|
||||
octal. An octal escape sequence consists of a backslash followed by
|
||||
up to three octal digits; thus, @samp{?\101} for the character
|
||||
@kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002}
|
||||
for the character @kbd{C-b}. Only characters up to octal code 777 can
|
||||
be specified this way.
|
||||
|
||||
Fourthly, you can specify characters by their name. A character
|
||||
name escape sequence consists of a backslash, @samp{N@{}, the Unicode
|
||||
character name, and @samp{@}}. Alternatively, you can also put the
|
||||
numeric code point value between the braces, using the syntax
|
||||
@samp{\N@{U+nnnn@}}, where @samp{nnnn} denotes between one and eight
|
||||
hexadecimal digits. Thus, @samp{?\N@{LATIN CAPITAL LETTER A@}} and
|
||||
@samp{?\N@{U+41@}} both denote the character @kbd{A}. To simplify
|
||||
entering multi-line strings, you can replace spaces in the character
|
||||
names by arbitrary non-empty sequence of whitespace (e.g., newlines).
|
||||
@end enumerate
|
||||
|
||||
These escape sequences may also be used in strings. @xref{Non-ASCII
|
||||
in Strings}.
|
||||
|
|
|
|||
|
|
@ -612,14 +612,13 @@ sanitize_char_width (EMACS_INT width)
|
|||
: (c) <= 0xE01EF ? (c) - 0xE0100 + 17 \
|
||||
: 0)
|
||||
|
||||
/* If C is a high surrogate, return 1. If C is a low surrogate,
|
||||
return 2. Otherwise, return 0. */
|
||||
/* Return true if C is a surrogate. */
|
||||
|
||||
#define CHAR_SURROGATE_PAIR_P(c) \
|
||||
((c) < 0xD800 ? 0 \
|
||||
: (c) <= 0xDBFF ? 1 \
|
||||
: (c) <= 0xDFFF ? 2 \
|
||||
: 0)
|
||||
INLINE bool
|
||||
char_surrogate_p (int c)
|
||||
{
|
||||
return 0xD800 <= c && c <= 0xDFFF;
|
||||
}
|
||||
|
||||
/* Data type for Unicode general category.
|
||||
|
||||
|
|
|
|||
106
src/lread.c
106
src/lread.c
|
|
@ -44,7 +44,6 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */
|
|||
#include "termhooks.h"
|
||||
#include "blockinput.h"
|
||||
#include <c-ctype.h>
|
||||
#include <string.h>
|
||||
|
||||
#ifdef MSDOS
|
||||
#include "msdos.h"
|
||||
|
|
@ -2151,88 +2150,42 @@ grow_read_buffer (void)
|
|||
MAX_MULTIBYTE_LENGTH, -1, 1);
|
||||
}
|
||||
|
||||
/* Signal an invalid-read-syntax error indicating that the character
|
||||
name in an \N{…} literal is invalid. */
|
||||
static _Noreturn void
|
||||
invalid_character_name (Lisp_Object name)
|
||||
{
|
||||
AUTO_STRING (format, "\\N{%s}");
|
||||
xsignal1 (Qinvalid_read_syntax, CALLN (Fformat, format, name));
|
||||
}
|
||||
|
||||
/* Check that CODE is a valid Unicode scalar value, and return its
|
||||
value. CODE should be parsed from the character name given by
|
||||
NAME. NAME is used for error messages. */
|
||||
/* Return the scalar value that has the Unicode character name NAME.
|
||||
Raise 'invalid-read-syntax' if there is no such character. */
|
||||
static int
|
||||
check_scalar_value (Lisp_Object code, Lisp_Object name)
|
||||
character_name_to_code (char const *name, ptrdiff_t name_len)
|
||||
{
|
||||
if (! NUMBERP (code))
|
||||
invalid_character_name (name);
|
||||
EMACS_INT i = XINT (code);
|
||||
if (! (0 <= i && i <= MAX_UNICODE_CHAR)
|
||||
/* Don't allow surrogates. */
|
||||
|| (0xD800 <= code && code <= 0xDFFF))
|
||||
invalid_character_name (name);
|
||||
return i;
|
||||
}
|
||||
Lisp_Object code;
|
||||
|
||||
/* If NAME starts with PREFIX, interpret the rest as a hexadecimal
|
||||
number and return its value. Raise invalid-read-syntax if the
|
||||
number is not a valid scalar value. Return −1 if NAME doesn’t
|
||||
start with PREFIX. */
|
||||
static int
|
||||
parse_code_after_prefix (Lisp_Object name, const char *prefix)
|
||||
{
|
||||
ptrdiff_t name_len = SBYTES (name);
|
||||
ptrdiff_t prefix_len = strlen (prefix);
|
||||
/* Allow between one and eight hexadecimal digits after the
|
||||
prefix. */
|
||||
if (prefix_len < name_len && name_len <= prefix_len + 8
|
||||
&& memcmp (SDATA (name), prefix, prefix_len) == 0)
|
||||
/* Code point as U+XXXX.... */
|
||||
if (name[0] == 'U' && name[1] == '+')
|
||||
{
|
||||
Lisp_Object code = string_to_number (SDATA (name) + prefix_len, 16, false);
|
||||
if (NUMBERP (code))
|
||||
return check_scalar_value (code, name);
|
||||
/* Pass the leading '+' to string_to_number, so that it
|
||||
rejects monstrosities such as negative values. */
|
||||
code = string_to_number (name + 1, 16, false);
|
||||
}
|
||||
return -1;
|
||||
}
|
||||
|
||||
/* Returns the scalar value that has the Unicode character name NAME.
|
||||
Raises `invalid-read-syntax' if there is no such character. */
|
||||
static int
|
||||
character_name_to_code (Lisp_Object name)
|
||||
{
|
||||
/* Code point as U+N, where N is between 1 and 8 hexadecimal
|
||||
digits. */
|
||||
int code = parse_code_after_prefix (name, "U+");
|
||||
if (code >= 0)
|
||||
return code;
|
||||
|
||||
/* CJK ideographs are not contained in the association list returned
|
||||
by `ucs-names'. But they follow a predictable naming pattern: a
|
||||
fixed prefix plus the hexadecimal codepoint value. */
|
||||
code = parse_code_after_prefix (name, "CJK IDEOGRAPH-");
|
||||
if (code >= 0)
|
||||
else
|
||||
{
|
||||
/* Various ranges of CJK characters; see UnicodeData.txt. */
|
||||
if ((0x3400 <= code && code <= 0x4DB5)
|
||||
|| (0x4E00 <= code && code <= 0x9FD5)
|
||||
|| (0x20000 <= code && code <= 0x2A6D6)
|
||||
|| (0x2A700 <= code && code <= 0x2B734)
|
||||
|| (0x2B740 <= code && code <= 0x2B81D)
|
||||
|| (0x2B820 <= code && code <= 0x2CEA1))
|
||||
return code;
|
||||
else
|
||||
invalid_character_name (name);
|
||||
/* Look up the name in the table returned by 'ucs-names'. */
|
||||
AUTO_STRING_WITH_LEN (namestr, name, name_len);
|
||||
Lisp_Object names = call0 (Qucs_names);
|
||||
code = CDR (Fassoc (namestr, names));
|
||||
}
|
||||
|
||||
/* Look up the name in the table returned by `ucs-names'. */
|
||||
Lisp_Object names = call0 (Qucs_names);
|
||||
return check_scalar_value (CDR (Fassoc (name, names)), name);
|
||||
if (! (INTEGERP (code)
|
||||
&& 0 <= XINT (code) && XINT (code) <= MAX_UNICODE_CHAR
|
||||
&& ! char_surrogate_p (XINT (code))))
|
||||
{
|
||||
AUTO_STRING (format, "\\N{%s}");
|
||||
AUTO_STRING_WITH_LEN (namestr, name, name_len);
|
||||
xsignal1 (Qinvalid_read_syntax, CALLN (Fformat, format, namestr));
|
||||
}
|
||||
|
||||
return XINT (code);
|
||||
}
|
||||
|
||||
/* Bound on the length of a Unicode character name. As of
|
||||
Unicode 9.0.0 the maximum is 83, so this should be safe. */
|
||||
Unicode 9.0.0 the maximum is 83, so this should be safe. */
|
||||
enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 200 };
|
||||
|
||||
/* Read a \-escape sequence, assuming we already read the `\'.
|
||||
|
|
@ -2458,14 +2411,14 @@ read_escape (Lisp_Object readcharfun, bool stringp)
|
|||
end_of_file_error ();
|
||||
if (c == '}')
|
||||
break;
|
||||
if (! c_isascii (c))
|
||||
if (! (0 < c && c < 0x80))
|
||||
{
|
||||
AUTO_STRING (format,
|
||||
"Non-ASCII character U+%04X in character name");
|
||||
"Invalid character U+%04X in character name");
|
||||
xsignal1 (Qinvalid_read_syntax,
|
||||
CALLN (Fformat, format, make_natnum (c)));
|
||||
}
|
||||
/* We treat multiple adjacent whitespace characters as a
|
||||
/* Treat multiple adjacent whitespace characters as a
|
||||
single space character. This makes it easier to use
|
||||
character names in e.g. multi-line strings. */
|
||||
if (c_isspace (c))
|
||||
|
|
@ -2483,7 +2436,8 @@ read_escape (Lisp_Object readcharfun, bool stringp)
|
|||
}
|
||||
if (length == 0)
|
||||
invalid_syntax ("Empty character name");
|
||||
return character_name_to_code (make_unibyte_string (name, length));
|
||||
name[length] = '\0';
|
||||
return character_name_to_code (name, length);
|
||||
}
|
||||
|
||||
default:
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
;;; lread-tests.el --- tests for lread.c -*- lexical-binding: t; -*-
|
||||
|
||||
;; Copyright (C) 2016 Google Inc.
|
||||
;; Copyright (C) 2016 Free Software Foundation, Inc.
|
||||
|
||||
;; Author: Philipp Stephani <phst@google.com>
|
||||
|
||||
|
|
@ -26,11 +26,10 @@
|
|||
;;; Code:
|
||||
|
||||
(ert-deftest lread-char-number ()
|
||||
(should (equal ?\N{U+A817} #xA817)))
|
||||
(should (equal (read "?\\N{U+A817}") #xA817)))
|
||||
|
||||
(ert-deftest lread-char-name ()
|
||||
(should (equal ?\N{SYLOTI NAGRI LETTER
|
||||
DHO}
|
||||
(should (equal (read "?\\N{SYLOTI NAGRI LETTER \n DHO}")
|
||||
#xA817)))
|
||||
|
||||
(ert-deftest lread-char-invalid-number ()
|
||||
|
|
@ -46,16 +45,23 @@
|
|||
(ert-deftest lread-char-empty-name ()
|
||||
(should-error (read "?\\N{}") :type 'invalid-read-syntax))
|
||||
|
||||
(ert-deftest lread-char-cjk-name ()
|
||||
(should (equal ?\N{CJK IDEOGRAPH-2B734} #x2B734)))
|
||||
(ert-deftest lread-char-surrogate-1 ()
|
||||
(should-error (read "?\\N{U+D800}") :type 'invalid-read-syntax))
|
||||
(ert-deftest lread-char-surrogate-2 ()
|
||||
(should-error (read "?\\N{U+D801}") :type 'invalid-read-syntax))
|
||||
(ert-deftest lread-char-surrogate-3 ()
|
||||
(should-error (read "?\\N{U+Dffe}") :type 'invalid-read-syntax))
|
||||
(ert-deftest lread-char-surrogate-4 ()
|
||||
(should-error (read "?\\N{U+DFFF}") :type 'invalid-read-syntax))
|
||||
|
||||
(ert-deftest lread-char-invalid-cjk-name ()
|
||||
(should-error (read "?\\N{CJK IDEOGRAPH-2B735}") :type 'invalid-read-syntax))
|
||||
|
||||
(ert-deftest lread-string-char-number ()
|
||||
(should (equal "a\N{U+A817}b" "a\uA817b")))
|
||||
(ert-deftest lread-string-char-number-1 ()
|
||||
(should (equal (read "a\\N{U+A817}b") "a\uA817bx")))
|
||||
(ert-deftest lread-string-char-number-2 ()
|
||||
(should-error (read "?\\N{0.5}") :type 'invalid-read-syntax))
|
||||
(ert-deftest lread-string-char-number-3 ()
|
||||
(should-error (read "?\\N{U+-0}") :type 'invalid-read-syntax))
|
||||
|
||||
(ert-deftest lread-string-char-name ()
|
||||
(should (equal "a\N{SYLOTI NAGRI LETTER DHO}b" "a\uA817b")))
|
||||
(should (equal (read "a\\N{SYLOTI NAGRI LETTER DHO}b") "a\uA817b")))
|
||||
|
||||
;;; lread-tests.el ends here
|
||||
|
|
|
|||
Loading…
Reference in a new issue