nchar_ctl counts all non Control Sequence characters.
nzchar_ctl returns TRUE for each input vector element that has non Control
Sequence sequence characters. By default newlines and other C0 control
characters are not counted.
Arguments
- x
a character vector or object that can be coerced to such.
- type
character(1L) partial matching
c("chars", "width", "graphemes"). See?nchar, as well as the corresponding documentation sections on this page.- allowNA
logical: should
NAbe returned for invalid multibyte strings or"bytes"-encoded strings (rather than throwing an error)?- keepNA
logical: should
NAbe returned whenxisNA? If false,nchar()returns2, as that is the number of printing characters used when strings are written to output, andnzchar()isTRUE. The default fornchar(),NA, means to usekeepNA = TRUEunlesstypeis"width".- ctl
character, which Control Sequences should be treated specially. Special treatment is context dependent, and may include detecting them and/or computing their display/character width as zero. For the SGR subset of the ANSI CSI sequences, and OSC hyperlinks,
fansiwill also parse, interpret, and reapply the sequences as needed. You can modify whether a Control Sequence is treated specially with thectlparameter."nl": newlines.
"c0": all other "C0" control characters (i.e. 0x01-0x1f, 0x7F), except for newlines and the actual ESC (0x1B) character.
"sgr": ANSI CSI SGR sequences.
"csi": all non-SGR ANSI CSI sequences.
"url": OSC hyperlinks
"osc": all non-OSC-hyperlink OSC sequences.
"esc": all other escape sequences.
"all": all of the above, except when used in combination with any of the above, in which case it means "all but".
- warn
TRUE (default) or FALSE, whether to warn when potentially problematic Control Sequences are encountered. These could cause the assumptions
fansimakes about how strings are rendered on your display to be incorrect, for example by moving the cursor (see?fansi). At most one warning will be issued per element in each input vector. Will also warn about some badly encoded UTF-8 strings, but a lack of UTF-8 warnings is not a guarantee of correct encoding (usevalidUTF8for that).- strip
character, deprecated in favor of
ctl.
Value
Like base::nchar, with Control Sequences excluded.
Details
nchar_ctl and nzchar_ctl are implemented in statically compiled code, so
in particular nzchar_ctl will be much faster than the otherwise equivalent
nzchar(strip_ctl(...)).
These functions will warn if either malformed or escape or UTF-8 sequences are encountered as they may be incorrectly interpreted.
Control and Special Sequences
Control Sequences are non-printing characters or sequences of characters.
Special Sequences are a subset of the Control Sequences, and include CSI
SGR sequences which can be used to change rendered appearance of text, and
OSC hyperlinks. See fansi for details.
Output Stability
Several factors could affect the exact output produced by fansi
functions across versions of fansi, R, and/or across systems.
In general it is best not to rely on exact fansi output, e.g. by
embedding it in tests.
Width and grapheme calculations depend on Unicode database version (see
fansi_unicode_version, and grapheme processing logic among other
things (see "Graphemes"). Individual character width are intended to match
R4.5.1 definitions in an English locale, except for differences introduced by
Unicode Database Version updates and grapheme processing.
How a particular display format is encoded in Control Sequences is
not guaranteed to be stable across fansi versions. Additionally, which
Special Sequences are re-encoded vs transcribed untouched may change.
In general we will strive to keep the rendered appearance stable.
To maximize the odds of getting stable output set normalize_state to
TRUE and type to "chars" in functions that allow it, and
set term.cap to a specific set of capabilities.
Graphemes
fansi approximates grapheme widths and counts by using heuristics for
grapheme breaks that work for most common graphemes, including emoji
combining sequences. The heuristic is known to work incorrectly with
invalid combining sequences, prepending marks, and sequence interruptors.
The utf8 package provides a
conforming grapheme parsing implementation.
See also
?fansi for details on how Control Sequences are
interpreted, particularly if you are getting unexpected results,
unhandled_ctl for detecting bad control sequences.
Examples
nchar_ctl("\033[31m123\a\r")
#> [1] 3
## with some wide characters
cn.string <- sprintf("\033[31m%s\a\r", "\u4E00\u4E01\u4E03")
nchar_ctl(cn.string)
#> [1] 3
nchar_ctl(cn.string, type='width')
#> [1] 6
## Remember newlines are not counted by default
nchar_ctl("\t\n\r")
#> [1] 0
## The 'c0' value for the `ctl` argument does not include
## newlines.
nchar_ctl("\t\n\r", ctl="c0")
#> [1] 1
nchar_ctl("\t\n\r", ctl=c("c0", "nl"))
#> [1] 0
## The _sgr flavor only treats SGR sequences as zero width
nchar_sgr("\033[31m123")
#> [1] 3
nchar_sgr("\t\n\n123")
#> [1] 6
## All of the following are Control Sequences or C0 controls
nzchar_ctl("\n\033[42;31m\033[123P\a")
#> [1] FALSE