split_text
divides input text into smaller segments that do not exceed a specified maximum size in bytes.
Segmentation is based on sentence or word boundaries.
split_text(text, max_size_bytes = 29000, tokenize = "sentences")
A tibble with one row per text segment, containing the following columns:
text_id
: The index of the original text in the input vector.
segment_id
: A sequential ID identifying the segment number.
segment_text
: The resulting text segment, each within the specified byte limit.
This function uses tokenizers::tokenize_sentences
(or tokenize_words
if specified)
to split the text into natural language segments before assembling byte-limited blocks.