Shortest Edit Script — ses • diffobj

Computes shortest edit script to convert a into b by removing elements from a and adding elements from b. Intended primarily for debugging or for other applications that understand that particular format. See GNU diff docs for how to interpret the symbols.

ses(a, b, max.diffs = gdo("max.diffs"), warn = gdo("warn"))

ses_dat(a, b, extra = TRUE, max.diffs = gdo("max.diffs"), warn = gdo("warn"))

Arguments

a: character
b: character
max.diffs: integer(1L), number of differences (default 50000L) after which we abandon the O(n^2) diff algorithm in favor of a naive O(n) one. Set to -1L to stick to the original algorithm up to the maximum allowed (~INT_MAX/4).
warn: TRUE (default) or FALSE whether to warn if we hit max.diffs.
extra: TRUE (default) or FALSE, whether to also return the indices in a and b the diff values are taken from. Set to FALSE for a small performance gain.

Value

character shortest edit script, or a machine readable version of it as a ses_dat object, which is a data.frame with columns op (factor, values “Match”, “Insert”, or “Delete”), val character corresponding to the value taken from either a or b, and if extra is TRUE, integer columns id.a and id.b corresponding to the indices in a or b that val was taken from. See Details.

Details

ses will be much faster than any of the diff* methods, particularly for large inputs with limited numbers of differences.

NAs are treated as the string “NA”. Non-character inputs are coerced to character.

ses_dat provides a semi-processed “machine-readable” version of precursor data to ses that may be useful for those desiring to use the raw diff data and not the printed output of diffobj, but do not wish to manually parse the ses output. Whether it is faster than ses or not depends on the ratio of matching to non-matching values as ses_dat includes matching values whereas ses does not. ses_dat objects have a print method that makes it easy to interpret the diff, but are actually data.frames. You can see the underlying data by using as.data.frame, removing the "ses_dat" class, etc..

Examples

a <- letters[1:6]
b <- c('b', 'CC', 'DD', 'd', 'f')
ses(a, b)
#> [1] "1d0"   "3c2,3" "5d4"  
(dat <- ses_dat(a, b))
#> "ses_dat" object (Match: 3, Delete: 3, Insert: 2):
#>                     
#> D: a   c         e  
#> M:   b         d   f
#> I:       CC DD      
str(dat)                 # data.frame with a print method
#> Classes ‘ses_dat’ and 'data.frame':	8 obs. of  4 variables:
#>  $ op  : Factor w/ 3 levels "Match","Insert",..: 3 1 3 2 2 1 3 1
#>  $ val : chr  "a" "b" "c" "CC" ...
#>  $ id.a: int  1 2 3 NA NA 4 5 6
#>  $ id.b: int  NA NA NA 2 3 NA NA NA

## use `ses_dat` output to construct a minimal diff
## color with ANSI CSI SGR
diff <- dat[['val']]
del <- dat[['op']] == 'Delete'
ins <- dat[['op']] == 'Insert'
if(any(del))
  diff[del] <- paste0("\033[33m- ", diff[del], "\033[m")
if(any(ins))
  diff[ins] <- paste0("\033[34m+ ", diff[ins], "\033[m")
if(any(!ins & !del))
  diff[!ins & !del] <- paste0("  ", diff[!ins & !del])
writeLines(diff)
#> - a
#>   b
#> - c
#> + CC
#> + DD
#>   d
#> - e
#>   f

## We can recover `a` and `b` from the data
identical(subset(dat, op != 'Insert', val)[[1]], a)
#> [1] TRUE
identical(subset(dat, op != 'Delete', val)[[1]], b)
#> [1] TRUE