data.tree is to hierarchical data what data.frame is to tabular data: An extensible, general purpose structure to store, manipulate, and display hierarchical data.

Introduction

Hierarchical data is ubiquitous in statistics and programming (XML, search trees, family trees, classification, file system, etc.). However, no general-use tree data structure is available in R. Where tabular data has data.frame, hierarchical data is often modeled in lists of lists or similar makeshifts. These structures are often difficult to manage. This is where the data.tree package steps in. It lets you build trees of hierarchical data for various uses: to print, to rapid prototype search algorithms, to test out new classification algorithms, and much more.

Tree Traversal

data.tree allows to Traverse trees in various orders (pre-order, post-order, level, etc.), and it lets you run operations on Nodes via Do. Similarly, you can collect and store data while traversing a tree using the Get and the Set methods.

Methods

The package also contains utility functions to Sort, to Prune, to Aggregate and Cumulate and to print in custom formats.

Construction and Conversion

The package also contains many conversions from and to data.tree structures. Check out the see also section of as.Node.

You can construct a tree from a data.frame using as.Node.data.frame, and convert it back using as.data.frame.Node. Similar options exist for list of lists. For more specialized conversions, see as.dendrogram.Node, as.Node.dendrogram, as.phylo.Node and as.Node.phylo

Finally, easy conversion options from and to list, dataframe, JSON, YAML, igraph, ape, rpart, party and more exist:

  • list: both directions

  • dataframe: both directions

  • JSON, YAML: both directions, via lists

  • igraph: from igraph to data.tree

  • ape: both directions

  • rpart: from rpart to data.tree

  • party: from party to data.tree

Node and Reference Semantics

The entry point to the package is Node. Each tree is composed of a number of Nodes, referencing each other.

One of most important things to note about data.tree is that it exhibits reference semantics. In a nutshell, this means that you can modify your tree along the way, without having to reassign it to a variable after each modification. By and large, this is a rather exceptional behavior in R, where value-semantics is king most of the time.

Applications

data.tree is not optimised for computational speed, but for implementation speed. Namely, its memory footprint is relatively large compared to traditional R data structures. However, it can easily handle trees with several thousand nodes, and once a tree is constructed, operations on it are relatively fast. data.tree is always useful when

  • you want to develop and test a new algorithm

  • you want to import and convert tree structures (it imports and exports to list-of-list, data.frame, yaml, json, igraph, dendrogram, phylo and more)

  • you want to play around with data, display it and get an understanding

  • you want to test another package, to compare it with your own results

  • you need to do homework

For a quick overview of the features, read the data.tree vignette by running vignette("data.tree"). For stylized applications, see vignette("applications", package='data.tree')

See also

Node

For more details, see the data.tree vignette by running: vignette("data.tree")

Author

Maintainer: Christoph Glur christoph.glur@powerpartners.pro (R interface)

Other contributors:

  • Russ Hyde (improve dependencies) [contributor]

  • Chris Hammill (improve getting) [contributor]

  • Facundo Munoz (improve list conversion) [contributor]

  • Markus Wamser (fixed some typos) [contributor]

  • Pierre Formont (additional features) [contributor]

  • Kent Russel (documentation) [contributor]

  • Noam Ross (fixes) [contributor]

  • Duncan Garmonsway (fixes) [contributor]

Examples

data(acme)
print(acme)
#>                           levelName
#> 1  Acme Inc.                       
#> 2   ¦--Accounting                  
#> 3   ¦   ¦--New Software            
#> 4   ¦   °--New Accounting Standards
#> 5   ¦--Research                    
#> 6   ¦   ¦--New Product Line        
#> 7   ¦   °--New Labs                
#> 8   °--IT                          
#> 9       ¦--Outsource               
#> 10      ¦--Go agile                
#> 11      °--Switch to R             
acme$attributesAll
#> [1] "cost" "p"   
acme$count
#> [1] 3
acme$totalCount
#> [1] 11
acme$isRoot
#> [1] TRUE
acme$height
#> [1] 3
print(acme, "p", "cost")
#>                           levelName    p    cost
#> 1  Acme Inc.                          NA      NA
#> 2   ¦--Accounting                     NA      NA
#> 3   ¦   ¦--New Software             0.50 1000000
#> 4   ¦   °--New Accounting Standards 0.75  500000
#> 5   ¦--Research                       NA      NA
#> 6   ¦   ¦--New Product Line         0.25 2000000
#> 7   ¦   °--New Labs                 0.90  750000
#> 8   °--IT                             NA      NA
#> 9       ¦--Outsource                0.20  400000
#> 10      ¦--Go agile                 0.05  250000
#> 11      °--Switch to R              1.00   50000

outsource <- acme$IT$Outsource
class(outsource)
#> [1] "Node" "R6"  
print(outsource)
#>   levelName
#> 1 Outsource
outsource$attributes
#> [1] "cost" "p"   
outsource$isLeaf
#> [1] TRUE
outsource$level
#> [1] 3
outsource$path
#> [1] "Acme Inc." "IT"        "Outsource"
outsource$p
#> [1] 0.2
outsource$parent$name
#> [1] "IT"
outsource$root$name
#> [1] "Acme Inc."
outsource$expCost <- outsource$p * outsource$cost
print(acme, "expCost")
#>                           levelName expCost
#> 1  Acme Inc.                             NA
#> 2   ¦--Accounting                        NA
#> 3   ¦   ¦--New Software                  NA
#> 4   ¦   °--New Accounting Standards      NA
#> 5   ¦--Research                          NA
#> 6   ¦   ¦--New Product Line              NA
#> 7   ¦   °--New Labs                      NA
#> 8   °--IT                                NA
#> 9       ¦--Outsource                  80000
#> 10      ¦--Go agile                      NA
#> 11      °--Switch to R                   NA

acme$Get("p")
#>                Acme Inc.               Accounting             New Software 
#>                       NA                       NA                     0.50 
#> New Accounting Standards                 Research         New Product Line 
#>                     0.75                       NA                     0.25 
#>                 New Labs                       IT                Outsource 
#>                     0.90                       NA                     0.20 
#>                 Go agile              Switch to R 
#>                     0.05                     1.00 
acme$Do(function(x) x$expCost <- x$p * x$cost)
acme$Get("expCost", filterFun = isLeaf)
#>             New Software New Accounting Standards         New Product Line 
#>                   500000                   375000                   500000 
#>                 New Labs                Outsource                 Go agile 
#>                   675000                    80000                    12500 
#>              Switch to R 
#>                    50000 

ToDataFrameTable(acme, "name", "p", "cost", "level", "pathString")
#>                       name    p    cost level
#> 1             New Software 0.50 1000000     3
#> 2 New Accounting Standards 0.75  500000     3
#> 3         New Product Line 0.25 2000000     3
#> 4                 New Labs 0.90  750000     3
#> 5                Outsource 0.20  400000     3
#> 6                 Go agile 0.05  250000     3
#> 7              Switch to R 1.00   50000     3
#>                                      pathString
#> 1             Acme Inc./Accounting/New Software
#> 2 Acme Inc./Accounting/New Accounting Standards
#> 3           Acme Inc./Research/New Product Line
#> 4                   Acme Inc./Research/New Labs
#> 5                        Acme Inc./IT/Outsource
#> 6                         Acme Inc./IT/Go agile
#> 7                      Acme Inc./IT/Switch to R
ToDataFrameTree(acme, "name", "p", "cost", "level")
#>                           levelName                     name    p    cost level
#> 1  Acme Inc.                                       Acme Inc.   NA      NA     1
#> 2   ¦--Accounting                                 Accounting   NA      NA     2
#> 3   ¦   ¦--New Software                         New Software 0.50 1000000     3
#> 4   ¦   °--New Accounting Standards New Accounting Standards 0.75  500000     3
#> 5   ¦--Research                                     Research   NA      NA     2
#> 6   ¦   ¦--New Product Line                 New Product Line 0.25 2000000     3
#> 7   ¦   °--New Labs                                 New Labs 0.90  750000     3
#> 8   °--IT                                                 IT   NA      NA     2
#> 9       ¦--Outsource                               Outsource 0.20  400000     3
#> 10      ¦--Go agile                                 Go agile 0.05  250000     3
#> 11      °--Switch to R                           Switch to R 1.00   50000     3
ToDataFrameNetwork(acme, "p", "cost")
#>          from                       to    p    cost
#> 1   Acme Inc.               Accounting   NA      NA
#> 2   Acme Inc.                 Research   NA      NA
#> 3   Acme Inc.                       IT   NA      NA
#> 4  Accounting             New Software 0.50 1000000
#> 5  Accounting New Accounting Standards 0.75  500000
#> 6    Research         New Product Line 0.25 2000000
#> 7    Research                 New Labs 0.90  750000
#> 8          IT                Outsource 0.20  400000
#> 9          IT                 Go agile 0.05  250000
#> 10         IT              Switch to R 1.00   50000