data.tree is to hierarchical data what data.frame is to tabular data: An extensible, general purpose structure to store, manipulate,
and display hierarchical data.
Hierarchical data is ubiquitous in statistics and programming (XML, search trees, family trees, classification, file system, etc.). However, no general-use tree data structure is available in R.
Where tabular data has data.frame, hierarchical data is often modeled in lists of lists or similar makeshifts. These
structures are often difficult to manage.
This is where the data.tree package steps in. It lets you build trees of hierarchical
data for various uses: to print, to rapid prototype search algorithms, to test out new classification algorithms, and much more.
data.tree allows to Traverse trees in various orders (pre-order, post-order, level, etc.), and it lets you run operations on Nodes via
Do.
Similarly, you can collect and store data while traversing a tree using the Get and the Set methods.
The package also contains utility functions to Sort, to Prune, to Aggregate and Cumulate
and to print in custom formats.
The package also contains many conversions from and to data.tree structures. Check out the see also section of as.Node.
You can construct a tree from a data.frame using as.Node.data.frame, and convert it back using as.data.frame.Node.
Similar options exist for list of lists.
For more specialized conversions, see as.dendrogram.Node, as.Node.dendrogram,
as.phylo.Node and as.Node.phylo
Finally, easy conversion options from and to list, dataframe, JSON, YAML, igraph, ape, rpart, party and more exist:
list: both directions
dataframe: both directions
JSON, YAML: both directions, via lists
igraph: from igraph to data.tree
ape: both directions
rpart: from rpart to data.tree
party: from party to data.tree
The entry point to the package is Node. Each tree is composed of a number of Nodes, referencing each other.
One of most important things to note about data.tree is that it exhibits reference semantics. In a nutshell, this means that you can modify
your tree along the way, without having to reassign it to a variable after each modification. By and large, this is a rather exceptional behavior
in R, where value-semantics is king most of the time.
data.tree is not optimised for computational speed, but for implementation speed. Namely, its memory
footprint is relatively large compared to traditional R data structures. However, it can easily handle trees with
several thousand nodes, and once a tree is constructed, operations on it are relatively fast.
data.tree is always useful when
you want to develop and test a new algorithm
you want to import and convert tree structures (it imports and exports to list-of-list, data.frame, yaml, json, igraph, dendrogram, phylo and more)
you want to play around with data, display it and get an understanding
you want to test another package, to compare it with your own results
you need to do homework
For a quick overview of the features, read the data.tree vignette by running vignette("data.tree"). For stylized
applications, see vignette("applications", package='data.tree')
For more details, see the data.tree vignette by running: vignette("data.tree")
data(acme)
print(acme)
#> levelName
#> 1 Acme Inc.
#> 2 ¦--Accounting
#> 3 ¦ ¦--New Software
#> 4 ¦ °--New Accounting Standards
#> 5 ¦--Research
#> 6 ¦ ¦--New Product Line
#> 7 ¦ °--New Labs
#> 8 °--IT
#> 9 ¦--Outsource
#> 10 ¦--Go agile
#> 11 °--Switch to R
acme$attributesAll
#> [1] "cost" "p"
acme$count
#> [1] 3
acme$totalCount
#> [1] 11
acme$isRoot
#> [1] TRUE
acme$height
#> [1] 3
print(acme, "p", "cost")
#> levelName p cost
#> 1 Acme Inc. NA NA
#> 2 ¦--Accounting NA NA
#> 3 ¦ ¦--New Software 0.50 1000000
#> 4 ¦ °--New Accounting Standards 0.75 500000
#> 5 ¦--Research NA NA
#> 6 ¦ ¦--New Product Line 0.25 2000000
#> 7 ¦ °--New Labs 0.90 750000
#> 8 °--IT NA NA
#> 9 ¦--Outsource 0.20 400000
#> 10 ¦--Go agile 0.05 250000
#> 11 °--Switch to R 1.00 50000
outsource <- acme$IT$Outsource
class(outsource)
#> [1] "Node" "R6"
print(outsource)
#> levelName
#> 1 Outsource
outsource$attributes
#> [1] "cost" "p"
outsource$isLeaf
#> [1] TRUE
outsource$level
#> [1] 3
outsource$path
#> [1] "Acme Inc." "IT" "Outsource"
outsource$p
#> [1] 0.2
outsource$parent$name
#> [1] "IT"
outsource$root$name
#> [1] "Acme Inc."
outsource$expCost <- outsource$p * outsource$cost
print(acme, "expCost")
#> levelName expCost
#> 1 Acme Inc. NA
#> 2 ¦--Accounting NA
#> 3 ¦ ¦--New Software NA
#> 4 ¦ °--New Accounting Standards NA
#> 5 ¦--Research NA
#> 6 ¦ ¦--New Product Line NA
#> 7 ¦ °--New Labs NA
#> 8 °--IT NA
#> 9 ¦--Outsource 80000
#> 10 ¦--Go agile NA
#> 11 °--Switch to R NA
acme$Get("p")
#> Acme Inc. Accounting New Software
#> NA NA 0.50
#> New Accounting Standards Research New Product Line
#> 0.75 NA 0.25
#> New Labs IT Outsource
#> 0.90 NA 0.20
#> Go agile Switch to R
#> 0.05 1.00
acme$Do(function(x) x$expCost <- x$p * x$cost)
acme$Get("expCost", filterFun = isLeaf)
#> New Software New Accounting Standards New Product Line
#> 500000 375000 500000
#> New Labs Outsource Go agile
#> 675000 80000 12500
#> Switch to R
#> 50000
ToDataFrameTable(acme, "name", "p", "cost", "level", "pathString")
#> name p cost level
#> 1 New Software 0.50 1000000 3
#> 2 New Accounting Standards 0.75 500000 3
#> 3 New Product Line 0.25 2000000 3
#> 4 New Labs 0.90 750000 3
#> 5 Outsource 0.20 400000 3
#> 6 Go agile 0.05 250000 3
#> 7 Switch to R 1.00 50000 3
#> pathString
#> 1 Acme Inc./Accounting/New Software
#> 2 Acme Inc./Accounting/New Accounting Standards
#> 3 Acme Inc./Research/New Product Line
#> 4 Acme Inc./Research/New Labs
#> 5 Acme Inc./IT/Outsource
#> 6 Acme Inc./IT/Go agile
#> 7 Acme Inc./IT/Switch to R
ToDataFrameTree(acme, "name", "p", "cost", "level")
#> levelName name p cost level
#> 1 Acme Inc. Acme Inc. NA NA 1
#> 2 ¦--Accounting Accounting NA NA 2
#> 3 ¦ ¦--New Software New Software 0.50 1000000 3
#> 4 ¦ °--New Accounting Standards New Accounting Standards 0.75 500000 3
#> 5 ¦--Research Research NA NA 2
#> 6 ¦ ¦--New Product Line New Product Line 0.25 2000000 3
#> 7 ¦ °--New Labs New Labs 0.90 750000 3
#> 8 °--IT IT NA NA 2
#> 9 ¦--Outsource Outsource 0.20 400000 3
#> 10 ¦--Go agile Go agile 0.05 250000 3
#> 11 °--Switch to R Switch to R 1.00 50000 3
ToDataFrameNetwork(acme, "p", "cost")
#> from to p cost
#> 1 Acme Inc. Accounting NA NA
#> 2 Acme Inc. Research NA NA
#> 3 Acme Inc. IT NA NA
#> 4 Accounting New Software 0.50 1000000
#> 5 Accounting New Accounting Standards 0.75 500000
#> 6 Research New Product Line 0.25 2000000
#> 7 Research New Labs 0.90 750000
#> 8 IT Outsource 0.20 400000
#> 9 IT Go agile 0.05 250000
#> 10 IT Switch to R 1.00 50000