wdkrnls February 2016

How to read in XML files with multiple root elements in R?

I've been thrust a bunch of XML files which are not well formed. They all have multiple root elements. Both xmlParse in XML and read_xml in xml2 packages barf when I try to use them to read them in with Error: 1: Extra content at the end of the document. Is there a package that makes reading multiple root elements easy, or do I need to resort to more brutish methods?

Answers


rumoku February 2016

xml standard does not support multiple root messages.

I would advice you to read this content as a string, wrap with single root and pass to any of xml r libraries.


G. Grothendieck February 2016

Try read_html in the xml2 package can read it adding some tags. Here is an example:

library(xml2)
s <- "<xyz>1</xyz><xyz>2</xyz>"
doc <- read_html(s)

giving:

> doc
{xml_document}
<html>
[1] <body>\n  <xyz>1</xyz>\n  <xyz>2</xyz>\n</body>

Now we can operate on doc, e.g.

> xml_find_all(doc, "//xyz")
{xml_nodeset (2)}
[1] <xyz>1</xyz>
[2] <xyz>2</xyz>

This also works with the XML package:

library(XML)
doc <- htmlTreeParse(s, asText= TRUE, useInternal = TRUE)
xpathSApply(xmlRoot(doc), "//xyz", xmlValue)

giving:

[1] "1" "2"

Post Status

Asked in February 2016
Viewed 2,912 times
Voted 6
Answered 2 times

Search




Leave an answer