xml parsing - Extracting specific internal nodes from an xml file and construct a dataframe in r -
i have xml file want extract specific nodes in r using xmltodataframe
xml
package. can function extract data individual nodes. ex:
xml <- xmlparse("file.xml") df <- xmltodataframe(getnodeset(xml, "//lat"))
however wondering if possible extract multiple nodes @ same time? looking make 5 column dataframe extracting data nodes: //nucleotides
,//lat
,//lon
,//bin_uri
,//record_id
xml.
the structure of xml file follows (just 1 record_id
there many in file need extract):
<record> <record_id>634750</record_id> <processid>ccsma054-07</processid> <bin_uri>aag2098</bin_uri> <collection_event> <collectors>arctic ecology</collectors> <coordinates> <lat>58.805</lat> <lon>-94.214</lon> </coordinates> <country>canada</country> <province>manitoba</province> </collection_event> <sequences> <sequence> <sequenceid>3336699</sequenceid> <markercode>coi-5p</markercode> <genbank_accession>hq938393</genbank_accession> <nucleotides>ctcagagttctcacctggc</nucleotides> </sequence> </sequences> </record>
consider running various xpath expressions using xpathsapply()
, bind data frame:
library(xml) doc<-xmlparse("d:/freelance work/scripts/boldxml.xml") record_id <- xpathsapply(doc, "//record/record_id", xmlvalue) bin_uri <- xpathsapply(doc, "//record/bin_uri", xmlvalue) lat <- xpathsapply(doc, "//record/collection_event/coordinates/lat", xmlvalue) lon <- xpathsapply(doc, "//record/collection_event/coordinates/lon", xmlvalue) nucleotides <- xpathsapply(doc, "//record/sequences/sequence/nucleotides", xmlvalue) df <- data.frame(record_id = unlist(record_id), bin_uri = unlist(bin_uri), lat = unlist(lat), lng = unlist(lon), nucleotides = unlist(nucleotides))
alternatively, can simplify raw xml using xslt, special-purpose language restructures/re-designs xml files. while r not have universal xslt package, practically general purpose languages (c#, java, php, perl, python, vb) maintain xslt libraries can call scripts r system(). more, command line programs such windows' powershell , linux's bash can run xslt.
xslt script (save .xsl or .xslt)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/xsl/transform" version="1.0"> <xsl:output version="1.0" encoding="utf-8" indent="yes" /> <xsl:strip-space elements="*"/> <xsl:template match="/"> <root> <xsl:apply-templates select="*"/> </root> </xsl:template> <xsl:template match="record"> <xsl:copy> <xsl:copy-of select="record_id"/> <xsl:copy-of select="bin_uri"/> <xsl:copy-of select="collection_event/coordinates/lat"/> <xsl:copy-of select="collection_event/coordinates/lon"/> <xsl:copy-of select="sequences/sequence/nucleotides"/> </xsl:copy> </xsl:template> </xsl:transform>
xml (after transformation)
<?xml version="1.0" encoding="utf-8"?> <root> <record> <record_id>634750</record_id> <bin_uri>aag2098</bin_uri> <lat>58.805</lat> <lon>-94.214</lon> <nucleotides>ctcagagttctcacctggc</nucleotides> </record> </root>
r script:
result <- system('..some command line call external script parses original xml , above xslt script , transforms former latter..', intern = true) doc <- xmlparse("c:/path/to/transformed/xml.xml") df <- xmltodataframe(getnodeset(doc, "//record"))
Comments
Post a Comment