xml parsing - Extracting specific internal nodes from an xml file and construct a dataframe in r -


i have xml file want extract specific nodes in r using xmltodataframe xml package. can function extract data individual nodes. ex:

xml <- xmlparse("file.xml")  df <- xmltodataframe(getnodeset(xml, "//lat")) 

however wondering if possible extract multiple nodes @ same time? looking make 5 column dataframe extracting data nodes: //nucleotides,//lat,//lon,//bin_uri,//record_id xml.

the structure of xml file follows (just 1 record_id there many in file need extract):

    <record>       <record_id>634750</record_id>       <processid>ccsma054-07</processid>       <bin_uri>aag2098</bin_uri>       <collection_event>         <collectors>arctic ecology</collectors>           <coordinates>             <lat>58.805</lat>             <lon>-94.214</lon>           </coordinates>         <country>canada</country>         <province>manitoba</province>       </collection_event>       <sequences>        <sequence>          <sequenceid>3336699</sequenceid>          <markercode>coi-5p</markercode>          <genbank_accession>hq938393</genbank_accession>          <nucleotides>ctcagagttctcacctggc</nucleotides>        </sequence>       </sequences>     </record> 

consider running various xpath expressions using xpathsapply() , bind data frame:

library(xml)  doc<-xmlparse("d:/freelance work/scripts/boldxml.xml")  record_id <- xpathsapply(doc, "//record/record_id", xmlvalue) bin_uri <- xpathsapply(doc, "//record/bin_uri", xmlvalue) lat <- xpathsapply(doc, "//record/collection_event/coordinates/lat", xmlvalue) lon <- xpathsapply(doc, "//record/collection_event/coordinates/lon", xmlvalue) nucleotides <- xpathsapply(doc, "//record/sequences/sequence/nucleotides", xmlvalue)  df <- data.frame(record_id = unlist(record_id),                   bin_uri = unlist(bin_uri),                                    lat = unlist(lat),                  lng = unlist(lon),                  nucleotides = unlist(nucleotides)) 

alternatively, can simplify raw xml using xslt, special-purpose language restructures/re-designs xml files. while r not have universal xslt package, practically general purpose languages (c#, java, php, perl, python, vb) maintain xslt libraries can call scripts r system(). more, command line programs such windows' powershell , linux's bash can run xslt.

xslt script (save .xsl or .xslt)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/xsl/transform" version="1.0"> <xsl:output version="1.0" encoding="utf-8" indent="yes" /> <xsl:strip-space elements="*"/>    <xsl:template match="/">     <root>       <xsl:apply-templates select="*"/>     </root>   </xsl:template>      <xsl:template match="record">     <xsl:copy>       <xsl:copy-of select="record_id"/>       <xsl:copy-of select="bin_uri"/>            <xsl:copy-of select="collection_event/coordinates/lat"/>       <xsl:copy-of select="collection_event/coordinates/lon"/>       <xsl:copy-of select="sequences/sequence/nucleotides"/>     </xsl:copy>   </xsl:template>  </xsl:transform> 

xml (after transformation)

<?xml version="1.0" encoding="utf-8"?> <root>   <record>     <record_id>634750</record_id>     <bin_uri>aag2098</bin_uri>     <lat>58.805</lat>     <lon>-94.214</lon>     <nucleotides>ctcagagttctcacctggc</nucleotides>   </record> </root> 

r script:

result <- system('..some command line call external script                    parses original xml , above xslt script , transforms                   former latter..', intern = true)  doc <- xmlparse("c:/path/to/transformed/xml.xml") df <- xmltodataframe(getnodeset(doc, "//record")) 

Comments

Popular posts from this blog

Delphi XE2 Indy10 udp client-server interchange using SendBuffer-ReceiveBuffer -

Qt ActiveX WMI QAxBase::dynamicCallHelper: ItemIndex(int): No such property in -

Enable autocomplete or intellisense in Atom editor for PHP -