Java (Apache POI) : How to retrieve comment/annotation and associated highlight text from Microsoft Word?

Discussion:

Ramani Routray

2017-05-09 16:42:45 UTC

I have a Microsoft word (.docx) file and trying to retrieve the comments and it's associated highlighted text. Can you pls help.

Attaching picture of the sample word document and the java code for extracting the comments. [ A file with a line "My name is John". The word "John" is highlighted with a comment "Noun" ]

I am able to extract the comments (Noun, Adjective). I would like to extract the text associated with the comment "Noun" (Noun = John, Adjective = great)

FileInputStream fis = new FileInputStream(new File(msWordFilePath));
XWPFDocument adoc = new XWPFDocument(fis);
XWPFWordExtractor xwe = new XWPFWordExtractor(adoc);
XWPFComment[] comments = adoc.getComments();

for(int idx=0; idx < comments.length; idx++)
{
MSWordAnnotation annot = new MSWordAnnotation();
annot.setAnnotationName(comments[idx].getId());
annot.setAnnotationValue(comments[idx].getText());
aList.add(annot);

}

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-***@poi.apache.org
For additional commands, e-mail: dev-***@poi.apache.org

Javen O'Neal

2017-05-10 06:14:07 UTC

Permalink

First, if you're using Java 1.5+(?), you can use for-each loops for
more readable code.
for (final XWPFComment comment : adoc.getComments()) {
final String id = comment.getId();
final String author = comment.getAuthor();
final String text = comment.getText();
}

I don't see anything in POI right now that make extracting the
annotated text that a track changes comment refers to.

Here's the current implementation of XWPFComment:
https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFComment.java?view=markup

Taking a look at the OOXML 2006 schemas wml.xsd (download from
http://www.ecma-international.org/publications/files/ECMA-ST/Office%20Open%20XML%201st%20edition%20Part%204%20(PDF).zip,
extract OfficeOpenXML-Part4a.zip, extract OfficeOpenXML-XMLSchema.zip,
open wml.xsd), I see that the comment (*.docx/word/comments.xml)
doesn't refer to the document text.

<xsd:complexType name="CT_Comment">
<xsd:complexContent>
<xsd:extension base="CT_TrackChange">
<xsd:sequence>
<xsd:group ref="EG_BlockLevelElts" minOccurs="0"
maxOccurs="unbounded"></xsd:group>
</xsd:sequence>
<xsd:attribute name="initials" type="ST_String" use="optional">
<xsd:annotation>
<xsd:documentation>Initials of Comment Author</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>

<xsd:complexType name="CT_TrackChange">
<xsd:complexContent>
<xsd:extension base="CT_Markup">
<xsd:attribute name="author" type="ST_String" use="required">
<xsd:annotation>
<xsd:documentation>Annotation Author</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
<xsd:attribute name="date" type="ST_DateTime" use="optional">
<xsd:annotation>
<xsd:documentation>Annotation Date</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>

<xsd:complexType name="CT_Markup">
<xsd:attribute name="id" type="ST_DecimalNumber" use="required">
<xsd:annotation>
<xsd:documentation>Annotation Identifier</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
</xsd:complexType>

Examining the zipped xml contents of a simple comment example docx
file that I created, I see that the relationship is the other way
around: the document refers to the comments (this ordering makes more
sense anyways).

For a simple file that I created with the text "My name is John." and
annotated the word John with a comment with the message "Noun", here's
what I got in CommentExample.docx/word/document.xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns....>
<w:body>

<w:p w:rsidR="00000000" w:rsidDel="00000000" w:rsidP="00000000"
w:rsidRDefault="00000000" w:rsidRPr="00000000">
<w:pPr>
<w:pBdr/>
<w:contextualSpacing w:val="0"/>
<w:rPr/>
</w:pPr>


<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:rPr><w:rtl w:val="0"/></w:rPr>
<w:t xml:space="preserve">My name is </w:t>
</w:r>


<w:commentRangeStart w:id="0"/>
<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:rPr><w:rtl w:val="0"/></w:rPr>
<w:t xml:space="preserve">John</w:t>
</w:r>
<w:commentRangeEnd w:id="0"/>

<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:commentReference w:id="0"/>
</w:r>


<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:rPr><w:rtl w:val="0"/></w:rPr>
<w:t xml:space="preserve">.</w:t>
</w:r>

</w:p>
<w:sectPr>
<w:pgSz w:h="15840" w:w="12240"/>
<w:pgMar w:bottom="1440" w:top="1440" w:left="1440"
w:right="1440" w:header="0"/>
<w:pgNumType w:start="1"/>
</w:sectPr>
</w:body>
</w:document>

So to solve your problem, you could either:
1. search the document.xml for all comments, looking up the comment's
author and text using the ID that is referenced in the document
commentRangeStart-commentRangeEnd and joining all the text contained
between those markers
2. for each comment in the comment table, find the corresponding
commentRangeStart and commentRangeEnd tags in document.xml and get the
corresponding text that was annotated (in this example, John).

If you don't already have a development environment set up, I
encourage you to do so. Patches are greatly appreciated.

Post by Ramani Routray
I have a Microsoft word (.docx) file and trying to retrieve the comments and it's associated highlighted text. Can you pls help.
Attaching picture of the sample word document and the java code for extracting the comments. [ A file with a line "My name is John". The word "John" is highlighted with a comment "Noun" ]
I am able to extract the comments (Noun, Adjective). I would like to extract the text associated with the comment "Noun" (Noun = John, Adjective = great)
FileInputStream fis = new FileInputStream(new File(msWordFilePath));
XWPFDocument adoc = new XWPFDocument(fis);
XWPFWordExtractor xwe = new XWPFWordExtractor(adoc);
XWPFComment[] comments = adoc.getComments();
for(int idx=0; idx < comments.length; idx++)
{
MSWordAnnotation annot = new MSWordAnnotation();
annot.setAnnotationName(comments[idx].getId());
annot.setAnnotationValue(comments[idx].getText());
aList.add(annot);
}
---------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-***@poi.apache.org
For additional commands, e-mail: dev-***@poi.apache.org

Javen O'Neal

2017-05-10 06:54:57 UTC

Permalink

A few additions, since <paragraph><commentRangeStart id="commentId"
/><run><text>John</text></run><commentRangeStop id="commentId"
/></paragraph> is the critical thing:


<w:commentRangeStart w:id="0"/>
<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:rPr><w:rtl w:val="0"/></w:rPr>
<w:t xml:space="preserve">John</w:t>
</w:r>
<w:commentRangeEnd w:id="0"/>

<xsd:element name="commentRangeStart" type="CT_MarkupRange">
<xsd:annotation>
<xsd:documentation>Comment Anchor Range Start</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="commentRangeEnd" type="CT_MarkupRange">
<xsd:annotation>
<xsd:documentation>Comment Anchor Range End</xsd:documentation>
</xsd:annotation>
</xsd:element>

So if performance isn't a concern here (you don't need to save
pointers to where the comment ranges are), the pseudo-code for a
XWPFComment method that gets the text that a comment refers to would
be:

public String getRefersToText() {
StringBuilder refersTo = new StringBuilder();
for each CTParagraph in document:
for each child element of the CTParagraph:
if child element is a commentRangeStart and id==this.id
append subsequent text runs to the refersTo buffer
continue
if we have found the comment range start and child
element is a text run
append this text run to the refersTo buffer
if child element is a commentRangeEnd and id==this.id
return refersTo.toString() (assuming that one
comment may not refer to multiple text ranges)

}

This would require searching the entire document for every comment.
https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFDocument.java?view=markup
https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFParagraph.java?view=markup

Post by Javen O'Neal
First, if you're using Java 1.5+(?), you can use for-each loops for
more readable code.
for (final XWPFComment comment : adoc.getComments()) {
final String id = comment.getId();
final String author = comment.getAuthor();
final String text = comment.getText();
}
I don't see anything in POI right now that make extracting the
annotated text that a track changes comment refers to.
https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFComment.java?view=markup
Taking a look at the OOXML 2006 schemas wml.xsd (download from
http://www.ecma-international.org/publications/files/ECMA-ST/Office%20Open%20XML%201st%20edition%20Part%204%20(PDF).zip,
extract OfficeOpenXML-Part4a.zip, extract OfficeOpenXML-XMLSchema.zip,
open wml.xsd), I see that the comment (*.docx/word/comments.xml)
doesn't refer to the document text.
<xsd:complexType name="CT_Comment">
<xsd:complexContent>
<xsd:extension base="CT_TrackChange">
<xsd:sequence>
<xsd:group ref="EG_BlockLevelElts" minOccurs="0"
maxOccurs="unbounded"></xsd:group>
</xsd:sequence>
<xsd:attribute name="initials" type="ST_String" use="optional">
<xsd:annotation>
<xsd:documentation>Initials of Comment Author</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="CT_TrackChange">
<xsd:complexContent>
<xsd:extension base="CT_Markup">
<xsd:attribute name="author" type="ST_String" use="required">
<xsd:annotation>
<xsd:documentation>Annotation Author</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
<xsd:attribute name="date" type="ST_DateTime" use="optional">
<xsd:annotation>
<xsd:documentation>Annotation Date</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="CT_Markup">
<xsd:attribute name="id" type="ST_DecimalNumber" use="required">
<xsd:annotation>
<xsd:documentation>Annotation Identifier</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
</xsd:complexType>
Examining the zipped xml contents of a simple comment example docx
file that I created, I see that the relationship is the other way
around: the document refers to the comments (this ordering makes more
sense anyways).
For a simple file that I created with the text "My name is John." and
annotated the word John with a comment with the message "Noun", here's
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns....>
<w:body>

<w:p w:rsidR="00000000" w:rsidDel="00000000" w:rsidP="00000000"
w:rsidRDefault="00000000" w:rsidRPr="00000000">
<w:pPr>
<w:pBdr/>
<w:contextualSpacing w:val="0"/>
<w:rPr/>
</w:pPr>

<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:rPr><w:rtl w:val="0"/></w:rPr>
<w:t xml:space="preserve">My name is </w:t>
</w:r>

<w:commentRangeStart w:id="0"/>
<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:rPr><w:rtl w:val="0"/></w:rPr>
<w:t xml:space="preserve">John</w:t>
</w:r>
<w:commentRangeEnd w:id="0"/>
<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:commentReference w:id="0"/>
</w:r>

<w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
<w:rPr><w:rtl w:val="0"/></w:rPr>
<w:t xml:space="preserve">.</w:t>
</w:r>
</w:p>
<w:sectPr>
<w:pgSz w:h="15840" w:w="12240"/>
<w:pgMar w:bottom="1440" w:top="1440" w:left="1440"
w:right="1440" w:header="0"/>
<w:pgNumType w:start="1"/>
</w:sectPr>
</w:body>
</w:document>
1. search the document.xml for all comments, looking up the comment's
author and text using the ID that is referenced in the document
commentRangeStart-commentRangeEnd and joining all the text contained
between those markers
2. for each comment in the comment table, find the corresponding
commentRangeStart and commentRangeEnd tags in document.xml and get the
corresponding text that was annotated (in this example, John).
If you don't already have a development environment set up, I
encourage you to do so. Patches are greatly appreciated.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-***@poi.apache.org
For additional commands, e-mail: dev-***@poi.apache.org