JAVA读取WORD文档解决方案

合集下载

JAVA读取WORD文档解决方案

JAVA读取WORD文档解决方案在Java中读取Word文档需要使用特定的Java库或API来解析和处理Word文档格式（.doc或.docx）。

在下面的解决方案中，我们将介绍两种流行的Java库，即Apache POI和JavaFX的XSSF。

1. Apache POI:Apache POI是一个流行的开源Java库，用于处理Microsoft Office 格式的文件，包括Word文档。

下面是使用Apache POI库读取Word文档的步骤：1.1 添加Apache POI依赖库到项目中。

在Maven项目中，可以在pom.xml文件中添加以下依赖项：```xml<dependency><groupId>org.apache.poi</groupId><artifactId>poi</artifactId><version>4.1.2</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>4.1.2</version></dependency>```1.2 使用`XWPFDocument`类打开Word文档。

下面是一个示例代码：```javaFileInputStream fis = newFileInputStream("path/to/word/document.docx");XWPFDocument document = new XWPFDocument(fis);fis.close(;```1.3 使用`XWPFParagraph`类和`XWPFRun`类来遍历Word文档中的段落和文本。

Java 读取Word中的文本的图片

Java 提取Word中的文本和图片本文将介绍通过Java来提取或读取Word文档中文本和图片的方法。

这里提取文本和图片包括同时提取文档正文当中以及页眉、页脚中的的文本和图片。

使用工具：Spire.Doc for JavaJar文件导入方法（参考）：方法1：下载jar文件包。

下载后解压文件，并将lib文件夹下的Spire.Doc.jar文件导入到java程序。

导入效果参考如下：方法2：可通过maven导入。

参考导入方法。

测试文档如下：Java 代码示例（供参考）【示例1】提取Word 中的文本 import com.spire.doc.*; import java.io.FileWriter;import java.io.IOException;public class ExtractText {public static void main(String[] args) throws IOException{//加载测试文档Document doc = new Document();doc.loadFromFile("test.docx");//获取文本保存为StringString text = doc.getText();//将String写入TxtwriteStringToTxt(text,"提取文本.txt");}public static void writeStringToTxt(String content, String txtFileName) throws IOException {FileWriter fWriter= new FileWriter(txtFileName,true);try {fWriter.write(content);}catch(IOException ex){ex.printStackTrace();}finally{try{fWriter.flush();fWriter.close();} catch (IOException ex) {ex.printStackTrace();}}}}文本提取结果：【示例2】提取Word中的图片import com.spire.doc.Document;import com.spire.doc.documents.DocumentObjectType;import com.spire.doc.fields.DocPicture;import com.spire.doc.interfaces.ICompositeObject;import com.spire.doc.interfaces.IDocumentObject;import javax.imageio.ImageIO;import java.awt.image.RenderedImage;import java.io.File;import java.io.IOException;import java.util.ArrayList;import java.util.LinkedList;import java.util.List;import java.util.Queue;public class ExtractImg {public static void main(String[] args) throws IOException { //加载Word文档Document document = new Document();document.loadFromFile("test.docx");//创建Queue对象Queue nodes = new LinkedList();nodes.add(document);//创建List对象List images = new ArrayList();//遍历文档中的子对象while (nodes.size() > 0) {ICompositeObject node = (ICompositeObject) nodes.poll();for (int i = 0; i < node.getChildObjects().getCount(); i++) {IDocumentObject child = node.getChildObjects().get(i);if (child instanceof ICompositeObject) {nodes.add((ICompositeObject) child);//获取图片并添加到Listif (child.getDocumentObjectType() == DocumentObjectType.Picture) { DocPicture picture = (DocPicture) child;images.add(picture.getImage());}}}}//将图片保存为PNG格式文件for (int i = 0; i < images.size(); i++) {File file = new File(String.format("图片-%d.png", i));ImageIO.write((RenderedImage) images.get(i), "PNG", file);}}}图片提取结果：（本文完）。

Java读取Word模板替换内容并另存

Java读取Word模板替换内容并另存⽤到的⼯具：，⽂件解压后主要有三个⽂件：jacob.jar、jacob-1.17-M2-x64.dll和jacob-1.17-M2-x86.dll。

jacob.jar引⼊到项⽬⼯程中，jacob-1.17-M2-x64.dll放在C:\Windows\System32下，如果系统是32位的则把jacob-1.17-M2-x86.dll放在C:\Windows\System32下。

注意：⽂件名要⽤.doc，万不能⽤.docx。

那样会打不开⽂件代码⽰例：/** Java2word.java** Created on 2007年8⽉13⽇, 上午10:32** To change this template, choose Tools | Template Manager* and open the template in the editor.*//** 传⼊数据为HashMap对象，对象中的Key代表word模板中要替换的字段，Value代表⽤来替换的值。

* word模板中所有要替换的字段（即HashMap中的Key）以特殊字符开头和结尾，如：$code$、$date$……，以免执⾏错误的替换。

* 所有要替换为图⽚的字段，Key中需包含image或者Value为图⽚的全路径（⽬前只判断⽂件后缀名为：.bmp、.jpg、.gif）。

* 要替换表格中的数据时，HashMap中的Key格式为“table$R@N”，其中：R代表从表格的第R⾏开始替换，N代表word模板中的第N张表格；Value为ArrayList对象，ArrayList中包含的对象统⼀为String[]，⼀条String[]代表⼀⾏数据，ArrayList中第⼀条记录为特殊记录，记录的是表格中要替换的列号，如：要替换第⼀列、第三列、第五列的数据，则第⼀条记录为String[3] {“1”,”3”,”5”}。

Java解析OFFICE（word,excel,powerpoint）以及PDF的实现方案。。。

Java解析OFFICE（word,excel,powerpoint）以及PDF的实现⽅案。

Java解析OFFICE(word,excel,powerpoint)以及PDF的实现⽅案及开发中的点滴分享在此，先分享下写此⽂前的经历与感受，我所有的感觉浓缩到⼀个字，那就是:"坑",如果是两个字那就是"巨坑"=>因为这个需求⼀开始并不是这样⼦的，且听我漫漫道来：⼀开始客户与我们商量的是将office和PDF上传，将此类⽂件解析成html格式，在APP端调⽤内置server直接以html"播放" 经历⼀个⽉~，两个⽉~，三个⽉~~~ 到需求开发阶段，发现这是个坑。

：按照需规的意思这个整体是当做⼀个功能来做的，技术难度也就算了，⽽且按照估算的⼯时也很难做成需规所需要的样⼦(缺陷太多！) 然后⼀周~，⼀周~，⼜⼀周~~~ 各种⽅案下来将需求做成能⽤的样⼦，然后需求确认时客户说：“我们没有要求你们能解析这些⽂档，我们只要求你们当做⼀个源⽂件上传，在APP端点击直接能选择调⽤第三⽅应⽤打开就⾏了，⽽且⼀开始我们的需求就是这样的。

” /**听完，顿时泪流满⾯( _ )，如果业务⼀开始就确认这样做，何⾄于浪费如此多的时间，花费如此多的精⼒绕⽼⼤⼀圈。

*/ 需求绕了⼀圈⼜绕回来了，作为经历过的⼈，现在总结下这需求⾥⾯⽆尽的坑： A>开源社区有很多Demo，这些Demo有很多缺陷，⽐如office⾥⾯的艺术字、图⽚、公式、颜⾊样式、视频和⾳频不能解析 B>能解析的对象，解析出来的效果不是很好，⽐如word和ppt⾃⾝的排版乱了,excel单元格⾥⾯的⾃定义格式全变成数字了~等等 C>开源社区的资料并不是很全，导致的结果是不同的⽂档类型需要⽤不同的解析⽅式去解析，⽐如word⽤docx4j解析、excel⽤poi解析带来的代码量巨⼤ D>由于代码⾃⾝的解析效果不是很好，更改后的⽅案需要在上传之前将源⽂件处理成其他的形式，如pdf需要切成图⽚，ppt需要转换成视频或是图⽚，这样⼀来需求实现的⽅式就变成半⾃动了╥﹏╥... E>word⽤docx4j解析⼀个很⼤的问题是解析的效率太低了，5MB以上的⽂件或者内容⽐较复杂的word⽂档解析⼗分耗时，解析效率太低，再⼀就是poi解析数据量⽐较⼤的Exel(⽐如>1000⾏)容易造成内存溢出，不好控制 F>⼯时太短，只有15天。

Java解析word文档

for(int j=1;j<imgCount+1;j++){ Dispatch shape = Dispatch.call(imgDispatch, "Item", new Variant(1)).toDispatch(); Dispatch imageRange = Dispatch.get(shape, "Range").toDispatch(); Dispatch.call(imageRange, "Copy"); (imageRange, "Paste"); } }
Java解析 word文档背景在互联网教育行业，做内容相关的项目经常碰到的一个问题就是如何解析word文档。因为系统如果无法智能的解析word，那么就只能通过其他方式手动录入word内容，效率低下，而且人工成本和录入出错率都较高。疑难点 word解析可以预见的困难主要有以下几个方面: word 结构问题 —— word不开源，且含有很多非文本内容，比如图表，而已知的常规方法只能解析纯文本内容，所以如果不知道word内部层级结构，解析将难以进行。 word 公式问题 —— word公式来源并非单一，可能是用MathType插件生成的latex公式，也可能是用word自带公式编辑器生成的公式，还有可能公式部分手敲，部分使用搜狗输入法或者其它编辑器输入。不同来源处理方式是否一样？且能否有效读取文档各种上下脚标？方便后期展示？ word 非文本问题 —— word含有很多的非文本内容，比如图表。来源也多样，图表可能是用word自带的画图工具生成的，也有可能是复制粘贴的，不同来源解析方式是否一样？且读取的时候是否能有效获取图片的位置及大小信息？方便文档内容后期在PC端和移动端展示。无论最终方案是什么，肯定是将所有的且需要的非文本信息转换为文本信息。 word 版本问题 —— word有03、07等好几个版本，还有WPS版本，解析是否要全部兼容？后缀名有docx和doc，是否全部兼容？当然，前提是已经成功解析一种类型。 word 规范问题 —— 有些word可能是早期制作的，返工代价太大，所以格式内容多样化。而且就算制定word格式规范，新制作的word也无法保证格式一定正确，除非是程序自动生成的文档。举个例子，试题的题序，肉眼无法区分的格式就有好几种。程序只可能尽量覆盖绝大部分情况，考虑的情况越多，解析正确率越高，当然程序也更复杂。

几种解析Word文档的Java类库比较

受云服务的启发我想到未必非要在java中解决问题于是想到之前写过一个nodejs的项目其中涉及到office文档的生成可以利用nodejs开发一个restful的接口将所有模板放在这个项目里调用接口实现模板生成
几种解析 Word文档的 Java类库比较
推荐指数：
因为之前做过EXCEL的解析，所以我首选就是POI，然而经过调查之后发现POI解析Word文档就是个坑，非常难用不说，有些功能还不支持。试验一番之后不得不放弃了。
推荐指数：
受云服务的启发，我想到未必非要在Java中解决问题，于是想到之前写过一个Node.js的项目，其中涉及到office文档的生成，可以利用 Node.js开发一个Restful的接口，将所有模板放在这个项目里，调用接口实现模板生成。Docxtemplater相对来讲是一个很好的Node.js office 中间件。
推荐指数：
发现POI不好用之后同事推荐给我了一种基于POI的模板类库，可以根据模板自动生成文档。语法简单，而且模板可以定制。因为这次的需求比较特殊，所以有些地方不太满足项目的需要。如果你的项目是那种从头搭建的项目的话，建议使用这个类库。
推荐指数：
FreeMarker是一种Html模板引擎工具，因为word文档也是一种固定格式的XML文档，所以可以使用FreeMarker来设定模板，并根据模板生成。缺点是所有doc模板都必须修改为符合标准的ftl模板文档，工程量较大。
推荐指数：
JACOB是一个Java-COM的中间件，通过这个组件你可以在Java应用程序中调用COM组件和Win32程序库。然而缺点也比较明显，就是只能在Windows环境下使用，如果是那种需要部署到Linux环境的项目就不适用了。我没有写测试小程序，不知道具体使用起来会是怎么样。

Java 读取Word文本框中的文本、图片、表格

Java 读取Word文本框中的文本/图片/表格Word可插入文本框，文本框中可嵌入文本、图片、表格等内容。

对文档中的已有文本框，也可以读取其中的内容。

本文以Java程序代码来展示如何读取文本框，包括读取文本框中的文本、图片以及表格等。

【程序环境】程序编辑环境为IntelliJ IDEA，并在程序中引入了free Spire.Doc.jar 3.9.0，安装的Jdk版本为1.8.0。

【源文档】程序中用于测试的Word源文档如下图：【程序代码】1.读取文本框中的文本import com.spire.doc.*;import com.spire.doc.documents.Paragraph;import com.spire.doc.fields.TextBox;import java.io.BufferedWriter;import java.io.File;import java.io.FileWriter;import java.io.IOException;public class ExtractText {public static void main(String[] args) throws IOException {//加载含有文本框的Word文档Document doc = new Document();doc.loadFromFile("sample.docx");//获取文本框TextBox textbox = doc.getTextBoxes().get(0);//保存文本框中的文本到指定文件File file = new File("ExtractedText.txt");if (file.exists()){file.delete();}file.createNewFile();FileWriter fw = new FileWriter(file, true);BufferedWriter bw = new BufferedWriter(fw);//遍历文本框中的对象for (Object object:textbox.getBody().getChildObjects()) {//判定是否为文本段落if(object instanceof Paragraph){//获取段落中的文本String text = ((Paragraph) object).getText();//写入文本到txt文档bw.write(text);}}bw.flush();bw.close();fw.close();}}2.读取文本框中的图片import com.spire.doc.*;import com.spire.doc.documents.Paragraph;import com.spire.doc.fields.DocPicture;import com.spire.doc.fields.TextBox;import javax.imageio.ImageIO;import java.awt.image.RenderedImage;import java.io.File;import java.io.IOException;import java.util.ArrayList;import java.util.List;public class ExtractImg {public static void main(String[] args) throws IOException {//加载含有文本框的Word文档Document doc = new Document();doc.loadFromFile("sample.docx");//获取文本框TextBox textbox = doc.getTextBoxes().get(0);//创建List对象List images = new ArrayList();//遍历文本框中所有段落for (int i = 0 ; i < textbox.getBody().getParagraphs().getCount();i++) {Paragraph paragraph = textbox.getBody().getParagraphs().get(i);//遍历段落中的所有子对象for (int j = 0; j < paragraph.getChildObjects().getCount(); j++) {Object object = paragraph.getChildObjects().get(j);//判定对象是否为图片if (object instanceof DocPicture){//获取图片DocPicture picture = (DocPicture) object;images.add(picture.getImage());}}}//将图片以PNG文件格式保存for (int z = 0; z < images.size(); z++) {File file = new File(String.format("图片-%d.png", z));ImageIO.write((RenderedImage) images.get(z), "PNG", file); }}}3.读取文本框中的表格import com.spire.doc.*;import com.spire.doc.documents.Paragraph;import com.spire.doc.fields.TextBox;import java.io.BufferedWriter;import java.io.File;import java.io.FileWriter;import java.io.IOException;public class ExtractTable {public static void main(String[]args) throws IOException { //加载Word测试文档Document doc = new Document();doc.loadFromFile("sample.docx");//获取文本框TextBox textbox = doc.getTextBoxes().get(0);//获取文本框中的表格Table table = textbox.getBody().getTables().get(0);//保存到文本文件File file = new File("ExtractedTable.txt");if (file.exists()){file.delete();}file.createNewFile();FileWriter fw = new FileWriter(file, true);BufferedWriter bw = new BufferedWriter(fw);//遍历表格中的段落并提取文本for (int i = 0; i < table.getRows().getCount(); i++){TableRow row = table.getRows().get(i);for (int j = 0; j < row.getCells().getCount(); j++){TableCell cell = row.getCells().get(j);for (int k = 0; k < cell.getParagraphs().getCount(); k++) {Paragraph paragraph = cell.getParagraphs().get(k);bw.write(paragraph.getText() + "\t");}}bw.write("\r\n");}bw.flush();bw.close();fw.close();}}。

Java实现word文档在线预览，读取office（word,excel,ppt）文件

Java实现word⽂档在线预览，读取office（word,excel,ppt）⽂件想要实现word或者其他office⽂件的在线预览，⼤部分都是⽤的两种⽅式，⼀种是使⽤openoffice转换之后再通过其他插件预览，还有⼀种⽅式就是通过POI读取内容然后预览。

⼀、使⽤openoffice⽅式实现word预览主要思路是：1.通过第三⽅⼯具openoffice，将word、excel、ppt、txt等⽂件转换为pdf⽂件2.通过swfTools将pdf⽂件转换成swf格式的⽂件3.通过FlexPaper⽂档组件在页⾯上进⾏展⽰我使⽤的⼯具版本：openof：3.4.1swfTools：1007FlexPaper：这个关系不⼤，我随便下的⼀个。

推荐使⽤1.5.1JODConverter：需要jar包，如果是maven管理直接引⽤就可以操作步骤：1.office准备下载openoffice：从过往⽂件，其他语⾔中找到中⽂版3.4.1的版本下载后，解压缩，安装然后找到安装⽬录下的program ⽂件夹在⽬录下运⾏soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard如果运⾏失败，可能会有提⽰，那就加上 .\ 在运⾏试⼀下这样openoffice的服务就开启了。

2.将flexpaper⽂件中的js⽂件夹(包含了flexpaper_flash_debug.js，flexpaper_flash.js,jquery.js,这三个js⽂件主要是预览swf⽂件的插件)拷贝⾄⽹站根⽬录;将FlexPaperViewer.swf拷贝⾄⽹站根⽬录下(该⽂件主要是⽤在⽹页中播放swf⽂件的播放器)项⽬结构：页⾯代码：fileUpload.jsp<%@ page language="java" contentType="text/html; charset=UTF-8"pageEncoding="UTF-8"%><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "/TR/html4/loose.dtd"><html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>⽂档在线预览系统</title><style>body {margin-top:100px;background:#fff;font-family: Verdana, Tahoma;}a {color:#CE4614;}#msg-box {color: #CE4614; font-size:0.9em;text-align:center;}#msg-box .logo {border-bottom:5px solid #ECE5D9;margin-bottom:20px;padding-bottom:10px;}#msg-box .title {font-size:1.4em;font-weight:bold;margin:0 0 30px 0;}#msg-box .nav {margin-top:20px;}</style></head><body><div id="msg-box"><form name="form1" method="post" enctype="multipart/form-data" action="docUploadConvertAction.jsp"><div class="title">请上传要处理的⽂件，过程可能需要⼏分钟，请稍候⽚刻。

java获取在线文档数据的方法

要获取在线文档数据，可以使用Java的网络编程功能来连接文档所在的服务器，并从服务器上下载文档数据。

以下是一些可能有用的方法：
1. 使用Java的URL类获取文档数据：使用URL类中的openStream()方法可以打开一个与指定URL相关的数据流，然后可以使用Java IO库中的类来读取该数据流并获取文档数据。

2. 使用Java的URLConnection类获取文档数据：使用URLConnection类中的getInputStream()方法可以打开一个与指定URL相关的输入流，然后可以使用Java IO库中的类来读取该输入流并获取文档数据。

3. 使用Java的HttpClient类获取文档数据：HttpClient类可以用于发送HTTP请求并接收HTTP响应。

可以使用HttpClient类中的execute()方法发送HTTP GET请求以获取文档数据，然后可以使用Java IO库中的类来读取响应数据。

4. 使用第三方库来获取文档数据：有许多第三方库可用于从Web上获取数据，例如Jsoup和Apache HttpClient。

这些库通常提供更高级的API和更多的功能，可以使获取在线文档数据变得更加容易和灵活。

无论使用哪种方法，都需要确保在处理完数据后关闭所有打开的资源，例如输入/输出流、套接字和HTTP连接。

这可以通过使用Java
的try-with-resources语句来实现。

java使用POI操作XWPFDocumen创建和读取OfficeWord文档基础篇

java使⽤POI操作XWPFDocumen创建和读取OfficeWord⽂档基础篇注：有不正确的地⽅还望⼤神能够指出，抱拳了⽼铁！参考 API:主要参考⽂章 1：主要参考⽂章 2：主要参考⽂章 3：⼀、基本属性建议⼤家使⽤ office word 来创建⽂档。

（wps 和 word 结构有些不⼀样）IBodyElement ------------------- 迭代器（段落和表格）XWPFComment ------------------- 评论（个⼈理解应该是批注）XWPFSDTXWPFFooter ------------------- 页脚XWPFFootnotes ------------------- 脚注XWPFHeader ------------------- 页眉XWPFHyperlink ------------------- 超链接XWPFNumbering ------------------- 编号（我也不知是啥...）XWPFParagraph ------------------- 段落XWPFPictureData ------------------- 图⽚XWPFStyles ------------------- 样式（设置多级标题的时候⽤）XWPFTable ------------------- 表格⼆、正⽂段落⼀个⽂档包含多个段落，⼀个段落包含多个 Runs，⼀个 Runs 包含多个 Run，Run 是⽂档的最⼩单元获取所有段落：List paragraphs = word.getParagraphs();获取⼀个段落中的所有 Runs：List xwpfRuns = xwpfParagraph.getRuns();获取⼀个 Runs 中的⼀个 Run：XWPFRun run = xwpfRuns.get(index);XWPFRun-- 代表具有相同属性的⼀段⽂本三、正⽂表格⼀个⽂档包含多个表格，⼀个表格包含多⾏，⼀⾏包含多列（格），每⼀格的内容相当于⼀个完整的⽂档获取所有表格：List xwpfTables = doc.getTables();获取⼀个表格中的所有⾏：List xwpfTableRows = xwpfTable.getRows();获取⼀⾏中的所有列：List xwpfTableCells = xwpfTableRow.getTableCells();获取⼀格⾥的内容：List paragraphs = xwpfTableCell.getParagraphs();之后和正⽂段落⼀样注：1. 表格的⼀格相当于⼀个完整的 docx ⽂档，只是没有页眉和页脚。

用Java读取Word文档

用Java读取Word文档由于Word的编码方式比较复杂，所以Word文档不可能通过流的方式直接读取；当然如果Word可以转化成TXT文件就可以直接读取了；目前读取Word比较好的开源工具是Poi及Jacob，感觉Poi读取功能要比Jacob略逊一筹，毕竟Jacob可以直接调用Word的COM组件；但是微软产品不开放源码，所以Jacob读取Word文档也只能是摸着石头过河，一点一点破解了。

Jacob读取Word内容，由于Word内容的复杂性，读取也是非常不方便的，目前可以有＂按段落读取＂，＂按书签读取＂及＂按照表格读取＂等几种形式。

示例讲解（通过Java FileReader,Jacob两种方式读取Word内容)一．通过java流读取Word内容复制代码1.import java.io.BufferedReader;2.import java.io.FileReader;3.import java.io.IOException;4.5.public class ReadWordByStream {6.public static void main(String[] args) throws IOException {7. String rowContent = new String();8. String content = new String();9. BufferedReader in = new BufferedReader(new FileReader("d:\\test3.doc"));10. while ((rowContent = in.readLine()) != null) {11.content = content + rowContent + "\n";12. }13. System.out.println(content.getBytes());14. System.out.println(new String(content.getBytes(),"utf-8"));//因为编码方式不同，不容易解析15. in.close();16.}17.18.}二．通过Jacob读取Word内容复制代码1.import com.jacob.activeX.ActiveXComponent;2.import Thread;3.import .Dispatch;4.import .Variant;5.6.public class WordReader {7.public static void main(String args[]) {8. ComThread.InitSTA();// 初始化com的线程9. ActiveXComponent wordApp = new ActiveXComponent("Word.Application"); // 启动word10. // Set the visible property as required.11. Dispatch.put(wordApp, "Visible", new Variant(true));// //设置word可见12. Dispatch docs = wordApp.getProperty("Documents").toDispatch();//所有文档窗口13.// String inFile = "d:\\test.doc";14.// Dispatch doc = Dispatch.invoke(docs,"Open",Dispatch.Method,15.// new Object[] { inFile, new Variant(false),new Variant(false) },//参数３,false:可写，true:只读16.// new int[1]).toDispatch();//打开文档17.18. Dispatch doc = Dispatch.call(docs, "Add").toDispatch(); //创建一个新文档19. Dispatch wordContent = Dispatch.get(doc, "Content").toDispatch(); //取得word文件的内容20. Dispatch font = Dispatch.get(wordContent, "Font").toDispatch();21. Dispatch.put(font, "Bold", new Variant(true)); // 设置为粗体22.Dispatch.put(font, "Italic", new Variant(true)); // 设置为斜体23.Dispatch.put(font, "Underline", new Variant(true));24.Dispatch.put(font, "Name", new Variant("宋体"));25.Dispatch.put(font, "Size", new Variant(14));26. for(int i=0;i<10;i++){//作为一个段落27.Dispatch.call(wordContent, "InsertAfter", "current paragraph"+i+" ");28. }29. for(int j=0;j<10;j++){//作为十个段落30. Dispatch.call(wordContent, "InsertAfter", "current paragraph"+j+"\r");31.}32. Dispatch paragraphs = Dispatch.get(wordContent, "Paragraphs")33. .toDispatch(); //所有段落34. int paragraphCount = Dispatch.get(paragraphs, "Count").getInt();35. System.out.println("paragraphCount:"+paragraphCount);36.37. for (int i = 1; i <= paragraphCount; i++) {38.Dispatch paragraph = Dispatch.call(paragraphs, "Item",39.new Variant(i)).toDispatch();40.Dispatch paragraphRange = Dispatch.get(paragraph, "Range")41..toDispatch();42.String paragraphContent = Dispatch.get(paragraphRange, "Text")43..toString();44.System.out.println(paragraphContent);45.//Dispatch.call(selection, "MoveDown");46. }47. // WordReader.class.getClass().getResource("/").getPath().substring+"test.doc";48. Dispatch.call(doc, "SaveAs","d:\\wordreader.doc");49. // Close the document without saving changes50. // 0 = wdDoNotSaveChanges51. // -1 = wdSaveChanges52. // -2 = wdPromptToSaveChanges53. ComThread.Release();//释放com线程54. Dispatch.call(docs, "Close", new Variant(0));55. docs = null;56. Dispatch.call(wordApp,"Quit");57. wordApp = null;58.}59.}用Java简单的读取word文档中的数据：第一步：下载tm-extractors-0.4.jar下载地址：/browser/elated-core/trunk/lib/tm-extractors-0.4.jar?rev =46并把它放到你的classpath路径下面。

JAVA读取word文件

JAVA读取word文件关键词：JAVA word作者:bluerain QQ:890626471。

读取word文件有两种方法，用jacob包，可以修改生成word文件内容。

如果只读取word里的文本内容的话，可以用poi读取word文件，先到/maven2/org/textmining/tm-extractors/下载tm-extractors-0.4.jar包2。

读取word里的文本内容的示列代码import java.io.*;import org.textmining.text.extraction.WordExtractor;public class TestPoi {public TestPoi() {}public static void main(String args[]) throws Exception{try{FileInputStream in = new FileInputStream ("D:/szqxjzhbase/doc/修改后/2001-2005年/重大致灾暴雨/20050819-20/技术总结/2005年8月20日一次大暴雨过程低空急流脉动与强降水关系分析 .doc");// FileInputStream in = new FileInputStream ("D:/szqxjzhbase/技术测试/新建Microsoft Word 文档.doc");WordExtractor extractor = new WordExtractor();System.out.println(in.available());String str = extractor.extractText(in);// System.out.println("the result length is"+str.length());System.out.println(str);}catch(Exception e){e.printStackTrace();}}}3。

java读取word文档,提取标题和内容的实例

java读取word⽂档,提取标题和内容的实例使⽤的⼯具为poi，需要导⼊的依赖如下<dependency><groupId>org.apache.poi</groupId><artifactId>poi</artifactId><version>3.17</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>3.17</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-scratchpad</artifactId><version>3.17</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>ooxml-schemas</artifactId><version>1.1</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml-schemas</artifactId><version>3.17</version></dependency>我采⽤的分离⽅式是根据字体⼤⼩判断。

JAVA读取word（doc）（docx）标题和内容----POI

JAVA读取word（doc）（docx）标题和内容----POI 1、下载poi的jar包下载地址：下载解压后⽤到的jar包maven：<dependency><groupId>org.apache.poi</groupId><artifactId>poi</artifactId><version>4.1.2</version></dependency><dependency><groupId>cn.hutool</groupId><artifactId>hutool-all</artifactId><version>5.5.7</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>4.1.2</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml-schemas</artifactId><version>4.1.2</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>ooxml-schemas</artifactId><version>1.1</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-scratchpad</artifactId><version>4.1.2</version></dependency>⼀、读取word全部内容（这个不区分doc和docx）1package com.wordcom;23import java.io.File;4import java.io.FileInputStream;5import java.io.InputStream;6import org.apache.poi.POIXMLDocument;7import org.apache.poi.POIXMLTextExtractor;8import org.apache.poi.hwpf.extractor.WordExtractor;9import org.apache.poi.openxml4j.opc.OPCPackage;10import org.apache.poi.xwpf.extractor.XWPFWordExtractor;11/**12 * @Author：hp13 * @Description：14 * @Date：2021年11⽉4⽇14:58:1115 * @Modified by：读取word所有内容16 **/17public class DocUtil {18public static void main(String[] args) {19 String filePath = "C:\\Users\\hp\\Desktop\\新建⽂件夹 (2)\\忻州地调中⼼站11楼机房更换通信电源三措⼀案.docx";20 String content = readWord(filePath);21 System.out.println(content);22 }2324public static String readWord(String path) {25 String buffer = "";26try {27if (path.endsWith(".doc")) {28 InputStream is = new FileInputStream(new File(path));29 WordExtractor ex = new WordExtractor(is);30 buffer = ex.getText();31 ex.close();32 } else if (path.endsWith("docx")) {33 OPCPackage opcPackage = POIXMLDocument.openPackage(path);34 POIXMLTextExtractor extractor = new XWPFWordExtractor(opcPackage);35 buffer = extractor.getText();36 extractor.close();37 } else {38 System.out.println("此⽂件不是word⽂件！");39 }4041 } catch (Exception e) {42 e.printStackTrace();43 }4445return buffer;46 }47 }⼆、获取word各级标题（doc格式）这个需要保证word格式提前定义好标题格式才能读出来1package com.wordcom;2import org.apache.poi.hwpf.HWPFDocument;3import org.apache.poi.hwpf.model.StyleDescription;4import org.apache.poi.hwpf.model.StyleSheet;5import ermodel.Paragraph;6import ermodel.ParagraphProperties;7import ermodel.Range;8import java.io.*;910/**11 * @author hp12 *获取doc⽂档的标题13*/14public class WordTitle {15public static void main(String[] args) throws Exception {1617 String filePath = "C:\\Users\\hp\\Desktop\\新建⽂件夹 (2)\\正⽂查找.doc";18 printWord(filePath);1920 }21public static void printWord(String filePath) throws IOException {2223 InputStream is = new FileInputStream(filePath);2425 HWPFDocument doc = new HWPFDocument(is);2627 Range r = doc.getRange();// ⽂档范围2829for (int i = 0; i < r.numParagraphs(); i++) {3031 Paragraph p = r.getParagraph(i);// 获取段落32int numStyles = doc.getStyleSheet().numStyles();3334int styleIndex = p.getStyleIndex();3536if (numStyles > styleIndex) {3738 StyleSheet style_sheet = doc.getStyleSheet();3940 StyleDescription style = style_sheet.getStyleDescription(styleIndex);41 ParagraphProperties style1 = style_sheet.getParagraphStyle(styleIndex); 4243 String styleName = style.getName();// 获取每个段落样式名称44//System.out.println(style_sheet);45//System.out.println(styleName);46// 获取⾃⼰理想样式的段落⽂本信息47//String styleLoving = "标题";48 String text = p.text();// 段落⽂本49//if (styleName != null && styleName.contains(styleLoving)) {50if (styleName.equals("标题")) {5152 System.out.println(text);53 }54 }55 }56 doc.close();57 }58 }三、按段落读取word(doc)(docx)可以按照⾃⼰的需求提取特定的内容doc1package com.wordcom;2import org.apache.poi.hwpf.HWPFDocument;3import org.apache.poi.hwpf.model.StyleDescription;4import org.apache.poi.hwpf.model.StyleSheet;5import ermodel.Paragraph;6import ermodel.ParagraphProperties;7import ermodel.Range;8import java.io.*;910/**11 *12 * @author hp13 *获取doc⽂档的标题14*/15public class WordTitledoc {16public static void main(String[] args) throws Exception {1718 String filePath = "C:\\Users\\hp\\Desktop\\新建⽂件夹 (2)\\⼀案 .doc";1920 printWord(filePath);2122 }2324public static void printWord(String filePath) throws IOException {2526 InputStream is = new FileInputStream(filePath);2728 HWPFDocument doc = new HWPFDocument(is);2930 Range r = doc.getRange();// ⽂档范围3132for (int i = 0; i < r.numParagraphs(); i++) {3334 Paragraph p = r.getParagraph(i);// 获取段落35int numStyles = doc.getStyleSheet().numStyles();3637int styleIndex = p.getStyleIndex();3839if (numStyles > styleIndex) {4041 StyleSheet style_sheet = doc.getStyleSheet();4243 StyleDescription style = style_sheet.getStyleDescription(styleIndex);44 ParagraphProperties style1 = style_sheet.getParagraphStyle(styleIndex);4546 String styleName = style.getName();// 获取每个段落样式名称47//System.out.println(style_sheet);48//System.out.println(styleName);49// 获取⾃⼰理想样式的段落⽂本信息50//String styleLoving = "标题";51 String text = p.text();// 段落⽂本52//if (styleName != null && styleName.contains(styleLoving)) {53if (text.contains(".") || text.contains("、")) {54//String text = p.text();// 段落⽂本55if (!text.contains("，") && !text.contains("；") && !text.contains("。

Java读取word文档解决方案

Java读取word文档解决方案嘿，小伙伴，今天咱们就来聊聊如何在Java中读取Word文档，让你的程序也能像人一样“读懂”Word文件。

这可是个常用需求，不管你是做数据分析，还是文档处理，这项技能绝对不能少。

下面，我就用我那十年的方案写作经验，带你一起探索这个话题。

咱们得明确一下，Java读取Word文档主要有两种方式：一种是通过ApachePOI库，另一种是通过JODConverter库。

这两种方法各有千秋，下面我会一一介绍。

一、ApachePOI库ApachePOI，这可是Java读取Word文档的经典之作。

它支持读取和写入Word文档，功能强大，稳定性高。

不过，使用起来可能会有点难度，因为它的API相对复杂。

1.引入依赖你需要在项目的pom.xml文件中引入ApachePOI的依赖：xml<dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>5.1.0</version></dependency>2.读取Word文档就是读取Word文档的核心代码了。

这里我以读取.docx格式的文档为例：javaermodel.XWPFDocument;ermodel.XWPFParagraph;ermodel.XWPFRun;importjava.io.FileInputStream;importjava.io.IOException;importjava.util.List;publicclassWordReader{publicstaticvoidmn(Stringargs){try(FileInputStreamfis=newFileInputStream("path/to/your/ document.docx");XWPFDocumentdoc=newXWPFDocument(fis)){List<XWPFParagraph>paragraphs=doc.getParagraphs();for(XWPFParagraphparagraph:paragraphs){List<XWPFRun>runs=paragraph.getRuns();Stringtext="";for(XWPFRunrun:runs){text+=run.getText(0);}System.out.println(text);}}catch(IOExceptione){e.printStackTrace();}}}这里，我们通过`FileInputStream`读取Word文档，然后创建一个`XWPFDocument`对象来解析文档。

Java读取Word文本段落格式属性

Java读取Word⽂本段落格式属性本⽂介绍通过Java后端程序代码来读取Word⽂本和段落格式的⽅法。

本次测试环境如下：Word版本：2013编译环境：IntelliJ IDEA2018Work库：free spire.doc.jarJDK版本：1.8.0通过textrange.getCharacterFormat()⽅法读取⽂本字符串格式，通过paragraph.getFormat()读取段落格式，读取具体⽂字及段落属性时，可⽀持读取字体、字号、⽂字颜⾊、⽂字背景、⽂字是否加粗或倾斜、⽂字下划线、⼤⼩写、边框、上标下标、⾏距、段落缩进、对齐⽅式、段落边框、背景等等，下表中罗列了所有可⽀持读取的样式属性，供参考：读取⽂本格式 getCharacterFormat()：⽅法类型getFontName()StringgetFontNameAscii()StringgetFontNameBidi()StringgetFontNameFarEast()StringgetFontNameNonFarEast()StringgetBold()booleangetFontSize()floatgetHighlightColor()ColorgetItalic()booleangetTextBackgroundColor()ColorgetTextColor()ColorgetAllCaps()booleangetAllowContextualAlternates()booleangetBidi()booleangetBoldBidi()booleangetBorder()BordergetCharacterSpacing()floatgetDoubleStrike()booleangetEmboss()booleangetEmphasisMark()EmphasisgetEngrave()booleangetFontSizeBidi()floatgetFontTypeHint()FontTypeHintgetHidden()booleangetItalicBidi()booleangetLigaturesType()LigatureTypegetLocaleIdASCII()shortgetLocaleIdFarEast()shortgetNumberFormType()NumberFormTypegetNumberSpaceType()NumberSpaceTypegetPosition()floatgetStylisticSetType()StylisticSetTypegetSubSuperScript()SubSuperScriptgetTextScale()shortgetUnderlineStyle()UnderlineStyle读取段落格式：getFormat()⽅法类型getLineSpacing()floatgetFirstLineIndent()floatgetLeftIndent()floatgetAfterSpacing()floatgetBeforeSpacing()floatgetRightIndent()floatgetRightIndent()float getTextAlignment()TextAlignmnet getAfterAutoSpacing()boolean getAutoSpaceDE()boolean getAutoSpaceDN()boolean getBackColor()Color getBeforeAutoSpacing()boolean getBoders()Borders getHorizontalAlignment()HorizontalAlignmnet getKeepFollow()boolean getKeepLines()boolean getLineSpacingRule()LineSpacingRule getMirrorIndents()boolean getOutlineLevel()OutlineLevel getOverflowPunc()boolean getPageBreakAfter()getPageBreakBefore()getSuppressAutoHyphens()getTabs()⽤于测试的Word⽂档：Java⽰例代码import com.spire.doc.*;import com.spire.doc.documents.Paragraph;import com.spire.doc.documents.TextSelection;import com.spire.doc.fields.TextRange;import java.awt.*;public class GetTextFormat {public static void main(String[] args) {//加载Word源⽂档Document doc = new Document();doc.loadFromFile("test.docx");//获取段落数量int count = doc.getSections().get(0).getParagraphs().getCount();System.out.println("总共含有段落数:" + count);//查找指定⽂本TextSelection textSelections = doc.findString("东野圭吾", false, true);//获取字体名称String fontname = textSelections.getAsOneRange().getCharacterFormat().getFontName();//获取字体⼤⼩float fontsize = textSelections.getAsOneRange().getCharacterFormat().getFontSize();System.out.println("字体名称:" + fontname +"\n"+"字体⼤⼩："+fontsize);//获取第⼆段Paragraph paragraph2 = doc.getSections().get(0).getParagraphs().get(1);//获取段落⾏距float linespage = paragraph2.getFormat().getLineSpacing();System.out.println("段落⾏距：" + linespage);//遍历段落中的⼦对象for (int z = 0; z < paragraph2.getChildObjects().getCount(); z++){Object obj2 = paragraph2.getChildObjects().get(z);//判定是否为⽂本if (obj2 instanceof TextRange){TextRange textRange2 = (TextRange) obj2;//获取⽂本颜⾊Color textcolor = textRange2.getCharacterFormat().getTextColor();if (!(textcolor.getRGB() == 0)){System.out.println("⽂本颜⾊：" + textRange2.getText() + textcolor.toString());}//获取字体加粗效果boolean isbold = textRange2.getCharacterFormat().getBold();if (isbold == true){System.out.println("加粗⽂本：" + textRange2.getText());}//获取字体倾斜效果boolean isitalic = textRange2.getCharacterFormat().getItalic();if (isitalic == true){System.out.println("倾斜⽂本：" + textRange2.getText());}//获取⽂本背景String text = textRange2.getText();Color highlightcolor = textRange2.getCharacterFormat().getHighlightColor();//获取⽂本的⾼亮颜⾊（即突出显⽰颜⾊）if (!(highlightcolor.getRGB() == 0 )){System.out.println("⽂本⾼亮：" + text + highlightcolor.toString());//输出⾼亮的⽂本和颜⾊}Color textbackgroundcolor = textRange2.getCharacterFormat().getTextBackgroundColor();//获取⽂字背景（底纹）if (!(textbackgroundcolor.getRGB()==0)){System.out.println("⽂本背景：" + text + textbackgroundcolor.toString());//输出有背景的⽂本和颜⾊}}}}}运⾏程序，输⼊获取结果：。

JAVA-实现-利用POI读取word文档实例

JAVA-实现-利⽤POI读取word⽂档实例package read.document;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException;import java.io.InputStream;import java.sql.Connection;import java.util.ArrayList;import java.util.List;import org.apache.poi.hwpf.HWPFDocument;import ermodel.CharacterRun;import ermodel.Range;import pers.mysql.DBUtil;import pers.mysql.MysqlDao;import pers.mysql.MysqlDaoImp;public class WordReading {public static void main(String[] args) {String filePath = "*****.doc";readOnWord(filePath);}public static void readOnWord(String filePath) {if (filePath.endsWith(".doc")) {// 输⼊流-基类InputStream is = null;try {is = new FileInputStream(filePath);} catch (FileNotFoundException e) {e.printStackTrace();System.out.println("⽂件打开失败。

");}// 加载doc⽂档try {HWPFDocument doc = new HWPFDocument(is);Range text = doc.getRange();// 整个⽂档/** 分解word：⽂本 ->⼩节 ->段落 ->characterRun(理解为⼩单元）* section -⼩节; paragraph - 段落*///1分出内容节点Range hotWord = text.getSection(2);// 0-封⾯，1-⽬录，2-⽂本；第3⼩节//2段落处理/** 维护两个变量** 热词和解释区别：⼤⼩-word:26,explaining:18**/String word = "";String explaining = "";int wordOK = 0;int explainOK = 0;// 判断当前word&explain是否可以填⼊数据库int count = 24;// 读取⼏条数据到数据库int begin = 2;// 段落读取位置for (int i = 0; i < count;) {Range para = hotWord.getParagraph(begin);CharacterRun field = para.getCharacterRun(0);int fontSize = field.getFontSize();if (fontSize == 26) {word = para.text();wordOK = 1;begin++;} else {while (fontSize < 26) {explaining += para.text();begin++;para = hotWord.getParagraph(begin); field = para.getCharacterRun(0);fontSize = field.getFontSize();}explainOK = 1;}// 判断word&explain是否可以填⼊数据库if (wordOK == 1 && explainOK == 1) {MysqlDaoImp.addData(word, explaining); i++;//填⼊数据库后，⼀切归"0"wordOK = 0;explainOK = 0;word="";explaining="";}}// 输出测试// System.out.println("读取：" + "head:");} catch (IOException e) {e.printStackTrace();System.out.println("IO错误。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Java读取word文档解决方案
java读取word文档时，虽然网上介绍了很多插件poi、java2Word、jacob、itext等等，poi无法读取格式(新的API估计行好像还在处于研发阶段，不太稳定，做项目不太敢用)；java2Word、jacob容易报错找不到注册，比较诡异，我曾经在不同的机器上试过，操作方法完全一致，有的机器不报错，有的报错，去他们论坛找高人解决也说不出原因，项目部署用它有点玄；itxt好像写很方便但是我查了好久资料没有见到过关于读的好办法。

经过一番选择还是折中点采用rtf最好，毕竟rtf是开源格式，不需要借助任何插件，只需基本IO操作外加编码转换即可。

rtf格式文件表面看来和doc没啥区别，都可以用word打开，各种格式都可以设定。

-----实现的功能：读取rtf模板内容（格式和文本内容），替换变化部分，形成新的rtf文档。

-----实现思路：模板中固定部分手动输入，变化的部分用$info$表示，只需替换$info$即可。

1、采用字节的形式读取rtf模板内容
2、将可变的内容字符串转为rtf编码
3、替换原文中的可变部分，形成新的rtf文档
主要程序如下：
以上为核心代码，剩余部分就是替换，从新组装java中的String.replace(oldstr,newstr);方法可以实现，在这就不贴了。

源代码部分详见附件。

运行源代码前提：
c盘创建YQ目录，将附件中"模板.rtf"复制到YQ目录之下，运行OpreatorRTF.java文件即可，就会在YQ 目录下生成文件名如：21时15分19秒_cheney_记录.rtf的文件。

文件名是在程序中指定的呵呵。

由于是由商业软件中拆分出的demo所以只是将自己原来的写的程序分离，合并在一个java文件中，所以有的方法在示例程序中看似多余，没有必要那么麻烦。

对于替换部分需要循环的特例程序，我不好拆分，里面很容易暴露商业软件的东西，所以就不贴了，有需要的话可以加我QQ或者MSN，一起讨论呵呵。

附件传了半天也没有传上去，没有办法只有这样搞了呵呵。

模板文件附件无法存放，需要的直接联系呵呵。

其实直接看以下的java程序部分，就会明白。