Apache POI 用eventmodel 读取大文件Excel 【项目已开源】

Apache POI 用eventmodel 读取大文件Excel 【项目已开源】

文章目录

代码介绍

此代码来源于我开发的一个小工具jar包,项目名称为ExcelToDB,顾名思义,从各种文件系统读取Excel提取数据插入到各种数据库。

特点:轻量级jar包(小jar包灵活性强,而不是kettle那么笨重),配置好后一键运行,可搭载到Job调度中一键调用,甚至可以把它集成到你们公司的管理系统后端🤭。

内存消耗小,速度快,灵活性强,编写代码自动化配置既可实现一键导入,否则需要简单的手工配置。

读取几十万行的.xlsx文件,并将数据转换为标准SQL并执行的一个小jar包,打包后17MB左右,优化后可以更小,可配置日志记录内容,可写到多种目标数据库,从多种文件系统读取Excel,只需要简单的XML配置即可。

转换并执行的速度:
实际测试中,8172行,8MB,每行60个字段的Excel里的真实业务数据,8700ms即可导入到数据库,初版未优化,预计优化后速度可以更快。

认准v1.0.4版本哦!项目已开源,拿走不谢哈哈哈哈哈哈

www.zeeklog.com  - Apache POI 用eventmodel 读取大文件Excel 【项目已开源】

下面就是研发过程需要的知识了,进行了详细整理(●ˇ∀ˇ●),不想浪费时间再造一遍轮子的,可以去我项目里把写好的代码拿来改吧改吧,就是你的了嘿嘿嘿嘿。

1,初识XLSX格式

前言

最近被要求用Java读取大文件Xlsx(超过10MB的Excel文档),并导入数据库,百度搜到了POI。

一开始对POI了解不够深入,于是轻松地用POI的usermodel开发出了第一版,然鹅,usermodel对于内存的要求太高,只要文件超过5MB就开始GC overhead limit exceeded了。

(而且读取6MB的xlsx文件居然能占用3个G的内存??)

于是我决定使用eventmodel模式,然鹅,由于excel里带有公式的缘故,需要开发的代码量非常大,而且在百度很难找到使用event model的POI的案例,百度,谷歌找了一整天都没发现合适的案例,最终决定自己从底层开始,一点一点啃掉这个知识点。顺便整理个文档,输入输出,福曼学习法

也不是没考虑过市面上其它的工具,比如:

  • easyexcel
  • hutool操作poi
  • excel-streaming-reader
  • gridexcel
  • jxl

然鹅这些工具都不能同时满足以下几个特性:

  • 低内存消耗(至少不能导致GC overhead limit exceeded,或者OOM
  • 自动处理复杂公式(最难受的是把2020/12/20,ABC/BCS-DS/BUC这样的日期字符串读取为公式,总之现有的类库对公式的支持都不够多)
  • 读取超大文件(增量数据20MB以上,全量数据将近500MB)

也许easyexcel可以做到,然鹅黑盒子不敢用呀,而且easyexcel抽象级别过高,不够灵活,没有文档,学习成本过高的同时无法满足项目需求。

Excel 之 Xlsx格式详解

以下内容部分来自于以下链接,并对其内容做了部分修改。
https://www.loc.gov/preservation/digital/formats/fdd/fdd000398.shtml

了解一下 XLSX, (Office Open XML, Spreadsheet ML)

Office Open XML简称 OOXML,是xlsx格式的标准,它在ECMA-376中制定,
下面这个链接是ECMA-376的文档下载地址。

还有关于OOXML,wiki上有详细的介绍

The Open Office XML-based spreadsheet format using .xlsx as a file extension has been the default format produced for new documents by versions of Microsoft Excel since Excel 2007.

The format was designed to be equivalent to the binary .xls format produced by earlier versions of Microsoft Excel (see ).

For convenience, this format description uses XLSX to identify the corresponding format.

The primary content of a XLSX file is marked up in SpreadsheetML, which is specified in parts 1 and 4 of ISO/IEC 29500, Information technology – Document description and processing languages – Office Open XML File Formats (OOXML).

This description focuses on the specification in ISO/IEC 29500:2012 and represents the format variant known as “Transitional.”

Although editions of ISO 29500 were published in 2008, 2011, 2012, and 2016, the specification has had very few changes other than clarifications and corrections to match actual usage in documents since SpreadsheetML was first standardized in ECMA-376, Part 1 in 2006.

This description can be read as applying to all SpreadsheetML versions published by ECMA International and by ISO/IEC through 2016.

See  below for more detail on the chronological versions and differences.

The XLSX format uses the SpreadsheetML markup language and schema to represent a spreadsheet “document.”

Conceptually, using the terminology of the Spreadsheet ML specification in ISO/IEC 29500-1, the document comprises one or more worksheets in a workbook.

A worksheet typically consists of a rectangular grid of cells.

Each cell can contain a value or a formula, which will be used to calculate a value, with a cached value usually stored pending the next recalculation.

A single spreadsheet document may serve several purposes: as a container for data values;

as program code (based on the formulas in cells) to perform analyses on those values; and as one or more formatted reports (including charts) of the analyses.

Beyond basics, spreadsheet applications have introduced support for more advanced features over time.

These include mechanisms to extract data dynamically from external sources, to support collaborative work, and to perform an increasing number of functions that would have required a database application in the past, such as sorting and filtering of entries in a table to display a temporary subset.

The markup specification must support both basic and more advanced functionalities in a structure that supports the robust performance expected by users.

看看它的结构:

An XLSX file is packaged using the Open Packaging Conventions (OPC/OOXML_2012, itself based on ZIP_6_2_0).

The package can be explored, by opening with ZIP software, typically by changing the file extension to .zip.

The top level of a minimal package will typically have three folders (_rels, docProps, and xl) and one file part ([Content_Types].xml).

www.zeeklog.com  - Apache POI 用eventmodel 读取大文件Excel 【项目已开源】

The xl folder holds the primary content of the document including the file part workbook.xml and a worksheets folder containing a file for each worksheet, as well as other files and folders that support functionality (such as controlling calculation order) and presentation (such as formatting styles for cells) for the spreadsheet.

www.zeeklog.com  - Apache POI 用eventmodel 读取大文件Excel 【项目已开源】

Any embedded graphics are also stored in the xl folder as additional parts.

The other folders and parts at the top level of the package support efficient navigation and manipulation of the package:

_rels

_rels is a Relationships folder, containing a single file .rels (which may be hidden from file listings, depending on operating system and settings).

It lists and links to the key parts in the package, using URIs to identify the type of relationship of each key part to the package.

In particular it specifies a relationship to the primary officeDocument (typically named /xl/workbook.xml ) and typically to parts within docProps as core and extended properties.

docProps

docProps is a folder that contains properties for the document as a whole, typically including a set of core properties, a set of extended or application-specific properties, and a thumbnail preview for the document.

[Content_Types].xml

[Content_Types].xml is a file part, a mandatory part in any OPC package, that lists the content types (using MIME Internet Media Types as defined in RFC 6838) for parts within the package.

The standards documents that specify this format run to over six thousand pages.

Useful introductions to the XLSX format can be found at:

by Daniel Dick of Reuters.

from Open XML SDK documentation. Includes

下一篇:

2,Xlsx格式详解

此节部分内容来自于:
http://officeopenxml.com/anatomyofOOXML-xlsx.php
http://officeopenxml.com/SScontentOverview.php
http://officeopenxml.com/SSstyles.php

Package Structure

A SpreadsheetML or .xlsx file is a zip file (a package) containing a number of “parts” (typically UTF-8 or UTF-16 encoded) or XML files.

The package may also contain other media files such as images.

The structure is organized according to the Open Packaging Conventions as outlined in Part 2 of the OOXML standard ECMA-376.

You can look at the file structure and the files that comprise a SpreadsheetML file by simply unzipping the .xlsx file.

www.zeeklog.com  - Apache POI 用eventmodel 读取大文件Excel 【项目已开源】


The number and types of parts will vary based on what is in the spreadsheet, but there will always be a [Content_Types].xml, one or more relationship parts, a workbook part , and at least one worksheet.

The core data of the spreadsheet is contained within the worksheet part(s), discussed in more detail at  Overview.

Content Types

Every package must have a [Content_Types].xml, found at the root of the package.

This file contains a list of all of the content types of the parts in the package.

Every part and its type must be listed in [Content_Types].xml.

The following is a content type for the main content part:

<Override PartName="/xl/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>

It’s important to keep this in mind when adding new parts to the package.

Relationships

Every package contains a relationships part that defines the relationships between the other parts and to resources outside of the package.

This separates the relationships from content and makes it easy to change relationships without changing the sources that reference targets.

www.zeeklog.com  - Apache POI 用eventmodel 读取大文件Excel 【项目已开源】


For an OOXML package, there is always a relationships part (.rels) within the _rels folder that identifies the starting parts of the package, or the package relationships.

For example, the following defines the identity of the start part for the content:

<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="xl/workbook.xml"/>.

There are also typically relationships within .rels for app.xml and core.xml.

In addition to the relationships part for the package, each part that is the source of one or more relationships will have its own relationships part.

Each such relationship part is found within a _rels sub-folder of the part and is named by appending '.rels' to the name of the part.

Typically the main content part (workbook.xml) has its own relationships part (workbook.xml.rels).

It will contain relationships to the other parts of the content, such as sheet1.xml, sharedStrings.xml, styles.xml, theme1.xml, as well as the URIs for external links.

www.zeeklog.com  - Apache POI 用eventmodel 读取大文件Excel 【项目已开源】


A relationship can be either explicit or implicit.

For an explicit relationship, a resource is referenced using the Id attribute of a <Relationship> element.

That is, the Id in the source maps directly to an Id of a relationship item, with an explicit reference to the target.

For example, a worksheet might contain a hyperlink such as this:

<w:hyperlink ref="A11" r:id="rId4">

The r:id=“rId4” references the following relationship within the relationships part for the worksheet (worksheet1.xml.rels).

<Relationship Id="rId4" Type="http://. . ./hyperlink" Target="http://www.google.com/" TargetMode="External"/>

For an implicit relationship, there is no such direct reference to a <Relationship> Id.

Instead, the reference is understood.

Parts Specific to SpreadsheetML Documents

Below is a list of the possible parts of a SpreadsheetML package that are specific to SpreadsheetML spreadsheets.

Keep in mind that a spreadsheet may only have a few of these parts.

For example, if a spreadsheet has no pivot table, then a pivot table part will not be included in the package.

PartDescription
Calculation ChainWhen the values of cells are calculated from formulas, the order of calculation can be affected by the order in which the values are calculated. This part contains specifies the ordering. A package can contain only one such part.
ChartsheetContains a chart that is stored in its owne sheet. A package can contain multiple such parts, referenced from the workbook part.
CommentsContains the comments for a given worksheet. Since there may be more than one worksheet, there may be more than one comments part.
(可能是每个sheet的名称)
ConnectionsA spreadsheet may have connections to external data sources. This part contains such connections, explaining both how to get such externnal data and how the connection is used within the workbook. There may be only one such part.
Custom PropertyContains user-defined data. There may be zero or more such parts.
Customer XML MappingsContains a schema for an XML file, and information on the behavior to be used when allowing the schema to be mapped into the spreadsheet. There may be only one such part.
DialogsheetContains information about a legacy customer dialog box for a user form. There may be zero or more such parts.
DrawingsContains the presentation and layout information for one or more drawing elements that are present in the worksheet. There should be drawings part for each worksheet that has a drawing.
External Workbook