Requirements for Source Documents
Document Requirements and Limitations
toolsxbrl is able to tag MS Word and PDF documents properly with the following requirements and limitations:
- It is not possible to tag any value of a table that is included as an image in a document.
- Scanned (PDF) reports can't be tagged, toolsxbrl does not include an OCR module.
- For PDFs hidden text as well as some font-specific settings might lead to issues. (See more information below)
Differences Between Source Formats
Word | PDF (pdf2htmlEx) | |
A4 Layout | Optional | Enforced |
WYSIWYG | N/A | Full |
Tags-Saving | In file/external | External |
Chapter detection | Styles Outline Level | Document Bookmarks/Pages |
Font handling | Integrated | Integrated |
Table detection | Auto | Manual |
Smart anchors | Yes | Yes |
XHTML formatting preserved | Partly* | Full |
Multiple tags per value | Yes | No |
*This depends on styles and formats applied to paragraphs, see limitations above.
How to Prepare a PDF File
PDF Requirements and Limitations
toolsxbrl is able to tag any PDF documents properly with the following requirements and limitations:
- It is not possible to tag any value of a table that is included as an image in a document.
- Make sure that the fonts that are being used are correct (this also applies to Word fonts when converting to PDF) with regards to Glyphs, otherwise conversion could lead to usage of wrong characters.
- Scanned (PDF) reports can't be tagged, toolsxbrl does not include an OCR module.
- For PDFs, hidden text as well as some font-specific settings might lead to issues. (See more information below)
- Always use the same software to create different versions of the PDF, otherwise restoring the mapping might be an issue
- This means when creating a PDF from Word and initially tagging this, you could get issues if you change the PDF afterwards with, for example, Adobe Pro
- If you need to stitch multiple documents together, rather use the Merge iXBRL functionality in toolsxbrl after converting all parts to XHTML
Recommendations on Tagging of PDF Documents
PDF is a very universal format for creating documents. Converting it to XHTML can be a challenge, especially if the PDF document that is used as a source has issues itself.
Here are some recommendations to create the best-possible conversion outcome:
- Keep in mind that PDF to HTML convertion is similar to actual printing but on a very special virtual device. Like printing on a physical print station this process can have font and color issues.
- Make all fonts embedded.
- Never add tables as pictures, also when converting from Word
- Do not use Type 3 fonts, they are not supported in any case.
- For CID fonts, make sure they include correct character mappings definitions.
- Do not include hidden text in PDF documents, or remove it with Adobe Acrobat Redact.
- Do not place any stamps/signs to PDF comments.
- Use RGB color space.
- Do not use special ICC color profiles.
- Create a PDF document that is compliant with PDF/A-1a standard and that does not contain text that cannot be mapped to Unicode or inconsistent with information for rendered glyphs.
- Major layout changes (styles, one-column to two-columns) can have a serious impact on the mapping restoration. Bear that in mind when planning.
Keep in mind that the tagging of PDFs requires an extra step.
What Are Hidden Facts?
When converting and tagging a PDF report with special font face in toolsxbrl, some facts (tags) might become hidden. The reason is that the Inline XBRL Specification does not allow individually formatted numbers to be tagged; e.g. when the font requires a special spacing between single characters by using HTML tags like , the number is no longer taggable. In the screenshot below, the number 24,540 is not taggable. In order to preserve the spacing and formatting of the PDF in the XHTML report, toolsxbrl moves the tag to an unformatted hidden section of the document and includes a link to the visual original number.
However, hiding facts an official mechanism of the Inline XBRL specification, as well as being allowed by ESMA in the Reporting Manual, page 34:
From firesys's point of view, untaggable items, like the number in the example above, are not eligible for transformation and can be hidden. The XBRL International standard setter working group is aware of the issue and will probably publish and update Inline XBRL specification, which will make those numbers taggable in the future.
How to Avoid Hidden Facts
There are multiple ways to avoid or reduce hidden facts in iXBRL reports:
- Tag Microsoft Word files instead of PDFs
- Do not use special non-web fonts in PDF reports that provide a special spacing between characters.
- Set the toolsxbrl CMaps option to "Ignore" when opening a PDF file (this might lead however to uglier reports).
- Use the latest toolsxbrl version, which includes some new options to reduce/avoid hidden facts.
- All numbers that are tagged need to have the OpenType setting “Default figure Style” to avoid “Hidden facts”. This setting only affects the digits in the report. To apply this setting you can manually choose “Default Figure Style” in the number columns or you can apply the setting in the Paragraph Style under “OpenType features”. Use 0 kerning in the tagged cells for best result.
- Other problems that can occur when you convert the PDF to XHTML may be:
- Text opacity - If you have a text with opacity in the document, the opacity will go back to default 100% after the conversion to XHTML. It will work if you create outlines of the text.
Remove Hidden Text From PDF Files Using Adobe Redact
In the case that you have hidden elements, it is possible to remove some of them using Adobe Redact. Hidden Text will be visible in the converted XHTML document. So, it must be removed before processing the PDF document with toolsxbrl.
Load the file into Adobe Acrobat Pro and click on the tools button.
Go to Protect & Standardize and click on Redact.
Click on Sanitize Document.
In the opening window you have to click on Click here.
After that you get a selection of all hidden elements. Remove all checks, but keep the one for Hidden Text and click on Remove.
Further Information About PDF Conversion
The limitations of the PDF converter:
- CID (identity H) fonts embedded to the source document.
- In this case, the converted document can contain unreadable (weird looking) text. To resolve this it is recommended to save the source document as PDF/X format in the Adobe Acrobat DC "Print Production" tool.
- If the converted document has wrong color palette, see step 1.
- The converter does not support PDF hidden text layers.
- If so, you should remove hidden text layers in the Adobe Acrobat DC "Redact" tool.
- The converter has fine tuning options helping to resolve the issues:
- Please change the option "PDF unicode CMaps handling" to "Auto" and "Use autohint on fonts without hint"to "Use AutoHint" if the converted document does not look good.
If the conversion still doesn't meet the expectations or some tables cannot be tagged properly, the source file might need corrections.
The following cases are known:
- The converted PDF looks good, but the imported table is unreadable.
- The converted PDF contains unreadable fragments.
- The PDF document has not been converted at all in toolsxbrl.
- The converted PDF shows wrong colors, visual artifacts or extra text fragments or pages.
For cases 1-3 there are two methods to repair the document in Adobe:
- Export the PDF to postscript and create a new file from it in Adobe Acrobat Distiller DC;
- Convert the PDF to the stadard PDF 1/A with Adobe Preflight in the "PDF standards" tool.
For case 4 use "Sanitize Document" in the "Redact" tool and convert the document to PDF/X for correct colors.
If the document after all processing still has artifacts, the "fallback mode" option in toolsxbrl can be used.
How to Prepare a Word File
MS Word Requirements and Limitations
toolsxbrl is able to tag any MS Word documents properly with the following requirements and limitations:
- It is not possible to tag any value of a table that is included as an image in a document.
- For MS Word documents it is required to use styles (heading 1, heading 2, etc.) to structure the documents.
- The chapter headings are used by toolsxbrl to allow easy navigation through the document.
- All tables that have to be tagged must be normal Word tables (no embedded Excel or similar).
- To change the outline level of styles, right click on the paragraph and select Paragraph and then select Outline level. For more information look at our FAQ #304 and FAQ #305.
- Shapes and images anchored in front of text or behind text are placed at the anchor position. This might lead to different layout when converting to XHTML.
- Images and shapes inserted as embedded Office objects (e.g. diagrams from PowerPoint or Excel) can't be converted to XHTML. Those images must be converted to pure images e.g. by taking a screenshot and inserting it.
- Two-column text layout is not yet supported for MS Word to XHTML conversion.
You can also checkt out the FAQ for the HTML Converter, where many questions on Word Documents are answered.
How to Create a Compatible PDF From Word
The most reliable way to create an iXBRL-compatible PDF from Word is to use the PDF-export functionality from Adobe. For that, you will have to have Adobe Acrobat installed on your computer and then use the following settings:
Siehe auch
Technical Documentation
Getting Started
toolsxbrl Settings
New Features
Weitere Inhalte
→ Webseite
→ Kundenbereich
→ YouTube