Колдонуучу:Formatted Text Parsers and APIs
so binary doc format cannot be accessed via parsers ?
Correct, the binary DOC format, which was used by Microsoft Word prior to Word 2007, is a binary file format that is not easily human-readable or parseable in the same way that newer XML-based formats like DOCX are. Unlike the XML structure of DOCX, the binary DOC format stores data in a binary file, making it more complex and less accessible without specialized tools.
For binary DOC files, the recommended approach is to use libraries or tools that can directly interpret the binary format and provide an API for accessing and manipulating the content. As mentioned earlier, when working with DOC files, one common approach is to use COM automation on Windows using libraries like pywin32 in Python.
Here's a summary:
Binary DOC Format: Requires specialized tools or libraries for direct access. XML-Based Formats (e.g., DOCX): Can be more easily parsed using libraries like python-docx. If you're working with a mix of DOC and DOCX files, you may need to use different approaches for each format. For newer projects or when possible, it's generally recommended to work with the XML-based formats like DOCX due to their increased transparency and ease of parsing.