Overview
Flexible content and metadata extraction framework
Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems.
Features
- crawl information systems such as file systems, websites, mail boxes and mail servers
- extract full-text and metadata from many common file formats
- view files in their native applications
- ease of use: easy to learn, easy to code, easy to deploy in industrial projects
- flexible architecture: can be extended with custom file formats, data sources, etc., with support for deployment on OSGi platforms
- data exchange based on Semantic Web standards (e.g. RDF, SPARQL, ...)
Supported file formats
- plain text
- HTML, XHTML
- XML
- PDF (Portable Document Format)
- RTF (Rich Text Format)
- Microsoft Office: Word, Excel, Powerpoint, Visio, Publisher
- Microsoft Works
- OpenOffice 1.x: Writer, Calc, Impress, Draw
- StarOffice 6.x - 7.x+: Writer, Calc, Impress, Draw
- OpenDocument (OpenOffice 2.x, StarOffice 8.x)
- Corel WordPerfect, Quattro, Presentations
- e-mails (.eml files)
Crawlers
Crawlers support the extraction of information from heterogenous data sources. At the moment we support the following source types:
- file systems (local, remote, removeable media)
- websites and intranets
- IMAP e-mail servers
- Microsoft Outlook (alpha)
Community
Aperture is developed by a community of developers and users, lead by engineers from Aduna and DFKI, the premier German institute for artificial intelligence research.