Aperture

Overview

Flexible content and metadata extraction framework

Download button

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems.

Features

  • crawl information systems such as file systems, websites, mail boxes and mail servers
  • extract full-text and metadata from many common file formats
  • view files in their native applications
  • ease of use: easy to learn, easy to code, easy to deploy in industrial projects
  • flexible architecture: can be extended with custom file formats, data sources, etc., with support for deployment on OSGi platforms
  • data exchange based on Semantic Web standards (e.g. RDF, SPARQL, ...)

Supported file formats

  • plain text
  • HTML, XHTML
  • XML
  • PDF (Portable Document Format)
  • RTF (Rich Text Format)
  • Microsoft Office: Word, Excel, Powerpoint, Visio, Publisher
  • Microsoft Works
  • OpenOffice 1.x: Writer, Calc, Impress, Draw
  • StarOffice 6.x - 7.x+: Writer, Calc, Impress, Draw
  • OpenDocument (OpenOffice 2.x, StarOffice 8.x)
  • Corel WordPerfect, Quattro, Presentations
  • e-mails (.eml files)

Crawlers

Crawlers support the extraction of information from heterogenous data sources. At the moment we support the following source types:

  • file systems (local, remote, removeable media)
  • websites and intranets
  • IMAP e-mail servers
  • Microsoft Outlook (alpha)

Community

DFKI

Aperture is developed by a community of developers and users, lead by engineers from Aduna and DFKI, the premier German institute for artificial intelligence research.