The whole idea is annotating some training data (which can be html pages or pdf files) with some annotations and then by applying semi-supervised learning methods to train the system for extracting discovered patterns and relations from real world data.