New Compression Techniques for the Presentation of Printed Documents on the WEB.
Bogdan Smolka, Konrad Wojciechowski
Silesian Technical University, Department of Computer Science
Address: Academica 16, 44-101, Gliwice, POLAND
Much of the cultural heritage is currently available only in paper form. Archives, libraries, museums are places where the majority of the scientific and cultural resources are stored and preserved. Unfortunately, the direct access to this huge deposit of valuable material for professional and personal use is rather complicated and expensive.
That is why, the most interesting documents has been filmed to ensure the permanence of the endangered information and to enable the easy access for students and scientists. Microfilms offer acceptable level of quality, media longevity, little machine dependence and the possibility for producing additional copies without much informational loss. However, to enable remote access to the filmed information, the microfilms have to be digitised, which means an additional quality loss.
In the time of rapid development in the network and data technology, with their constantly improving capacity for the transmission, the microfilm approach is beginning to belong to the old analogue era. The new era is the Internet and the rapidly growing technical advances of the digitisation techniques. The explosive development of the Internet, as a universal platform of exchange of information enables now an easy access to various treasures of the cultural heritage.
The electronic access to the printed or written documents is however a complex matter. Paper documents contain text and illustrations and very often the whole document has to be digitised and transformed into an image format. Sometimes it is possible to use optical character recognition tools and extract the text from the photographs or drawings, but this seems to be very difficult, when dealing with old documents and in many cases even inappropriate. An old document is simply not a sum of the information contained in the text and illustrations. The colours, texture of the paper, style of the handwriting or printing technique are very often more important as the informational content of the document, which in many cases is already known.
The philosophy behind this approach, is that old documents should be shown as an integration of the information embedded in the document and its visual content. In this way the documents are digitised and shown as images. The so called virtual libraries allow viewing the documents, making copies using a printer and what is also important the documents can be collected and stored in a private archive for later use.
The major problem with the presentation of printed documents on the WEB is that a compromise between the quality of reproduction of the document and the transmission time required to download the huge amount of data contained in an image file has to found.
At the moment, the rapid growth of the number of the Internet users is worsening the problem, as the transmission capacities are almost exhausted. Maybe, with the introduction of high speed Internet connections, the problem will be alleviated, but at present and in the near future the only possible solution is the compression of the image data, so that the user can access the document in a reasonable time and enjoy viewing it in a satisfactory quality. As a result of digitisation of an average size document at a medium resolution and colour depth, a huge file of about 20–50 MB on average is produced. The quality of such an image is high, but the transmission time, the computational burden and the hardware requirements are simply enormous. That is why compression techniques are commonly used, as they are capable of reducing the amount of data without much loss of the image quality. The most popular standards for image data transmission are GIF and JPEG formats.
The GIF format is mainly used for compression of images which contain low number of different colours. As it uses the lossless coding scheme, its efficiency is not very high and it is not suitable for distribution of real life colour or grey scale images. Much better results are obtained using the JPEG format from the Joint Pictures Expert Group, which performs the subsampling of the chrominance information, quantisation of the DCT transform coefficients and Huffman coding of the image data. Although the compression ratios of about 40:1 are easily achieved without much sacrifice of image quality, the JPEG format is not suitable for the compression of documents. As documents contain many high frequency objects like letters, drawings, therefore the elimination of the high frequency components of the cosine transform leads to severe loss of the quality of document reproduction. As compression rates increase, the text rapidly becomes distorted and illegible. In order to ensure the document legibility, the JPEG files have to be large and this is the main obstacle while preparing an efficient Internet Library.
A common solution is the transformation of the document image to a bi-tonal form and then to compress it using the CCITT compression standards Fax Group 3 or Fax Group 4. This approach enables the text legibility at high compression rates, at the expense of the total loss of colour information.
The JPEG, GIF and fax formats used for distribution of documents are being replaced now by the new wave-based formats geared towards the direct compression of high-quality scanned documents. These new formats enable fast transmission of document images over the Internet with acceptable level of quality.
Among the new wave-based formats, three are especially interesting for the presentation of scanned documents with the use of Internet: DjVu, LuraDocument, and MrSID.
The DjVu* format has been developed at AT&T Bell Labs with the purpose of constructing a format which could enable the distribution of scanned high-resolution images over the WEB. DjVu represents the image using three layers:
· the mask, which is a bi-tonal bitmap representing the text and drawings. This layer, coded with loss-less compression algorithm, indicates which pixel belongs to the foreground (text, drawings, signatures) and which one to the background (paper texture, photographs);
· the second layer represents the colours of the background using the wave-based transformation coding;
· the third layer contains the information of the foreground encoded using the same wave-based algorithm. Using DjVu, it is possible to faithfully reproduce a document scanned at 300 dpi of 25 MB size down to about 100 - 200 kB. The quality of the DjVu images are acceptable and this format is suitable for the compression of books, newspaper pages with colour photographs, catalogues and so on. The size of the compressed document images is so low, that it can be used for distribution of documents on CD-ROMs (a CD-ROM can contain about 5000 newspaper pages) and of course over the Internet (the Internet sites contain in average about 50 KB of information). Another important feature of DjVu is the good performance of the freely available Internet browsers plug-ins, which allow fast zooming and scrolling of images, its splitting into background and foreground, conversion to black and white and many other useful facilities.
As digital images and scanned documents require an enormous amount of memory, storage capacity and transmission speed, the LuraTech ** company has developed its own standard of image compression called LuraDocument (LDF), based on the novel achievements of the wave-based theory.
The LDF, like DjVu format, performs the segmentation of the document into the text and background. This approach allows the achievement of high compression ratios. Using the waves to the compression of the background and foreground layer, the LDF format achieves compression ratios of about 200:1, while preserving acceptable image quality. This format is specially designed for the processing of document images and allows their distribution over the WEB and local networks in the business and non profit applications. LDF format offers significant savings in disk and network resources and can be used on many computer platforms. An efficient plug-in for the popular Internet browsers is also available.
LizardTech's MrSID*** for Photography is a file format which encodes large, high-resolution images to a fraction of their original file size while maintaining high image quality. In this way images become really scalable and can be reduced, enlarged, zoomed or printed without much quality loss. MrSID is designed specifically for the compression of large files of scanned documents, like old books, newspapers and especially large geographical maps. This multi- resolution, device-independent image format delivers the optimal resolution for the screen and the Internet.
This format enables that portions of huge files can be sent quickly via the Internet at print-ready quality and the transfer of data takes only a few seconds. The same is true for the encoding process. MrSID requires only seconds to open large files once they have been encoded. MrSID's high-encoding ratios result in the great overall file-size reduction without visible quality compromises.
One of the benefits of a MrSID image is its simplicity of usage. It is possible to compress an image from Photoshop® or to use the MrSID Workgroup Encoder for image collections. MrSID files are supported in commonly-used graphic applications and standard Web browsers, allowing users to take advantage of the flexibility of MrSID throughout their workflows.
As shown in this short paper, the new compression techniques based on the wave transform deliver substantially higher image quality compared to the standard formats and enable the presentation of scanned documents over the Internet. The high quality of the compressed images and their low average size make them an interesting tool for building virtual libraries on the WEB.
Our experience with the new technology shows that it should be widely used to allow the easy and fast access to high quality material, available only in the paper form. The introduction of the new techniques will surely bring the Internet a step further towards being the universal, most powerful medium of information exchange.
The capabilities of the new formats can be assessed by visiting the experimental WWW site http://plum.ia.polsl.gliwice.pl/vb, prepared in co-operation with the Silesian Library, Katowice, Poland. At this site we have collected many examples of different documents like incunabula, old letters, newspapers, song books, photographs, maps. Every document is compressed using DjVu and LuraDocument to enable the comparison of these two software packages. Our virtual library provides also information on other projects, whose aim is to enable the enhanced access to the cultural heritage over the Internet.
Now the Silesian University of Technology is a big and modern university, teaching approximately 25 000 of students. The Silesian University of Technology offers courses in 21 engineering disciplines, with more than 100 Honours, including full-time MSc courses, full-time and part-time BSc courses, as well as supplementary MSc courses. Optional PhD courses and post-MSc courses are offered in the most attractive engineering disciplines, enjoying increasing popularity. Several post-MSc studies are run in English or French. Pictures of the University can be found at http://www.polsl.gliwice.pl/alma.mater/pictures.html.