If you’ve heard of the Vatican Secret Archives, you know that it’s one of the most grandiose collections of historical data in the world. To this point, it’s also been completely useless to those who have tried to understand the majority of what’s in it.
The VSA is located within the walls of the Vatican, just north of the Sistine Chapel and right next to the Apostolic Library. It houses 53 linear miles of shelving dating back more than 12 centuries. This includes please from Mary Queen of Scots to Pope Sixtus V just before she was executed, the papal bull that excommunicated Martin Luther, and much more. There’s nothing like the VSA in the entire world.
Of all the information available in the VSA, only a few pages have been made available online to researchers and students. Not very many pages of text have been scanned and made searchable, making it very difficult to find much within the vast amount of information behind the walls of the Vatican. If what you’re looking for isn’t available with the basic search, you must apply for access and go through the archives yourself to find what you need. Even then, there’s no guarantee.
In Condice Ratio Will Make A Difference
A new project called In Condice Ratio is marrying artificial intelligence with optical-character-recognition (OCR) to scan through all of this information that has yet to be sorted through and uploaded to the online database.
OCR is used to scan books, images, and other printed material in order to transform it into machine-encoded text. It is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing, machine translation, text-to-speech, key data and text mining.
The traditional OCR method is great for typeset text, but doesn’t work so well with handwritten documents, which make up the majority of the information in the VSA. Since OCR works by reading the spaces in between text, known as dirty segmentation, reading text from centuries ago that look like a mix between calligraphy and cursive can be rather difficult to accomplish. OCR can’t tell where one letter stops and another starts, and therefore doesn’t know how many letters there are.
By using AI to enhance OCR and create what is known as jigsaw segmentation, the tool is able to identify separate pen strokes rather than relying on identifying whole letters or words. The four scientists behind the project – Paolo Merialdo, Donatella Firmani, and Elena Nieddu at Roma Tre University, and Marco Maiorino at the VSA – explain how the advanced process works. In jigsaw segmentation, OCR looks at the thinner strokes, making it easier to analyze them, then carves out letters using the joints of the strokes, thus creating what looks like jigsaw pieces. These pieces are then scanned and turned into searchable data, which in this case is uploaded to an online database.
The VSA isn’t its only archive the project has its sights on, either. If it can successfully grant more access to the VSA, In Condice Ratio also plans to tackle other large archives around the world.
Putting OCR to the Test
To give the refined OCR an opportunity to show its true potential, the team had it scour through documents from the Vatican Register. The 18,000-page batch of documents includes letters to European kings, rulings on legal matters, and other correspondence between rulers and religious leaders from centuries ago.
The software received a 96% success rate after it finished reading through the documents. The most common mistakes made were with letters ‘m’, ‘n’, and ‘i’, as well as cofusing the archaic ‘f’ with ‘s’. While it would be ideal to have 100% accuracy, “imperfect transcriptions can provide enough information and context about the manuscript at hand” to be useful, says Paolo Merialdo.
The team claims that, like any form of AI, the process will improve itself over time. As it teaches itself to detect more distinct features between these letters, the results will become much more accurate. In Condice Ratio also plans to implement its strategy – jigsaw segmentation mixed with crowdsourced training of the software – to other projects and languages.
Although this is expected to make great progress in this type of research, Rega Wood, a historian of philosophy and paleographer (expert on ancient handwriting) at Indiana University, claims even artificial intelligence will always have some kind of limitation. It “will be problematic for manuscripts that are not professionally written but copied by nonprofessionals,” she says. The larger amount of information it has access to, the less accurate it will be as well. In some cases, says Wood, “it is not only more accurate, but just as quick to make transcriptions without such technology.”