

As you know, Genizah documents can be quite messy. In the best cases: clear writing, straight lines, etc. Pretty good! I don’t have exact numbers (Marina would know) but so far the transcriptions it’s producing are already fairly close to what a human scholar has produced.

How is successful is the computer so far? Is it better or worse with particular scribes? Or genres? So, a lot of the time we’re actually removing material from the transcription that a human scholar was able to infer but that is not actually visible on the page and not accessible to a computer algorithm. Once it’s been segmented, we start to correct the transcriptions – matching up what the computer sees with the transcription we have through the PGP (that have accumulated over many years of scholarship on the Genizah). The other issue is that humans are very good at inferring things from very little context, and especially trained scholars who know Judeo-Arabic writing conventions, palaeography, etc., – sometimes we can guess the whole word just from half a letter! But a computer only knows what it sees. Eventually it learns that writing of that size/colour/shape should be ignored.

Lots of times it actually highlights the handwritten label of the fragment number! So that’s something that for a human is obviously not part of the original fragment text, but a computer has no way of knowing that without multiple corrections. Sometimes it misses words, or accidentally picks up something else. But my working process at the moment involves looking at an image of a Genizah document, checking to make sure the computer has “segmented” it appropriately – identified all the lines of text and individual letters on the page.

The computer science part is beyond my ken – we’re working with a digital palaeography platform called eScriptorium, and collaborating with scholars around the world. There are a lot of people involved with this program. How do you set about training a computer? Can you outline the process, or your part in it? I’m also working on revising my dissertation into a book. Even if it’s only 80 or 90% accurate, that would still be an immense advantage for scholarship. The PGP has lots of transcriptions and editions of documents that various Genizah scholars have done, so right now we’re getting the computer to match up what it sees on an image with the transcription that a person has done, and once it can do that, hopefully it will be able to transcribe documents that scholars haven’t looked at yet. I’m working with Marina Rustow at the Princeton Geniza Project, helping to train a computer algorithm to transcribe documentary material from the Genizah. Noam, what are you working on at the moment?
