The Hidden Challenges of DIY Machine Translation: It Isn’t Always Enough
May 15, 2024
In today’s digital age, machine translation (MT) software has become increasingly accessible, leading many to believe that translating documents is as simple as feeding them into an MT engine. However, as countless legal professionals have discovered, the reality is far more complex, highlighting the need for a more reliable solution than DIY machine translation.
Many clients initially turn to DIY online machine translation solutions or rely on the MT plug-in tools of their eDiscovery platforms, only to find themselves facing significant challenges. The problem? First and foremost, free online MT tools are not free. The tradeoff is that those platforms will use the data to train those engines. It may benefit humanity in the long run, but it will likely lead you to breach your client’s NDA. Free or not, DIY solutions often fall short. Why? Well, there are a few challenges which we’ll cover below.
First and most important is handling the intricacies of document formatting. In other words, Optical Character Recognition (OCR) technology. Not all are created equal. As anybody in the legal field knows, much of the problem starts with the quality of source documents. Whether they are scanned PDFs, documents with handwriting, or complex charts, images, and tables, using any stock OCR technology doesn’t simply convert everything perfectly. These challenges become more complicated when dealing with non-Latin scripts such as Korean, Arabic, Chinese, Japanese, and Russian. Merely relying on the built-in OCR capability of a free MT plug-in without any further analysis of those conversions will inevitably lead to subpar translations, rendering the output unreliable and unsuitable for review. This is where Divergent’s machine translation approach adds significant value. Having the source files as clean and small as possible dramatically enhances the final product.
Unlike some DIY approaches, which dump everything into an MT engine, Divergent’s team meticulously organizes, saves, and tracks every document submitted for MT. This level of expertise and attention to detail instills confidence in the quality of our service, ensuring that the process doesn’t end with machine translation of the OCRed files.
There are a variety of common errors that often occur during MT processing. While many pre-MT steps within our process help avoid these issues, some are inevitable. Some errors can be remediated simply by re-running the file in an isolated manner. Others need more investigation. If a file cannot be processed – inevitably, some files won’t play nicely – we will identify these for clients at the point of delivery. These problematic files might require MT with advanced formatting (more to come on this in another piece), human translation, or a summary, all of which Divergent can provide.
As is already apparent, DIY MT requires much time, care, and attention to get it right. It becomes especially crucial when dealing with vast volumes of documents. In our experience, encountering these hurdles takes a few hundred pages. Many MT engines have size limits as to how much can be processed at a time, meaning that batching files in smaller packages is sometimes needed. In our experience, most clients – even the largest international organizations – do not have the time, the tools, or the human resources needed to carry out the steps above, which will help to produce significantly better MT output. Most importantly, they need to know that the MT quality is as good as possible if they rely on MT to review and flag what is critical to their matter.
Another increasingly common issue for our clients is bilingual text or a batch of hundreds or thousands of files that include multiple foreign languages. We have seen firsthand that various platforms’ foreign language analysis capabilities struggle in this regard. In such situations, it’s an actual individual (the original AI) that makes a big difference.
Meticulously reviewing and documenting which languages have been found can help clients cut costs by deciding which languages to ignore. Just like with OCR technology, some MT engines do not cover specific languages, and here, our team offers the know-how and capability to use various engines to cover all your needs. Divergent has also seen cases where files previously flagged as entirely in a foreign language are fully bilingual documents that do not require translation.
By choosing Divergent, clients can experience the immediate relief of outsourcing instead of using DIY MT, which can be complex and time-consuming. This decision saves clients time and effort and ensures unparalleled accuracy and reliability, providing a sense of reassurance and lightening the burden of document translation.