The seven (7) “deadly” sins of text analytics

John Martin of “BeyondRecognition” posted a couple of interesting articles on LinkedIn concerning the use of Text Analytics or Text Mining to classify files and documents.

Of course his “catch” is that one needs visual recognition as well as text based pattern recognition; BeyondRecognition delivers visual recognition technology.
In nearly every article the “problem” of having “image-only” PDFs or TIFFs is mentioned; when there is no text, text mining will not work. We all know that it is very easy to OCR PDFs and TIFFs. One step further is image recognition within photo’s. Both technologies will give us text and metadata to associate with the files.

But still, the articles have some good point that have to be taken into account when using text based classification solutions:

Parts 5 through 7 are still to come…