Extract Data From Bank Statements
Extracting Data from documents, recipients or financial documents using machine learning models or algorithms is very challenging task or difficult task to accomplish. During my first job I had given a task to “Extract Data From Bank Statements.” I was working alone on this project and it was very frustrating for me to work on this kind of project without any senior supervision.
So I requested accounts department to share some pdfs of bank statements. I started working on them, tried different Open CV algorithms but no success. Did some more RnD and found a solution for a format.
So my first task was to handle the above image format.
Solution for 3 columns format
As we know that OCR extract data as single string and I was using OCR.Space. So I came with solution to classify the text. So I used “Naive Bayes classifier” to classify the text from columns.
After training data on Naive Bayes model, I got the accuracy of 95% which is incredible.
Final Results
The Next Big Challenge
So after some time when I lost my job due to corona crisis, I had plenty of time to improve this problem and I was very optimistic if I improve this problem so I could definitely get back my job. I discussed this problem with someone on linked and promised me to mentor me for this project.The next challenge was to extract data from 5 columns bank statements pdfs.
Again working alone as 1 man army working alone on this problem. Started data labeling for YOLO architecture on 1177 images. It took me few days to labels those images but when I starting training images on YOLO the loss was not going down means Gradient Descent is not working well. Another failed attempt!
Mask RCNN
The “MASK RCNN” the final solution, I would not go into the technical details of the architecture you could find other medium article which could help you to understand the mechanism. Anyhow again data labeling for MASK RCNN so I used “Via Annotator Tool” link:https://www.robots.ox.ac.uk/~vgg/software/via/via_demo.html. Now I am progressing!
Started training the images on “MASK RCNN”. So I used “Matterport Git repo” https://github.com/matterport/Mask_RCNN but I had to adjust the code according to my requirement. So here is the link to my git repo for multiple classes: https://github.com/atta007/Mask_RCNN-for-Multiple-classess.
The Magic of Mask RCNN
So finally I got the results which I was looking for.
Yes there is non maximum problem but that is not a big issue just increase the threshold value ,train or more data and I have achieved what I was looking for but still could not get the job and still searching.
This is the so far best work done by me. I hope you like my work. Please do comment for any question or suggestion. Criticism is always welcome.
Conclusion
Won the Battle but Lost the War!
Reference Links
Contact
attam059@gmail.com