As a data scientist who specialized in developing algorithms to help real estate analysts and underwriters, I’ve seen a lot of manual tasks performed in real estate. Whether it was observing the process through which clients put market data in their Excel models, or seeing them transcribe transaction data from scanned PDFs into sizers and web apps, it was always a bit painful to see how things were being done in ride-along meetings and requirement gathering calls.
The real estate industry tends to lock its data in Excel models and PDFs. Every large real estate company says they have a “massive database”, but the reality is that data is often siloed and inaccessible for analytical purposes. For most data scientists, this barrier would be a significant roadblock for analyzing data and delivering insights. I saw it as an opportunity, however, and developed AnyExtract.ai to solve the real estate industry’s siloed data problem once and for all.
Structured vs Unstructured Data
Instead of applying OCR and extracting key-value pairs, I built an algorithm that actually understands the hierarchy of the data in a document. For example, in student housing, you can have multiple households in a unit, many units within a property, and many different lease charges applied to those units and households. If you type “pdf to excel” in Google and just use the first PDF to Excel converter you find to extract data from a student housing rent roll, you’d probably get a ton of key value pairs for various residents and charge codes. But the relationships between the data wouldn’t be understood or represented in the outputs. You wouldn’t know which students are associated with which units or which charges are associated with their leases.
AnyExtract.ai looks at data like a person would. When you look at a data table, you intuitively understand that the data in a column is related to the column header. If there’s a sub header, you know that the data under that sub header is related to it, and that the sub header is related to the header at the top of the document. After looking at only a few documents of a particular type, you intuitively understand how to read them, even if the formats change from template to template because you understand the data structure.
AnyExtract.ai applies a probabilistic model to build this structural understanding with a very limited training set. Instead of training on tens of thousands of templates to pick up slight variations, the AnyExtract.ai algorithm understands the structure of the data to deliver superior results after training on literally 20-30 documents. That’s a small enough number that an entry-level real estate analyst can train the algorithm to achieve 98%+ accuracy in an afternoon!
A More Usable Data Format
More importantly, the structured data unlocked by AnyExtract.ai can be delivered in a consistent JSON format that can be used to populate a database or build a data pipeline. Business logic is built in when the data is parsed, so there is no need to build heuristic models to analyze the key-value pairs and put the data in the right place. Outputs are provided with a probability too, so if there is an issue with Optical Character Recognition (OCR), spurious results can be flagged and manually addressed.
This is why we have no interface for our product. We charge clients an implementation fee to train the algorithm on their specific document type (which again, we can do in a few hours with just a handful of documents) and to help integrate the API directly into their workflows. Instead of logging in, highlighting fields, confirming results, etc. we let clients send PDF documents to a designated email or directly to our API and receive JSON extracts of the data in seconds.
Now at this point, if you’ve looked at other document extraction software, you’re probably thinking “Yeah Nico, that sounds great, but how accurate is it really?”. I’d answer that question with another question: “How accurate is it when your analyst opens a scanned PDF on one screen and manually types data from it into an Excel workbook?”. In reality, human transcription is less accurate than you’d think, and certainly less accurate than AnyExtract.ai. I’ll support this with a story.
With one client, we had just trained the algorithm on a new document type, tested the results on our end, and embedded AnyExtract.ai into their platform. This client had to process very large volumes of a particular document type, and they were wasting a ton of time manually looking at documents on one screen and manually typing the key information from these docs into spreadsheets to ingest in their accounting software.
The client wanted us to test the outputs from AnyExtract.ai against their database to judge the accuracy of the algorithm. That’s fair. So we pulled a sample of 100 files from which they had manually keyed data into their system the prior year, ran them through, and compared our outputs to their database… and the results weren’t quite what we expected.
Approximately 80% of the documents were parsed 100% accurately. Ouch. I had tested so thoroughly, I thought for sure the results would be better than that. So naturally, I dove in and reviewed the data, comparing what was in their database to what we extracted as well as the original documents. On the first incorrectly parsed file I opened, I looked at the outputs from AnyExtract.ai and saw they matched up perfectly. I was naturally confused – maybe there was an error when we tried to run the parser on that many documents? So I compared the outputs to their database, and found that their database actually had incorrect information!
That was a relief, but could have been an outlier. So I went to the next doc… same issue. And the next, and the one after that, and so on and so forth until we went through about 20 documents together with the client and realized the algorithm was actually working perfectly. At the end of the day, 98 out of the 100 documents were parsed with zero errors (only a couple of fields on the remaining 2 documents had OCR-related extraction errors), but the client’s database had typos and errors in 20% of the deals. It turns out that when a human being has to manually type in the information from thousands of documents a year, day in and day out, they tend to make mistakes. Now that lucky analyst can focus on the 2% of documents that are actually difficult to process, and the algorithm handles the rest.
The Future of AnyExtract.ai
At the end of the day, I’m glad I witnessed firsthand the trials and tribulations of real estate analysts and underwriters. Seeing how they work inspired me to launch a product that has already improved the lives of so many real estate professionals, and we’re just getting started. In 2023, we are going to expand AnyExtract.ai to cover many of the documents the real estate world hates to work with. SREOs, personal financial statements, YTD statements, etc... the list goes on.
Since there are nearly infinite document types to cover, we're going to do the smart thing and let our customers guide us. If you're interested in implementing AnyExtract.ai for a particular document type, reach out and let us know!: https://hellodcata.ai/contact/
AnyExtract.ai was designed to help real estate professionals save time and focus on the work they love by automating the extraction of structured data from scanned PDFs and Excel workbooks. It automates the process of extracting data that is usually entered by hand, such as schedules of real estate owned, personal financial statements, and tax documents, saving real estate professionals time and money. Learn more about the product from the resources below: