Data harvesting superapp admits it struggled to wield data – until it built an LLM

Asia’s answer to Uber, Singaporean superapp Grab, has admitted it gathered more data than it could easily analyze – until a large language and generative AI turned things around.

Grab offers ride-share services, food delivery, and even some financial services. In 2021 the biz revealed it collects 40TB of data every day. Execs have bragged that its fintech arm knows enough about its drivers that it can rate their suitability for a loan before they even bother applying.

In a Thursday blog post, the developer admitted it has sometimes struggled to make sense of all that data.

“Companies are drowning in a sea of information, struggling to navigate through countless datasets to uncover valuable insights,” the org wrote, before admitting it was no exception. “At Grab, we faced a similar challenge. With over 200,000 tables in our data lake, along with numerous Kafka streams, production databases, and ML features, locating the most suitable dataset for our Grabber’s use cases promptly has historically been a significant hurdle.”

Prior to mid-2024, Grab used an in-house tool called Hubble – built on top of the popular open source platform DataHub and utilizing open source search and analytics engine Elasticsearch – to sort through its giant data pile.

“While it excelled at providing metadata for known datasets, it struggled with true data discovery due to its reliance on Elasticsearch, which performs well for keyword searches but cannot accept and use user-provided context (ie it can’t perform semantic search, at least in its vanilla form),” Grab’s engineering blog explains.

Eighteen percent of searches were abandoned by staff users. Grab guessed the searches were abandoned because the Elasticsearch parameters provided by Datahub were not yielding helpful results.

Grab – Asia’s Uber – knows customers and drivers so well it can vet them for loans

Ever wondered how much data web giants generate? Singaporean super-app Grab says 40TB a day

Big Tech’s maps led ride-sharing giant Grab astray

Uber plans to ride out of stable Singapore, move APAC HQ to high-tension Hong Kong

But Elasticsearch wasn’t the only problem to blame for laborious data discovery – oodles of documentation was missing. Only 20 percent of the most frequently queried tables had any descriptions.

The developer’s data analysts and engineers were forced to rely on internal tribal knowledge in order to find the datasets they needed. Most reported it took days to find the right dataset.

Grab sought to rectify this through three initiatives: enhancing Elasticsearch; improving documentation; and creating an LLM-powered chatbot to catalog its datasets.

The Singaporean superapp enhanced Elasticsearch by boosting relevant datasets, hiding irrelevant ones, and simplifying the user interface.

Eventually it brought the number of abandoned searches to just six percent. It also built a documentation generation engine that used GPT-4 to produce labels based on table schemas and sample data. That effort increased the number of data sets with thorough descriptions from 20 to 70 percent.

And then it built the pièce de résistance: its own LLM. Called HubbleIQ, the LLM uses an off-the-shelf search tool called Glean to draw on its newly expanded descriptions and recommend datasets to its employees through a chatbot.

“We aimed to reduce the time taken for data discovery from multiple days to mere seconds, eliminating the need for anyone to ask their colleagues data discovery questions ever again,” the superapp techies blogged.

The upgrades are a work in progress. Grab intends to work to improve the accuracy of its documentation and incorporate more dataset types into its LLM, in addition to other initiatives.

Grab’s hyperlocalization strategy, which is enabled by its massive quantities of data, has given it the edge to know the ins and outs of Asia’s people and roads – and frankly kept the business alive.

While its 2021 IPO results may have been unquestionably disappointing, it did run Uber out of town.

In Grab’s Q2 2024 earnings, it reported a record high of 41 million monthly transacting users, narrowing losses and 17 percent revenue growth.

“Features like mapping, hyper batching and just-in-time allocation, they’re all unique to Grab and none of our competitors have that and we believe that makes us consistently more reliable as well as more affordable,” explained CEO Anthony Tan.

Consistently reliable, affordable … and drowning in datasets. ®

Report