Big Data is morphing into Vast Data. The next generation of the technology will lead to insights and correlations that reveal new strategies—even new business models.
Dan McCaffrey has an ambitious goal: solving the world’s looming food shortage.
As vice president of data and analytics at The Climate Corporation (Climate), which is a subsidiary of Monsanto, McCaffrey leads a team of data scientists and engineers who are building an information platform that collects massive amounts of agricultural data and applies machine-learning techniques to discover new patterns. These analyses are then used to help farmers optimize their planting.
“By 2050, the world is going to have too many people at the current rate of growth. And with shrinking amounts of farmland, we must find more efficient ways to feed them. So science is needed to help solve these things,” McCaffrey explains. “That’s what excites me.”
“The deeper we can go into providing recommendations on farming practices, the more value we can offer the farmer,” McCaffrey adds.
But to deliver that insight, Climate needs data—and lots of it. That means using remote sensing and other techniques to map every field in the United States and then combining that information with climate data, soil observations, and weather data. Climate’s analysts can then produce a massive data store that they can query for insights.
Meanwhile, precision tractors stream data into Climate’s digital agriculture platform, which farmers can then access from iPads through easy data flow and visualizations. They gain insights that help them optimize their seeding rates, soil health, and fertility applications. The overall goal is to increase crop yields, which in turn boosts a farmer’s margins.
Climate is at the forefront of a push toward deriving valuable business insight from Big Data that isn’t just big, but vast. Companies of all types—from agriculture through transportation and financial services to retail—are tapping into massive repositories of data known as data lakes. They hope to discover correlations that they can exploit to expand product offerings, enhance efficiency, drive profitability, and discover new business models they never knew existed.
The internet democratized access to data and information for billions of people around the world. Ironically, however, access to data within businesses has traditionally been limited to a chosen few—until now. Today’s advances in memory, storage, and data tools make it possible for companies both large and small to cost effectively gather and retain a huge amount of data, both structured (such as data in fields in a spreadsheet or database) and unstructured (such as e-mails or social media posts). They can then allow anyone in the business to access this massive data lake and rapidly gather insights.
It’s not that companies couldn’t do this before; they just couldn’t do it cost effectively and without a lengthy development effort by the IT department. With today’s massive data stores, line-of-business executives can generate queries themselves and quickly churn out results—and they are increasingly doing so in real time. Data lakes have democratized both the access to data and its role in business strategy.
Indeed, data lakes move data from being a tactical tool for implementing a business strategy to being a foundation for developing that strategy through a scientific-style model of experimental thinking, queries, and correlations. In the past, companies’ curiosity was limited by the expense of storing data for the long term. Now companies can keep data for as long as it’s needed. And that means companies can continue to ask important questions as they arise, enabling them to future-proof their strategies.
Climate’s McCaffrey has many questions to answer on behalf of farmers. Climate provides several types of analytics to farmers including descriptive services, which are metrics about the farm and its operations, and predictive services related to weather and soil fertility. But eventually the company hopes to provide prescriptive services, helping farmers address all the many decisions they make each year to achieve the best outcome at the end of the season. Data lakes will provide the answers that enable Climate to follow through on its strategy.
Behind the scenes at Climate is a deep-science data lake that provides insights, such as predicting the fertility of a plot of land by combining many data sets to create accurate models. These models allow Climate to give farmers customized recommendations based on how their farm is performing.
“Machine learning really starts to work when you have the breadth of data sets from tillage to soil to weather, planting, harvest, and pesticide spray,” McCaffrey says. “The more data sets we can bring in, the better machine learning works.”
The deep-science infrastructure already has terabytes of data but is poised for significant growth as it handles a flood of measurements from field-based sensors.
“That’s really scaling up now, and that’s what’s also giving us an advantage in our ability to really personalize our advice to farmers at a deeper level because of the information we’re getting from sensor data,” McCaffrey says. “As we roll that out, our scale is going to increase by several magnitudes.”
Also on the horizon is more real-time data analytics. Currently, Climate receives real-time data from its application that streams data from the tractor’s cab, but most of its analytics applications are run nightly or even seasonally.
In August 2016, Climate expanded its platform to third-party developers so other innovators can also contribute data, such as drone-captured data or imagery, to the deep-science lake.
“That helps us in a lot of ways, in that we can get more data to help the grower,” McCaffrey says. “It’s the machine learning that allows us to find the insights in all of the data. Machine learning allows us to take mathematical shortcuts as long as you’ve got enough data and enough breadth of data.”
Growth is essential for U.S. railroads, which reinvest a significant portion of their revenues in maintenance and improvements to their track systems, locomotives, rail cars, terminals, and technology. With an eye on growing its business while also keeping its costs down, CSX, a transportation company based in Jacksonville, Florida, is adopting a strategy to make its freight trains more reliable.
In the past, CSX maintained its fleet of locomotives through regularly scheduled maintenance activities, which prevent failures in most locomotives as they transport freight from shipper to receiver. To achieve even higher reliability, CSX is tapping into a data lake to power predictive analytics applications that will improve maintenance activities and prevent more failures from occurring.
Beyond improving customer satisfaction and raising revenue, CSX’s new strategy also has major cost implications. Trains are expensive assets, and it’s critical for railroads to drive up utilization, limit unplanned downtime, and prevent catastrophic failures to keep the costs of those assets down.
That’s why CSX is putting all the data related to the performance and maintenance of its locomotives into a massive data store.
“We are then applying predictive analytics—or, more specifically, machine-learning algorithms—on top of that information that we are collecting to look for failure signatures that can be used to predict failures and prescribe maintenance activities,” says Michael Hendrix, technical director for analytics at CSX. “We’re really looking to better manage our fleet and the maintenance activities that go into that so we can run a more efficient network and utilize our assets more effectively.”
“In the past we would have to buy a special storage device to store large quantities of data, and we’d have to determine cost benefits to see if it was worth it,” says Donna Crutchfield, assistant vice president of information architecture and strategy at CSX. “So we were either letting the data die naturally, or we were only storing the data that was determined to be the most important at the time. But today, with the new technologies like data lakes, we’re able to store and utilize more of this data.”
CSX can now combine many different data types, such as sensor data from across the rail network and other systems that measure movement of its cars, and it can look for correlations across information that wasn’t previously analyzed together.
One of the larger data sets that CSX is capturing comprises the findings of its “wheel health detectors” across the network. These devices capture different signals about the bearings in the wheels, as well as the health of the wheels in terms of impact, sound, and heat.
“That volume of data is pretty significant, and what we would typically do is just look for signals that told us whether the wheel was bad and if we needed to set the car aside for repair. We would only keep the raw data for 10 days because of the volume and then purge everything but the alerts,” Hendrix says.
With its data lake, CSX can keep the wheel data for as long as it likes. “Now we’re starting to capture that data on a daily basis so we can start applying more machine-learning algorithms and predictive models across a larger history,” Hendrix says. “By having the full data set, we can better look for trends and patterns that will tell us if something is going to fail.”
Another key ingredient in CSX’s data set is locomotive oil. By analyzing oil samples, CSX is developing better predictions of locomotive failure. “We’ve been able to determine when a locomotive would fail and predict it far enough in advance so we could send it down for maintenance and prevent it from failing while in use,” Crutchfield says.
“Between the locomotives, the tracks, and the freight cars, we will be looking at various ways to predict those failures and prevent them so we can improve our asset allocation. Then we won’t need as many assets,” she explains. “It’s like an airport. If a plane has a failure and it’s due to connect at another airport, all the passengers have to be reassigned. A failure affects the system like dominoes. It’s a similar case with a railroad. Any failure along the road affects our operations. Fewer failures mean more asset utilization. The more optimized the network is, the better we can service the customer.”
Detecting Fraud Through Correlations
Traditionally, business strategy has been a very conscious practice, presumed to emanate mainly from the minds of experienced executives, daring entrepreneurs, or high-priced consultants. But data lakes take strategy out of that rarefied realm and put it in the environment where just about everything in business seems to be going these days: math—specifically, the correlations that emerge from applying a mathematical algorithm to huge masses of data.
The Financial Industry Regulatory Authority (FINRA), a nonprofit group that regulates broker behavior in the United States, used to rely on the experience of its employees to come up with strategies for combating fraud and insider trading. It still does that, but now FINRA has added a data lake to find patterns that a human might never see.
Overall, FINRA processes over five petabytes of transaction data from multiple sources every day. By switching from traditional database and storage technology to a data lake, FINRA was able to set up a self-service process that allows analysts to query data themselves without involving the IT department; search times dropped from several hours to 90 seconds.
While traditional databases were good at defining relationships with data, such as tracking all the transactions from a particular customer, the new data lake configurations help users identify relationships that they didn’t know existed.
Leveraging its data lake, FINRA creates an environment for curiosity, empowering its data experts to search for suspicious patterns of fraud, marketing manipulation, and compliance. As a result, FINRA was able to hand out 373 fines totaling US$134.4 million in 2016, a new record for the agency, according to Law360.
Data Lakes Don’t End Complexity for IT
Though data lakes make access to data and analysis easier for the business, they don’t necessarily make the CIO’s life a bed of roses. Implementations can be complex, and companies rarely want to walk away from investments they’ve already made in data analysis technologies, such as data warehouses.
“There have been so many millions of dollars going to data warehousing over the last two decades. The idea that you’re just going to move it all into a data lake isn’t going to happen,” says Mike Ferguson, managing director of Intelligent Business Strategies, a UK analyst firm. “It’s just not compelling enough of a business case.” But Ferguson does see data lake efficiencies freeing up the capacity of data warehouses to enable more query, reporting, and analysis.
Data lakes also don’t free companies from the need to clean up and manage data as part of the process required to gain these useful insights. “The data comes in very raw, and it needs to be treated,” says James Curtis, senior analyst for data platforms and analytics at 451 Research. “It has to be prepped and cleaned and ready.”
Companies must have strong data governance processes, as well. Customers are increasingly concerned about privacy, and rules for data usage and compliance have become stricter in some areas of the globe, such as the European Union.
Companies must create data usage policies, then, that clearly define who can access, distribute, change, delete, or otherwise manipulate all that data. Companies must also make sure that the data they collect comes from a legitimate source.
Many companies are responding by hiring chief data officers (CDOs) to ensure that as more employees gain access to data, they use it effectively and responsibly. Indeed, research company Gartner predicts that 90% of large companies will have a CDO by 2019.
Data lakes can be configured in a variety of ways: centralized or distributed, with storage on premise or in the cloud or both. Some companies have more than one data lake implementation.
“A lot of my clients try their best to go centralized for obvious reasons. It’s much simpler to manage and to gather your data in one place,” says Ferguson. “But they’re often plagued somewhere down the line with much more added complexity and realize that in many cases the data lake has to be distributed to manage data across multiple data stores.”
Meanwhile, the massive capacities of data lakes mean that data that once flowed through a manageable spigot is now blasting at companies through a fire hose.
“We’re now dealing with data coming out at extreme velocity or in very large volumes,” Ferguson says. “The idea that people can manually keep pace with the number of data sources that are coming into the enterprise—it’s just not realistic any more. We have to find ways to take complexity away, and that tends to mean that we should automate. The expectation is that the information management software, like an information catalog for example, can help a company accelerate the onboarding of data and automatically classify it, profile it, organize it, and make it easy to find.”
Beyond the technical issues, IT and the business must also make important decisions about how data lakes will be managed and who will own the data, among other things (see How to Avoid Drowning in the Lake).
How to Avoid Drowning in the Lake
The benefits of data lakes can be squandered if you don’t manage the implementation and data ownership carefully. Deploying and managing a massive data store is a big challenge. Here’s how to address some of the most common issues that companies face:
- Determine the ROI. Developing a data lake is not a trivial undertaking. You need a good business case, and you need a measurable ROI. Most importantly, you need initial questions that can be answered by the data, which will prove its value.
- Find data owners. As devices with sensors proliferate across the organization, the issue of data ownership becomes more important.
- Have a plan for data retention. Companies used to have to cull data because it was too expensive to store. Now companies can become data hoarders. How long do you store it? Do you keep it forever?
- Manage descriptive data. Software that allows you to tag all the data in one or multiple data lakes and keep it up-to-date is not mature yet. We still need tools to bring the metadata together to support self-service and to automate metadata to speed up the preparation, integration, and analysis of data.
- Develop data curation skills. There is a huge skills gap for data repository development. But many people will jump at the chance to learn these new skills if companies are willing to pay for training and certification.
- Be agile enough to take advantage of the findings. It used to be that you put in a request to the IT department for data and had to wait six months for an answer. Now, you get the answer immediately. Companies must be agile to take advantage of the insights.
- Secure the data. Besides the perennial issues of hacking and breaches, a lot of data lakes software is open source and less secure than typical enterprise-class software.
- Measure the quality of data. Different users can work with varying levels of quality in their data. For example, data scientists working with a huge number of data points might not need completely accurate data, because they can use machine learning to cluster data or discard outlying data as needed. However, a financial analyst might need the data to be completely correct.
- Avoid creating new silos. Data lakes should work with existing data architectures, such as data warehouses and data marts.
From Data Queries to New Business Models
The ability of data lakes to uncover previously hidden data correlations can massively impact any part of the business. For example, in the past, a large soft drink maker used to stock its vending machines based on local bottlers’ and delivery people’s experience and gut instincts. Today, using vast amounts of data collected from sensors in the vending machines, the company can essentially treat each machine like a retail store, optimizing the drink selection by time of day, location, and other factors. Doing this kind of predictive analysis was possible before data lakes came along, but it wasn’t practical or economical at the individual machine level because the amount of data required for accurate predictions was simply too large.
The next step is for companies to use the insights gathered from their massive data stores not just to become more efficient and profitable in their existing lines of business but also to actually change their business models.
For example, product companies could shield themselves from the harsh light of comparison shopping by offering the use of their products as a service, with sensors on those products sending the company a constant stream of data about when they need to be repaired or replaced. Customers are spared the hassle of dealing with worn-out products, and companies are protected from competition as long as customers receive the features, price, and the level of service they expect. Further, companies can continuously gather and analyze data about customers’ usage patterns and equipment performance to find ways to lower costs and develop new services.
Data for All
Given the tremendous amount of hype that has surrounded Big Data for years now, it’s tempting to dismiss data lakes as a small step forward in an already familiar technology realm. But it’s not the technology that matters as much as what it enables organizations to do. By making data available to anyone who needs it, for as long as they need it, data lakes are a powerful lever for innovation and disruption across industries.
“Companies that do not actively invest in data lakes will truly be left behind,” says Anita Raj, principal growth hacker at DataRPM, which sells predictive maintenance applications to manufacturers that want to take advantage of these massive data stores. “So it’s just the option of disrupt or be disrupted.”
Read more thought provoking articles in the latest issue of the Digitalist Magazine, Executive Quarterly.
Timo Elliott is Vice President, Global Innovation Evangelist, at SAP.
John Schitka is Senior Director, Solution Marketing, Big Data Analytics, at SAP.
Michael Eacrett is Vice President, Product Management, Big Data, Enterprise Information Management, and SAP Vora, at SAP.
Carolyn Marsan is a freelance writer who focuses on business and technology topics.