Big Data & Unemployment in the U.S.


On October 5, the U.S. Bureau of Labor Statistics announced that the country’s unemployment rate fell to 7.8 percent after the economy added 114,000 workers in August and September. Some very public figures cried foul at this news, and vented on Twitter. Jim Cramer, host of CNBC’s Mad Money, also took to Twitter, suggesting that SAP crunch the numbers. “Just give the payroll calc job to $SAP or $TIBX and we can get them daily,” he tweeted.

In-memory computing in the White House

Can SAP HANA accurately calculate national payroll-data to render U.S. unemployment figures, and do it daily? I asked a few SAP HANA experts to find out if the company’s in-memory database technology is up to that prodigious task.

“We can really only speculate, as nobody collects this data today,” says David Hull, senior manager, Technology & Innovation Platform Marketing at SAP Labs. “Hence the controversy over the recent numbers – they’re not based on hard data, and so they’re likely accurate within a certain margin of error.”

To clarify, Cramer suggests a model in which companies send their payroll data to an outfit like SAP, just as they are required to report the information to the IRS – only a lot more often. His Tweet puts forth a daily scenario, on the logic that tracking day-to-day fluctuations in the job numbers would yield a more accurate measure of the nation’s unemployment rate (and deeper insight into the factors affecting it) than the quarterly reporting system in place today.

Next page: Daily updates on labor stats would mean 10TB of data per year

That means every employer in the nation would provide payroll information for all its workers every day. That’s a lot of data. To approximate just how much, Hull says you would need to visualize and estimate the data set, which might consist of:

  • An irreversible hash of the worker’s social security number (to enable per-capita tracking while preserving individual anonymity);
  • Number of hours the worker worked each week;
  • Who the worker worked for, represented by a hash of the company’s employer ID or social security number; and,
  • The worker’s age, zip code, and other relevant stats.

Using active-workforce numbers and student data from reporting educational institutions, Hull estimates such a database would include around 200 million people, and store and process 1KB of data per person, per payroll period.

“Unless my math is wrong, that’s about 10TB of data per year,” figures Hull. “Let’s say you want to size for three years of data, that’s 7.5TB compressed, which would require a 16-node cluster with 16TB of DRAM. We have partners that ship these today.”

Giving Jack Welch the data he wants

So, in effect, hypothetically … Yes, SAP HANA could crunch these mighty numbers ‘til they’re manageable and meaningful. Unfortunately, persuading every U.S. business to kindly forward their payroll data to SAP so that it can test this hypothesis will take some time.

Enter Joe King, chancellor of SAP HANA Academy. King thinks there’s a way to simulate Jim Cramer’s scenario in the near-term, and it goes like this:

  • Load all appropriate historical labor-statics into an SAP HANA data mart on the Experience SAP HANA site.
  • Build out a complete set of analytics on top of that data.
  • Recruit several payroll-processing firms to share their historical data.
  • With that, SAP can look at the payroll company numbers to see if they indicate or justify any reported change in labor statistics.

“We can then meet Jack Welch’s challenge of knowing when the announced labor statics are not supported by other metrics,” says King. “We could then do real-time analysis.”

This story originally appeared as a blog on SAP Business Trends.