Pittsburgh Technical Council

BIG DATA - Making sense of 'Big Data' and the inherent challenges and opportunities

BIG DATA - Making sense of 'Big Data' and the inherent challenges and opportunities

Article Published: December 22, 2014

Everyone is talking about Big Data, but no one can really define it. People agree on one thing: the world is generating unfathomable amounts of data that is creating never-before-seen business opportunities. 

Big Data cuts across every industry at every level. It is ubiquitous and evolving at a blistering speed as we generate ever-increasing amounts of data. The Pittsburgh Technology Council convened a full-day conference, “I Love It When You Call Me Big Data,” to explore the intersections of this business-changing reality. Leaders from industries as diverse as healthcare, data analytics and manufacturing provided insights through Tech Talks, panel discussions and keynote addresses.

We asked two speakers to bring their Tech Talks to the pages of TEQ. Karl Herleman of Management Science Associates details the latest tools businesses can deploy to make sense of terabytes of data.  The Pittsburgh Supercomputing Center’s Nick Nystrom reports on how the center is bringing the power of supercomputing to tame Big Data. 

We also sought out one of Pittsburgh’s top Big Data thought leaders Bob Seiner of KIK Consulting to provide insight on proper data governance and its importance in a world driven by enormous amounts of data.

Our Big Data coverage wouldn’t be complete without a profile of a local company that has Big Data at its core. Discover how Rhiza is making Big Data actionable for media companies and marketers like Comcast, Univison and Cox Media with powerful web-based research and sales tools. 

This is just the beginning for Big Data. Here’s a look what’s happening in Pittsburgh.

Big Data for One and All

By Karl Herleman, Management Science Associates

Companies looking for text analytics now have an alternative to exclusive providers like IBM®, SAS® and SAP. Distributed, open source, large-scale technology ecosystems, such as Apache™ Hadoop® and the Apache UIMATM project, are providing a platform for unstructured data management that everyone can afford and apply to their everyday business problems. 

What could previously be accomplished only in an academic setting with Ph.D. scientists and large computing power is now available to anyone, thanks to IBM open-sourcing its Unstructured Information Management Architecture, or UIMA. UIMA is the cornerstone of IBM’s Watson™ question answering computing system that defeated Jeopardy! champions Ken Jennings and Brad Rutter in 2011.

For data management companies, this is a fantastic breakthrough that will lead to business improvements in core data warehousing practices, such as data cleansing, master data management, standardization and matching.

So how might a data management company create such a powerful and cutting-edge solution?

It begins with Apache Hadoop, an open-source, free framework for reliable, scalable, distributed computing across clusters of commodity-level computers. Hadoop can parallel-process petabytes of data in a short amount of time and ensures the work can be completed even if the hardware fails. 

Layered onto Hadoop is UIMA, the Unstructured Information Management Architecture that is a framework for large-scale text, video and audio analytics. UIMA provides powerful abstractions that enable the assembly of a sophisticated information processing pipeline in a very short amount of time.

All of this work takes place under an Apache license—the nonprofit software foundation that creates and allows users to enhance open-source tools. By using open-source software, companies can lower their cost of software acquisition in addition to gaining access to innovative solutions.

At the same time, talent with the skills to use these tools is on the rise. Many universities are currently teaching Hadoop, which leads to a lot of talent with real experience. 

This perfect storm of technology and skills is setting the stage for innovation of unlimited scale that will meet your needs for a long time.

Karl Herleman is the VP of IT, Strategy and Innovation at Management Science Associates, Inc. in Pittsburgh. He has more than 25 years of experience in IT, spanning large-scale enterprise application development and integration, middleware, product and project management, quality assurance, database and security management, business intelligence, networking, and telecommunications. He holds an MBA from Barry University, an MS in Computer Science from the University of Central Florida and a BS in Computer Science from Penn State. Karl is a Sr. Member of the IEEE, a long-time member of the ACM, and recently appointed to the Board of the Pittsburgh Data Works.

Bringing the Power of Supercomputing to Big Data

By Nick Nystrom, Ph.D., Director, Strategic Applications, Pittsburgh Supercomputing Center

The Pittsburgh Supercomputing Center (PSC), a joint project between Carnegie Mellon University and the University of Pittsburgh launched in 1986, provides uniquely capable computational, data, networking, and collaborative resources to enable researchers to analyze their most challenging datasets and perform advanced simulations. Using PSC’s resources, researchers in Pittsburgh, throughout Pennsylvania, and across the nation can unlock the full potential of their data. In this brief article, we survey three recent Big Data initiatives at PSC.

Bridges will be a unique national resource for data-intensive computation, designed to enable applications that transform how people work with and combine big data and for ease of use, including by users with no prior supercomputing experience. Bridges will couple three tiers of large-memory computational nodes, innovative shared, distributed, and flash storage, and an unusually flexible software environment to transformatively integrate data-analytic capabilities.

Bridges will incorporate technologies proven in PSC’s Data Exacell (DXC), an accelerated pilot project to create, deploy, and test software and hardware building blocks to enable data-intensive research. Funded by the National Science Foundation (NSF) through its Data Infrastructure Building Blocks (DIBBs) program, the DXC is driven by the requirements of challenging, data-intensive applications. The DXC brings supercomputing to Big Data by integrating uniquely capable analytic engines with PSC’s innovative, high-performance data storage system and cutting-edge database technologies. The first of these analytical engines is Blacklight, which features very large coherent shared memory—two partitions of 16 terabytes each—that is ideal for genomic research, machine learning, and high-productivity programming languages. The second is Sherlock, which features custom-built processors coupled to a custom-built internal network to accelerate graph analytics, allowing for advancement of, for example, cybersecurity, the life sciences, social networks, and unstructured text analytics.

Complementing PSC’s hardware and software resources, its multi-disciplinary staff engages in research and develops algorithms and applications. The Center for Causal Modeling and Discovery of Biomedical Knowledge from Big Data (CCMD), initiated in October through a four-year, $11M grant from the National Institutes for Health (NIH) Big Data to Knowledge (BD2K) program and led by the University of Pittsburgh in collaboration with Carnegie Mellon University, PSC, and Yale University, aims to extend techniques for discovering causal relationships to research in gene pathways that drive the development of cancer, lung disease susceptibility and severity, and functional connections within the human brain (the “connectome”) associated with autism and schizophrenia. The significance of the new Center is its focus on discovering cause-and-effect relationships—not merely correlations—which are necessary for making reliable predictions from Big Data. PSC will contribute to the CCMD’s software architecture and its implementation for supercomputers, which will be conveniently accessible from the users’ notebooks and even mobile devices when analyses call for accessing large datasets or extensive computation.

Allocations on PSC’s resources are available at no charge to academic researchers. PSC also works with industry to improve competitiveness by providing computational and data resources and by partnering to develop effective data- and simulation-based solutions.

Governance of Big Data

By Robert S. Seiner, KIK Consulting & Educational Services

The governance of Big Data and its associated metadata will explode in 2015 and become established as a norm. I define data governance as “the execution and enforcement of authority over the management of data” since for governance to be successful, the appropriate influence over the management, protection and quality of the data must be accomplished. 

My definition is considered brutally aggressive by some and brutally honest by others. In fact, organizations either 1) embrace the “execution and enforcement” aspect because they recognize the need for teeth; or 2) temper the words out of necessity, thus crafting the definition as more “touchy-feely” thinking that will be better accepted. 

Organizations implement data governance in a variety of ways, from command-and-control to less-invasive approaches. Productive approaches include a focus on effective risk management, data integration, quality and analytics. Matching the discipline to the culture is paramount to success.

The management of “Big” metadata, that is, the information about the “Big” data—including the management rules,  business meaning and usage, lineage and stewards—will also be vital to realizing value excellence from your investments in big data. Big data cannot be governed without metadata.

In 2015 the term Big Data will continue to evolve away from super-sized data sets. Big data will evolve to data sets in varieties of formats available for you every direction you turn, at high volumes and tremendous speed, to ultimately mean big VALUE data, the data with the most value to the organization. Determining the data with the most value will depend on whom you ask in your organization. The term “Big” may actually be removed from the equation. Whose definition of “Big” will YOU use?

Therefore a data governance plan will be necessary to execute and enforce authority over your highest-value data in a consistent manner across your organization. The cross-organizational aspect will play a big role in big data, therefore forcing companies to break down the silos and govern the most valuable data as a highly valued asset.

By 2015 Big Data will have infiltrated all aspects of your life. Do you travel and do you have frequent traveler id for each plane, train and automobile trip you take? Do you buy things at stores using something other than cash and/or do you use an advantage card when you shop? Do you have a license plate on your car and do you drive through frequently monitored intersections? Does your family subscribe to the triple bundle and do you use the Internet? Do you have a publicly recorded image (i.e. driver’s license or passport) and do you notice those little cameras not-so-hidden in almost every public and many private places? It’s all in the data.

The truth is that this big data about me being out there probably does not bother me as much as it should. The other truth is that there are a lot of people who are worried about this data. I tend to lead a pretty boring life, so if “they” want to observe my data—more power to them. Besides, I could not stop “them” from watching me even if I tried. I guess I could severely decrease my global high-tech “Big Footprint” but what fun would that be, and besides I might not save that .55 per gallon.

By 2015 Big Data will have permeated the news as well. Watching the evening news, the first four of five big news stories are typically about data. Producers of the news will tell you that at least one in five news stories should have a personal edge—chuckle. Four out of five doctors say this. Joe Blow leads in the polls by a percentage with a reasonable error rate of plus or minus some percent. This company has had their data stolen and the adversary is hacking the data that is not stolen. It’s all in the data.

Newsworthy big data bothers me more than it should. This is especially true when the analysis of the big data or the combination of big data facts appears ludicrous. Do I care that Ben Roethlisberger completes ninety percent of his passes of more than five yards in the fourth quarter of games on Thursdays in the second half of the season? All I care is that he completes his next pass and that we win enough games to make the playoffs each and every year. Or, the fact that men over the age of forty buy beer with their diapers on Fridays or Saturday evenings at Walmarts in rural areas … or something like that.

In 2015 Big Data will become so big that it will be impossible for one entity to govern all at once. There are laws about personally identifiable information (PII), personal health information (PHI), web-data, financial data, phone data, and so on and so forth. The data may be governed in pockets and even poorly in that arena. But who is governing those that are governing the data? This is another question for another day.

Has “Big Data” become big news and a way of life for your company or organization? If it has—what is the plan? If it has not—what is the delay? I am looking forward to communicating more with you on this subject in 2015 and beyond through TEQ with “It’s all in the data”. Thank you.

Robert S. Seiner is a Pittsburgh-based expert in the field of Data Governance and an authority on Big Data and data management. Seiner is the President and Principal of KIK Consulting & Educational Services and recently authored his first book on Non-Invasive Data Governance. Seiner speaks regularly across the globe and has hosted a monthly webinar series titled Real-World Data Governance since January of 2012. Seiner has assisted many well-known organizations in the financial, education, manufacturing, insurance, and energy and government industries.  

Rhiza: Making Big Data Actionable

By Matt Pross, Editor

The explosion of mobile computing and smartphone-enabled eCommerce in the last five to seven years has forced “Big Data” into our popular lexicon; however, very few businesses actually know how to get the most out of the massive amount of data now available to them. Executives know it’s there and know it can help their business, but don’t have the faintest idea how to go about monetizing the potential insight locked within the enigma of ‘Big Data.’

Shadyside-based Rhiza is changing this trend by delivering fantastic results for its clients through data analytics platforms that not only pinpoint the important data sets but also display the information in easy-to-understand formats.

“Technologically, we are in the infancy of Big Data,” Josh Knauer, CEO of Rhiza, explained. “We have the computing power to collect tons of data, and most of the people working in this space (data geeks) are merely optimizing the processing of this data. But very few people are actually focused on making sense of the data being collected, and even fewer are using this data to make informed business decisions.

“Rhiza is one of the few companies focused on creating solutions that put massive data sets into real-world context,” Knauer continued. “In order for Big Data to be useful, it has to fade into the background. Rhiza creates tools that make data accessible to a broad audience of business users. Sales teams and marketers, for example, use our tools to create beautiful data visualizations that are customized to the specific market/demographic they’re pitching. Sales presentations, in turns out, are incredibly effective when they show something new to the buyer.”

Rhiza offers focused product lines to two distinct customer profiles: large consumer products companies (PepsiCo, Inc., The Clorox Company and Highmark Blue Cross Blue Shield) and global media companies (Cox, Comcast, Experian, Univision and Dun & Bradstreet). 

“Our technology empowers business users to find answers to some of the toughest questions a company has,” Knauer explained. “For media companies, we provide data-driven sales presentations with unique insight into the demographic composition of specific markets. Our tool then allows the sales guy to track the effectiveness of each presentation and track sales performance over time, providing measurable ROI to all our media customers.

“Rhiza has also specifically developed a product line of tools for the CMOs of large consumer brands,” he continued. “Our technology subverts the shotgun-marketing approach that so many companies use due to lack of specific customer insight. Not only does our solution for brands enable CMOs to define the profile of their optimal customer, it also pinpoints the exact media channels where they can find and influence the customer. We enable our clients to spend their advertising dollars precisely, while also boosting the effectiveness of their marketing strategies—a result that makes a real impact on the bottom line.”

In terms of the future value or utility of big data in the marketplace, Knauer believes that Rhiza’s current work in the space is a good indicator of where big data will be in the near future.

“Right now, most people and companies are fascinated by the data itself,” Knauer explained. “But companies don’t know what to expect from the data, and don’t know how to use it effectively. Imagine if you had more than just data. Imagine if you could also have suggestions for what data you should examine and then recommendations for what you should do with your findings. Everyone has data, but few organizations know what do to with it.

“Our marketing analytics tools are driving actionable decisions for our customers right now,” Knauer explained. “In the next 10 years, every organization will use big data to increase profits, and to increase operational efficiency. Our clients are already doing those things today.”

It’s important to mention that Rhiza’s leadership position in the big data space is far more than self-promotion. In 2010, Knauer was appointed to a working group of President Obama’s Council of Advisors on Science and Technology (PCAST). According to www.whitehouse.gov, “PCAST is an advisory group of the nation’s leading scientists and engineers who directly advise the President… PCAST makes policy recommendations in the many areas where understanding of science, technology, and innovation is key to strengthening our economy and forming policy that works for the American people.”

According to www.whitehouse.gov and Rhiza’s blog, Knauer’s work for PCAST has centered on data management best practices and standardization of how open data is defined, collected and published by federal agencies.

“People make better decisions when they have access to data,” Knauer explained. “Our goal is to empower our clients to make better decisions. Big Data is more than a buzz word or a concept that businesses struggle with—it’s a promise of greater revenue and greater efficiencies. So the sooner you can get past the awkward teenage phase, the better.” 

Other Articles That May Interest You:

Our Sponsors

Sign In or Join Up