Wednesday, May 13, 2015

Blueprint: Data Management Challenges with Continuously Streaming Telemetry

by Ravi Mayuram, Senior Vice President of Products and Engineering at Couchbase

As the Internet of Things continues to take shape, millions of connected devices collect and share remote sensor data that organizations use to understand and improve business efficiencies. The process of collecting this remote sensor data – which can be anything from movement to temperature to power status and more – is known as telemetry and it presents a unique challenge for organizations. As a technology, telemetry has been around for decades, but with the rise of the Internet of Things, the characteristics (speed, volume, etc.) of telemetry data have changed significantly. Whether it’s the utility industry leveraging new smart meters to improve the reliability and efficiency of smart grids in real-time, or the airline industry to monitoring all aspects of a jet engine in motion, or blind spot detection in cars, telemetry has become a primary source of Big Data and Big Data requires new capabilities in order to catalog, interpret, and derive value from the data collected.

As a result, most industries that rely on telemetry are implementing new database technologies that have the ability to capture and analyze the vast amounts of incoming real-time data in a millisecond range. These continuous streams of bite-sized data require high-performance databases that can sustain very high throughput, maintain very low latency, and store semi-structured data in a useful and easy-to-understand format.

Challenges of Managing Continuously Streaming Telemetry Data - The Three Vs: Volume, Velocity and Variety 

The biggest challenge for managing streams of telemetry data is being able to keep up with the demands of “Big Data” (as the unprecedented volume of data generated today is called). Big Data has three primary characteristics; its volume (there is an awful lot of it), its velocity (how fast this volume is generated) and its variety (the very different types of data that are generated). All three of these create unique challenges for next-generation databases.

Velocity
First, consider the performance needed to keep up with the speed with which data is generated in a true Big Data implementation. A single sensor can generate hundreds of readings per second. Your average jet engine may have as many as 250 sensors operating simultaneously. As a result, a single component of a machine can generate hundreds of thousands of readings at the same time. All of the machines within a single oil rig can generate millions, if not tens of millions, of readings per second and a machine cannot wait seconds, or even milliseconds, for a data point to be collected because it is continuously producing new data.

Volume
Scalability is a byproduct of the amount of infrastructure required to handle the volume of data generated by a Big Data implementation. It does not take long to accumulate massive volumes of data when generating millions of data points per second. Enterprises now store much, much more data than they used to because the most complete data set enables early detection of trends as well as deeper insights.

Variety
As the type or variety of data continues to expand, organizations find they require a flexible data model. For example, when a new sensor is added to a machine or firmware is upgraded (which can be often, depending on the application), the data model changes, as there is new information being captured by this new sensor. Modern NoSQL databases which support flexible schemas take this change in stride and adapt to this new data model in real-time – no upgrade outages. The ability for a database to incorporate a variety of sensor data (new and updated) without having to change application code or relational schemas is what enables these more modern systems to make applications as agile as they are.

NoSQL and Real-Time Database Systems 

As a result of these characteristics for today’s telemetry, more and more organizations are turning to NoSQL databases. These next-generation solutions deliver the performance and scale capable of meeting the challenges of today’s telemetry streams. These systems also deliver significant developer agility, a requisite for today’s telemetry applications. Telemetry data is semi-structured, making it difficult to force into the highly structured format of a relational database. In contrast, instead of rows and columns used in other models, NoSQL data is stored in JSON documents. JSON, or JavaScript Object Notation, is a lightweight, developer-friendly, data-interchange format that is easy to parse and generate. The database is schema-less and thus the data model is flexible, fast to read and write, and easy to scale across commodity servers. This is the ideal format for telemetry data.

The native distributed nature of NoSQL technology is another advantage when massive amounts of data are being stored and accessed. Data is spread across multiple processing nodes and often across multiple servers. As a result, NoSQL databases are referred to as horizontally scalable systems. What this means is that if you find that you need more database capacity you don’t have to buy a more powerful hardware system, you simply add one more server (commodity class hardware will do) to your database server farm, adding additional servers as needed. You can do this based on the immediate need without needing to plan too much in advance and procure sophisticated hardware. This real-time ability to elastically scale this system is a distinct advantage to agile business who cannot predict the data (business) growth a priori.

NoSQL solutions are also ideal for the kind of analysis required by Hadoop-based machine learning, a critical form of advanced analytics that’s capable of identifying deep insights from Big Data sources like telemetry. NoSQL databases provide the best of both worlds when it comes to analytics, since NoSQL enables organizations to reap the benefits of real-time analysis while simultaneously feeding the company’s Hadoop platform the data it needs for deeper, offline analysis.

A high performance, distributed NoSQL database can store raw and/or processed data from stream processors; it can further process the data by sorting, filtering, and aggregating it; and it enables external applications to access raw and processed data via live dashboards for real-time monitoring and visualization. For real-time analysis, sensor data can be ingested as messages for a stream processor to analyze in real-time. The analysis can be written to a NoSQL database for display by live dashboards so enterprises can take immediate action. At the same time the stream processor writes the analysis to the NoSQL database, it continues to write raw sensor data to Hadoop for offline analysis. This allows enterprises to combine real-time analytics and data access with offline analytics to improve both short-term and long-term operations.

Conclusion

Continuously streaming telemetry data has the potential to transform industries from healthcare, to energy, to transportation. But in order to maximize its potential, we need the right Big Data architecture in place. The combination of NoSQL and Hadoop gives organizations a modern data architecture that can continuously analyze streaming data, make smart decisions in milliseconds, and intelligently evolve system behavior to respond to events happening in real-time.

About the Author
Ravi Mayuram is the Senior Vice President of Products and Engineering at Couchbase. He leads product development and delivery of the company’s NoSQL offerings. He was previously with Oracle where he was a senior director of engineering leading innovations in the areas of recommender systems and social graph, search and analytics, and lightweight client frameworks. He was also responsible for kickstarting the cloud collaboration platform. Ravi has also held senior technical and management positions at BEA, Siebel, Informix and HP in addition to couple of startups including BroadBand office, a Kleiner Perkins funded venture. Ravi holds a MS in Mathematics from University of Delhi.

About Couchbase
Couchbase delivers the world’s highest performing NoSQL distributed database platform. Developers around the world use the Couchbase platform to build enterprise web, mobile, and IoT applications that support massive data volumes in real time. All Couchbase products are open source projects. Couchbase customers include industry leaders like AOL, AT&T, Bally’s, Beats Music, BSkyB, Cisco, Comcast, Concur, Disney, eBay, Intuit, KDDI, Nordstrom, Neiman Marcus, Orbitz, PayPal, Rakuten / Viber, Ryanair, Tencent, Verizon, Wells Fargo, Willis Group, as well as hundreds of other household names. Couchbase investors include Accel Partners, Adams Street Partners, Ignition Partners, Mayfield Fund, North Bridge Venture Partners, and West Summit. www.couchbase.com


Got an idea for a Blueprint column?  We welcome your ideas on next gen network architecture.
See our guidelines.