Big Data News – 24 Sep 2015

Top Stories
Winterkorn says the company needs new leadership to win back trust.

Researchers in the US and China explore rice genomes with AWS analytics tools to develop drought and disease resistant crops.

Excerpt: There are four key data visualization techniques used by data analysis pros in the government and local law enforcement. As financial institutions, e-commerce organizations and social network analysts begin to apply data visualization more frequently, these techniques will help guide the process of uncovering meaningful insights hidden within mountains of disparate data. This post focuses on advanced data visualization using relationship graphs. In our last post ("Four Key Data Visualization Techniques Used by the Pros"), we mentioned four important techniques in data visualization. They are: 1) Data Preparation & Data Connectivity 2) Data Profiling 3) Advanced Analysis Using Relationship Graphs 4) Annotation, Collaboration and Presentation

At its .conf2015 users conference in Las Vegas yesterday, operational intelligence specialist Splunk took the wraps off a new version of its Splunk Enterprise platform and a new premium offering, Splunk IT Service Intelligence. Splunk Enterprise 6.3 — designed for on-premises, cloud or hybrid deployment — is focused on enhancements to performance and total cost of ownership as well as high-volume event collection for DevOps and Internet of Things (IoT) devices. In many cases, says Clint Sharp, Splunk director of product management, Big Data & Operational Intelligence, the hardware cost of a Splunk Enterprise 6.3 deployment can be cut in half compared with Splunk Enterprise 6.0.

An article in The New York Times touched off a spirited debate about the work culture at Amazon, now the most valuable retailer in the country. One of the central issues at hand is that Amazon uses data not only to provide an exceptional customer experience, but also to manage its staff and improve productivity. To many of us, this news comes as no surprise. For years now, companies of all sizes have been increasingly turning to data analytics to improve employee engagement and performance.

IBM has written a new 8 page white paper "IBM Storage with OpenStack Brings Simplicity and Robustness to Cloud" to review the increasingly popular OpenStack cloud platform and the abilities that IBM storage solutions provide to enable and enhance OpenStack deployments.

At this year's nginx.conf, Nginx has announced a preview of nginScript, a JavaScript-based server configuration language. Meant to accompany existing scripting offerings like Lua, nginScript will give technologists with experience in JavaScript a lower barrier to entry to create more advanced configuration and delivery options.

In this book excerpt published on, I write about why taking a siloed approach to creating a BI architecture framework leads to problems. The excerpt is from chapter 4 of my book Business Intelligence Guidebook: From Data Integration to Analytics. In the chapter, I provide insight as to how the BI environment in many organizations has been waylaid by the siloed approach to IT and application development. I also explain the benefits of a comprehensive and well-planned BI architecture strategy, and list the four architectural layers of a BI framework. View the excerpt on

Swiss Postal Services has used scaled Scrum with seven teams to replace a legacy system. InfoQ interviewed Ralph Jocham about how they scaled Scrum and dealt with legacy issues, using a definition of done, how they managed to deliver their system three months earlier than planned, and the main learnings from the project.

Today a study will come out saying that Spark is eating Hadoop — really! That's like saying SQL is eating RDBMSes or HEMIs are eating trucks. Spark is one more execution engine on an overall platform built of various tools and parts. So, dear pedants, if it makes you feel better, when I say "Hadoop," read "Hadoop and Spark" (and Storm and Tez and Flink and Drill and Avaro and Apex and …). [ Also on InfoWorld: Harness the power of Hadoop — find out how in InfoWorld's Deep Dive report. | 18 essential Hadoop tools for crunching big data. | Cut to the key news in technology trends and IT breakthroughs with the InfoWorld Daily newsletter, our summary of the top tech happenings. ] The major Hadoop vendors say Hadoop is not an enterprise data warehouse (EDW) solution, nor does it replace EDW solutions. That's because Hadoop providers want to co-sell with Teradata and IBM Netezza, despite hawking products that are increasingly eating into the market established by the big incumbents.

Ryan Polk talks about the future direction of the Rally product and the merger with Computer Associates.

As infrastructure becomes code, reviewing (and testing) provides the confidence necessary for refactoring and fixing systems. Reviews also help spread consistent best practices throughout an organization and are applicable where testing might require too much scaffolding.

Puppet Labs' latest version of Puppet Enterprise – version 2015.2 – includes new features like node graph visualization, inventory filtering, and a VMware vSphere module. It provides users with major enhancements of the Puppet Language and an updated web UI. InfoQ spoke with Michael Olson, Senior Product Marketing Manager at Puppet Labs .

Microsoft announced three new tie-ups in China on the same day that the country's President Xi Jinping and a delegation visited its campus at Redmond, Washington. The seven deals with Chinese companies and government institutions will likely give Microsoft greater access to the country's large market. Other companies like Cisco Systems and Hewlett-Packard have also announced ties with Chinese companies, a market that has been proving complex for U.S. companies because of the strong backing of the government for local players. Microsoft, for example, announced an agreement with its cloud partner in Beijing, 21Vianet, and IT company Unisplendour to provide custom hybrid cloud solutions and services to Chinese customers, particularly state-owned enterprises.

Optimizing queries in Splunk's Search Processing Language is similar to optimizing queries in SQL. The two core tenants are the same: Change the physics and reduce the amount of work done. Added to that are two precepts that apply to any distributed query.

Today Google is releasing Google Cloud Dataproc as a beta service. Cloud Dataproc gives you anytime access to super-fast, simple yet powerful, managed Spark and Hadoop clusters.

A surprisingly common theme at the Splunk Conference is the architectural question, "Should I push, pull, or search in place?"

If you could handle all of the data you need to work with on one machine, then there is no reason to use big data techniques. So clustering is pretty much assumed for any installation larger than a basic proof of concept. In Splunk Enterprise, the most common type of cluster you'll be dealing with is the Indexer Cluster.

During the second technical keynote at SpringOne2GX last week Guillaume Laforge talked about plans for Groovy 2.4.x and 2.5. Perhaps the most significant is improved compiler performance with a new Abstract Syntax Tree (AST) class reader in place of using class loading tricks.

Apple's acquisition of the mapping visualization startup Mapsense adds to the growing momentum behind analyzing and rendering location data. According to multiple reports, Apple (AAPL) acquired the San Francisco-based startup recently for an estimated $30 million. Mapsense was founded in 2013 by Erez Cohen, who previously worked for data analytics specialist Palantir Technologies (PALTECP). The Mapsense platform and accompanying developer tools focus on helping users ingest and analyze huge volumes of location data to make sense of "geotagged data." The platform also includes a high-speed mapping engine designed to ingest, index, search and filter up to 1 terabyte of location data.

Unprecedented opportunities await enterprises that are involved with the Internet of Things–but only if they apply analytics to their production or operational processes with Predictive Asset Optimization. This addition can help prevent costly delays, maximize assets and improve the consumer experience in the long run.

As enterprises embrace NoSQL and other emerging database technologies, they are encountering familiar teething problems as they attempt to scale out and deal with stress points like soaring costs for hardware like fast media storage. In the thick of the enterprise shift to NoSQL is the Apache Cassandra database and leading platform vendor DataStax. Network partner Datagres, the file-level storage intelligence specialist, this week rolled out an automated tiered storage package for the DataStax enterprise implementation of Cassandra that leverages fast storage media like flash and SSDs. The goal is to boost database performance that is slowing as data volumes grow. To read the rest of the story, please go to The post How Datagres Boosts Cassandra Performance appeared first on Datanami.

Want to try out a faster, more scalable, more flexible open source SQL on Hadoop engine? DOWNLOAD PRESTO   Presto, an open source distributed SQL query engine for big data, initially developed by Facebook, enables you to easily query your data on Hadoop in a more interactive manner. The engine not only allows you to query data on multiple Hadoop distributions using standard ANSI SQL but also query other data sources such as Cassandra, MySQL and PostgreSQL. Teradata, a leading vendor in data management and analytic software is now contributing code to advance Presto and increase enterprise adoption of the open source platform. Teradata is also now providing full enterprise support for Presto. Download a free open source copy and begin to gain more insight out of your Hadoop data today.   Want to try out a faster, more scalable, more flexible open source SQL on Hadoop engine? DOWNLOAD NOW   Subscribe to Teradata. 

Hello DSC Member, The critically-acclaimed Predictive Analytics Innovation Summit returns to Chicago on November 11 & 12 at the Hyatt Regency McCormick Place. Check out the diverse schedule here. This summit will gather the industry's leading analytics executives to offer insights through keynote presentations, dynamic workshops and interactive panel sessions.

By using data analytics to provide personalized patient care, you can improve your patients' health and reduce their office visits.

Webinar: Fostering a Data-Driven Culture with Interactive Data Analysis in Slack Effective data-driven organizations make data accessible and easy to analyze. However, for many, data is difficult to obtain and sharing analysis can be cumbersome. Panoptez, a platform for direct interactive data analysis within the enterprise chat platform Slack, makes sharing data analysis uncomplicated and enjoyable. Slack has become an ideal app for ad hoc data analysis due to its transparent yet secure communication channels, simple file sharing system, comprehensive tool integration, and powerful search operators. With so much data and an interface to interact with machines available, all that is missing is a platform to manage the computations.

In part one of this blog series, I talked about changing your IT culture to better support self-service BI and data discovery. Absolutely essential. However, your work is not done! Self-service BI and data discovery will drive the number of users using the BI solutions to rapidly expand. Yet all of these more casual users

Two of the developers behind the KVM and OSv projects have now released and open-sourced a direct replacement for the Apache Cassandra NoSQL database that they say is an order of magnitude faster. ScyllaDB is meant as a substitute for Cassandra in the same way that MariaDB can be swapped in for MySQL without blinking. ScyllaDB is written in C++ as opposed to Cassandra's Java, and its creators, Avi Kivity and Dor Laor, claim its sharded architecture provides the parallelism and speed-up on a single computer that was previously only available in a cluster.

As the "tsumani of data and information" flooding organizations threatens to become overwhelming, many companies–particularly telecoms–need a solution that handles enormous data volumes with stellar performance and cost-efficiency. See how one company uses predictive analytics to tame this tsunami while improving network insight.

Helping businesses stay competitive is the impetus behind the Top Recurring Revenue Metrics infographic below from our friends at Aria Systems.

Analytic insights are no longer just for data analysts and technologists. A movement to make these insights available to anyone using the latest design concepts is helping to create simpler, more intuitive user experiences.

Your data represents business value that your organization can't afford to ignore. In this webcast, listen as two chief data officers discuss how you can position the CDO role for success and help your organization provide true value to customers.

The Omni Parker House in downtown Boston has been home to some great stories.  Charles Dickens gave his first American reading of A Christmas Carol there in 1867, and Jack Kennedy proposed to Jackie O at Table #40 in Parker's Restaurant in 1953. Yesterday Sales Director David France and I were there to share the Attunity story with CIOs, CTOs and other local tech decision makers at the CIOarena event.  

Big Data has changed the way that we use and manage data. With more data than ever before in higher velocities from more sources across the organisation, enterprises simply can't afford to miss business opportunities due to time spent "data wrangling" for useful nuggets in their data. Many enterprises have outgrown the traditional Operational Data Store (ODS) and instead are looking for a new way forward. We can help.

IT is faced with enormous challenges in delivering data to the enterprise. Data is growing exponentially while IT budgets are staying flat.  It's nearly impossible to invest in infrastructure at the same rate of data growth. The lines of business (LOBs) expect the right data at the right place at the right time so that they can extract value from the data as quickly as possible. In this virtual series, you'll learn how to pinpoint the best combination of hardware, software, and services for each step of the way. And you'll leave armed with ideas and vital information for building an impactful big data and analytics game plan.

Strata + Hadoop 2015 brings together technology masters next week in New York to consider the future of data and machines. It is a tricky subject. Ubiquitous data collection and ever-smarter machines can make us all feel grim and dumb. But there is room to be light-hearted and intelligent. Here are three big show themes that underscore how data and software will improve how we live, work and play.

Manufacturing made the US a world power in the 19th Century.  Today we share the industrial stage, and experts disagree on our standing. Amidst all the offshoring, onshoring and reshoring, one trend is clear: with wages converging globally, the supply of digital skills has a big say in what jobs are being created, and where.

When it comes to building things, data is everything.  It improves production and supply chains, and creates new possibilities for custom products. Accordingly, manufacturers hire more analytics experts and use more advanced tools than other industries as they look to harness emerging opportunities for the Internet of Things and 3D printing. The challenge is integrating all the digits across platforms and preparing it for meaningful analysis.

Moving data is tricky!

In your business and every other, efficient management of assets matters. It matters to Colmobil, the #1 importer and distributor of cars, trucks and buses in Israel. Vehicles must go from Point A to Point B at minimal time and expense.

Dolby Laboratories is dedicated to advancing the science of sound and has transformed the entertainment experience for 50 years. It began with Dolby noise reduction for tape recordings, and continues through today with many groundbreaking technologies in cinemas, professional recording studios, video games, and mobile media to name a few.

eTix, the largest ticketing service provider in North America, decided to implement real-time analytics in the cloud rather than on-premises. And they chose Attunity CloudBeam and Amazon Redshift to make it happen.

by Andrie de Vries Every once in a while I try to remember how to do interpolation using R. This is not something I do frequently in my workflow, so I do the usual sequence of finding the appropriate…

Getting insights out of big data is typically neither quick nor easy, but Google is aiming to change all that with a new, managed service for Hadoop and Spark. Cloud Dataproc, which the search giant launched into open beta on Wednesday, is a new piece of its big data portfolio that's designed to help companies create clusters quickly, manage them easily and turn them off when they're not needed. Enterprises often struggle with getting the most out of rapidly evolving big data technology, said Holger Mueller, a vice president and principal analyst with Constellation Research. "It's often not easy for the average enterprise to install and operate," he said. When two open source products need to be combined, "things can get even more complex."

Guest infographics, originally posted on Udemy. It explains the concept, alternatives and career advice. DSC Resources Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC Buzz: Business News | Announcements | Events | RSS Feeds Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers Additional Reading 50 Articles about Hadoop and Related Topics 10 Modern Statistical Concepts Discovered by Data Scientists Top data science keywords on DSC 4 easy steps to becoming a data scientist 13 New Trends in Big Data and Data Science 22 tips for better data science Data Science Compared to 16 Analytic Disciplines How to detect spurious correlations, and how to find the real ones 17 short tutorials all data scientists should read (and practice) 10 types of data scientists 66 job interview questions for data scientists High versus low-level data science Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Getting insights out of big data is typically neither quick nor easy, but Google is aiming to change all that with a new, managed service for Hadoop and Spark. Cloud Dataproc, which the search giant launched into open beta on Wednesday, is a new piece of its big data portfolio that's designed to help companies create clusters quickly, manage them easily and turn them off when they're not needed. Enterprises often struggle with getting the most out of rapidly evolving big data technology, said Holger Mueller, a vice president and principal analyst with Constellation Research. "It's often not easy for the average enterprise to install and operate," he said. When two open source products need to be combined, "things can get even more complex."

Learn how your feedback in hands-on sessions at Insight 2015 will help IBM create products designed to meet your specific needs while making analytics available to everyone in business. It's all part of IBM Design Thinking, which focuses on users first.

Published Date: 2015-09-23 15:25:44 UTC Tags: Analytics, Human Capital, Human Capital & Careers, Human Resources, Predictive Analytics Title: Why Corporate Leadership Should Care About Talent Analytics Subtitle: Talent analytics have a vital role to play and management should utilize it

Sanjiv Augustine wrote Scaling Agile: A Lean JumpStart, a book on scaling Agile methods. It is a short and informative book about scaling Agile methods. It covers essential set of Lean building blocks as a starting foundation for larger Agile scaling frameworks, including the Scaled Agile Framework (SAFe), Large-Scale Scrum (LeSS), and Disciplined Agile Delivery (DAD).

Guest blog post by kyvos insights The HDFS design is totally based on the design of the GFS (Google File System). Its implementation addresses a number of problems that are present in a number of distributed filesystems such as Network File System (NFS). Specifically, the implementation of HDFS address are…

As announced at CppCon, Bjarne Stroustrup and Herb Sutter have started working on a set of guidelines for modern C++. The goal of this effort is improving how developers use the language and help ensuring they write code that is type safe, has no resource leaks, and is as much as possible free of programming logic errors.

In this contributed article, Prat Moghe, CEO and founder of Cazena, dives into the different ways companies use data lakes, with real-world examples. Prat will share practical issues to consider upfront, as well as hidden gotchas that can drain the success out of a project.

By: Eric Siegel, Founder, Predictive Analytics World In anticipation of his upcoming conference presentation, Predictive Analytics for Project Management – Cost Avoidance, at Predictive Analytics World Boston, Sept 27-Oct 1, 2015, we asked Scott Lancaster, Vice President at State Street Corp., a few questions about his work in predictive analytics. Q: In your work with predictive analytics, what behavior do your models predict? A: I use the Putnam model for estimating project cost/effort, duration, size, and productivity at a certain level of quality. This model is used for project management and is based on the Rayleigh distribution curve. Q: How does predictive analytics deliver value at your organization?

In this special guest feature, Opher Kahane, CEO and Co-founder of Origami Logic highlights five key areas on which data professionals must work closely with their marketing counterparts to truly realize the promise of Big Data.

HP Enterprise will offer a a portfolio of IT operations management, security and other products. But it plans to differentate all of them by harnessing the power of its HP Haven big data platform.

Cast your mind back to the early 2000s when the chatter amongst those in the energy industry most likely focused on 'keeping the lights on' in the face of a growing demand for power versus available generation. There would also have been some talk of saving the planet by lowering carbon emissions. And while these remain key objectives for energy providers in many countries, it has become clear that the industry needs something more. Why? Because those energy firms that have made the most progress, they have destroyed their own businesses in the process. Ironic, isn't it? The fact is, legacy generation revenues are not what they used to be and they're no longer exclusive to the 'old-school' utilities either.

Apache Spark, hosted on Hadoop, is great for processing large amounts of data quickly, but wouldn't it be even better if you could process data in real time? If your business depends on making decisions quickly, you should definitely consider the MapR distribution, which includes the complete Spark stack including Spark Streaming.

The word "metadata" has different meanings for different people. Most people think of this as the embodiment of big brother grabbing information about everything we do and say. More fundamentally, metadata is really data that describes other data. In essence, it allows for quicker insight or easier interpretation of the data than one might get from analyzing all of the data at an atomic level.

We've spent some time since the Apple Watch was announced last year getting intimately familiar with the operating system, the equipment and its use in the business world. Since then we've maintained our thought that Apple Watch can improve business productivity. Yesterday, the latest operating system for the watch came out, watchOS 2. We spent some time with it and there are a few features you should know about that will make productivity even better.

Guest blog post by Dr. Vincent Granville Sqrrl views Big Data market as 11 large segments (isn't analytics / data science missing?): Hardware providers: Big Data software runs on both commodity disks and flash/SSD. Services providers: These folks help with both strategy and implementation of Big Data solutions Cloud providers: Many organizations run their Big Data solutions in public, private, or hybrid clouds Enterprise Data Warehouse (EDW) vendors: These are traditional EDW vendors and the relational databases that typically sit on top of them. Data Integration vendors: These companies sell the tools that assist in getting data into Hadoop or Scale-Out databases. Hadoop vendors: These folks license commercial distributions of the Hadoop Distributed File System and related Apache projects (or in some cases, just sell support services around them).

Oracle Big Data SQL has been out for a while and we announced Oracle Big Data SQL Cloud Service back in June. So what's different when it moves to the cloud? The pricing is different and it's running in our datacenter, not yours. But technically it offers the same capability. Which means these entries from earlier in the year still apply. So if you want to see how Big Data SQL Cloud Service handles metadata, has the best SQL on Hadoop, extends Oracle Database security to Hadoop and NoSQL, and minimizes data movement, you can check out this other entries. Or come and see us at Booth 123 (yes, that's the real number) at Strata Hadoop World in New York on September 29-October 1.

Will a red call to action result in a higher conversion rate than a green one? One of the most controversial topics in web design is the issue of color. This subject attracts a great deal of attention, based on the notion that the color of an object can affect the way we feel about that object. The Priming Effect Color is shown to be a significant determinant for both website trust and satisfaction. Color has the potential to communicate meaning to the user and influence the visitors' perception through the priming effect. This is when the exposure to one stimulus then influences the way we response to a further stimulus. In this way the exposure to a certain color can influence the visitor's reaction towards the site in a 'carry-over' effect, meaning that the emotional reaction towards a color can be translated to positive or negative interaction with the website.

Big Data, the Internet of Anything (IoAT) and the Connected Car have created a new Information Superhighway that fundamentally changes the relationship between automakers and car buyers. Previously, automakers had an incomplete feedback loop after they sold a vehicle. They learned of negative customer sentiment through slumping sales, increasing warranty expenses or when they needed to recall their vehicles. Positive signals of driver happiness were similarly sparse. Read the White Paper on the Automotive Information Superhighway The connected car has changed all that.

Looker, the company on a mission to power data-driven companies, today announced Looker Blocks. Reusable and customizable, Looker Blocks are apps that are components of business logic, such as churn prediction or lifetime value metrics.

In this special guest feature, John Thielens, VP of Technology at Cleo discusses how the ability to connect to non-traditional storage repositories can solve the security, access control, and scalability challenges of data lakes, which are more suited to handle today's less structured data.

IBM has written a new 10 page report "Mitigating IT Risk for Financial Risk Analytics," designed to explore deploying an integrated solution for risk analytics based on the IBM Application Ready Solution for Algorithmics.

In this article I wrote for, I explain how before selecting a BI analytics tool, you should create BI use cases and then match those requirements with BI analytics tool categories and styles. Which BI analytics tool does my company need? Over the years, many business intelligence (BI) tool styles have emerged to match the varied ways that business people need to analyze data across broad product categories that include guided analysis and reporting and self-service BI and analysis.

Skytree®, a leader in enterprise-grade machine learning on big data, announced the release of Skytree 15.2. This release enhances Skytree's high performance, scalable machine-learning platform with streamlined data preparation for unstructured text data.

In a world that creates 2.5 quintillion bytes of data every year, how can organizations take advantage of unprecedented amounts of data? Is data becoming the largest untapped asset? What architectures do companies need to put in place to deliver new business insights while reducing storage and maintenance costs? Cisco and Hortonworks have been partnering since 2013 to offer operational flexibility, innovation and simplicity when it comes to deploying and managing Hadoop clusters. UCS Director Express for Big Data provides a single touch solution that automates deployment of Apache Hadoop and gives a single management pane across both physical infrastructure and Hadoop software.

Guest Blog Author: Eric Sammer, CTO and Founder at Rocana I'm very excited to announce our partnership with Oracle. We've been spending months optimizing and certifying Rocana Ops for the Oracle Big Data Appliance (BDA) and will be releasing certified support for Oracle's Big Data software. Ever since we worked with the Oracle Big Data team in the early days of the Big Data Appliance Engineered System while still at Cloudera, we've had a vision of the power of a pre-packaged application backed by the BDA.

by Youko Watari You manage your company's Teradata system and receive a phone call: "We are adding a new subsidiary, so get ready!" You are told that the data for this subsidiary needs to be hosted on your Teradata system right away, but the data must be completely secured and isolated from the rest of the company's information. In fact, even you are not supposed to be able to view it. Your Teradata system is already scaled to accommodate additional data, but you are not sure how you can tackle the data security requirement.

In this special guest feature, Ashish Gupta of Actian discusses how legacy solutions were not architected to handle the requirements of today, but with traditional systems so pervasive (and so much having been invested in them) organizations are reluctant to rip and replace.

Strata, the conference where cutting-edge science and new business fundamentals intersect, will take place September 29th to October 1st in New York City. The conference is a deep-immersion event where data scientists, analysts, and executives explore the latest in emerging techniques and technologies. Quantopian Talks & Tutorials Our team will be presenting several talks and tutorials at Strata. The topics range from how global-sourcing is flattening finance, to a Blaze tutorial, to a review on pyfolio and how it can improve your portfolio and risk analytics, to an out-performing investment algorithm on women-led companies in the Fortune 1000.  To see our entire lineup, please click here.  Join Us! If you would like to attend the conference, RSVP here and enter discount code QUANT for a a 20% discount on any pass. We hope to see you there!

This article is the third in an editorial series with a goal of directing line of business leaders in conjunction with enterprise technologists with a focus on opportunities for retailers and how Dell can help them get started. The guide also will serve as a resource for retailers that are farther along the big data path and have more advanced technology requirements.

Yahoo! JAPAN needed a data platform that could scale to generate 100,000 reports per day as well as having the ability to process large amounts of data. It needed to keep the last 13 months' worth of data, which is approximately 500 billion rows, organized and easily accessible. Relational Database Management Systems (RDBMS) cannot scale to these levels from a cost and processing power perspective. Yahoo! JAPAN explored Hadoop to achieve this and evaluated two platforms based on our requirements; Hortonworks Hive and Tez on YARN and Cloudera Impala. Hive and Tez on YARN was able to scale beyond 15,000 queries per hour while Impala hovered at about 2,500 queries per hour. BACKGROUND Our business report systems generate 15,000 reports per hour from about 500 billion lines of Time Series data over the last 13 months based on conditions predefined by a user.

The new NFL has commenced! This visualization by Shine Pulikathara uses Tableau public to analyze the heights and weights of the NFL players for the 2015 regular season.

This entry was posted in News. Bookmark the permalink.