Big Data News – 25 May 2016

Today's Infographic Link: World’s Biggest Data Breaches

Featured Article
The big data revolution is upon us. Firms are scrambling to hire a new brand of analysts dubbed “data scientists,” and universities have responded to this demand by introducing data science courses into degrees ranging from computer science to business. Survey-based reports find that firms are currently spending an estimated $36 billion on storage and infrastructure, and that is expected to double by 2020.

Top Stories
Apple is reportedly preparing its response to the Amazon Echo by opening up Siri to developers and forging a home-based speaker of its own.

The majority of network administrators don't have the programming skills needed to manage an SDN environment.

In preparation for Hadoop Summit San Jose, I asked the Chair for the Apache Committer Insights track, Andy Feng – VP Architecture, Yahoo! which were the top 3 sessions he would recommend. Although it was a tough choose only 3, he recommended: HDFS: Optimization, Stabilization and Supportability Speakers: Chris Nauroth from Hortonworks and Arpit Agarwal… The post Top 3 Insights from the Apache Committers appeared first on Hortonworks.

Analysis: The countdown to the implementation of General Data Protection Regulation has begun, and nobody seems to be ready.

An unnamed homeland security agency has signed a contract with a company that claims it can "reveal" your personality "with a high level of accuracy" just by analyzing your face, be that facial image captured via photo, live-streamed video, or stored in a database. It then sorts people into categories; with some labels as potentially dangerous such as terrorist or pedophile, it is disturbing that some experts believe the science behind it is antiquated, has previously been discredited, and the results are inaccurate.

An app-centric start-up can under-price and out-perform longstanding business models simply by connecting existing services in new and innovative ways.

Innovation is vital to business and helps to drive growth. It defines how we transform the things we do, how we make more from the services we offer and how we improve the things we make. Innovation is the successful exploitation of ideas to make our businesses ever more competitive across international markets and at home. In most, if not all of these markets, it is the key differentiator, which shapes Scotland's competitive advantage and helps us win market share. How we innovate has changed. It is no longer happening as an add-on for businesses but as a fundamental part of their plans for success.

Speed is of the essence, but is it always necessary?

What is Feature Engineering One of the growing discussions and debates within the data science community is the determination of inputs or variables that should be included in any predictive analytics algorithm. This type of process is more commonly referred to as feature engineering. Historically, this process is typically the most time-consuming element in building… The post Feature Engineering within the Predictive Analytics Process — Part One appeared first on Predictive Analytics Times.

Interest in ride-sharing specialists continues to grow as auto industry giants Toyota and Volkswagen announce strategic partnerships to grow the industry. Uber and Gett, a smaller ride-sharing company, are the beneficiaries of these new investments.

LinkedIn has made Kafka Monitor, a framework for monitoring and testing Kafka deployments, open source, and it's now available on Github. It continuously monitors SLAs in production clusters and runs regression tests in test clusters.

Alpine Data announced Chorus 6, an upgrade of its integrated analytics platform which adds collaboration and governance capabilities to machine learning projects for both business users and data scientists. It's designed to reduce the friction between humans regarding big data projects and to focus the work beyond the never-ending quest for the perfect algorithm.

Have you ever attended a sporting event where billboards flashed advertisements? Quite likely. Have you ever been to a conference with sponsored events, shuttles and meals? Happens all the time. In these cases, and many more, advertisers in effect help subsidize your ticket purchase. The practice is so common as to be unremarkable.  And yet the Federal Communications Commission (FCC), prodded by self-styled consumer interest groups, is seeking to ban sponsored data on the Internet. Spotify and Pandora may want to help pay for part of your T-Mobile subscription, and Facebook may wish to defray the cost of getting online for cash-strapped young people, but these groups insist these consumer subsidies are secretly dangerous.

Cray, a supercomputing giant, launched its Cray Urika-GX system, a platform that combines supercomputing with an open, enterprise-ready software framework for big data analytics. Think of it as the speed and scale of supercomputers with the handiness of an appliance and the flexibility and reliability of open-source. The product is aimed at conquering cluster and application sprawl in large enterprises.

Confluent launched Kafka Streams, a lightweight product for stream processing development. It's part of the now generally released open source Confluent Platform 3.0 which also features Confluent's first commercial product, the Confluent Control Center, for managing Kafka Clusters.

KPMG, an audit, tax and advisory firm, released a report profiling fraudsters and measuring the success in using data and analytics in detecting fraud. While proactive analytics are often a good way to detect fraud, KPMG found they were not the primary detection means in any North American frauds, and used to detect only 3 percent of fraudsters worldwide.

TD Bank has embarked on an effort to transform traditional banking data infrastructure into a more modern system built on a new application framework and APIs. This system leverages TD Bank's data to deliver insights on-demand.

This morning I saw #tweepsmap on my twitter feed and decided to check it out. Tweepsmap is a a neat tool that can analyze any twitter account from a social network perspective. It can create interactive maps showing where the followers of a twitter account reside , segment followers  and even show who unfollowed you! Here is my Followers map generated by country. You can create the followers map based on city and state as well. Tweepsmap also provides demographic information such as languages, occupation and gender but it relies on the twitter user having entered this information in the twitter profile.

Citrix continues to provide a steady stream of updates to its core technologies.

Amazon Web Services consultancy 2nd Watch this week released the findings of an analysis of 100,000 public cloud instances to determine the 30 most popular services being used.  It's not surprising that AWS's two core products: compute and storage, lead the pack. 100% of the environments 2nd Watch examined were using Amazon Simple Storage Service (S3), the massively scalable object storage service. 99% of customers also were using Amazon Elastic Compute Cloud (EC2), the on-demand virtual machine service. 100% of customers use AWS Data Transfer, because if you have data in the cloud, you need to transfer it in or out at some point.

In this contributed article, Morten Brøgger, CEO of Huddle talks about how enterprise thought leaders need to evolve into the role of big data evangelist.

Telecoms revenue fraud is a primary driver for increased Apache Hadoop adoption, according to a recent poll of telco and enterprise users by Cloudera and Argyle Data. Communication service providers lose around U.S. $38 billion to fraud every year.

Thanks to its broad applicability, data analytics has rapidly become a critical business function for modern organisations. But with expertise in the field in short supply and high demand, companies with an identified need for data analytics are looking beyond their traditional borders to monetise their information assets. Forrester Research predicts that a third of businesses will "pursue data science through outsourcing and technology"  as organisations become less process-driven and look to their data to find new opportunities for innovation.

Data becomes more difficult to manage every year, so implementing an effective information governance program helps IT better handle this growing problem.

Hewlett Packard Enterprise (HPE) will spin off its Enterprise Services business and merge it with CSC, creating a $26 billion IT services giant. The company made the announcement in conjunction with its second-quarter earnings report, which saw quarterly revenues rise year-over-year for the first time in five years.

There are so many different players in the revenue and contract management space, from those focused purely on the subscription and billing aspects (Zuora, Aria and Vindicia), to the ERP vendors who believe they offer the best solution (NetSuite, Intacct and FinancialForce), and on to the pure-play vendors in various other spaces — CPQ, CLM, etc. (vendors like Apttus, Icertis, PROS and a host of others).

Have you ever attended a sporting event where billboards flashed advertisements? Quite likely. Have you ever been to a conference with sponsored events, shuttles and meals? Happens all the time. In these cases, and many more, advertisers in effect help subsidize your ticket purchase. The practice is so common as to be unremarkable.  And yet the Federal Communications Commission (FCC), prodded by self-styled consumer interest groups, is seeking to ban sponsored data on the Internet. Spotify and Pandora may want to help pay for part of your T-Mobile subscription, and Facebook may wish to defray the cost of getting online for cash-strapped young people, but these groups insist these consumer subsidies are secretly dangerous. Free data, they are hoping the FCC will declare, is bad for you. 

News: IBM's Watson gives developers insights through its natural language processing capabilities.

On 9th-11th May 2016, over 120 of the leading CDOs and senior executives of data and analytics met at the Millennium Hotel, Mayfair to network and engage in an exchange of ideas which would bolster each of their respective strategies. The general consensus is that the CDO position is indeed gaining global momentum, with Gartner predicting  that 50% of all companies in regulated industries will have a CDO by 2017.

For anyone who isn't aware, and there must be at least one or two of you out there, healthcare IT has a tendency to look the way IT generally did about 30 years ago. There's lots of reasons for that — a desire to invest in patient services rather than infrastructure, concerns about security and privacy, and a general conservatism in the space. No matter what the causes, it's clear that healthcare's historical reluctance to move the needle on technology change has had an impact on patient outcomes. But in recent years, we've seen a change. Of course, the highest-profile event was Obamacare and a move toward results-based funding. But underlying that very important event has also been a more general awareness that things need to change, as well as a broader acceptance that in healthcare, IT can be a major enabler of patient outcomes.

So, there you are relaxing in your Asian grassland holiday villa; daydreaming you're Jack Bauer battling against real-time odds in an old episode of 24, when this snarling tiger leaps through an open window. What do you do? Throw a chair at it? Wave a kitchen knife in its general direction? What? The simple answer is… pray. You can't rely on knee-jerk reactions to deal effectively with this kind of intense, in-the-moment situation. A little planning goes a long way. It's the same with real-time marketing. Big Data analytics can help you predict and plan for threats and opportunities, and make appropriate just-in-time decisions. Going, going, gone Real-time marketing implies a sequence of steps — business-event data, analytical insights and decisions based on context.

In statistics, a outlier is defined as a observation which stands far away from the most of other observations. Often a outlier is present due to the measurements error. Therefore, one of the most important task in data analysis is to identify and (if is necessary) to remove the outliers. There are different methods to detect the outliers, including standard deviation approach and Tukey's method which use interquartile (IQR) range approach. In this post I will use the Tukey's method because I like that it is not dependent on distribution of data. Moreover, the Tukey's method ignores the mean and standard deviation, which are influenced by the extreme values (outliers).

It took a decade of research before scientists decrypted human DNA for the first time. Today – after 13 years of progress – same work is done within 24 hours! We continuously sharpen data processing tools. Hence, the amount of data has grown drastically over the past ten years. But is there still room for innovation? Does the future hold new, jaw-dropping revelations? There's no need to guess. Let's check out what do the biggest data science gurus think on Big Data tendencies in the following ten years, and how will it change the world we know. Simplicity is the new black! First of all, data analysis will become more "dummy-friendly". Business-centric data analysis tools will not require programming skills. Both use and even development will be simple as cake.

Could your security and performance be in jeopardy? Nearly half (3.2 billion, or 45%) of the seven billion people in the world used the Internet in 2015, according to a BBC news report. If you think all those people generate a huge amount of data (in the form of website visits, clicks, likes, tweets, photos, online transactions, and blog posts), wait for the data explosion that will happen when the Internet of Things (IoT) meets the Internet of People. Gartner, Inc. forecast that there will be twice as many–6.4 billion–Internet-connected gadgets (everything from light bulbs to baby diapers to connected cars) in use worldwide in 2016, up 30 percent from 2015, and will reach over 20 billion by 2020.

Most companies today claim to be fluent in data, but as with most trends, these claims tend to be exaggerated. Companies are high on data, but what does it mean to be a data-driven company? I went ahead and asked a number of business leaders. According to Amir Orad, CEO of Sisense, a business intelligence software provider, true data-driven companies understand that data should be omnipresent and accessible. "A data-driven company is an organization where every person who can use data to make better decisions, has access to the data they need when they need it. being data-driven is not about seeing a few canned reports at the beginning of every day or week; it's about giving the business decision makers the power to explore data independently, even if they're working with big or disparate data sources."

The market for self-service data preparation tools is having a golden moment in the sun, with analyst firms like Gartner deciding that it does, in fact, have legs to stand on its own. The health of that market is also why Trifacta today launched a formal business partner program. With the new Wrangler Partner Program, Trifacta aims to bring a variety of types of firm into the self-service data prep fold, including system integrators, consulting firms, software vendors, and Hadoop and data warehouse platform providers.

Designing IoT applications is complex, but deploying them in a scalable fashion is even more complex. A scalable, API first IaaS cloud is a good start, but in order to understand the various components specific to deploying IoT applications, one needs to understand the architecture of these applications and figure out how to scale these components independently. In his session at @ThingsExpo, Nara Rajagopalan is CEO of Accelerite, will discuss the fundamental architecture of IoT applications, and the various components that the cloud needs to support in order to make IoT as a Service possible.

Global supercomputer leader Cray Inc. (Nasdaq: CRAY) today announced the launch of the Cray® Urika®-GX system — the first agile analytics platform that fuses supercomputing technologies with an open, enterprise-ready software framework for big data analytics.

If you manage your whole LAN in the cloud, why not add in the desk phones, too? That's what Cisco's Meraki division has done. Its first phone, the MC74, can be managed on the same dashboard Meraki provides for its switches, Wi-Fi access points, security devices, and other infrastructure. Cisco bought Meraki in 2012 when it was a startup focused on cloud-managed Wi-Fi. The wireless gear remains, but Cisco took the cloud management concept and ran with it. Now Meraki's approach is the model for Cisco's whole portfolio.

In his session at 18th Cloud Expo, Bruce Swann, Senior Product Marketing Manager at Adobe, will discuss how the Adobe Marketing Cloud can help marketers embrace opportunities for personalized, relevant and real-time customer engagement across offline (direct mail, point of sale, call center) and digital (email, website, SMS, mobile apps, social networks, connected objects). Bruce Swann has more than 15 years of experience working with digital marketing disciplines like web analytics, social media, mobile marketing and email marketing, as well as marketing and CRM technologies, including marketing automation, predictive analytics and marketing resource management.

Enterprise networks are complex. Moreover, they were designed and deployed to meet a specific set of business requirements at a specific point in time. But, the adoption of cloud services, new business applications and intensifying security policies, among other factors, require IT organizations to continuously deploy configuration changes. Therefore, enterprises are looking for better ways to automate the management of their networks while still leveraging existing capabilities, optimizing performance and reducing operational risk through standardization and best-practice architectures.

Enterprises' initial entrance into the cloud is over and they witnessing the arrival of the Cloud 2.0. That's the word from Diane Greene, senior vice president for Google's cloud businesses. The first phase of the cloud involved testing the waters, figuring out how companies could save time and in-house effort by having apps and services run in the cloud and using the cloud to store data. The top concerns were security and reliability. Fast forward several years, and enterprises that have moved to the cloud have resolved most of their worries, figured out if they want a private, public or hybrid cloud, and chosen their vendors. Now CIOs want to do more than store their data and run their apps in the cloud.

One of the big questions surrounding the rise of real-time stream processing applications is consistency. When you have a distributed application involving thousands of data sources and data consumers, how can you be sure that the data going in one side comes out the other unchanged? That's the challenge that Confluent is addressing with today's launch of new software for Apache Kafka.

UC vendors have to simultaneously keep up with more general technology while creating their own niche.

New User Experience for Iterative Analytics and a Re-Architected, Future-Proof Back End for Fastest Processing Every Time SAN FRANCISCO, CA — May 24, 2016 – Further democratizing big data analytics by making traditionally complex tasks easy, Datameer today unveiled Datameer 6 to enable a new class of data-driven business analysts. Datameer 6 introduces an elegant new front end that reinvents the entire user experience, making the previously linear steps of data integration, preparation, analytics and visualization a single, fluid interaction. Shifts in context, tools or teams are no longer required every time a data change is needed saving both time and cost over traditional analytic workflows. Datameer 6 also introduces the addition of Spark to its patent-pending Smart Execution™ technology, which intelligently selects the best processing framework for every single job while abstracting complexity from the end user.

The company's IRIS system promises dramatic speed improvements over existing technologies. These storage improvements should benefit anyone running large database queries, particularly transactions that require speed.

Cray continued its courtship of the advanced scale enterprise market with today's launch of the Urika-GX, a system that integrates Cray supercomputing technologies with an agile big data platform designed to run multiple analytics workloads concurrently. While Cray has pre-installed analytics software in previous systems, the new system takes pre-configuration to a new level with an open software framework designed to eliminate installation, integration and update headaches that stymie big data implementations. Available in the third quarter of this year, the Urika-GX is pre-integrated with the Hortonworks Data Platform, providing Hadoop and Apache Spark. The system also includes the Cray Graph Engine, which can handle multi-terabyte datasets comprised of billions of objects, according to Cray. The Graph Engine runs in conjunction with open analytics tools for support of end-to-end analytics workflows, minimizing data movement. The system includes enterprise tools, such as OpenStack for management and Apache Mesos for dynamic configuration. 

With the partnership, developers will be able to invoke a REST-based application programming interface to connect a device to a cellular network.

Redis isn't your typical NoSQL data store, and that's exactly why it hits the sweet spot for certain use cases. Get started using Spring Data Redis as a remote cache server to store and query volatile data

A multi-tiered architecture built on top of Java EE presents a powerful server-side programming solution. As a Java EE developer for many years, I've been mostly satisfied with a three-tiered approach for enterprise development: a JPA/Hibernate persistence layer at the bottom, a Spring or EJB application layer in the middle, and a web tier on top. For more complex use cases I've integrated a workflow-driven solution with BPM (business process management), a rules engine like Drools, and an integration framework such as Camel.




A multi-tiered architecture built on top of Java EE presents a powerful server-side programming solution. As a Java EE developer for many years, I've been mostly satisfied with a three-tiered approach for enterprise development: a JPA/Hibernate persistence layer at the bottom, a Spring or EJB application layer in the middle, and a web tier on top. For more complex use cases I've integrated a workflow-driven solution with BPM (business process management), a rules engine like Drools, and an integration framework such as Camel.

Big data has never been bigger, nor more of a crapshoot. At least, that's the sense one gets from a new survey revealing that 76 percent of all enterprises are looking to maintain or increase their investments in big data over the next few years. This despite a mere 23.5 percent owning up to a clear big data strategy. That wouldn't be so bad if things were getting better, but they're not. Three years ago 64 percent of enterprises told Gartner that they were hopped up on the big data opportunity. But then, as now, the vast majority of big data acolytes didn't have a clue as to how to get value from their data.

The enterprise needs to build roughly the same data ecosystem that exists within the data center, but tailored to the more abstract nature of the IoT.

The rise of the citizen data scientist has placed a new premium on easy-to-use analytics tools, and on Tuesday Datameer announced a fresh version of its namesake platform designed with that imperative in mind. New in Datameer 6 are a simplified front end as well as an expanded tool for selecting the best processing framework for the job. Datameer 6's new front end combines the previously linear steps of data integration, preparation, analytics and visualization into a single screen. Shifts in context, tools or teams are no longer required every time a data change is needed, the company says. Instead, users can toggle among different phases of the workflow, with visualization along the way to illustrate the effects any changes have made.

As a data scientist, your job doesn't always make sense to others. Ever tried explaining what you do to your parents? They may nod their heads, but their eyes scream confusion. Well, aside from possibly stifling job-related conversations, this isn't a big deal. However, when it comes to explaining what you do to potential clients, who happen to be just as technology averse, it's a major issue.

Calling 2016 "the year mobile has happened," Google's SVP of ads and commerce Sridhar Ramaswamy announced improvements to advertising functionality for tablets and smartphones. More text, better bidding, responsive formats, and expanded location-based capabilities are all coming Google's mobile ad business.

Manufacturers will tell you that unified storage arrays are the shangri-la for enterprise storage, yet few environments need every protocol. Virtualization, databases, cloud and file all have different needs for the best user experience. Despite this, I heard from the majority of the senior IT staff for the first 5-10 years of my IT career that, "fibre channel is the most secure, best performance protocol and should be used for all enterprise-class products."

How do you handle your semi-structured data? In this whitepaper from IBM Cloud Data Services, find out how keeping your data warehouse on the cloud can help you gain the competitive advantage by freeing valuable resources.

In previous posts, we've looked at why organizations transforming themselves with data should care about good development practices, and the characteristics unique to data infrastructure development. In this post, I'm going to share what we've learned at SVDS over our years of helping clients build data-powered products–specifically, the capabilities you need to have in place in order to successfully build and maintain data systems and data infrastructure. If you haven't looked at the previous posts, I would encourage you to do so before reading this post, as they'll provide a lot of context as to why we care about the capabilities discussed below.

The Ambari Metrics System (AMS) released with Ambari 2.0 about a year ago is an Ambari-native pluggable and scalable system for collecting and querying Hadoop Metrics (AMBARI-5707). Since that time, the community has been working hard at adding new capabilities to the system and recently announced the availability of Ambari 2.2.2 where AMS now includes… The post Under-the-Hood with Ambari Metrics and Grafana appeared first on Hortonworks.

by John Mount Ph. D. Data Scientist at Win-Vector LLC In part 2 of her series on Principal Components Regression Dr. Nina Zumel illustrates so-called y-aware techniques. These often neglected methods…

There is an unprecedented volume of data being created, with an unprecedented number of people around the world regularly producing and storing data. Research shows that 90 percent of the data in the world today was created in the last two years alone. This may not be news to those of us who plan for, manage, or process this barrage of data, but questions still remain about best practices when taking on infrastructure changes to address big data in a big way. Without structure, big data is simply noise; massive amounts of information exuding from a large, and growing, pool of internal and third-party sources. What was once a question of how do we extract, process and store the data, is now a question of how to manage, analyze and operationalize accurate insights from this data.

Worldwide revenues for big data and business analytics will grow from nearly $122 billion in 2015 to $187 billion in 2019, according to the new Worldwide Semiannual Big Data and Analytics Spending Guide from research firm International Data Corporation (IDC). That's an increase of more than 50 percent over IDC's five-year forecast period. [ CIO.com and Drexel to honor 50 analytics innovators. Nominate your analytics project today! ]

Cloudera considers the handling and reporting of security vulnerabilities a very serious matter. In this post, learn the processes involved. In addition to expecting enterprise-class standards for stability and reliability, Cloudera's customers also have expectations for industry-standard processes around the discovery, fix, and reporting of security issues. In this post, I will describe how Cloudera addresses such issues in our software. An overview of the process looks like this flowchart: The first step in the life cycle of a security vulnerability is that it is discovered and reported to Cloudera. The post Cloudera's Process for Handling Security Vulnerabilities appeared first on Cloudera Engineering Blog.

When it comes to IT security, many business owners think that hackers are only targeting large businesses. We see things like the Target and Home Depot breach in the media and we think they're the only ones having trouble with hackers. But the fact of the matter is that more and more, hackers are turning to small businesses to try to cash in. From ransomware and phishing for your credentials, the risk is real for business owners that are trying to protect their data.

Your organization's data is money unaccounted for on the balance sheet, and it must be protected. A secure production site with no single point of failure goes a long way, but disaster recovery is the insurance that keeps you running in case of a fire, cyber attack, or accidental database wipe.

GRC2016 in Vienna – It's almost here, and I can hardly wait!  Sure, I've never seen Vienna, and I hear it is beautiful, historic, musical, and even magical.  Am I excited to be booked for a visit at the Spanish Riding School in Vienna before the conference?  Yes, definitely.  But, believe it or not, I…

Use open-source tools to supercharge the data science lifecycle, giving data science teams a boost as they work to provide compelling results in the complex team environments that mark modern corporations. Learn how you can make open data science an ongoing part of your business environment when you attend the Apache Spark Maker Community Event, whether in person or over the Internet.

This entry was posted in News and tagged , , , , , , , . Bookmark the permalink.