Big Data News – 21 Jan 2016

Today's Infographic Link: Killing Time – How to Destroy Your Productivity

Top Stories
IBM turned in better-than-expected fourth-quarter results this week. However, the numbers still show how far Big Blue has to go to return to profit and growth. When does the Street run out of patience?

A new report by Forrester Research's big data analysts says that adopting Hadoop is "mandatory" for any organization that wishes to do advanced analytics and get actionable insights on their data. Forrester estimates that between 60 percent and 73 percent of data that enterprises have access to goes unused for business intelligence and analytics.

Trying to show the data analysis package R is no more scary than Excel, John Mount of the Win-Vector blog shows a simple analysis both in Excel and in R.

Are you doing an ETL job or batch process in Hadoop? If so, you have steps in the load to worry about. Some of those steps can run in parallel, whereas some depend on the other steps and shouldn't go until those steps complete. The company I work for is a big Kettle user. Kettle is an open source tool from Pentaho (also known as Pentaho Data Integration). Our main use for it is to push data from A to B and transform it at step C, where step D and E can happen in parallel, but not before step C. In other words, we use Kettle/PDI for orchestration.

A quick heads-up that I'll be presenting a live webinar on Thursday January 28, Introduction to Microsoft R Open. If you know anyone that should get to know R and/or Microsoft R Open, send 'em along.

iStock_000077380897_Small As the Apple Watch and other Smart Watches gain momentum, we've seen increasing adoption in the business world. These wearable devices have done a lot for the productivity of employees. But a concern that must be raised is what security gaps exist in these devices? Hackers are always looking for ways to infiltrate company networks and data, and wearables are just one more opportunity for them to find a hole or weakness.

The Internet has also provided the option to outsource the call center services. This frees the company from managing an in-house call center staff and focus on other parts of the organization. Outsourced call center service providers are becoming really popular these days. Still it's not a decision that can be taken lightly. While these outsourced centers are more affordable, they often end up being negligent in quality control so many companies are staunch believers in having on-site call center staff even if it becomes cost ineffective, not willing to risk their brand's reputation.

News: Partnership with Nvidia will help to boost its capabilities.

The following 50 companies received each between $77 MM and $1,200 MM in funding, for a grand total of nearly $9 billion when aggregated. We scraped Yahoo Finance, press releases, Wikipedia, and other sources to gather this data. If you include all big data companies with at least $10 MM in funding, the grand total is about $18 billion. Of course, this is a typical example of a Zipf distribution, where a few lucky companies received massive funding, and the vast majority received zero. Also, when performing this analysis, we had to deal with dirty data: $10 MM is sometimes listed as $10 Million, 10 million, 10MM, 10M, 10 M, $10 M and so on.

  A common scenario for data scientists is the marketing, operations or business groups give you two sets of similar data with different variables & asks the analytics team to normalize both data sets to have  a common record for modelling.   Here is an example of two similar data sets:   Data Set 1 Data Set 2 Organization Name Sales Organization Name # of Customers John Doe Inc $300 Sally Harper Cntr 10 Saint Rogers $400 John Doe Incorporated 50 Sally Harper Center $500 St. Rogers 100 How would you as a data scientist match these two different but similar data sets to have a master record for modelling?   Short of doing it manually, the most common method is fuzzy matching.  

Big data platform vendors make the case that "data overload" driven by the proliferation of mobile devices along with the emergence of Internet of Things (IoT) sensor networks are making it harder than ever to extract accurate information from this data waterfall. Even harder is connecting the dots to provide analyses that can be used to make reasoned business decisions. Hence, analytics vendors like Microsoft partner Webtrends are stressing the need in recent product releases for "achieving data accuracy at scale."

If you are deploying Skylake for Business, turn on and require the included authentication technology. Don't be the cause of your CEO's fall.

The SLS fabric interconnects give IT the flexibility they need to migrate from one processor architecture to another on their own terms.

Dell is opening up its network operating system, one step toward a data-center OS that could help enterprises emulate cloud companies like Google and Facebook. Operating System 10, rolling out in stages this year, changes the company's networking foundation from a closed Dell platform to open software based on an unmodified Linux kernel. It will let enterprises add third-party networking components and use common scripting languages to develop new network capabilities. But beyond its usefulness in networking, OS10 could become a single OS for computing and storage, too. That idea has the potential to make it much easier for enterprises to work on all three parts of their infrastructure in one place using a devops approach.

After a few years of proving its value, addressing security concerns, and developing viable business models, the cloud has achieved a significant market share in terms of data storage, applications, and computing infrastructure. More businesses are accepting  the cloud storage and services option, and your business is likely among those ranks. Ready to go cloud? Here's what you need to know. 1. Why Cloud? Great reasons to migrate to the cloud include eliminating the expense of hardware and shifting the burden of updates and security to the cloud service provider. "Migrating to the cloud" isn't and shouldn't be a goal.

Focus on environmental issues appears to be creating an atmosphere in which more substantial actions are possible on dealing responsibly with e-waste.

Another big part of the food supply comes from ranches and farms that raise and slaughter various livestock. While ranching is sometimes bundled with agriculture, I discussed farming in Big Data in Agriculture, so we'll focus on ranching this time around. Somewhat surprising is that big data usage in ranching appears more limited than in farming. That said, there are a number of novel uses of technology and data in animal husbandry.

Advice from top security experts to help you identify areas of vulnerability and develop strategies for securing your data and information systems.

Since my last post, I had an opportunity at work to take over the responsibilities for a couple of Web apps. I also implemented one from scratch. The last time I had anything to do with JavaScript was over a decade ago. The browsers were weak and JavaScript support was not standard. Web pages were rendered using server-side templates and all business logic happened on the server. A decade in the software industry is like a century in other fields. Browsers are no longer dumb terminals, and JavaScript has emerged as a tool for building cross-platform apps. Expensive and bloated Java application servers declined in popularity years ago. Node has emerged as a platform for server-side JavaScript.

Google today announced that it has made a proposal to submit its Dataflow data processing technology to the Apache Software Foundation (ASF) in order to make Dataflow an Apache incubator project and thereby introduce broader governance and transparency around the software. Dataflow is interesting because it can handle both batch and stream processing of large data sets. It goes far beyond the MapReduce technology at the core of the Hadoop open source big data software that Google first documented in a paper in 2004.

A report released today by Glassdoor says that data scientists have the best jobs in the U.S., according to that company's analysis of its outsized database of job information. With a median base salary of $116,840, more than 1,700 job openings on Glassdoor's site, and a user-provided career opportunities rating of 4.1, "data scientist" took the prize for most highly rated job title in America, ahead of "tax manager," "solutions architect," "engagement manager" and "mobile developer." + ALSO ON NETWORK WORLD: Authentication startup brings on 'Catch Me If You Can' ID thief as adviser | Can agile scale and does it matter? + 

Historically, bank robberies could be violent episodes perpetrated by armed criminals. But in today's digital era, bank robberies and other fraudulent crimes are being committed through highly sophisticated technological means cleverly disguised as legitimate transactions. Take a look at how collaboration and correlation can be applied to elevate risk levels and take action to thwart innovative fraud schemes.

Summary:  At least one instance of Real Time Predictive Model development in a streaming data problem has been shown to be more accurate than its batch counterpart.  Whether this can be generalized is still an open question.  It does challenge the assumption that Time-to-Insight can never be real time.   A few months back I was making my way through the latest literature on "real time analytics" and "in stream analytics" and my blood pressure was rising.  The cause was the developer-driven hyperbole that claimed that the creation of brand new insights using advanced analytics has become "real time". 

In this special guest feature Ajay Anand, Vice President of Marketing at Kyvos Insights, gives his views on speed and interactivity of OLAP and the scalability and flexibility of Hadoop.

Microsoft R Open 3.2.3, the performance-enhanced distribution of R 3.2.3, is now available for download from This is the latest (and first of many to come!) update to Revolution R…

Yesterday (Jan 19th, 2016) we held our first meetup in Toronto.  What a great start! More than 130 people attended, and more than 135 people watched the live stream!.  If our first meetup is good indication of what we will see the rest of the year, I'm excited! Before describing what happened on this first meetup, let me quickly recap what we did with meetups in 2015: The Toronto BDU meetup grew from 1,100 members to 3,800; it more than tripled! We re-focused the meetup to cover mainly big data, analytics and data science topics (before we were covering everything: Social, Mobile, Analytics, and Cloud topics), We aligned the meetup closely to Big Data University (the meetup name was changed from SMAC to Big Data University). We delivered 37 meetups with a very aggressive schedule in the last quarter: 2 meetups per week!  We did this on purpose to drive us crazy =), and at the same time to create content in Big Data University.

In the 1970s, if you had a car with air-conditioning, you'd probably have been the envy of all your friends, and you'd even have gotten more for your car on the second hand market. Today, it's pretty much impossible to buy a new car that doesn't have air-conditioning. You're more likely to find that you new car offers not just air-conditioning, but also heated seats (that you can control remotely with your smartphone), climate control, heated windscreens and a whole lot more. Product innovation has always been top priority for car manufacturers, which is why they are among the top spenders in research and development.

In the 1970s, if you had a car with air-conditioning, you'd probably have been the envy of all your friends, and you'd even have gotten more for your car on the second hand market. Today, it's pretty much impossible to buy a new car that doesn't have air-conditioning. You're more likely to find that you new car offers not just air-conditioning, but also heated seats (that you can control remotely with your smartphone), climate control, heated windscreens and a whole lot more. Product innovation has always been top priority for car manufacturers, which is why they are among the top spenders in research and development.

In the 1970s, if you had a car with air-conditioning, you'd probably have been the envy of all your friends, and you'd even have gotten more for your car on the second hand market. Today, it's pretty much impossible to buy a new car that doesn't have air-conditioning. You're more likely to find that you new car offers not just air-conditioning, but also heated seats (that you can control remotely with your smartphone), climate control, heated windscreens and a whole lot more.

Narrative Science, a leader in advanced natural language generation (NLG) for the enterprise today announced availability of Narratives for Qlik® for Qlik Sense® users, a first-of-its-kind product extension.

If you're shopping for a Hadoop distribution on which to hang your big data hat, you have your work cut out for you, according to Forrester Research, which found four strong performers in a market it says is neck and neck. "Choosing a Hadoop distribution will be difficult for most AD&D [application development and delivery] pros who carefully consider each of these Leaders," write Forrester analysts Mike Gualtieri and Noel Yuhanna. "Forrester doesn't think there is a wrong choice among the Leaders in this evaluation. This is still a neck-and-neck market."

Even organizations well-versed in data warehousing realize that building infrastructure for the so-called "data lake" is a different ballgame.

SPONSORED: This sponsored post is produced by PubGalaxy. If you ask the experts to name the most important aspect of today's complex online advertising landscape, it's a pretty safe bet you'll hear a myriad of buzzwords, including big data, omni-channel, personalization, marketing cloud, customer journey, and so on. In fact, our industry is so cluttered with trendy buzzwords that it's fairly easy to forget that a human being still stands at the center of this technology-dominated environment.

Guest blog post by Kumar Chinnakali We have received many requests from friends who are constantly reading our blogs to provide them a complete guide to sparkle in Apache Spark. So here we have come up with learning initiative called "Self-Learn Yourself Apache Spark in 21 Blogs". We have drilled down various sources and archives to provide a perfect learning path for you to understand and excel in Apache Spark. These 21 blogs which will be written over a course of time will be a complete guide for you to understand and work on Apache Spark quickly and efficiently.

Spark Dataflow from Cloudera Labs is now part of Google's New Dataflow SDK, which will be proposed to the Apache Incubator. Spark Dataflow is an experimental implementation of Google's Dataflow programming model that runs on Apache Spark. The initial implementation was written by Josh Wills, and entered Cloudera Labs exactly a year ago. Since then, we've seen a number of contributions to the project, culminating in the recent addition of an implementation of streaming (running on Spark Streaming) by Amit Sela from PayPal. The post Spark Dataflow Joins Google's Dataflow SDK appeared first on Cloudera Engineering Blog.

Intelligence analysis and data analytics provide organizations with the ability to detect and counter cybercriminals. Identifying even the smallest changes in users' online behavior patterns and in daily system operations provides crucial clues that help stop bad actors from stealing data.

According to a new report, companies are failing when it comes to security monitoring and goals.

Ciena encryption tech is designed to automatically encrypt data the second it comes on to a Ciena network and decrypt it once it exits that network.

The benefits of data analytics for travel industry leaders are not confined to understanding customer behavior over time — insights can also be gained in the moment. Real-time weather data provides marketing and operations teams within various arms of the travel industry with immediate, actionable insights.

Guest blog post by Ritesh Gujrati Those who follow big data technology news probably know about Apache Spark, and how it's popularly known as the Hadoop Swiss Army Knife. For those not so familiar, Spark is a cluster computing framework for data analytics designed to speed up and simplify common data-crunching and analytics tasks. Spark is certainly creating buzz in the big data world, but why? What's so special about this framework? Spark is Speedy Data Scientists Love Spark Spark Effectively Handles Workloads Soaring Demand for Spark Professionals IoT Fits Right in Enterprises are increasingly adopting Spark for many reasons, ranging from speed and efficiency, to analytics versatility, development familiarity, ease of use and a single integrated system for all data pipelines. Spark has established considerable momentum today across many verticals, and we can only expect to see it grow in 2016.

Although the weather can frequently be the go-to topic when engaging in small talk, it is very much intertwined with deeper discussions around the environment, global warming, and reducing carbon emissions. A rising tide of sensor-derived, Internet of Things data is now fueling a worldwide focus on the application of advanced atmospheric sciences that very likely will stimulate the insight to develop solutions.

Charts and graphs may be some of the most commonly used tools for bringing data sets to life, but Narrative Science wants you to consider another one: stories. The company already helps enterprises put data-driven stories to work through its flagship Quill natural language generation platform, and on Wednesday it debuted a new option in the form of an extension designed specifically for users of visual analytics tools from business intelligence software vendor Qlik. Companies that use Qlik Sense data-visualization software can now download the free Narratives for Qlik extension and automatically create stories that explain what's most interesting and important about their graphs and charts.

Redpaper, published: Wed, 20 Jan 2016 The IBM® Smarter Asset Management for Oil and Gas solution gives oil and gas companies direct visibility into asset usage and operational health.

Digital marketing is essential for any and all businesses regardless of whether you are an e-commerce business or a brick-and-mortar business. In everything you do with digital marketing is it also essential that you provide an exceptional customer experience, or rather to focus on experience digital marketing. If you dedicate your efforts in these marketing areas, your business will reap the rewards of long-term sustainability through customer loyalty.   B2B and B2C   It is important that you do not forget that many different types of customers exist.

The world of big data and data science can often seem complex or even arcane from the outside looking in. In business, a lot of people by now probably understand the basics of what Big Data analysis involves — collecting the ever growing amount of data we are generating, and using it to come up with meaningful insights. But what does this actually involve on a day to day level for the professionals who get their hands dirty with the nuts and bolts? To have a look under the hood of a job that some describe as the 'Sexiest Job Of The 21st Century' I spoke to leading data scientist Dr Steve Hanks to get an get an overview of what the work of a data scientist actually involves, and what sort of person is likely to be successful in the field.

The most popular example of application of analytics in sports is from the Hollywood movie Moneyball, which is based on the true story of a baseball coach who assembles a stellar team despite having a very limited budget. He uses the help of an economics student to use data models to identify players who have the potential to perform outstandingly, but are under-valued in the market. Fast forward today, and this has become a best-practice in team-building across nations, across sports. This has made it necessary for player data to be collected and maintained exhaustively.

This post is a summary of 3 different posts about outlier detection methods. You can find the original posts with detailed implementation in below links: Detecting Outliers In High Dimensional Data Sets Local Outlier Factor(LOF): Identifying Density Based Local Outliers Outlier Detection Using Principal Component Analysis One of the challenges in data analysis in general and predictive modeling in particular is dealing with outliers.

For some, simply skimming over material with words like 'systems analysis' or 'business statistical data' causes eyes to glaze over as interest and attention vaporize into thin air like a wisp of smoke. But these business buzzwords are not so nebulous in their foundations. Analytical data and use of Big Data actually permeates our everyday lives; take your GPS system, for example, guiding you through your morning commute. As you drive along, the system is constantly sorting and sifting through vast quantities of information, detecting the amount of traffic along a particular route, thereby predicting (based upon available data) alternative routes to save you time in traffic. But how do businesses use this?

This article has two parts:  Listing the top 20 experts, along with their Twitter handle, rank in reverse order, number of Twitter followers, and Klout score. We hope to soon see a woman among the top 10.The top woman is currently #11.

This entry was posted in News and tagged , , , , , , , , . Bookmark the permalink.