Big Data News – 17 Jun 2016

Featured Article
Summary: What changes in your analytics and Big Data stack would you have to make if you only had 10 milliseconds to make a decision? There's an entire industry that has to live by that rule. This is a great story about the collision of ecommerce, digital advertising, predictive analytics, and AI in a new digital battleground, the automation and optimization of advertising targeting and spend. Like so many interesting opportunities this new market starts with a pain point and an unmet need. In this case the lag in digital advertising spend, particularly in mobile devices. Mary Meeker in her 2016 Internet Trends report that we featured last week pointed out this trend.

Top Stories
Data visualizations can transform analytics data into actionable business information. But remember to keep things easy to understand and remember your audience, an expert cautions.

Changes to one piece of the data environment affect the performance of others, and not always in a good way.

Six steps for organizations to strengthen their hybrid directory environment and ensure successful hybrid cloud environment performance.

Asmita Barve-Karandikar is an SDE with DynamoDB Customers often want to process streams on an Amazon DynamoDB table with a significant number of partitions or with a high throughput. AWS Lambda and the DynamoDB Streams Kinesis Adapter are two ways to consume DynamoDB streams in a scalable way.

Experienced business intelligence and analytics managers offer their thoughts on how to get started on instilling more data-driven management and decision-making processes in organizations.

Hadoop Summit San Jose (#HS16SJ), is rapidly approaching and is less than 2 weeks away! The Hortonworks office is perfecting every detail of Summit, from keynotes to special events.  One particular event that I am eager to attend at Summit is the Women in Big Data (WiBD) Lunch and Panel. On Tuesday, June 28th from… The post My Summer At Hortonworks — Part 2: WiBD Be Assertive. Be Innovative. Take Risks. appeared first on Hortonworks.

In the months ahead, Samsung will increasingly become an anchor tenant on the Joyent cloud.

In the months ahead, Samsung will increasingly become an anchor tenant on the Joyent cloud.

An expert panel discussion examines the value and direction of The Open Group IT4IT initiative, a new reference architecture for managing IT to help business become digitally innovative. IT4IT was a hot topic at The Open Group San Francisco 2016 conference in January. This panel, conducted live at the event, explores how the reference architecture grew out of a need at some of the world's biggest organizations to make their IT departments more responsive, more agile.

Natero aims to empower Customer Success, Sales, and Marketing teams to become data-driven, without having to be data experts. It automatically aggregates and mines all sources of customer data to uncover actionable insights.

Congratulations to Peter Aldhous and Charles Seife of Buzzfeed News, winners of the 2016 Data Journalism Award for Data Visualization of the Year. They were recognized by their reporting for Spies in…

Graph technology is popping up in many places, including master data management. A major data integration player has joined the quest, as seen recently at Informatica World.

"The world is one big data problem."  Andrew McAfee, associate director of the Center for Digital Business at MIT Sloan One whole year of almost daily client meetings & discussions with industry leaders have helped me see crystallize my view of an important yet abstract idea into reality.  That is, Big Data capabilities or the lack of… The post The six megatrends helping enterprises derive massive value from Big Data appeared first on Hortonworks.

Olli is more than just a 3D-printed self-driving vehicle. By leveraging IBM's Watson cognitive computing platform, the bus can foster a more interactive journey with riders.

Distributed computing cannot guarantee consistency, accuracy, and partition tolerance. Most system architects need to think carefully about how they should appropriately balance the needs of their application across these competing objectives. Linh Da and Kyle discuss the CAP Theorem using the analogy of a phone tree for alerting people about a school snow day.

The last time Gartner published their IaaS/PaaS provider rankings Amazon AWS and Microsoft Azure occupied the coveted upper right quadrant. To make it into Gartner's magic quadrant both Amazon and Microsoft needed to demonstrate the quality of their services as well as completeness of their vision. According to Amazon's company profile on Reuters, they participate in a number of business segments. Amazon operates and markets an Android App Store, streaming video and music, mobile advertising, retail analytics, movie production, mobile devices (i.e. Kindle tablets), audiobooks and book publishing.

With over 1.65 billion monthly active users, it's no surprise that businesses want to access user-generated content from Facebook. Status updates, likes, reviews, comments, photos, and videos all contribute to a sea of web data generated by Facebook users. But what exactly can be collected from Facebook and how are companies typically accessing that data?… The post What Type of Web Data Can You Collect From Facebook? appeared first on BrightPlanet.

Oracle revenues declined slightly year-over-year in the fourth fiscal quarter, but they still beat analyst estimates. Meanwhile, Chairman and CTO Larry Ellison told financial analysts the company is ready to meet customer demand in the Infrastructure-as-a-Service market with its second-generation data center technology.

Most commercial distributions of Spark offer a cloud option, and these have been popular with customers. But that doesn't mean cloud Spark platforms make sense for every situation.

The acquisition of cloud provider Joyent will support Samsung's mobile, IoT, and cloud service businesses.

Data theft plagues banks. Blockchain and tokenization are potential solutions to the problem. So why won't banks use them?

The cloud has made customer data management easier than ever. It has also created some unique complications that brands need to be prepared for. Brands need to weigh their options carefully before choosing a cloud management solution.

IBM Research and The Weather Company are putting forecasting models and machine learning to work on historical data. The result? Deep Thunder, a hyper-local forecasting tool offering precision insights and predictions for business customers.

In the compelling keynote address below, Josh Wills, Director of Data Engineering at Slack, discusses an all-too-common theme these days: "Data Engineering and Data Science: Bridging the Gap."

The big data blast has given rise to a host of information technology software and tools and abilities that enable companies to manage, capture, and analyze large data sets of unstructured and structure data for result oriented insights and competitive success. But with this latest technology comes the challenge of keeping confidential information secure and private. Big data that resides within a Hadoop environment contains sensitive confidential data such as bank account details financial information in the form credit card, corporate business, property information, personal confidential information, security information of clients and all. Due to the confidential nature of all of data and the losses that can be done should it fall into the wrong hands, it is mandatory that it be protected from unauthorized access.

Through 2020, spending on self-service data tools will grow 2.5 times faster than spending on traditional data tools.

Oracle Corp is catching up in cloud. Although it was late to the party, it says it's now doing very well, as its fiscal Q4 numbers demonstrate. And in an ironic troll to Salesforce CEO Marc Benioff, Larry Ellison publicly hopes to reach $10 billion revenue before him. In IT Blogwatch, bloggers break out the popcorn. Your humble blogwatcher curated these bloggy bits for your entertainment. Not to mention: LOVE…

News: Software company could be valued at around $1.9bn.

2014 was a watershed moment in Indian politics for a variety of reasons. The foremost one was the stupendous victory of the Narendra Modi led NDA in the 2014 Lok Sabha elections. Modi managed to win 282 seats for his party in the elections, the first time in 30 years that a political party had come to power with absolute majority. Several reasons have been attributed to this victory. Political analysts have spent hours in TV newsroom debates discussing the reasons for this victory. Some have credited the presidential style campaign that Modi ran, while some cited the anti-incumbency against the UPA government. While all these may have contributed to the victory, there is one thing which analysts have overlooked, or at the very least not considered as a major contributing factor for the victory. .

Designed specifically for banks, telcos, government agencies and manufacturers, the applications provide users with tailored analytics and insights via visualizations.

The Internet of Things is as much about computing as it is about the "things" themselves, and that's why Samsung Electronics is buying Joyent. At first glance, a maker of smartphones, home appliances and wearables doesn't seem like it would need a cloud computing company. But so-called smart objects rely on a lot of number-crunching behind the scenes. A connected security camera can't handle all its video storage and image analysis by itself, for example, and that's where cloud services come in. The real money in IoT will be in the services more than the devices themselves, research firm Gartner says. It's not entirely up to Samsung to deliver services its devices, but the company sees an opportunity there.

Machine learning is a hot topic, but what applications is the technology driving and what drives machine learning? The short answer, concludes a new research report on machine learning, is data–lots of data. "Data is available in volumes never dreamed of, which means the algorithms have more to get their teeth into," concluded a survey by market watcher 451 Research, noting that "machine learning lives or dies depending on the quantity and quality of the data available." Moreover, the rise of application containers in software development "means small [machine language] applications can be placed into software without breaking the rest of the application," the report's authors continue.

Editor's note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here. At Silicon Valley Data Science, our focus on data strategy has given us a window into how various organizations are thinking about data at the executive level.

The challenge for online retailers is to combine effective fraud prevention that protects their bottom line with great customer experience and high approvals.

In crisis management, the first step happens before the storm: Create robust communications infrastructures, policies and procedures.

Enterprises interested in tapping container technology now have a brand-new option for managing it: ContainerX, a multitenant container-as-a-service platform for both Linux and Windows. Launched into beta last November by a team of engineers from Microsoft, VMware and Citrix, the service became generally available in both free and paid versions on Thursday. Promising an all-in-one platform for orchestration, compute, network, and storage management, it provides a single "pane of glass" for all of an enterprise's containers, whether they're running on Linux or Windows, bare metal or virtual machine, public or private cloud.

Your Step-By-Step Guide To Learning R Programming. Do you want to learn R Programming? Do you get overwhelmed by complicated lingo and want a guide that is easy to follow, detailed and written to make the process enjoyable? 

Summary Introducing Data Science teaches you how to accomplish the fundamental tasks that occupy data scientists. Using the Python language and common Python libraries, you'll experience firsthand the challenges of dealing with data at scale and gain a solid foundation in data science. About the Technology Many companies need developers with data science skills to work on projects ranging from social media marketing to machine learning. Discovering what you need to learn to begin a career as a data scientist can seem bewildering. This book is designed to help you get started.

In an earlier post I explored the value of using scalable machine learning to extract value from huge amounts of data. In this post, I will dive down into the technical side of things, particularly the challenges and benefits that come with making algorithms scalable on large clusters of computers. Machine learning algorithms are written to run on single-node systems, or on specialized supercomputer hardware, which I'll refer to as HPC boxes. They grew up in a world where they didn't have to scale across multiple nodes. It's relatively easy to get high performance when running algorithms on a single computer. With distributed computing, things get a great deal harder for some algorithms due to the communications latencies among what could be thousands of server nodes.

Hadoop is growing among small and large businesses. See how and why it's being used for business operations today.

One of the key challenges sales compensation leaders face is demonstrating the ROI of Sales Performance Management (SPM) to other key stakeholders, particularly, the CFO. Consider three key, strategic arguments for advancing CFO and other organizational leadership stakeholder interest in SPM.




One of the key challenges sales compensation leaders face is demonstrating the ROI of Sales Performance Management (SPM) to other key stakeholders, particularly, the CFO. Consider three key, strategic arguments for advancing CFO and other organizational leadership stakeholder interest in SPM.

In the domain of data science, solving problems and answering questions through data analysis is standard practice. Often, data scientists construct a model to predict outcomes or discover underlying patterns, with the goal of gaining insights. Organizations can then use these insights to take actions that ideally improve future outcomes.  The flow of the methodology illustrates the iterative nature of the problem-solving process. As data scientists learn more about the data and the modeling, they frequently return to a previous stage to make adjustments, iterate quickly and provide continuous value to the organization. Models are not created once, deployed and left in place as is; instead, are continually improved and adapted to evolving conditions.

The Data Science Summit is packed with industry experts, authors, researchers and business leaders delivering concrete examples of data science and machine learning in action. Promotional Code: datasciencecentral15 for 15% off. We hope you'll join us!  View the schedule here   Highlights: 2 days & 3 tracks with 60 talks and tutorials, and a startup showcase. Hands-on trainings: gain practical skills using the best tools in the industry. Best practices from business leaders at Cloudera, Google/TensorFlow, Kaggle, Pandora, Pinterest, Quora, Salesforce, StichFix, Tableau, Uber and more. Machine learning advances from professors at Carnegie Mellon, Stanford, UC Berkeley & University of Washington. Book signings by scikit-learn's Andreas Mueller & Pedro Domingos, author of "The Master Algorithm". 1400 fellow data scientists, developers and business leaders for networking.

The DNC hack is a high-profile example of what can happen if cybersecurity isn't taken more seriously.

With Google CEO Sundar Pichai sharply focused on AI, machine learning and their potentials, the search giant is opening a dedicated machine learning research center in Europe.

By: Bala Deshpande, Conference Co-Chair, Predictive Analytics World for Manufacturing 2016 In anticipation of his upcoming Predictive Analytics World for Manufacturing conference presentation, Building a Predictive Analytics Organization, we interviewed Chris Labbe, Managing Technologist at Seagate Technology. View the Q-and-A below for a glimpse of what’s in store at the PAW Manufacturing conference. Q: What are the challenges in translating the lessons of predictive analytics from other verticals into manufacturing? A: "Manufacturing at a company like Seagate means volume.

There's no doubt that the Apache Spark phenomena has taken the big data world by storm. But can the technology actually deliver according to the tremendous hype that is accompanying it, or has it been oversold? There were some interesting takes on that question during last week's Spark Summit.

The current crop of wearables, such as smartwatches and fitness bands, is expected to power the market into the near future, but eyewear is where the profits lie. By the end of the year, wearable shipments will top 101 million. They will grow to 213 million by 2020.

Humorous metaphor about the cloud aside, this is a not-so-lighthearted blog entry. This is the convergence of progress, loss, and humility. In the fall of 2015, Salesforce.com (SFDC) announced upcoming initiatives for the Internet of Things. What we now know is that it was envisioned with AWS as its backbone. By some accounts, the affected infrastructure team at SFDC advocated fiercely for themselves. After all, they had built something in the industry which is still unparalleled, despite many newcomers in the field. Salesforce.com was all about the cloud, even when it was derided, not like today when everyone who's not rushing to it is considered ancient.

Check out the details on a tool that can change the game for data scientists–open source analytics notebooks. Learn what notebooks are, what value they provide and how to get started using them today.

Data analytics is fueling new strategies in law enforcement from the federal level down to local departments. Whether it's finding patterns across time and location, predicting new threats or linking resources to responders during major events, data is the future of proactive emergency plans.

Do you remember Captain James Kirk using his wrist watch to communicate with the crew of the Starship Enterprise back in 1966? Today, almost after 50 years, it has finally become a reality! Digital disruption is occurring in all business functions all around the world. Wearables are becoming mainstream and disrupting almost every industry, with the biggest impact being seen in customer service, healthcare and manufacturing. Wearables currently stand at the stage where smartphones were back in 2007.

During a session of the 46th session of the UN Statistical Commission on the the post-2015 development agenda, UN Deputy Secretary-General Jan Eliasson said data will be the "lifeblood of decision-making and the raw material for accountability" in the new agenda and called for a statistical framework that would meet such expectations (http://sd.iisd.org/news/un-statistical-commission-sets-roadmap-to-post-2015-indicators/). Statistics has always been presented as a support to decision making, whether is is official statistics or statistics collected for monitoring and evaluation purposes.

Last week Microsoft has announced that Apache Spark on Azure HDInsight (Microsoft's managed Hadoop and Spark cloud service) is now generally available. I spoke to Tampa Bay Data Science Group last night regarding Apache Spark on Azure HDInsight and the associated offerings.  Spark for Azure HDInsight offers customers an enterprise-ready Spark solution that's fully managed, secured, and highly available and made simpler for users with compelling and interactive experiences. The slides from my presentation along with references to codebase and links are available as follows. 

What is Text Mining?   Text Mining is a general catch-all for a range of techniques for extracting information from text strings. Being able to extract, clean and summarize text data is a key ability for any Data Scientist. The following blog aims to highlight some of the process steps I use to clean text data as well as some summarization methods.   Initial cleaning   To illustrate some of the approaches to text mining I am going to use the full text of 1984 by George Orwell. This data was extracted from msxnet.org/orwell with analysis carried out in R.

Data mining (sometimes called knowledge discovery) is the process of analyzing and summarizing data into useful information which can be used to understand common features, the origin of data and to extract hidden predictive information. Data mining is used in science, engineering,modeling and analysis of financial markets. In this article we will discuss a free data-analysis framework called DMelt (The DataMelt project, http://jwork.org/dmelt/) which can be used to facilitate data analysis and data mining. It is a great program for scientists, engineers and students who need numerical and statistical computations, data and function visualization and even symbolic computation.

Recently I spoke with a Brand Director about how to find market space he can win and defend in the long run. The discussion shifted on the meaning of information and insights and, after an interesting debate on what's insight at all[1], we found agreement by means of an example, which I'd like to present to you too.  From data to information The following table shows data on automobile brand perception gathered through a survey. 324 auto owners were asked, among other questions, For each of these brands mark the statements you agree with. The Total % column shows the frequency of occurrence of both row and column counts. Table: Survey on automobile brand perception. This table requires just a few simple data transformations to deliver information.

Introduction The City and County of San Francisco had launched an official open data portal called SF OpenData in 2009 as a product of its official open data program, DataSF. The portal contains hundreds of city datasets for use by developers, analysts, residents and more. Under the category of Public Safety, the portal contains the list of SFPD Incidents since Jan 1, 2003. In Part 1 of this series of analysis, I performed an exploratory time-series analysis on the crime incidents data to identify any patterns.

Yeah… working with data sets means that you have a way to get them first. After you get them you have to clean them. Data scientists spend 80% of their time in data cleaning and data manipulation and only 20% of their time actually analyzing it. And then you find yourself spending 80% of your time to clean these data. At the same time, deadlines and management demands keep you up at night. This is one reason data analysts and data scientists regularly scour the web looking for anything that could help. Tools, tutorials, resources.

Mobile games are serious business as eMarketer clearly states, and proper management of the game's economy can mean the difference between success and failure. Start here: The science (and art) of game economics management starts with listing down the sources and sinks. Sources are all resources provided to the users, either free or paid (examples can be daily amount of free chips in a poker game, or another life purchased but still not consumed). Sinks are all of the ways in which users can spend their accumulated sources (such as spend gold and wood to build a new house, to use a potion to heal your hero).

This entry was posted in News and tagged , , , , , , . Bookmark the permalink.