Big Data News – 14 Nov 2016

Our site has not been updated because our VPS provider left the business. This surprise left us hanging.  We are porting the site to another VPS provider and will soon be updating on a daily basis once again. This teaches an important lesson about trusting “The Cloud”.

Our new VPS server provider is IO Zoom. Their migration support has been exemplary. IO Zoom’s servers are fast and reasonably priced.

For the curious, our old servers ran FreeBSD Unix and our new ones run CentOS Linux.

We expect to be back to updating the industry news, on a daily basis, by Febuary 27.

Today’s Infographic Link: Big Data Myths and Facts


Featured Article
Are your data analytics predictions models suffering from the same problems as the models that predicted Hillary Clinton would easily win the US Presidential election? Here are a few things to consider.


Top Stories
Samsung Electronics wants to buy its way into the connected car market, with a plan to acquire Harman for $8 billion. It’s the latest in a line of acquisitions by Samsung, as it seeks to diversify its business beyond the slowing smartphone market. Other recently announced deals include last month’s buy of artificial intelligence startup Viv Labs, which has developed a virtual personal assistant Samsung hopes to put in its consumer electronics products, and the June purchase of Joyent, a supplier of cloud services for the internet of things.


We now experience life through an algorithmic lens. Whether we realize it or not, machine learning algorithms shape how we behave, engage, interact, and transact with each other and with the world around us. Deep learning is the next advance in machine learning. While machine learning has traditionally been applied to textual data, deep learning goes beyond that to find meaningful patterns within streaming media and other complex content types, including video, voice, music, images, and sensor data.


Machine learning couldn’t be hotter, with several heavy hitters offering platforms aimed at seasoned data scientists and newcomers interested in working with neural networks. Among the more popular options is TensorFlow, a machine learning library that Google open-sourced a year ago.


Some of the biggest issues facing humanity — from global climate change, to water and power infrastructure, to monetary systems, social networks, and other complex systems — involve massive amounts of data that are daunting to analysts and policymakers. To help address these great challenges, MIT’s leaders decided a new approach was needed. That decision led to the creation of the new Institute for Data, Systems, and Society (IDSS) two years ago, as a way to support and coordinate research using analytical tools to tackle major societal issues. Ali Jadbabaie was recruited from the University of Pennsylvania to serve as interim director and help establish the new center and its doctoral program. Now, with the program well underway, Jadbabaie has come to MIT full-time and become the associate director of IDSS and the director of one of its parts, the Sociotechnical Systems Research Center. Jadbabaie’s work has spanned many disciplines and departmental affiliations, but his central focus has remained relatively constant: understanding the way distributed systems of people and/or devices interact and work together, and how to optimize those systems and interactions.


However, extracting data from plain text and organising it for quantitative analysis may be time consuming.


Our World in Data is data visualization site for exploring the history of civilization. The site was created by Max Roser. Our World in Data contains tons of information about many aspects of people’s lives. It also includes numerous visuals (like the one below) which can be easily shared or embedded on other sites. Beware, the site is addicting, and you might spend a lot of time exploring data.


Today’s information-driven economy is leading to an influx of new big data jobs. Discover which job might be the best fit for your aptitudes and interests.


News this week includes lots of telecom merger developments.


Reports call for a continuation of last year’s double-digit growth for 2017 and beyond, but some assumptions about cloud computing haven’t panned out.


Fixing our cybersecurity problem trumps nearly everything else either candidate discussed.


In the aftermath of the election, pollsters and pundits are scrambling to account for misguided forecasts. What went wrong (and right)? Political analysts weigh in on the role of data in politics.


Author Jim Whiddon says technology might be handicapping young people’s social and professional skills.


Extraordinary costs of misconduct can be a daunting challenge for financial services institutions. In a recent episode of Finance in Focus, hear wealth management experts Marc Andrews and April Rudin discuss a cognitive and holistic solution for monitoring complex trading scenarios. Learn how it allows financial firms to go beyond traditional, rules-based alert systems to create a robust and unified surveillance system.


IBM’s Project Intu is an effort to create new forms of AI that can proactively interact with humans across multiple dimensions.


IBM Watson Customer Insight for Insurance helps you leverage dynamic customer segmentation to create a more personalized policyholder experience based on the policyholder’s financial and life events. This video demonstrates how to view and share actionable insights from easy-to-use, customizable dashboards to segment policyholders based on behavior, determine churn propensity and choose retention or replacement actions. You can also use the dashboards to analyze life events across the span of the customer relationship and assign customer lifetime value and risk. Learn how you can leverage Watson Customer Insight for Insurance to create more meaningful policyholder connections that reduce churn and open new revenue streams.



IT organizations want to be able to create a hybrid cloud spanning the VMware and AWS platforms. Velostrata has added support for spot instances.


  Feel the power with our new ebook: “11 Analytics Use Cases for Getting the Most from Smart Grid Data” For operations that are geographically diverse, it pays to harness streams of data generated by smart meters, networked devices, power generators, customers and more. Smarter analytics improve productivity, maximize currently generated power, and make data-informed decisions on capital expenditures. Read this e-book to see how advanced analytics benefits power producers, distributors, consumers and government agencies with: Greater efficiency Real-time utilities monitoring Demand-side management programs that save energy Regulatory compliance and security.


Cities are looking to the Internet of Things to help them track and report on sustainability and climate change goals required under various environmental regulations such COP 21 and Horizon 2020.

Financial analysis techniques for studying numeric, well structured data are very mature. While using unstructured data in finance is not necessarily a new idea, the area is still very greenfield. On this episode,Delia Rusu shares her thoughts on the potential of unstructured data and discusses her work analyzing Wikipedia to help inform financial decisions. Delia’s talk at PyData Berlin can be watched on Youtube (Estimating stock price correlations using Wikipedia). The slides can be found here and all related code is available on github.


About a decade or so photomosaics were all the rage: a near-recreation of a famous image by using many smaller images as elements. Here, for example, is the Mona Lisa, created using the Metapixel program by overlaying 32×32 images of vehicles and animals. An image like this presents an interesting computer vision challenge: can you use deep learning techniques to find the pictures of boats and cars embedded in the image, amongst all the noise and clutter of the other images around and often on top of them? This is the challenge that Max Kaznady and his colleagues in the data science team took upon themselves, using the power of an Azure N-Series virtual machine with 24 cores and 4 K80 GPUs. The model was trained using the mxnet package running on Microsoft R Server, which takes advantage of the powerful GPUs to train a Residual Network (ResNet) DNN with 21 convolutional blocks. (You can about Resnet in this Microsoft Research paper.) Once the model was trained, an HDInsight Spark cluster running Microsoft R Server was used to parallelize the problem of finding boat and car images within the photomosaic. Here’s the architecture of the system, with the steps marked in order in yellow; another blog post explains how to set up such an architecture yourself. You can also just use the Deep Learning Toolkit with the Data Science Virtual Machine. To learn more about this application, check out this recorded presentation from the Data Science Summit presented by Max Kaznady and Tao Wu, or the blog post linked below. Cortana Intelligence and Machine Learning Blog: Applying Deep Learning at Cloud Scale, with Microsoft R Server & Azure Data Lake


A holistic and dynamic surveillance system that looks across the trading activity, specific actions and communications of the trader goes beyond traditional, rules-based surveillance systems. See how IBM Surveillance Insight for Financial Services provides a cognitive system that takes surveillance to the next level by providing unique visualization experiences to identify new patterns that haven’t yet been considered.


Enterprises are outgrowing their traditional business intelligent solutions and struggling to use the data they have collected to gain real insights. What happens is that data engineers will scramble to shift data sets between repositories so that data can be analyzed using older methods. As data volumes and data sources grow this problem is only getting… The post How Verizon is Solving Big Data Problems with Interactive BI appeared first on Hortonworks.


We just concluded our highly attended 7-part Data-In-Motion webinar series. The final installment was a very informative session on how Apache NiFi, Kafka and Storm work together. Slides and Q&A below. Should you have any more questions, anytime, we encourage you to check out the Data Ingestion & Streaming track of Hortonworks Community Connection where…


MiNiFI is a subproject of NiFi designed to solve the difficulties of managing and transmitting data feeds to and from the source of origin, often the first/last mile of digital signal, enabling edge intelligence to adjust flow behavior/bi-directional communication. Since the first mile of data collection (the far edge), is very distributed and likely involves…


Last week, Eric Thorsen, our vice president for industry solutions, visited Europe to meet with customers and we managed to get some time with him to talk specifically about trends we are seeing in the retail sector as part of our ‘Five Minutes with….’ video series. As we hit the ramp up the busy shopping… The post Retail Focus: Five minutes with…. Eric Thorsen, VP of industry solutions appeared first on Hortonworks.



This new release adds support for Amazon EBS volumes and ability to diagnose errors quickly. Cloudera Director provides a simple, reliable, enterprise-grade way to deploy, scale, and manage Apache Hadoop in the cloud of your choice. Cloudera Director enables you to deploy production-ready clusters for big data applications and successfully run workloads in the cloud. Cloudera Director makes it easier for customers to: Deploy clusters in line with patterns native to cloud infrastructure Use an interface to define in one place the desired cluster specification all the way down to the operating system Repeatedly and programmatically instantiate these cluster definitions Adapt to the dynamic nature of cloud infrastructure Cloudera Director 2.2 provides additional mechanisms to get that initial cluster definition right and the ability to diagnose errors and iterate quickly.


TidalScale helps companies draw insights from big data faster, more easily and with greater flexibility at improved costs.


The shock of Donald Trump’s upset victory has begun to wear off. Now the search for answers begins. In particular: How in this age of big data collection and data-crunching analytics could so many polls, economic election models, and surveys–even those by top Republican pollsters–have been so wrong going into election day?


The Apache Software Foundation is overhauling project leadership for the Cassandra database. Will it survive the change?


Mirai is a threat that may or may not be fading, but the idea of assaulting the internet with DDoS attacks on IoT-connected devices is a potent one.


Cybersecurity is everybody’s responsibility. Ignoring it might seem to make your work life easier, but the consequences hurt everyone in the long run.

Discover how your organization can improve sales results and achieve operational efficiencies with IBM Incentive Compensation Solution.


A holistic and dynamic surveillance system that looks across the trading activity, specific actions and communications of the trader goes beyond traditional, rules-based surveillance systems. See how IBM Surveillance Insight for Financial Services provides a cognitive system that takes surveillance to the next level by providing unique visualization experiences to identify new patterns that haven’t yet been considered.



This animation of Airbnb host locations from 2011-2014, presented by Ricardo Bion (data scientist manager at Airbnb) at the EARL Boston conference earlier this week, shows the dramatic growth in properties to rent through the service along with the most common routes of travellers. (You can find the R code that created this animation here.) How did Airbnb achieve such rapid growth? According to Ricardo, it’s by being a “data-informed company”. One of the first eight hires at Airbnb was a data scientist. And as we’ve noted here before, R is widely used at Airbnb along with other data science tools including Python and Jupyter.
This entry was posted in News and tagged , , , , , , , , . Bookmark the permalink.