Like A Girl

Pushing the conversation on gender equality.

Code Like A Girl

Inside the Data Science Summit 2016: 1400 Data Scientists and Practitioners together

This was my first time attending the Data Science Summit in SF, presented by Turi. It was a great experience overall, interesting sessions, good networking and well, San Fransisco!

What/Who is Turi?

Turi is a Seattle-based Machine learning company, formerly called Dato. Turi is one of the platinum sponsors of Data Science Summit, along with Intel and TAP. Turi is the main organizer of the conference and also creator of GraphLab Create framework for machine learning.

The conference was a two-day event, with lots of interesting sessions, hands-on-tutorials and lightning talks. 1400 data scientists and practitioners attended the conference. Here are some short summaries of some of the talks I attended:

Day 1:

1. Making Data Accessible with SQL-on-everything: Apache Drill, by Tomer Shiran, Co-Founder and CEO of Dremio

This was an interesting talk about new open source technologies to address the challenge of accessing variety of different non-relational data-stores easily, through a SQL Engine interface: Apache Drill.

2. Introducing the Trusted Analytics Platform Intel TAP — Kyle Ambert, Intel

In this talk, Kyle introduced TAP (Trusted Analytics Platform), an open source software from Intel, which is an integrated platform for Big Data Analytics, providing capabilities for data ingestion, analysis and model design, and making analytics easily consumable by applications.

3. Evolution of the SFrame: Scalable Data Structure for ML — Sethu Raman, VP Engineering , Turi

The core content of this talk was the SFrame data structure in GraphLab Create machine learning framework by Turi. It is a column-mutable dataframe that can scale to big data. The great thing about this talk was it delved into how to use this data structure and when.

4. Tools for exploring, explaining, and evaluating your recommender system — Dr. Chris DuBois, Data Scientist, Turi

A common theme in most talks at the Data Science Summit was explaining models to be able to trust them. This talk delved into verifying model behavior in different contexts, using interactive visualizations.

5. Advancing the Python Data Stack with Apache Arrow — Wes McKinney, Software Engineer, Cloudera

This was my fangirl moment! The person who created the pandas library that I have been using for a year now, was talking on stage! Python stack + Big Data problems has been a perfect recipe for headaches. This talk was about a remedy for this headache. Wes talked about Apache Arrow, a data structure specification providing columnar in-memory analytics, making better use of CPU caches. It is important for Python and R communities, as data interoperability has been one of the biggest problems in allowing tighter integration with big data systems, which mostly run on JVM.

6. “Why Should I Trust You?”: Explaining the predictions of Any Classifier — Prof. Carlos Guestrin, University of Washington

Again, explaining predictions for developing trust was the theme of this talk, similar to some more talks at the Data Science Summit this year. Prof. Carlos Guestrin talked about his student’s research and examples on how to be sure that you can trust the predictions generated by your model. His examples were on-point and well, humorous. In this world where deep learning is gaining fast popularity, it is important to pause and understand if the features used by the model are indeed trustworthy and reliable. An example he illustrated this with, was a model’s prediction in a “Husky vs Wolf” task. If the “deep predictor” is basically just detecting snow, instead of wolf features, it is not a very useful model.

7. MOOCs Turn 4: What have we Learned? — Prof. Daphne Koller, President/Co-Founder at Coursera

Prof. Daphne Koller talked about Coursera celebrating 4 successful years and lessons they have learned. She also talked about data Coursera collects about courses and learners and what they are learning from them.

Day 2:

1. Engineering Open Machine Learning Software — Andreas Muller, Research Engineer, NYU Center for Data Science

Andreas is a core contributor of scikit-learn, which has become the default machine learning library in Python community. He talked about important design decisions during development, that led to success of the library. He also talked about challenges posed to integration of scikit-learn in intelligent applications, such as evolution of pandas dataframes, better defaults benchmarking, correctness testing etc.

2. Machine Learning in Production — Dr. Yucheng Low, Chief Architect, Turi

In this talk, Dr. Yucheng Low started out with exemplifying the need for online learning, by showing that offline model training tends to miss short trends. Online learning is able to capture short trends, however, it is very difficult. He talked about a solution to this problem: Online Re-ranking and correction, which is kind of building a correction model to accommodate changes. Think of this as piling diffs online onto the offline trained model, which takes away from the complexity.

3. The Five Tribes of Machine Learning, And What You Can Take From Each — Prof. Pedro Domingos, UW

In this talk, Prof. Pedro Domingos talked at length about 5 main schools of thought in machine learning and the master algorithm of each school of thought: Inverse deduction for symbolists, backpropagation for connectionists, genetic programming for evolutionists, probabilistic inference for Bayesians and support vector machines (SVM) for analogizers. However, each algorithm has a drawback of its own and special area of suitable application. Hence, the need for combining the key features of all the 5 algorithms to create a single Master algorithm was born, which is what he discussed about. His research and work towards this goal, the new applications that this will enable and how society will change with such a universal learner were main points discussed in this talk.

4. Small Team, Large Impact: how we solved it — Robin Glinton, Sr. Director of Data Science Applications, Salesforce

This talk as more of a case study and discussion about perspectives regarding how Robin and his team were able to make a large impact with a small team of data scientists, using automated model generation and monitoring frameworks and collaborative experimentation. He mentioned an important technique their team uses to enable more efficient model verification: dockerizing experiments (i.e. Algorithm + Data), so that it is easier to detect that changes between experiment 1 to 2 were only values for parameter x, for example, and there was no change in data used.

5. Scaling Data Science in Python — Christine Doig, Senior Data Scientist, Continuum Analytics

This was the only hands-on-training session at the Data Science Summit, this year. Christine Doig talked about scaling data analysis, visualizations and machine learning with new data structures, tools and libraries. For scaling data analysis, she talked about moving from pandas dataframes to dask dataframes, which are similar to pandas dataframes in look and feel, but use multiple threads to enable using larger datasets, which was a limitation for pandas dataframes. For scaling data visualizations, she talked about moving from bokeh to datashader. Though both bokeh and datashader were new to me, I quickly learnt in her tutorial, that bokeh is a powerful visualization library and that datashader is even more powerful graphics pipeline system, through the examples she used, utilizing the NYC Taxi and Limousine Commission dataset.

You can find her slide deck here and her tutorial code here. It is pretty easy to follow and the tools are great! However, one point to note is that dask-dataframes have not implemented the entire pandas interface yet.

6. The Exploit-Explore Dilemma in Music Recommendation — Dr. Oscar Celma, Director of Research, Pandora

Pandora is a music recommendation and streaming service. Pandora Radio is best known for the Music Genome Project, a music catalog1.5M+ tracks. Having collected more than a decade of contextual listener feedback, they have lots of data to build a powerful recommender system. Dr. Oscar Celma gave an overview of recommenders at Pandora. He also talked about details of dynamic ensemble learning system to provide a personalized experience to users. He exemplified this through case study of Thumbprint Radio, a product recently launched by the research team at Pandora. My key takeaway from this talk was the critical role played by their online and offline architecture stack in the success of this product.

7. Synthesizing human and machine capabilities — Eric Colson, Chief Algorithms Officer, StitchFix

This was a very interesting talk on a topic I had not heard discussions on before: infusing unique abilities of expert humans into intelligent services. As the world moves towards using intelligent machines, we humans can think less like machines and more like humans. As a result, we can use human abilities like empathy, grasping broader contexts, absorbing and leveraging ambient information and synthesize them together with machine capabilities to enhance software systems.

8. Active Learning and Human-in-the-loop — Lucas Biewald, CEO, CrowdFlower

Human-in-the-loop is real-world active learning. No machine learning algorithm is perfect. Hence, we sometimes do see legit emails sitting in our Spam folders, or maps tools sometimes mislead us into walking into unsafe areas, book recommendation system might suggest a book that is not appropriate for your tastes, etc. Humans often make smarter choices than the most advanced machines: we can parse language better, we can identify music/images more accurately and faster. Also, it is known that algorithm is not the secret sauce of intelligent systems; it is the data and parameters. Hence, simulating active learning for the real world, that is, using human input to create training datasets (cleaning and preparing data) and make difficult decisions (for example, parsing sarcasm that machines are not great at) will improve performance of these ML algorithms. In this talk, Lucas talked about the CrowdFlower AI platform that integrates such capabilities into a single platform, so that human-in-the-loop machine learning can be easily done.

9. The Purpose and Power of Platforms in Data Science — Kevin Novak, Sr. Manager, Data Science Platform, Uber

Kevin talked about the history of platform teams at Uber and how they have evolved and of course, increased in number. He talked about why platforms are essential as foundation for Data Science groups, using Uber’s organization as a case study.

This conference provided a good discussion on the current state of the art in machine learning algorithms, tools and architectures for intelligent systems. I have installed Turi’s GraphLab Create and using their 30 day trial, to assess the capabilities of their framework. Also, I will be evaluating some other technologies that I learnt about at the conference, in coming weeks.

Happy exploring!

If you like this post, don’t forget to recommend and share it. Check out more great articles at Code Like A Girl.