jackmyers.info

"Paraphrasing Tim O'Reilly, the person who has the most data wins. That's a neat slogan, but the more data one has, the more likely it is to be unlabeled. Unfortunately, there aren't that many unsupervised learning algorithms out there, for machine learning in general and for NLP in particular. Recent advances in deep learning provide new tools for text mining of large unsupervised datasets. In particular, I will talk about the math, intuition and implementation of the word2vec algorithm, its variants (skipgram and continuous bag of words), use cases, and extensions (e.g. paragraph2vec, doc2vec). I will wrap up with a simple demonstration at scale using Scala, Apache Spark, MLLib, and the Apache Zeppelin Notebook." — Marek Kolodziej

Marek Kolodziej is a Sr. Research Engineer at Nitro, Inc. He's been working on a diverse set of machine learning, distributed computing and big data problems for the past 4 years, and statistics and econometrics for the past nine. He is passionate about functional programming and static typing in general, and about Scala in particular. He is obsessed about production-quality data science - the insights are only useful if the deployment is rock-solid, hence his focus on the JVM. Marek got his PhD in Energy and Environmental Economics from Boston University.

Unsupervised NLP Tutorial
using Apache Spark