Machine Learning: Using Spark to Deal with Massive Data

By John Paul Mueller, Luca Massaron

The real world of machine learning relies heavily on huge datasets. Imagine trying to wend your way through the enormous data generated just by the sales made by Amazon.com every day. The point is that you need products that help you manage these huge datasets in a manner that makes them easier to work with and faster to process. This is where Spark comes in. It relies on a clustering technique.

The emphasis of Spark is speed. When you visit the site, you’re greeted by statistics, such as Spark’s capability to process data a hundred times faster than other products, such as Hadoop MapReduce (see the tutorial) in memory. However, Spark also offers flexibility in that it works with Java, Scala, Python, and R, and it runs on any platform that supports Apache. You can even run Spark in the cloud if you want.

Spark works with huge datasets, which means that you need to know programming languages, database management, and other developer techniques to use it. This means that the Spark learning curve can be quite high, and you need to provide time for developers on your team to learn it. The simple examples at Spark’s website give you some ideas of just what is involved. Notice that all the examples include some level of coding, so you really do need to have programming skills to use this option.