Correlation is very often used within the initial exploratory stage when given a dataset, because of its ability to comb through pairs of variables and swiftly summarize whether they appear to be related or not.
This is the second (and last) part of the series dealing with the formal comparison of Machine Learning (ML) algorithms from a statistical point of view. In this post, we examine how statistical tests are applied to performance data of ML algorithms.
Apache Ignite is a distributed in-memory cache, query and processing platform for working with large-scale data sets in real-time (leaving aside, streaming processing, Spark integration, Machine learning grid, Ignite FileSystem, persistence, transactions…)
Have you ever watched the cooking teaching shows? You have probably noticed that chefs have usually already all the ingredients separated and chopped. Likewise, a data scientist will be more useful and creative building models rather than spending time with data preprocessing…
In industry, when a practitioner (often a Data Scientist) uses a machine learning algorithm to build a predictive model to solve a real-world problem, they are interested in the performance when the model is deployed into a production environment…
Spark Streaming is one of the most widely used frameworks for real time processing in the world with Apache Flink, Apache Storm and Kafka Streams. However, when compared to the others, Spark Streaming has more performance problems and its process is through time windows instead of event by event, resulting in delay.
Data analysts are often confronted with a seemingly difficult decision: to choose between a simple model or a more complex one. Discover more in this post in which Carlos del Cacho explains the unexplainable.