Big data has become a buzzword for an exciting new set of tools and approaches for modern, data-driven applications that are revolutionising the way the world computes. To the dismay of statisticians, this all-encompassing term appears to be widely utilised to incorporate the application of well-known statistical techniques to huge datasets for predictive purposes. Despite the fact that big data has become a cliché, contemporary distributed computer techniques are enabling studies of datasets substantially larger than those previously analysed, with astonishing results.
Distributed computing, on the other hand, does not automatically lead to data science. Data products have emerged as a new economic paradigm as a result of the tremendous growth of datasets generated by the Internet and the insight that these datasets may be used to power prediction models (“more data is better than better algorithms”1). Stunning accomplishments of data modelling across vast heterogeneous datasets—for example, Nate Silver’s seemingly supernatural ability to forecast the 2008 election using big data techniques—have led to a widespread recognition of data science’s significance and attracted a diverse collection of practitioners to the subject.
By offering a framework for distributed data storage and parallel computation, Hadoop has developed from a cluster-computing abstraction to an operating system for big data. Spark has expanded on these concepts, making cluster computing more accessible to data scientists. However, data scientists and analysts who are new to distributed computing may believe that these technologies are designed for programmers rather than analysts. This is because a fundamental shift in thinking about how we handle and compute data in a parallel rather than sequential manner is required.
This book aims to educate data scientists for that shift in thinking by giving an accessible and straightforward overview of cluster computing and analytics. We’ll cover the majority of the concepts, tools, and techniques involved in distributed computing for data analysis, as well as provide the groundwork for more in-depth exploration of specific topics.
By writing to a data scientist audience, this book aims to fill up the gap. From a data science standpoint, it will expose you to the world of clustered computing and analytics with Hadoop. The focus will be on common analytics, data warehousing approaches, and higher-order data workflows rather than deployment, operations, or software development.
Reviews
There are no reviews yet.