Learning Scrapy - Second Edition
Douban
Dimitrios Kouzis-Loukas
overblik
Scrapy is an application framework designed specially for crawling web sites and extracting meaningful data which can be used for wide range of applications such as data mining, information processing and many more.This book will provide you with the rundown explaining all the required concepts and fundamentals of Scrapy 1.4 framework, followed by thorough description with practical examples to extract data from different sources ranging from simple to complex websites.
You will learn how to clean the data up and shape it as per your requirement using Python and third party APIs. You will explore the steps involved in scraping online data from online shops like eBay and from news portal like CNN and BBC news. You will also get a hands on experience of using Scrapy with Selenium. You will learn how to build and run web spiders and deploy them to Scrapy cloud. Next you will be introduced to the process of storing the scrapped data in databases as well as search engines to perform real time analytics with Spark Streaming. You will also be familiarized with the best practices that you can follow to get the optimum result.
By the end of this book, you will perfect the art of scraping data for your applications and apply them in your projects with ease
What you will learn
Understand HTML pages and write XPath to extract the data you need
Write Scrapy spiders with simple Python and do web crawls over news portal and online shops
Push your data into any database, search engine or analytics system
Discover the steps involved in scraping Javascript sites with Selenium
Use Twisted Asynchronous API to process hundreds of items concurrently
Make your crawler super-fast by learning how to tune Scrapy's performance through best practices
Perform large scale distributed crawls with scrapyd and scrapinghub