Data Science at the Command Line: Facing the Future with Time-Tested Tools

This hands-on advisor demonstrates how the pliability of the command line might be useful turn into a extra effective and efficient info scientist. You’ll how to mix small, but robust, command-line instruments to speedy receive, scrub, discover, and version your data.

To get you started—whether you’re on home windows, OS X, or Linux—author Jeroen Janssens introduces the information technological know-how Toolbox, an easy-to-install digital atmosphere filled with over eighty command-line tools.

Discover why the command line is an agile, scalable, and extensible know-how. no matter if you’re already cozy processing info with, say, Python or R, you’ll drastically enhance your info technology workflow through additionally leveraging the facility of the command line.

  • Obtain info from web pages, APIs, databases, and spreadsheets
  • Perform scrub operations on simple textual content, CSV, HTML/XML, and JSON
  • Explore info, compute descriptive facts, and create visualizations
  • Manage your information technology workflow utilizing Drake
  • Create reusable instruments from one-liners and present Python or R code
  • Parallelize and distribute data-intensive pipelines utilizing GNU Parallel
  • Model facts with dimensionality relief, clustering, regression, and category algorithms

Show description

Preview of Data Science at the Command Line: Facing the Future with Time-Tested Tools PDF

Best Computer Science books

Web Services, Service-Oriented Architectures, and Cloud Computing, Second Edition: The Savvy Manager's Guide (The Savvy Manager's Guides)

Internet prone, Service-Oriented Architectures, and Cloud Computing is a jargon-free, hugely illustrated clarification of the way to leverage the speedily multiplying prone on hand on the web. the way forward for company depends on software program brokers, cellular units, private and non-private clouds, large info, and different hugely hooked up expertise.

Software Engineering: Architecture-driven Software Development

Software program Engineering: Architecture-driven software program improvement is the 1st accomplished advisor to the underlying abilities embodied within the IEEE's software program Engineering physique of data (SWEBOK) normal. criteria specialist Richard Schmidt explains the normal software program engineering practices well-known for constructing tasks for presidency or company platforms.

Platform Ecosystems: Aligning Architecture, Governance, and Strategy

Platform Ecosystems is a hands-on advisor that gives an entire roadmap for designing and orchestrating brilliant software program platform ecosystems. in contrast to software program items which are controlled, the evolution of ecosystems and their myriad contributors has to be orchestrated via a considerate alignment of structure and governance.

Extra resources for Data Science at the Command Line: Facing the Future with Time-Tested Tools

Show sample text content

Py #! /usr/bin/env python from sys import stdin, stdout whereas actual: line = stdin. readline() if no longer line: holiday stdout. write("%d\n" % int(line)**2) stdout. flush() instance 4-8. ~/book/ch04/stream. R #! /usr/bin/env Rscript f <- file("stdin") open(f) while(length(line <- readLines(f, n = 1)) > zero) { write(as. integer(line)^2, stdout()) } close(f) fifty two | bankruptcy four: developing Reusable Command-Line instruments Further examining • Docopt. (2014). Command-Line Interface Description Language. Retrieved from http://docopt. org. • Robbins, A. , & Beebe, N. H. F. (2005). vintage Shell Scripting. O’Reilly Media. • Peek, J. , Powers, S. , O’Reilly, T. , & Loukides, M. (2002). Unix strength instruments (3rd Ed. ). O’Reilly Media. • Perkins, J. (2010). Python textual content Processing with NLTK 2. zero Cookbook. Packt Pub‐ lishing. • McKinney, W. (2012). Python for info research. O’Reilly Media. • Rossant, C. (2013). studying IPython for Interactive Computing and knowledge Visuali‐ zation. Packt Publishing. • Wirzenius, L. (2013). Writing handbook Pages. Retrieved from http://liw. fi/ manpages/. • Raymond, E. S. (2014). fundamentals of the Unix Philosophy. Retrieved from http:// www. faqs. org/docs/artu/ch01s06. html. additional studying | fifty three CHAPTER five Scrubbing facts In bankruptcy 2, we checked out step one of the OSEMN version for information technology, tips on how to receive facts from a number of resources. It’s now not unusual for this information to have leave out‐ ing values, inconsistencies, blunders, bizarre characters, or boring columns. a few‐ occasions we simply desire a particular component of the information. and occasionally we'd like the information to be in a unique structure. In these instances, we need to fresh, or scrub, the information ahead of we will be able to circulate directly to the 3rd step: exploring info. the knowledge we bought in bankruptcy three can are available in various codecs. the main com‐ mon ones are undeniable textual content, CSV, JSON, and HTML/XML. simply because such a lot command-line instruments function on one structure in basic terms, it's precious for you to convert facts from one structure to a different. CSV, that's the most structure we’re operating with during this bankruptcy, is really now not the simplest structure to paintings with. Many CSV info units are damaged or incompatible with one another simply because there isn't any usual syntax, in contrast to XML and JSON. as soon as our info is within the layout we'd like it to be, we will be able to observe universal scrubbing operations. those comprise filtering, exchanging, and merging information. The command line is mainly well-suited for these kinds of operations, as there exist many robust command-line instruments which are optimized for dealing with quite a lot of info. instruments that we’ll talk about during this bankruptcy comprise vintage ones equivalent to: lower (Ihnat, MacKenzie, & Meyering, 2012) and sed (Fenlason, Lord, Pizzini, & Bonzini, 2012), and more recent ones reminiscent of jq (Dolan, 2014) and csvgrep (Groskopf, 2014). The scrubbing initiatives that we talk about during this bankruptcy not just follow to the enter info. occasionally, we additionally have to reformat the output of a few command-line instruments. for instance, to remodel the output of uniq -c to a CSV info set, lets use awk (Brennan, 1994) and header: fifty five $ echo 'foo\nbar\nfoo' | style | uniq -c | variety -nr 2 foo 1 bar $ echo 'foo\nbar\nfoo' | variety | uniq -c | style -nr | > awk '{print $2","$1}' | header -a value,count value,count foo,2 bar,1 in case your information calls for extra performance than what's provided by means of (a mix of) those command-line instruments, you should use csvsql.

Download PDF sample

Rated 4.54 of 5 – based on 41 votes