数据科学导论 lesson1

数据科学导论 lesson1

Intro to Data ScienceLesson 1 NotesIntroductionIntroduction to Data ScienceHi, and welcome to Introduction to Data Science. My name's Dave, and I'll bethe instructor for this course. I've worked as a data scientist in Silicon Valley,most recently at a small company called Yub and before that at a companycalled TrialPay. I'm formally trained as a physicist, and I originally becameinterested in data scientist because I love the idea of improving the quality ofpeople's lives or building really cool products by using data and mathematics.In this lesson, we'll discuss data science at a high level. Together we'll find outwhat data science is and discuss what skills are required to be a data scientist. We'll also hear from a bunch of other data scientists about interesting projects they worked on. And discuss how data science is being used to solve a bunch of different problems. This lesson in particular is going to be a little bit different than the others. We're not going to build as much. I think it's important to understand data science at high level before we dive into the details. Alright, well I'm really excited about this course, so why don't we get started.What Is a Data ScientistPeople have many different conceptions of what data scientists do. Some might say that a data scientist is just a data analyst who lives in California. While others might say that a data scientist is a person who's better at statistics than any software engineer, and better at software engineering than any statistician. As you can see, definitions vary wildly from place to place, and from person to person.Quiz: What Is a Data ScientistSo before we get started, let me ask you a question. What do you think data scientists do in theirday-to-day work? Type your thoughts in the text box below. Don't worry, there are no right or wrong answers and this quiz will not be graded.Answer:Let me tell you my perspective. From personal experience, data scientists today are people who have a blend of many different skills. This Venn diagram shows a definition of a data scientist that I like a lot. A data scientist is someone who knows mathematics and statistics, which allows them to identify interesting insights in a sea of data. They also have theprogramming skills to code up statistical models and getdata from a variety of different data sources. Furthermore,a data scientist is someone who knows how to ask theright questions and translate those questions into a soundanalysis. After doing the analysis, they have thecommunication skills to report their findings in a way thatpeople can easily understand. In other words, datascientists have the ability to perform complicated analysison huge data sets. Once they've done this, they also havethe ability to write and make informative graphs tocommunicate their findings to others.What Does a Data Scientist DoHere are some things that a data scientist may do in his or her daily work. They mightwrangle data. That is, collect data from the real world, process it, and create a data setthat can be analyzed. Once they have a data set, they may analyze trends in the existingdata or try to make data driven predictions about the future using the data at hand.Based on this models or predictions, they cannot only build data driven products but also communicate their findings to those other data scientists and the general public. File visualizations, reports or blog posts. But hey, this is my point of view. Why don't we talk to some other data scientists and hear their thoughts.Pi Chuan IntroductionMy name is Pi-Chaun Chang. My background, so I've been doing Computer Science ever since college.I did a PhD CS PhD at Stanford, and I worked at Google for four years. Now I'm in a startup called AltSchool.Pi Chuan - What Is Data Sciencenow I think about it, I have been doing data science since the day I was at inTaiwan doing a Masters in speech recognition. The way we wouldunderstand speech is to collect a lot of data and understand how to modelthings like a phoneme in speech. And how to understand people's languageprocessing requires a lot of data collection as well. And at Google, which isa company that collects a lot of data, I also did personalization there which requires a lot of data to understanding a person's behavior. So, that to me is data science. Using data to build a useful model or to understand a particular pattern that is useful, then later on, for othersoftware applications.Gabor IntroductionSo my name is Gabe Savo, I work at Twitter, and I am data scientist. Iactually come from a background that's towards natural sciences. I didstatistical physics, and I have a PhD in statistical physics. And anything thatgoes with that. Obviously, I was looking at like a lot of big, big systems as aninteraction of, of very small very small entities composing of these systems.And and later on I did complex network research. So that means that wehave interactions again. Imagine like a big gas composed of molecules. Butinstead here we have like, the humans interacting with, with each other through social networks. Through mobile communication networks. So that was the main focus of my research all year.Gabor - What is Data ScienceThat's a great question. So what do data scientists do? I think it's it's really hard to to pinpoint exactly what they do because it's going to be tailored to their actual application area that they work. But in general, what they do is they take data and they find meaning in the data. And what the meaningis going to be really geared towards what they would like to explain. So it could be that a particular company, if they are looking at a company or, or the project. If they are looking for some, some particular signal or something. I think in general in my mind what data science does is use is usethis data. Data sciences uses data to essentially explain and perhaps predict behavior be it human behavior or even the behavior of a more machine generated system, anything could be like that.Quiz: Basic Data Scientist SkillsJust to recap, let me ask you a quick question. What does it mean for adata scientist to have substantive expertise and why is it important? Typeyour answer in the box below. Don't worry, your response won't be graded.Answer:As we discussed earlier in this lesson, a data scientist needs to have substantive expertise. What does that mean? Well typically it means that a data scientist knows which questions to ask, can interpret the data well and understands the structure of the data. You can imagine that a data scientistneeds to know about the problem that solving. For example, if you are solving an online advertising problem, you want to make sure you understand what types of people are coming to your website. How they are interacting with the website and what different data means that can help you ask the right questions like, Are people falling off and not completing our ads at a certain point in the flow,or do people complete more ads at a certain time of the day? You would also, then, be very familiar with how the data is stored and structured. And that could help you work more efficiently and more effectively. This is why it's important for a data scientist to have substantive expertise. It's important to note that data scientists usually work in teams. So it's normal for data scientists to be stronger in some areas and weaker in others. So even if you, as a data scientist, don't have a tons of substantive expertise, if you have great hacking skills or know a lot of statistics you can still be a valuable member of a data science team.Problems Solved by Data ScienceNow that you have a better Ideaof what data science is, and whatdata scientist do, let's talk abouthow data science can be appliedacross a wide spectrum ofindustries. You might havesigned up for this class under thenotion that if you become a datascientist, you'll end up workingfor a Silicon Valley startup. Wellit's true that most techcompanies do employ datascientists. Data science can alsobe used to solve problems in many different fields. What are some examples of the types of problems being solved using data science? Well for one, Netflix uses collaborative filtering algorithms to recommend movies to users based on things they've previously watched. Also, elements of many popular social media websites are powered by data science. Things like recommending new connections on Linkedin, constructing your Facebook newsfeed or suggesting new people to follow on Twitter. Many other online services or apps such as dating websites like OKCupid or ride sharing services like Uber uses the vast amount of user data available to them, not only need to customize and improve the user experience but also to publish interesting anthropological findings regarding people's behavior in the offline world on their blogs.Okay, so I know what you're thinking. So far, all of these seem like problems data scientists are expected to solve, but data scientists work in many domains. Data science concepts are integral in processing and modeling data in the field bioinformatics where scientists are working on projects like annotating genomes and analyzing data sequences. This past summer, data science for social good fellows in Chicago worked on a project attempting to solve Chicago's bus crowding issues using data. In addition, physicists use data science concepts when building a 100 terabyte database of astronomical data collected by the Sloan digital sky server. This one's cool. Analyzing electronic medical records allowed the city of Camden, New Jersey to save enormous amounts of money by targeting their efforts towards specific buildings accounting for a majority of emergency admission. Finally, NBA teams like the Toronto Raptors are installing a new technology, sports view cameras, on the basketball courts. They collect huge amounts of data on players movement and playing styles. This helps teams to better analyze game trends and improve coaching decisions. You've probably noticed by now that data science is making an impact in areas far and wide. Data science isn't simply a trendy new way to think about tech problems. It's a tool that can be used to solve problems in a variety of fields. Data scientists are working at Silicon Valley start-ups to enrich our online experiences, but they're also doing important work in our cities, in our laboratories and in our sports stadiums.PandasAs you can imagine, since data science is being deployed in such a wide range of fields. Data scientists use many different tools. Depending on the task at hand. One of the most versatile and ubiquitous is a Python package called Pandas. Which we'll use in order to handle and manipulate data during this course. You might wonder why we'll beusing Pandas as opposed to anothertool. Pandas allows us to structure andmanipulate our data in ways that areparticularly well suited for dataanalysis. If you happen to be familiarwith the scripting language R, Pandastakes a lot of the best elements from Rand implements them in python.DataframesFirst of all, data in Pandas is often contained in a structure called a dataframe. A dataframe is atwo-dimensional labeled data structure with columns which can be different types if necessary. Forexample types like string, int, float, and Boolean. You can think of a dataframe as being similar to an Excel spreadsheet. We'll talk about making dataframes in a bit. For now, here's what an example dataframe might look like. Using data describing passengers on the Titanic, and whether or not they survived the Titanic's tragic collision with an iceberg. Note that there are numerous columns. Name,age, fare, and survived? Thesecolumns have different data-types. There are also some Not-a-Number entries which happenwhen we don't specify a value.There are a bunch of cool thingswe can do with this data frame.Let's jump to the command line.Say that I had already loaded thisdata into a data frame called DF.We can operate on specific columns by calling on them as if they were keys in a dictionary. For example, DFH and we can call on specific rows by calling a data frame objects loc method, and passing the row index as an argument, for example, df.loc('a').Create a New DataframePanda also allows us to operate on your data frame in a vectorized item by item way. What does it mean to operate on a data frame in a vectorized way? Well first let's create a new data frame. Note that first I want to create a dictionary where the keys are going to be my column names and the values are series corresponding to their values and then the indexes for the rows where these values should appear. In order to make a data frame, I can simply say df equals data frame of this dictionary d. Let's see what this data frame looks like. We can call dataframe.apply and pass in some arbitrary function. In this case, numpy.mean to perform that function on every single column in the data frame. So when we df.apply numpy.mean, what we get back is the mean of each column in our data frame df. There are also some operations that simply cannot be vectorized in this way, that is, take a numpy.meanas their input, and then return an array or a value. So we can also call map on particular columns or apply map on entire data frames. These methods also accept functions, but functions that take in a single value and return a single value. For example, if I were to type df1.map lambda x, x greater than or equal to 1, what I get back here is whether or not every single value in the column 1 is greater than or equal to 1. Now say that I were to call df.applymap lambda x, x greater than or equal to 1, whatthis function returns is whether or not every single value in the data frame df is greater than or equal to 1. This is just the tip of the iceberg when it comes to Panda's functionality. If you're interested to read more about what the library can do, you should check out the full documentation at the URLcontained in the instructor notes. Now, we know some of the very basics when it comes to handling the data, but how do we acquire the data that we wish to handle and analyze?Lesson Project - Titanic DataAlright. Before we get started on the class project, this lesson's assignment will allow you to get comfortable with the type of work that data scientists do using a small and classic data set. This data set describes the riders on the Titanic and a bunch of information about them. For example, what class they were in, whether they were male or female, how old they were, etcetera. Over the courseof the assignment, you'll build a few different models. The models will start out simple, but they'll get increasingly complex. To see if using data science, we can predict who will survive and who won't survive the Titanic tragedy. This may sound complicated, but don't worry. We'll give you plenty of help. This assignment is meant to get you in the habit of thinking like a data scientist.Class ProjectThrough the class project, you'll investigate New York City subway ridership as a data scientist might. First, you'll pull some publicly available data on subway ridership, and also on New York weather conditions, using the New York MTA website and the Weather Underground API. Then, you'll answer some questions about subway ridership using statistics and machine learning. Does the weather influence how many people ride the subway? Is the subway busier at certain times than others? Can we predict subway ridership? Finally,you'll develop some charts andgraphics that communicate yourfindings, and synthesize everythinginto a cohesive write-up that yourfriends or family might find usefuland informative. This may sounddaunting, but we'll be going throughthis step by step, and learning how touse the necessary tools as we goalong.Pi Chuans Advice for Aspiring StudentsSo for me, the reason why I even come into this field of data science is my passion for something specific, which is natural language. Right, so I observe the people who work around me who knows about data science. I think one thing that's very important is they either have a passion for the particular data they're looking at. Like, you know, natural language, or like speech recognition kind ofdata. Or some people are just very interested in patterns in data. Like when they see some data they would try to calculate the mean of the data, the variance of data. It comes natural to them because they want to find patterns to the data. So I think for anyone who wants to become a data scientist, it's good to think about, what kind of data you are interested in doing, and start with that. And then later on when you have this skill of analyzing data, you can apply it to any other kind of thing.Gabors Advice for Aspiring Data ScientistI think the, the most important thing that, that inspired me to start to start keep In mind is that they should have a very curious mind. They should have the ability to ask questions, to formulate these questions as it pertains to them, as they would see these questions being raised in their own lives.So if, if there's a problem they see with the pieces that they are working with or with the project that they're working with, they should try to ask these questions in terms of and how they can understand for themselves. And once they understand, once they know what is the gist of the question is, then they can go and use algorithms. Obviously it helps tremendously if you have experience about allthe arguments that are out there to attack these questions. But I think the most important ability is that you should have the mindset as a data scientist, and you could obviously improve throughout your career in this if you have this inquisitive mindset, where you are trying to ask the right questions. While you are trying to, to see what is important, you should also have an overview of what kind of data can support your conclusions and draw conclusions with the help of these algorithms that you are going to use to solve these problems.Recap of Lesson 1To recap today's lesson, data science is a discipline that incorporates methods and expertise froma variety of different fields, to glean insights from large data sets, and then use those insights to build data driven products, or to create action items. There's no cookie cutter way to become a data scientist, but a lot of data scientists have a very strong background in mathematics and statistics, and also have the programming skills to handle these data sets and also to perform analysis. Currently, data science is most closely associated with the tech field, but more and more data science is being used all over the world in a variety of different fields. In order to solve new and old problems, more efficiently and effectively. Now I know what you're thinking. This sounds awesome, data science sounds really cool, I want to work on a project get me some data. Unfortunately, data seldom comes in formats that we want. It might be in a format we've never seen, it might come from a weird source. There might be missing or erroneous values. Because of this, any data scientist worth their salt is skilled at acquiring data. And then massaging it into a workable format. That's what we're going to be covering in the next lesson.。



Teaching in these new programs has significant overlap in curricular subject matter with tradi-tional statistics courses;in general,though,the new initiatives steer away from close involvement with academic statistics departments.This paper reviews some ingredients of the current“Data Science moment”,including recent commentary about data science in the popular media,and about how/whether Data Science is really different from Statistics.The now-contemplatedfield of Data Science amounts to a superset of thefields of statistics and machine learning which adds some technology for‘scaling up’to‘big data’.This chosen superset is motivated by commercial rather than intellectual developments.Choosing in this way is likely to miss out on the really important intellectual event of the nextfifty years.Because all of science itself will soon become data that can be mined,the imminent revolution in Data Science is not about mere‘scaling up’,but instead the emergence of scientific studies of data analysis science-wide.In the future,we will be able to predict how a proposal to change data analysis workflows would impact the validity of data analysis across all of science,even predicting the impactsfield-by-field.Drawing on work by Tukey,Cleveland,Chambers and Breiman,I present a vision of data science based on the activities of people who are‘learning from data’,and I describe an academic field dedicated to improving that activity in an evidence-based manner.This newfield is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.Based on a presentation at the Tukey Centennial workshop,Princeton NJ Sept182015:Contents1Today’s Data Science Moment4 2Data Science‘versus’Statistics42.1The‘Big Data’Meme (6)2.2The‘Skills’Meme (7)2.3The‘Jobs’Meme (8)2.4What here is real? (9)2.5A Better Framework (9)3The Future of Data Analysis,196210 4The50years since FoDA124.1Exhortations (12)4.2Reification (14)5Breiman’s‘Two Cultures’,200115 6The Predictive Culture’s Secret Sauce166.1The Common Task Framework (16)6.2Experience with CTF (17)6.3The Secret Sauce (18)6.4Required Skills (18)7Teaching of today’s consensus Data Science19 8The Full Scope of Data Science228.1The Six Divisions (22)8.2Discussion (25)8.3Teaching of GDS (26)8.4Research in GDS (27)8.4.1Quantitative Programming Environments:R (27)8.4.2Data Wrangling:Tidy Data (27)8.4.3Research Presentation:Knitr (28)8.5Discussion (28)9Science about Data Science299.1Science-Wide Meta Analysis (29)9.2Cross-Study Analysis (30)9.3Cross-Workflow Analysis (31)9.4Summary (32)10The Next50Years of Data Science3210.1Open Science takes over (32)10.2Science as data (33)10.3Scientific Data Analysis,tested Empirically (34)10.3.1DJ Hand(2006) (35)10.3.2Donoho and Jin(2008) (35)10.3.3Zhao,Parmigiani,Huttenhower and Waldron(2014) (36)10.4Data Science in2065 (37)11Conclusion37 Acknowledgements:Special thanks to Edgar Dobriban,Bradley Efron,and Victoria Stodden for comments on Data Science and on drafts of this mss.Thanks to John Storey,Amit Singer,Esther Kim,and all the other organizers of the Tukey Centennial at Princeton,September18,2015.Belated thanks to my undergraduate statistics teachers:Peter Bloomfield,Henry Braun,Tom Hettmansperger, Larry Mayer,Don McNeil,GeoffWatson,and John Tukey.Supported in part by NSF DMS-1418362and DMS-1407813.Acronym MeaningASA American Statistical AssociationCEO Chief Executive OfficerCTF Common Task FrameworkDARPA Defense Advanced Projects Research AgencyDSI Data Science InitiativeEDA Exploratory Data AnalysisFoDA The Furure of Data Analysis,1962GDS Greater Data ScienceHC Higher CriticismIBM IBM Corp.IMS Institute of Mathematical StatisticsIT Information Technology(thefield)JWT John Wilder TukeyLDS Lesser Data ScienceNIH National Institutes of HealthNSF National Science FoundationPoMC The Problem of Multiple Comparisons,1953QPE Quantitative Programming EnvironmentR R–a system and language for computing with dataS S–a system and language for computing with dataSAS System and lagugage produced by SAS,Inc.SPSS System and lagugage produced by SPSS,Inc.VCR Verifiabe Computational ResultTable1:Frequent Acronyms1Today’s Data Science MomentOn Tuesday September8,2015,as I was preparing these remarks,the University of Michigan an-nounced a$100Million“Data Science Initiative”(DSI),ultimately hiring35new faculty.The university’s press release contains bold pronouncements:“Data science has become a fourth approach to scientific discovery,in addition to experimentation,modeling,and computation,”said Provost Martha Pollack.The web site for DSI gives us an idea what Data Science is:“This coupling of scientific discovery and practice involves the collection,manage-ment,processing,analysis,visualization,and interpretation of vast amounts of het-erogeneous data associated with a diverse array of scientific,translational,and inter-disciplinary applications.”This announcement is not taking place in a vacuum.A number of DSI-like initiatives started recently,including(A)Campus-wide initiatives at NYU,Columbia,MIT,...(B)New Master’s Degree programs in Data Science,for example at Berkeley,NYU,Stanford,...There are new announcements of such initiatives weekly.12Data Science‘versus’StatisticsMany of my audience at the Tukey Centennial where these remarks were presented are applied statisticians,and consider their professional career one long series of exercises in the above“... collection,management,processing,analysis,visualization,and interpretation of vast amounts of heterogeneous data associated with a diverse array of...applications.”In fact,some presentations atthe Tukey Centennial were exemplary narratives of“...collection,management,processing,analysis, visualization,and interpretation of vast amounts of heterogeneous data associated with a diverse array of...applications.”To statisticians,the DSI phenomenon can seem puzzling.Statisticians see administrators touting,as new,activities that statisticians have already been pursuing daily,for their entire careers;and which were considered standard already when those statisticians were back in graduate school.The following points about the U of M DSI will be very telling to such statisticians:•U of M’s DSI is taking place at a campus with a large and highly respected Statistics Depart-ment•The identified leaders of this initiative are faculty from the Electrical Engineering and ComputerScience Department(Al Hero)and the School of Medicine(Brian Athey).1For an updated interactive geographic map of degree programs,see http://data-science-university-programs.silk.co•The inagural symposium has one speaker from the Statistics department(Susan Murphy),out of more than20speakers.Seemingly,statistics is being marginalized here;the implicit message is that statistics is a partof what goes on in data science but not a very big part.At the same time,many of the concrete de-scriptions of what the DSI will actually do will seem to statisticians to be bread-and-butter statistics. Statistics is apparently the word that dare not speak its name in connection with such an initiative!2 Searching the web for more information about the emerging term‘Data Science’,we encounter the following definitions from the Data Science Association’s“Professional Code of Conduct”3‘‘Data Scientist"means a professional who uses scientific methods to liberate and create meaning from raw data.To a statistician,this sounds an awful lot like what applied statisticians do:use methodology to make inferences from data.Continuing:‘‘Statistics"means the practice or science of collecting and analyzing numerical data in large quantities.To a statistician,this definition of statistics seems already to encompass anything that the def-inition of Data Scientist might encompass,but the definition of Statistician seems limiting,since alot of statistical work is explicitly about inferences to be made from very small samples—this been true for hundreds of years,really.In fact Statisticians deal with data however it arrives-big or small.The statistics profession is caught at a confusing moment:the activities which preoccupied it over centuries are now in the limelight,but those activities are claimed to be bright shiny new, and carried out by(although not actually invented by)upstarts and strangers.Various professional statistics organizations are reacting:•Aren’t we Data Science?Column of ASA President Marie Davidian in AmStat News,July,20134•A grand debate:is data science just a‘rebranding’of statistics?Martin Goodson,co-organizer of the Royal Statistical Society meeting May11,2015on the relation of Statistics and Data Science,in internet postings promoting that event.•Let us own Data Science.IMS Presidential address of Bin Yu,reprinted in IMS bulletin October201452At the same time,the two largest groups of faculty participating in this initiative are from EECS and Statistics. Many of the EECS faculty publish avidly in academic statistics journals–I can mention Al Hero himself,Raj Rao Nadakaduti and others.The underlying design of the initiative is very sound and relies on researchers with strong statistics skills.But that’s all hidden under the hood.3/code-of-conduct.html4/blog/2013/07/01/datascience/5/2014/10/ims-presidential-address-let-us-own-data-science/One doesn’t need to look far to see click-bait capitalizing on the befuddlement about this newstate of affairs:•Why Do We Need Data Science When We’ve Had Statistics for Centuries?Irving Wladawsky-BergerWall Street Journal,CIO report,May2,2014•Data Science is statistics.When physicists do mathematics,they don’t say they’re doing number science.They’re doing math.If you’re analyzing data,you’re doing statistics.You can call it data scienceor informatics or analytics or whatever,but it’s still statistics....You may not like what some statisticians do.You may feel they don’t share your values.They may embarrass you.But that shouldn’t lead us to abandon the term‘‘statistics’’.Karl Broman,Univ.Wisconsin6On the other hand,we canfind pointed comments about the(near-)irrelevance of statistics:•Data Science without statistics is possible,even desirable.Vincent Granville,at the Data Science Central Blog7•Statistics is the least important part of data science.Andrew Gelman,Columbia University8Clearly,there are many visions of Data Science and its relation to Statistics.In discussions one recognizes certain recurring‘Memes’.We now deal with the main ones in turn.2.1The‘Big Data’MemeConsider the press release announcing the University of Michigan Data Science Initiative with whichthis article began.The University of Michigan President,Mark Schlissel,uses the term‘big data’repeatedly,touting its importance for allfields and asserting the necessity of Data Science for handlingsuch data.Examples of this tendency are near-ubiquitous.We can immediately reject‘big data’as a criterion for meaningful distinction between statisticsand data science9.•History.The very term‘statistics’was coined at the beginning of modern efforts to compilecensus data,prehensive data about all inhabitants of a country,for example Franceor the United States.Census data are roughly the scale of today’s big data;but they havebeen around more than200years!A statistician,Hollerith,invented thefirst major advance in 6https:///2013/04/05/data-science-is-statistics/7/profiles/blogs/data-science-without-statistics-is-possible-even-desirable 8/2013/11/14/statistics-least-important-part-data-science/9One sometimes encounters also the statement that statistics is about‘small datasets,while Data Science is about‘big datasets.Older statistics textbooks often did use quite small datasets in order to allow students to make hand calculations.big data:the punched card reader to allow efficient compilation of an exhaustive US census.10 This advance led to formation of the IBM corporation which eventually became a force pushing computing and data to ever larger scales.Statisticians have been comfortable with large datasets for a long time,and have been holding conferences gathering together experts in ‘large datasets’for several decades,even as the definition of large was ever expanding.11•Science.Mathematical statistics researchers have pursued the scientific understanding of big datasets for decades.They have focused on what happens when a database has a large number of individuals or a large number of measurements or both.It is simply wrong to imagine that they are not thinking about such things,in force,and obsessively.Among the core discoveries of statistics as afield were sampling and sufficiency,which allow to deal with very large datasets extremely efficiently.These ideas were discovered precisely because statisticians care about big datasets.The data-science=‘big data’framework is not getting at anything very intrinsic about the re-spectivefields.122.2The‘Skills’MemeComputer Scientists seem to have settled on the following talking points:(a)data science is concerned with really big data,which traditional computing resources could notaccommodate(b)data science trainees have the skills needed to cope with such big datasets.The CS evangelists are thus doubling down on the‘Big Data’meme13,by layering a‘Big Data skills meme’on top.What are those skills?Many would cite mastery of Hadoop,a variant of Map/Reduce for use with datasets distributed across a cluster of computers.Consult the standard reference Hadoop: The Definitive Guide.Storage and Analysis at Internet Scale,4th Edition by Tom White.There we learn at great length how to partition a single abstract dataset across a large number of processors. Then we learn how to compute the maximum of all the numbers in a single column of this massive dataset.This involves computing the maximum over the sub database located in each processor, followed by combining the individual per-processor-maxima across all the many processors to obtain an overall maximum.Although the functional being computed in this example is dead-simple,quite a few skills are needed in order to implement the example at scale.10/2014/10/ims-presidential-address-let-us-own-data-science/11During the Centennial workshop,one participant pointed out that John Tukey’s definition of‘Big Data’was:“anything that won’tfit on one device”.In John’s day the device was a tape drive,but the larger point is true today, where device now means‘commodityfile server’.12It may be getting at something real about the Masters’degree programs,or about the research activities of individuals who will be hired under the new spate of DSI’s.13...which we just dismissed!Lost in the hoopla about such skills is the embarrassing fact that once upon a time,one could do such computing tasks,and even much more ambitious ones,much more easily than in this fancy new setting!A dataset couldfit on a single processor,and the global maximum of the array‘x’could be computed with the six-character code fragment‘max(x)’in,say,Matlab or R.More ambitious tasks,like large-scale optimization of a convex function,were easy to set up and use.In those less-hyped times,the skills being touted today were unnecessary.Instead,scientists developed skills to solve the problem they were really interested in,using elegant mathematics and powerful quantitative programming environments modeled on that math.Those environments were the result of50or more years of continual refinement,moving ever closer towards the ideal of enabling immediate translation of clear abstract thinking to computational results.The new skills attracting so much media attention are not skills for better solving the real problem of inference from data;they are coping skills for dealing with organizational artifacts of large-scale cluster computing.The new skills cope with severe new constraints on algorithms posed by the multiprocessor/networked world.In this highly constrained world,the range of easily constructible algorithms shrinks dramatically compared to the single-processor model,so one inevitably tends to adopt inferential approaches which would have been considered rudimentary or even inappropriate in olden times.Such coping consumes our time and energy,deforms our judgements about what is appropriate,and holds us back from data analysis strategies that we would otherwise eagerly pursue.Nevertheless,the scaling cheerleaders are yelling at the top of their lungs that using more data deserves a big shout.2.3The‘Jobs’MemeBig data enthusiasm feeds offthe notable successes scored in the last decade by brand-name global Information technology(IT)enterprises,such as Google and Amazon,successes currently recognized by investors and CEOs.A hiring‘bump’has ensued over the last5years,in which engineers with skills in both databases and statistics were in heavy demand.In‘The Culture of Big Data’[1],Mike Barlow summarizes the situationAccording to Gartner,4.4million big data jobs will be created by2014and only a third of them will befilled.Gartner’s prediction evokes images of“gold rush”for big datatalent,with legions of hardcore quants converting their advanced degrees into lucrativeemployment deals.While Barlow suggests that any advanced quantitative degree will be sufficient in this environ-ment,today’s Data Science initiatives per se imply that traditional statistics degrees are not enough to land jobs in this area-formal emphasis on computing and database skills must be part of the mix.14We don’t really know.The booklet‘Analyzing the Analyzers:An Introspective Survey of Data Scientists and Their Work’[20]points out thatDespite the excitement around“data science”,“big data”,and“analytics”,the ambi-guity of these terms has led to poor communication between data scientists and those whoseek their help.14Of course statistics degrees require extensive use of computers,but often omit training in formal software develop-ment and formal database theory.Yanir Seroussi’s blog15opines that“there are few true data science positions for people with no work experience.”A successful data scientist needs to be able to become one with the databy exploring it and applying rigorous statistical analysis...But good data scientists also understand what it takes to deploy production systems,andare ready to get their hands dirty by writing code that cleans up the dataor performs core system functionality...Gaining all these skills takes time [on the job].Barlow implies that would-be data scientists may face years of further skills development post masters degree,before they can add value to their employer’s organization.In an existing big-data organization,the infrastructure of production data processing is already set in stone.The databases, software,and workflow management taught in a given Data Science Masters program are unlikely to be the same as those used by one specific employer.Various compromises and constraints were settled upon by the hiring organizations and for a new hire,contributing to those organizations is about learning how to cope with those constraints and still accomplish something.Data Science degree programs do not actually know how to satisfy the supposedly voracious demand for graduates.As we show below,the special contribution of a data science degree over a statistics degree is additional information technology training.Yet hiring organizations face diffi-culties making use of the specific information technology skills being taught in degree programs.In contrast,Data Analysis and Statistics are broadly applicable skills that are portable from organiza-tion to organization.2.4What here is real?We have seen that today’s popular media tropes about Data Science don’t withstand even basic scrutiny.This is quite understandable:writers and administrators are shocked out of their wits. Everyone believes we are facing a zero-th order discontinuity in human affairs.If you studied a tourist guidebook in2010,you would have been told that life in villages in India (say)had not changed in thousands of years.If you went into those villages in2015,you would see that many individuals there now have mobile phones and some have smartphones.This is of course the leading edge fundamental change.Soon,8billion people will be connected to the network,and will therefore be data sources,generating a vast array of data about their activities and preferences.The transition to universal connectivity is very striking;it will,indeed,generate vast amounts of commercial data.Exploiting that data seems certain to be a major preoccupation of commercial life in coming decades.2.5A Better FrameworkHowever,a science doesn’t just spring into existence simply because a deluge of data will soon be filling telecom servers,and because some administrators think they can sense the resulting trends in hiring and government funding.15/2014/10/23/what-is-data-science/Fortunately,there is a solid case for some entity called‘Data Science’to be created,which would be a true science:facing essential questions of a lasting nature and using scientifically rigorous techniques to attack those questions.Insightful statisticians have for at least50years been laying the groundwork for constructing that would-be entity as an enlargement of traditional academic statistics.This would-be notion of Data Science is not the same as the Data Science being touted today,although there is significant overlap.The would-be notion responds to a different set of urgent trends-intellectual rather than commercial.Facing the intellectual trends needs many of the same skills as facing the commercial ones and seems just as likely to match future student training demand and future research funding trends.The would-be notion takes Data Science as the science of learning from data,with all that this entails.It is matched to the most important developments in science which will arise over the coming50years.As science itself becomes a body of data that we can analyze and study,there are staggeringly large opportunities for improving the accuracy and validity of science,through the scientific study of data analysis.Understanding these issues gives Deans and Presidents an opportunity to rechannel the energy and enthusiasm behind today’s Data Science movement towards lasting,excellent programs canoni-calizing a new scientific discipline.In this paper,I organize insights that have been published over the years about this new would-befield of Data Science,and put forward a framework for understanding its basic questions and procedures.This framework has implications both for teaching the subject and for doing scientific research about how data science is done and might be improved.3The Future of Data Analysis,1962This paper was prepared for the John Tukey centennial.More than50years ago,John prophecied that something like today’s Data Science moment would be coming.In“The Future of Data Anal-ysis”[42],John deeply shocked his readers(academic statisticians)with the following introductory paragraphs:16For a long time I have thought I was a statistician,interested in inferences from the particular to the general.But as I have watched mathematical statistics evolve,I havehad cause to wonder and to doubt....All in all I have come to feel that my centralinterest is in data analysis,which I take to include,among other things:proceduresfor analyzing data,techniques for interpreting the results of such procedures,ways ofplanning the gathering of data to make its analysis easier,more precise or more accurate,and all the machinery and results of(mathematical)statistics which apply to analyzingdataThis paper was published in1962in“The Annals of Mathematical Statistics”,the central venue for mathematically-advanced statistical research of the day.Other articles appearing in that journal16One questions why the journal even allowed this to be published!Partly one must remember that John was a Professor of Mathematics at Princeton,which gave him plenty of authority!Sir Martin Rees,the famous as-tronomer/cosmologist once quipped that“God invented space just so not everything would happen at Princeton”.JL Hodges Jr.of UC Berkeley was incoming editor of Annals of Mathematical Statistics,and deserves credit for publishing such a visionary but deeply controversial paper.at the time were mathematically precise and would present definitions,theorems,and proofs.John’s paper was instead a kind of public confession,explaining why he thought such research was too narrowly focused,possibly useless or harmful,and the research scope of statistics needed to be dramatically enlarged and redirected.Peter Huber,whose scientific breakthroughs in robust estimation would soon appear in the same journal,recently commented about FoDA:Half a century ago,Tukey,in an ultimately enormously influential paper redefined our subject...[The paper]introduced the term“data analysis”as a name for what appliedstatisticians do,differentiating this term from formal statistical inference.But actually,as Tukey admitted,he“stretched the term beyond its philology”to such an extent that itcomprised all of statistics.Peter Huber(2010)So Tukey’s vision embedded statistics in a larger entity.Tukey’s central claim was that this new entity,which he called‘Data Analysis’,was a new science,rather than a branch of mathematics: There are diverse views as to what makes a science,but three constituents will be judged essential by most,viz:(a1)intellectual content,(a2)organization in an understandable form,(a3)reliance upon the test of experience as the ultimate standard of validity.By these tests mathematics is not a science,since its ultimate standard of validity is an agreed-upon sort of logical consistency and provability.As I see it,data analysis passes all three tests,and I would regard it as a science, one defined by a ubiquitous problem rather than by a concrete subject.Data analysisand the parts of statistics which adhere to it,must then take on the characteristics of ascience rather than those of mathematics,...These points are meant to be taken seriously.Tukey identified four driving forces in the new science:Four major influences act on data analysis today:1.The formal theories of statistics2.Accelerating developments in computers and display devices3.The challenge,in manyfields,of more and ever larger bodies of data4.The emphasis on quantification in an ever wider variety of disciplinesJohn’s1962list is surprisingly modern,and encompasses all the factors cited today in press releases touting today’s Data Science initiatives.Shocking at the time was item#1,implying that statistical theory was only a(fractional!)part of the new science.Tukey and Wilk1969compared this new science to established sciences and further circumscribed the role of Statistics within it:。
