文章目录
- 1.What is statistics good for?
- 1.1 Statistics
- 2. What is machine learning ?
- 2.1 Two main activities of machine learning
- 2.2 Some characteristics of ML
- 3. What is Software Engineering for Data Science?
- 3.1 Types of Software
- 4. The Structure of a Data Science Project
- 4.1 Five phases of a data science project
- 4.2 Two main goals to exploratory data analysis
- 4.3 There is another approach that can be taken
- 5. The outputs of a data science experiment
- 5.1 The type of the output
- 5.2 a few hallmarks of a good data science report
- 6. The four secrets of a successful data science experiment
- 7. Data Scientist Toolbox
- 8. Separating Hype from Value
1.What is statistics good for?
1.1 Statistics
- Descriptive statistics
Descriptive statistics includes exploratory data analysis, unsupervised learning, clustering and basic data summaries. - Inference
Inference is the process of making conclusions about populations from samples. - Prediction
Prediction overlaps quite a bit with inference, but modern prediction tends to have a different mindset. - Experimental Design
Experimental design is the act of controlling your experimental process to optimize the chance of arriving at sound conclusions.
2. What is machine learning ?
2.1 Two main activities of machine learning
- Unsupervised learning -trying to uncover unobserved factors in the data.
- supervised learning
2.2 Some characteristics of ML
- the emphasis on predictions;
- evaluating results via prediction performance;
- having concern for overfitting but not model complexity per se;
- emphasis on performance;
- obtaining generalizability through performance on novel datasets;
- usually no superpopulation model specified;
- concern over performance and robustness.
3. What is Software Engineering for Data Science?
3.1 Types of Software
- just some code
- that you wrote code at all is the fitst step;
- encapsulating automation with a loop or similar
- some sort of function
- first level of abstraction; defuined “interface”
- software package
- API + convenience for user
4. The Structure of a Data Science Project
4.1 Five phases of a data science project
- question
- exploratory data analysis
- formal modeling
- interpretation
- communication.
4.2 Two main goals to exploratory data analysis
- are the data suitable for the question?
- Sketch the solution.
4.3 There is another approach that can be taken
So often there will be a data set available, But, it won’t be immediately clear kind of what the data set will be useful for. So it can be useful to kind of do some exploratory data analysis, to look at the data, to summarize it a little bit, make some plots, and see what’s there. And to generate some interesting questions based on the data. So this is sometimes called hypothesis generating because it kind of produces questions that were already there.
5. The outputs of a data science experiment
5.1 The type of the output
- Reports
- Presentations
- Interactive web pages
- Apps
5.2 a few hallmarks of a good data science report
- Be clearly written
- Involve a narrative around the data
- Discuss the creation of the analytic dataset
- Have concise conclusions
- Omit unnecessary details
- Reproducible
6. The four secrets of a successful data science experiment
- New knowledge is created.
- Decisions or policies are made based on the outcome of the experiment.
- A report, presentation or app with impact is created.
- It is learned that the data can’t answer the question being asked of it.
7. Data Scientist Toolbox
- Large scale data sets
- Hadoop
- Spark
- Communicate with others
- Slack
- Solve questions
- Stack Overflow
- Reproducible or literate ducumentation
- R Markdown
- IPython notebooks
- Build quickly data products
- Shink
8. Separating Hype from Value
- What is the question you are trying to answer with the data?
- Do you have the data to actually answer that question?
- If you could answer the question, could you use the question?