DSCI 510: Principles of Programming for Data Science
Final Project Guidelines In the final project for this class, you will have the opportunity to apply the knowledge andprogramming skills you have learned to a real-world problem. Your projectshould focus onweb scraping (or collection data through APIs), data cleaning, analysis, and visualization usingPython.
Final Project Due Date: December 19th, 2024 at 4pm PT
Final grade submission via Grading and Roster System (GRS) for Fall 2024 is the week afterDecember 19th and we should have graded every project by then. We need to set some time asidein order to be able to grade your projects, therefore we have to be strict about this deadline.Please refer to the Academic Calendar for the specific dates.Final Project Submission via GitHub Classroom In order to submit your final project assignment you will need to accept the assignment on ourGitHub Classroom (similar to the lab assignments). With the final assignment repository youwill get a template where you can upload all of your files. To get started, you can accept finalproject assignment here: https://classroom.github.com/a/7A rrid
Project Proposal You may send a one page proposal document (in a PDF format) describing your final project.This proposal should include the following:
- Name of your final project and a short synopsis/description (1 paragraph max).
- What problem are you trying to solve, which question(s) are you trying to answer?
- How do you intend to collect the data and where on the web is it coming from?
- What type of data cleaning and/or analysis are you going to perform on the data?
- What kind of visualizations are you going to use to illustrate your findings?There is no official due date for the proposal, but the sooner you send it to us the sooner you willget feedback on it. We will provide feedback and suggest changes if required. This is usually totest the feasibility of the project and give you a sense of whether you need to scale back becauseit is too ambitious or if you need to do more work in order to improve your grade. Please uploadhe original proposal in the same repository with the other files of your final project.Note: For faster processing, you can send us an email: Gleb (gleb@isi.edu), Mia(osultan@usc.edu)or Zhivar (souratih@usc.edu) an email with the subject “DSCI 510: Final Project Proposal”, please also upload your proposal document to the final project GitHub repository. Theemail should contain a link to your GitHub repository or the proposal.pdf file itself.1Project Goals and Steps
- Data Collection (20%)
You should identify websites or web resources from which you will get raw data for yourproject. You can either web-scrape data or collect data using publicly available APIs.This could include news articles, e-commerce websites, social media posts, weather data,or any other publicly available web content. This step should be fairly sophisticated asto demonstrate the techniques代写DSCI 510: Principles of Programming you have learned in the class. Use multiple data sourcesto compare different data in your analysis. Using Python libraries like BeautifulSoup andrequests, you should be able to write scripts to scrape data from the chosen websites. Thisstep includes making HTTP requests, handling HTML parsing, and extracting relevantinformation.Please note that if you need to collect data that changes over time, you might want tosetup a script that runs every day and collects the data at a certain time of the day. Thatway you can collect enough data to run your analysis for the final project later.We recommend that you scrape data from static websites, or use publicly available APIs.If you scrape data from dynamically generated pages, you might run into issues as certainwebsites are not keen on giving away their data (think sites like google, amazon, etc).Please note that some APIs are not free and you need to pay to use them - you should
try to avoid those as when we are grading your final project we should be able to replicateyour code without paying for an API.
Data Cleaning (20%)Once your data collection is complete, you will need to clean the data in order to be ableto process it. This will involve handling missing values, cleaning HTML tags, removingduplicates, and converting data into a structured format for analysis in Python. If yourraw data is not in English, you should attempt to translate the data into English as partof this step.Depending on the size of your data you can upload both raw and preprocessed data to thdata folder in the repository of your final project.
Data Analysis (20%In this step, you will perform an analysis on the scraped data to gain insights or answer
Data Visualization (20%)
Last but not least, you should create plots, graphs, or charts using Matplotlib, Seaborn,D3.js, Echarts or any other data visualization library, to effectively communicate yourfindings. Visualizations created in this step could be static or interactive, if they areinteractive - you need to describe this interaction and its added value in the final report.Our team should be able to replicate your interactive visualizations when we are gradingyour final projects.
- Final Report (20%Finally, you will submit a final report, describing your project, the problem you are tryingto solve or the questions that you are trying to answer. What data did you collect as well2as how it was collected. What type of data processing/cleaning did you perform? You
would also need to explain your analysis and visualizations. See Final Report section for
the analysis is fairly complicated,you could score more points in the data analysis step to compensate. Similarly, complexity ofthe final data visualizations could be used to get additional points if you decide to make yourvisualizations more interactive and engaging to the end users.
Project Deliverables
GitHub Repository We will create an assignment for the final project. You will need to accept the assignment andcommit your code and any additional files (e.g. raw data or processed data) to the repository.Here is a generic structure of the repository:utils/And here is a description of what each of the folders/files could contain:
- proposal.pdfThe project proposal file (PDF). This is what you can send us in advance to see if yourproject meets the minimum requirements or if the scope is too large and if you needtoscale it back. See the section: Project Proposal.
- requirements.txtThis file lists all of the external libraries you have used in your project and the specificversion of the library that you used (e.g. pandas, requests, etc). You can create this filemanually or use the following commands in your virtual (conda) environment:You can run this command to create the requirements.txt file:
3pip freeze >> requirements.txTo install all of the required libraries based on this requirements file, run this commandpip install -r requirements.txt
- README.md
This file typically contains installation instructions, or the documentation on how to installthe requirements and ultimately run your project. Here you can explain how torun yourcode, explain how to get the data, how to clean data, how to run analysis code and finallyhow to produce the visualizations. We have created sections in the README.md file foryou to fill in. Make sure you fill in all of the sections.Please note that this file is most important to us as we will try to reproduce your resultson our end to verify that everything is working. If there is anything that is tricky aboutthe installation of your project, you want to mention it here to make it easier for us to runyour project.
- data/ directorSimply put, this folder contains the data that you used in this project.(a) The raw data folder will have the raw files you downloaded/scraped from the web. Icould contain (not exhaustive) html, csv, xml or json files. If your raw data happensto be too large to upload to GitHub (i.e. larger than 25mb) then please upload yourdata to the USC Google Drive and provide a link to the data in your README.mdfile.
(b) The processed data folder will contain your structured files after data cleaning. Forexample, you could clean the data and convert them to JSON or CSV files.Youranalysis and visualization code should perform operations on the files in this folder.Note: Make sure your individual files are less than 25mb in size, you can use USC Google Drive if the files are larger than 25mb. In that case, please provide a link for us to get to the data in your README.md file.
- results/ directoryThis folder will contain your final project report and any other files you might have as partof your project. For example, if you choose to create a Jupyter Notebook for your datavisualizations, this notebook file should be in this results folder. If you have any staticimages of the data visualizations, those images should go in this folder as well.
- src/ directorThis folder contains the source code for your project.(a) get data.py will download, web-scrape or fetch the data from an API and store it in
the data/raw folder.
(b) clean data.py will clean the data, transform the data and store structured data filesin the data/processed folder, for example as csv or json files.
(c) analyze data.py will contain methods used to analyze the data to answer the projectspecific questions.(d) visualize results.py will create any data visualizations usingmatplotlib or any otherlibrary to conclude the analysis you performed.4(e) utils/ folder should contain any utility functions that you need in order to processyour code, this could be something generic such as regular expressions used to cleanthe data or to parse and lowercase otherwise case-sensitive information.
- .gitignore
Last but not least, the .gitignore file is here to help ignore certain meta-data or otherwiseunnecessary files from being added to the repository. This includes files that were usedin development or were created as a by-product but are not necessary for you to run theproject (for example, cached files added by using various IDEs like VS Code or PyCharm.Please note that this project structure is only a suggestion, feel free to add more files or changethe names of files and folders as you prefer. That being said, please take into account that wewill be looking for the specific files to get the data, clean the data, analyze data, etc. You canchange this structure or create more files in this repository as you like but please do mentionwhere what is in your README.md file.
Final Report
You’ll find an empty template for the final report document (pdf) in the GitHub repository onceyou accept our final project assignment. At the very least, your final report should have thefollowing sections:
- What is the name of your project?(a) Please write it as a research question and provide a short synopsis/description.(b) What is/are the research question(s) that you are trying to answer?
- What type of data did you collect?(a) Specify exactly where the data is coming from.(b) Describe the approach that you used for data collection.(c) How many different data sources did you use?(d) How much data did you collect in total? How many samples?(e) Describe what changed from your original plan (if anything changed) as well as thechallenges that you encountered and resolved.
- What kind of analysis and visualizations did you do?(a) Which analysis techniques did you use, and what are your findings?(b) Describe the type of data visualizations that you made(c) Explain the setup and meaning of each element.(d) Describe your observations and conclusion.(e) Describe the impact of your findings.
- Future Work(a) Given more time, what would you do in order to further improve your project?5(b) Would you use the same data sources next time? Why yes or why not?Your final project report should be no less than 2 and no more than 5 pages including any images(e.g. of data visualizations) that you want to embed in the report. Please spend a decent amountof time on the report. Your report is the first file we will read. We will not know how great yourproject is if you don’t explain it clearly and in detail.6