首页 > 其他分享 >SIT742: Modern Data Science

SIT742: Modern Data Science

时间:2024-09-19 12:45:28浏览次数:9  
标签:product Science dataframe will each need SIT742 Data your

Deakin University Trimester 2, 2024

School of IT Assignment 2Unit Team: SIT742

SIT742: Modern Data Science

Extension Request Students with difficulty in meeting the deadline because of variousreasons, must apply for an assignment extension no later than 5:30pm on 20/09/2024(Friday). Apply via ‘CloudDeakin’, the menu item ‘Extension Request’ under the‘Assessment’ drop-down menu.

Academic Integrity All assignment will be checked for plagiarism, and any academicmisconduct will be reported to unit chair and university.

Generative AI Deakin’s Policy and advices on responsible usage of Generative AI in your studies: https://www.deakin.edu.au/students/study-support/study-resources/artificial-intelligence

Instructions

Assignment Questions There are total 2 parts in the assessment task 2:

Part 1 The first part will focus on the data manipulation and pyspark skills which includes the DataAcquisition, the Data Wrangling, the EDA and Spark, the modules and library from M03, M04.Part 2 The second part focus on more advanced data science skills with particular scenario. This partwill require the knowledge covered in M05.

What to Submit?

There is no optional part for assignment 2. You (your group) are required to submit the following completedfiles to the corresponding Assignment (Dropbox) in CloudDeakin:SIT742Task2.ipynb The completed notebook with all the run-able code for all requirements (part 1and part2).In general, you (your group) need to complete, save the results of running, download/export thenotebook as a local file, and submit your notebook from Python platform such as Google Colab.You need to clearly list the answer for each question, and the expected format from your notebookwill be like in Figure 1 (One notebook for each group).1Figure 1: Notebook FormatSIT742Task2report.pdf You (group) are also required to write a report with your answer (code) andrunning results from SIT742Task2.ipynb for all the questions (Part 1 and Part 2). You couldmake screenshot on your answer (code) and running results from SIT742Task2.ipynb and pasteinto the report. Please try to include the code comments, and results including plot images as wellin the report, and make sure the code format such as Indentation keeps same as the ipynb notebook.In this report (one for each group), you will also need to provide a clear explanation on yourlogic for solving each question (you could write explanation below your solution and results in thereport). In the explanation, you will need to cover below parts: 1). why you decide to choose yoursolution; 2). are there any other solutions that could solve the question; 3). whether your solution代 写SIT742: Modern Data Science isthe optimal or not? why? The length of the explanation part for each question is limited below 100words.

In the end of your report, you (group) also need to discuss below three points:

  • How you and your team member collaborate on this assignment?
  • What you have learned with your team member from the second assignment.
  • What is the contribution of each the team member for finishing the second assignment.

SIT742Task2video.avi A video demonstration between 10 and 15 minutes, and the file format can beany common video format, such as ‘MKV’, ‘WMV’, ‘MOV’ etc.For your group, one important submission is a short video in which each of You (group members)orally present the solutions that you provide in the notebook and illustrate the running of code withthe used logic. In the video, your group need to work together to discuss below three points:

  • Which question(s) you have worked on and how did you collaborate with other team members.
  • What is the logic behind the your solution on the question(s)? and is there any alternativeoptimized ways to resolve the question?
  • What is your understanding of Code collaboration? How do you collaborate with your groupn coding? What are the common tools/platform to support the Code collaboration?

2Part I

Data Acquisition and Manipulation

There are 10 questions in this part, totalling 60 marks. Each of question is worth 5 marks. Additionally,

he quality of your explanation in both the report and video will collectively be worth 10 marks.You are recommended to use Google Colab to finish all the coding in the code block cell, and provideufficient coding comments, and also save the result of running as well.The (transactionrecord.zip) data used for this part could be found in here. You will need to usepark to read the unzipped (csv) data for starting. You could find the code on reading csv data withSpark from M04G.

Question 1.1

Using PySpark to do some of the data wrangling process, so that:

1.1.1 For the ’NA’ in CustomerNo columns, change it to ’-1’.

1.1.2 Process the text in productName column, only alphabet characters left, and save the processedresult to a new column productName_process and show the first 5 rows.

Question 1.2

Find out the revenue on each transaction date. In order to achieve the above, some wrangling work isequired to be done:

1.2.1 Using pyspark to calculate the revenue (price * Quantity) and save as float format inpyspark dataframe to show the top 5 rows.

1.2.2 Transform the pyspark dataframe to pandas dataframe (named as df) and create the columntransaction_date with date format according to Date. Print your df pandas dataframe withtop 5 rows after creating the column transaction_date.

1.2.3 Plot the sum of revenue on transaction_date in a line plot and find out any immediatepattern / insight?

Question 1.3

Let’s continue to analyse on the transaction_date vs revenue.

1.3.1 Determine which workday (day of the week), generates the most sales (plotting the results ina line chart with workday on averaged revenues).

1.3.2 Identify the name of product (column productName_process) that contributes the highestrevenue on ‘that workday’ (you need to find out from 1.3.1) and the name of product (columnproductName_process) that has the highest sales volume (sum of the Quantity), no need toremove negative quantity transactions.) on ‘that workday’ (you need to find out from 1.3.1).

1.3.3 Please provide two plots showing the top 5 products that contribute the highest revenues ingeneral and top 5 products that have the highest sales volumes in general.

Question 1.4

3Which country generates the highest revenue? Additionally, identify the month in that country thatas the highest revenue.

Question 1.5

Let’s do some analysis on the CustomerNo and their transactions. Determine the shopping frequency ofcustomers to identify who shops most frequently (find out the highest distinct count of transactionNoon customer level, be careful with those transactions that is not for shopping – filter thosetransaction quantity <= 0). Also, find out what products (column productName_process) ‘thiscustomer’ typically buys based on the Quantity of products purchased.

Question 1.6

As the data scientist, you would like to build a basket-level analysis on the product customer buyingfilter the ‘df’ dataframe with df[’Quantity’]>0). In this task, you need to:

1.6.1 Group by the

transactionNo and aggregate the category of product (columnproduct_category) into list on transactionNo level. Similarly, group and aggregate name ofproduct (column productName_process) into list on transactionNo level.

1.6.2 Removing duplicates on adjacent elements in the list from product_category you obtained from 1.6.1, such as [product category 1, product category 1, product category2, ...] will be processed as [product category 1, product category 2,....]. After thisprocessing, there will be no duplicates on on adjacent elements in the list. Please save yourprocessed dataframe as ‘df_1’ and print the top 10 rows.

Question 1.7

Continue work on the results of question 1.6, now for each of the transaction, you will have a list of

product categories. To further conduct the analysis, you need to finish below by using dataframe

‘df_1’:

1.7.1 Create new column prod_len to find out the length of the list from product_category on

each transaction. Print the first five rows of dataframe ‘df_1’.

1.7.2 Transform the list in product_category from [productcategory1, productcategory2...]to ‘start > productcategory1 > productcategory2 > ... > conversion’ with new col

umn path. You need to add ‘start’ as the first element, and ‘conversion’ as the last. Alsoyou need to use ‘ > ’ to connect each of the transition on products (there is a space betweenthe elements and the transition symbol >). The final format after the transition is given in

example as below fig. 2. Define the function data_processing to achieve above with threearguments: df which is the dataframe name, maxlength with default value of 3 for filtering thedataframe with prod_len" <=maxlength and minlength with default value of 1 for filteringthe dataframe with prod_len >=minlength. The function data_processing will return thenew dataframe ‘df_2’. Run your defined function with dataframe ‘df_1’, maxlength = 5 and

minlength = 2, print the dataframe ‘df_2’ with top 10 rowsFigure 2: Example of the transformation on 1.7.2, left column is before the transformation, rightcolumn is after the transformation. After transformation, it is not list anymore4Hint: you might consider to use str.replace() syntax from default python 3.

Question 1.8

Continue to work on the results of question 1.7, the dataframe ‘df_2’, we would like to build the

transition matrix together, but before we actually conduct the programming, we will need to

finish few questions for exploration:

1.8.1 Check on your transaction level basket with results from question 1.7, could you please find

out respectively how many transactions ended with pattern ‘... > 0ca > conversion’ / ‘...

> 1ca > conversion’ / ‘... > 2ca > conversion’ / ‘... > 3ca > conversion’ / ‘... >

4ca > conversion’ (1 result for each pattern, total 5 results are expected).

1.8.2 Check on your transaction level basket with results from question 1.7, could you please find

out respectively how many times the transactions contains ‘0ca > 0ca’ / ‘0ca > 1ca’ / ‘0ca >2ca’ / ‘0ca > 3ca’ / ‘0ca > 4ca’ / ‘0ca > conversion’ in the whole data (1 result for eachpattern, total 6 results are expected and each transaction could contain those patterns multipleimes, such as ‘start > 0ca > 1ca > 0ca > 1ca > conversion’ will count ‘two’ times withpattern ‘0ca > 1ca’, if there is not any,then return 0, you need to sum the counts from eachransaction to return the final value).

1.8.3 Check on your transaction level basket with results from task question 1.7, could you please

ind out how many times the transactions contains ‘...> 0ca > ...’ in the whole data (1esult is expected and each transaction could contain the pattern multiple times, such as ‘start

> 0ca > 1ca > 0ca > 1ca > conversion’ will count ‘two’ times, you need to sum the countsfrom each transaction to return the final value).

1.8.4 Use the 6 results from 1.8.2 to divide the result from 1.8.3 and then sum all of them and return

the value.

Hint: you might consider to use endswith and count functions from default python 3.

Question 1.9

Let’s now look at the question 1.6 again, you have the list of product and list of product category

for each transaction. We will use the transactionNo and productName_process to conduct the

Association rule learning.

1.9.1 Work on the dataframe df from question 1.2 (filter out the transaction with negative quantity value and also only keep those top 100 products by ranking the sum of quantity) andbuild the transaction level product dataframe (each row represents transactionNo androductName_process become the columns, the value in the column is the Quantity).Hint: you might consider to use pivot function in pandas.

1.9.2 Run the apriori algorithm to identify items with minimum support of 1.5% (only looking at

baskets with 4 or more items).Hint: you might consider to use mlxtend.frequent_patterns to run apriori rules.

1.9.3 Run the apriori algorithm to find the items with support >= 1.0% and lift > 10.

1.9.4 Please explore three more examples with different support / confidence / lift measurements

(you could leverage your rule mining with one of the three measurements or all of them) tofind out any of the interesting patterns from the Association rule learning.Save your code andresults in a clean and tidy format and writing down your insights.

Question 1.10

5After we finished the Association rule learning, it is a time for us to consider to do customer analysis

based on their shopping behaviours.

1.10.1 Work on the dataframe df from question 1.2 and build the customer product dataframe(each row represents single customerNo and productName_process become as the columns,the value in the columns is the aggregated Quantity value from all transactions and the results a N by M matrix where N is the number of distinct customerNo and M is the number ofdistinct productName_process. Please filter out the transaction with negative quantity valueand also only keep those top 100 product by ranking the sum of quantity).

1.10.2 Use the customer-product dataframe, let’s calculate the Pairwise Euclidean distance oncustomer level (you will need to use the product Quantity information on each customer tocalculate the Euclidean distance for all other customers and the result is a N by N matrix whereN is the number of distinct customerNo).

1.10.3 Use the customer Pairwise Euclidean distance to find out the top 3 most similar customer to

CustomerNo == 13069 and CustomerNo == 17490.

1.10.4 For the customer CustomerNo == 13069, you could see there are some products that thisustomer has never shopped before, could you please give some suggestions on how to recommendhese product to this customer? please write down your suggestions and provide a coding logic

steps on how to achieve, not actual code).

Part II

Sales Prediction

There are 3 questions in this part, totaling 40 marks. Each question is worth 10 marks. Additionally, the

quality of your explanation in both the report and video will collectively be worth 10 marks.You are required to use Google Colab to finish all the coding in the code block cell, and provide sufficientcoding comments, and also save the result of running as well.In this part, we will focus only on two columns revenue with transaction_date to form the revenuetime series based on transaction_date. We will use the dataframe df from question 1.2 (without anyfiltering on transactions) to finish below sub-tasks:

Question 2.1

You are required to explore the revenue time series. There are some days not available in theevenue time series such as 2019-01-01. Please add those days into the revenue time series withdefault revenue value with the mean value of the revenue in the whole data (without any filtering onransactions). After that, decompose the revenue time series with addictive mode and analyses onhe results to find if there is any seasonality pattern (you could leverage the M05A material from labession with default setting in seasonal_decompose function).

Question 2.2

We will try to use time series model ARIMA for forecasting the future. you need to find the best modelwith different parameters on ARIMA model. The parameter range for p,d,q are all from [0, 1, 2]. Inotal, you need to find out the best model with lowest Mean Absolute Error from 27 choices basedon the time from ”Jan-01-2019” to ”Nov-01-2019” (you might need to split the time series to trainand test with grid search according to the M05B material).

Question 2.3

6There are many deep learning time series forecasting methods, could you please explore those methodsand write down the necessary data wrangling and modeling steps (steps on how to achieve, not actualcode). Also please give the reference of the deep learning time series forecasting models you are using.

标签:product,Science,dataframe,will,each,need,SIT742,Data,your
From: https://www.cnblogs.com/WX-codinghelp/p/18420373

相关文章

  • [1064] Change values in a DataFrame based on different values
    TochangevaluesinaDataFramebasedondifferentvalues,youcanuseseveralmethodsinPandas.Hereareafewcommonapproaches:UsinglocforConditionalReplacementYoucanusethelocmethodtoreplacevaluesbasedonacondition:importpandasasp......
  • 易优cms网站安装报错Parse error: syntax error, unexpected '[' in /data/home/xyu80
    当您在安装易优CMS时遇到“Parseerror:syntaxerror,unexpected'['”的错误,这通常是由于PHP版本过高导致的。易优CMS可能不支持某些较新版本的PHP语法特性。您可以尝试将PHP版本设置为5.5来解决这个问题。解决步骤确认当前PHP版本降级PHP版本至5.5重新安装易优CMS详细......
  • 如何在删除ibdata1和ib_logfile的情况下恢复MySQL数据库
    昨天,有个朋友对公司内部使用的一个MySQL实例开启binlog,但是在启动的过程中失败了(他也没提,为何会失败),在启动失败后,他删除了ibdata1和ib_logfile,后来,能正常启动了,但所有的表通过showtables能看到,但是select的过程中却报“Tabledoesn'texist”。于是,建议他试试可传输表空间。同......
  • 构建Data Agent:探讨企业应用中基于大模型的交互式数据分析及方案【中】
    前言在上篇中,我们探讨了在企业应用中基于自然语言的交互式数据分析的业务驱动力与场景:作为现有数据分析类应用或商业BI工具的一个更简单、友好、且易于交互(特别是针对业务决策人员)的工具(Data-Agent),与现有应用形成有效互补。针对这样的场景,提出了当前基于大模型实现的三种基础方案设......
  • 构建Data Agent:探讨企业应用中基于大模型的交互式数据分析及方案【上】
    前言数据,在如今企业商业决策中的重要性毋庸置疑。正如BI(商业智能)系统已经成为众多中大型企业数据分析、预测与决策支持的关键IT设施。当然,广义上的BI是一个庞大的IT体系,并非一句简单的数据分析就能概括,ETL、数据仓库与集市、BI分析与可视化、数据挖掘工具等构成一个完整的端到端系......
  • 159.235 2023 S02 Wireframe Data Viewer
    159.2352023S02—Assignment2Thisassignmentcoversthetopics:coordinates,transformations,3dmodelling,andvisiblesurfaces.WireframeDataViewerWriteaJavaprogramthatrendersa3dimensionaltrianglewireframesurfacedatamodelandallowso......
  • HCIP Datacom认证是什么?深入解读华为HCIP数通方向!
    HCIPDatacom认证不单象征着个人于数据通信范畴所具备的专业技能,更是职业发展途中的关键里程碑。此文将深度剖析HCIPDatacom认证以及其在数通方向的重大意义。HCIPDatacom认证是什么?HCIPDatacom认证,全名叫做华为认证ICT专家-数据通信方向,乃是华为认证体系里的中级认证......
  • 32130 Data exploration and preparation
    32130AssessmentTask2:DataexplorationandpreparationTaskdetailsThisassessmentwillgiveyouprac!calexperienceindatavisualisation,explora!on,andprepara!on(preprocessingandtransforma!on)fordataanalytics.Thisassignmentisindividual......
  • Hadoop(十三)DataNode
    一、DataNode工作机制1、一个数据块在DataNode上以文件形式存储在磁盘上,包括两个文件,一个是数据本身,一个是元数据包括数据块的长度,块数据的校验和,以及时间戳2、DataNode启动后向NameNode注册,通过后,周期性(6小时)的向NameNode上报所有的块信息3、心跳是每3秒一次,心跳返回结果带有......
  • FIT9132 Introduction to Databases
    FIT9132 Introductionto DatabasesAssignment 1 Logical- ReadMoreCommunity Library(RCL)PurposeGiventhe providedcasestudyfromAssignment 1-Conceptual,and additionalforms/documents relatedtothecasestudy,studentswillbeasked t......