讲解Supervised Learning、讲解Python编程语言、辅导scikit-learn

2019.09.14 - 首页 >> Python编程

Numbers

The assignment is worth 15% of your final grade.Read everything below carefully!Why?

The purpose of this project is to explore some techniques in supervised learning. It isimportant to realize that understanding an algorithm or technique requiresunderstanding how it behaves under a variety of circumstances. As such, you will beasked to "implement" some simple learning algorithms (for sufficiently small values ofimplement, meaning I don't really want you to implement anything at all), and tocompare their performance.The Means

In this assignment you will go through the process of exploring your chosen datasets,tuning the algorithms you learned about, and writing a thorough analysis of yourfindings. All that matters is the analysis. It doesn't matter if you implement any learningalgorithms yourself (you probably shouldn't) as long as you participate in this journey ofexploring, tuning, and analyzing. Concretely, this means you may program in anylanguage that you wish, and you are allowed to use any library you wish as long as itwas not written specifically to solve this assignment, and as long as a TA canrecreate your experiments on a standard linux machine if necessary (we know how toinstall a pip package). You might want to look at scikit-learn or Weka, for example.Some examples of acceptable libraries:? Machine learning algorithms: scikit-learn (python), Weka (java), e107/nnet/randomforest(R), ML toolbox (matlab), tensorflow/pytorch (python)? Scientific computing: numpy/scipy(python), Matlab, R? Plotting: matplotlib (python), seaborn (python), Matlab, RYou can use other libraries as long as they fulfill the conditions above. If you areunsure, ask a TA (but please use common sense first)! There is no trick here for you tooverthink. Again, the key issue is that I don't care that you implement any of thelearning algorithms below; however, I care very much about your analysis.The Problems Given to YouYou should implement five learning algorithms. They are:? Decision trees with some form of pruning? Neural networks

? Boosting

? Support Vector Machines? k-nearest neighbors

Each algorithm is described in detail in your textbook, the handouts, and all over theweb. In fact, instead of implementing the algorithms yourself, you may (and by may Imean should) use software packages that you find elsewhere; however, if you do soyou should provide proper attribution. Also, you will note that you have to do somefiddling to get good results, graphs and such, so even if you use another's package,you may need to be able to modify it in various ways.Decision Trees. For the decision tree, you should implement or steal a decision treealgorithm (and by "implement or steal" I mean "steal"). Be sure to use some form ofpruning. You are not required to use information gain (for example, there is somethingcalled the GINI index that is sometimes used) to split attributes, but you shoulddescribe whatever it is that you do use.Neural Networks. For the neural network you should implement or steal your favoritekind of network and training algorithm. You may use networks of nodes with as manylayers as you like and any activation function you see fit.Boosting. Implement or steal a boosted version of your decision trees. As before, youwill want to use some form of pruning, but presumably because you're using boostingyou can afford to be much more aggressive about your pruning.Support Vector Machines. You should implement (for sufficiently loose definitions ofimplement including "download") SVMs. This should be done in such a way that youcan swap out kernel functions. I'd like to see at least two.k-Nearest Neighbors. You should "implement" (the quotes mean I don't mean it: stealthe code) kNN. Use different values of k.Testing. In addition to implementing (wink), the algorithms described above, youshould design two interesting classification problems. For the purposes of thisassignment, a classification problem is just a set of training examples and a set of testexamples. I don't care where you get the data. You can download some, take somefrom your own research, or make some up on your own. Be careful about the data youchoose, though. You'll have to explain why they are interesting, use them in laterassignments, and come to really care about them.What to Turn In

You must submit:

1. a file named README.txt containing instructions for running your code (see notebelow)

2. a file named yourgtaccount-analysis.pdf containing your writeupNote below: if the data are way, way, too huge for submitting, see if you can arrangefor an URL. This also goes for code, too. Submitting all of Weka isn't necessary, forexample, because I can get it myself; however, you should at least submit any files youfound necessary to change and enough support and explanation so we couldreproduce your results if we really wanted to do so. In any case, include all theinformation in README.txtThe file yourgtaccount-analysis.pdf should contain:? a description of your classification problems, and why you feel that they areinteresting. Think hard about this. To be at all interesting the problems should benon-trivial on the one hand, but capable of admitting comparisons and analysis ofthe various algorithms on the other.? the training and testing error rates you obtained running the various learningalgorithms on your problems. At the very least you should include graphs that showperformance on both training and test data as a function of training size (note thatthis implies that you need to design a classification problem that has more than atrivial amount of data) and--for the algorithms that are iterative--trainingtimes/iterations. Both of these kinds of graphs are referred to as learning curves,BTW.

? analyses of your results. Why did you get the results you did? Compare andcontrast the different algorithms. What sort of changes might you make to each ofthose algorithms to improve performance? How fast were they in terms of wallclock time? Iterations? Would cross validation help (and if it would, why didn't youimplement it?)? How much performance was due to the problems you chose? Howabout the values you choose for learning rates, stopping criteria, pruning methods,and so forth (and why doesn't your analysis show results for the different valuesyou chose? Please do look at more than one. And please make sure youunderstand it, it only counts if the results are meaningful)? Which algorithmperformed best? How do you define best? Be creative and think of as manyquestions you can, and as many answers as you can.For the sanity of your graders, please keep your analysis as short as possible while stillcovering the requirements of the assignment: to facilitate this sanity, analysis writeupis limited to 12 pages.

Grading Criteria

You are being graded on your analysis more than anything else. Roughly speaking,implementing everything and getting it to run is worth maybe 0% of the grade on thisassignment (I know you don't believe me, but in fact, steal the code; I not only don't care, I am encouraging you to use one of the many packages available both from theresources page and on the web). Of course, analysis without proof of working codemakes the analysis suspect.The key thing is that your explanations should be both thorough and concise. Imagineyou are writing a paper for the major conference in your field the year you will begraduating and you need to impress all those folks who will be deciding whether tointerview you later. You don't want them to think you're shallow do you? Or that you'reincapable of coming up with interesting classification problems, right? And you surelydon't want them to think that you make up for a lack of content by blathering on aboutirrelevant aspects of your work? Of course not.Finally, I'd like to point out that I am very particular about the format of theassignments. Follow the directions carefully. Failure to turn in files with the propernaming scheme, or anything else that makes the graders' lives unduly hard is simplygoing to lead to an ignored assignment. I am remarkably inflexible about this. Also,there will be no late assignments accepted, so start now. Have fun. One day you'll lookback on this and smile. There may be tears, but they will be tears of joy.When your assignment is graded, you will receive feedback explaining your errors (andyour successes!) in some level of detail. This feedback is for your benefit, both on thisassignment and for future assignments. It is considered a part of your learning goals tointernalize this feedback.If you are convinced that your grade is in error in light of the feedback, you mayrequest a regrade within a week of the grade and feedback being returned to you. Aregrade request is only valid if it includes an explanation of where the grader made anerror. Send a private Piazza post to only the head TA. In the Summary add “[Request]Regrade <whichever assignment>”. In the Details add sufficient explanation as to whyyou think the grader made a mistake. Be concrete and specific. We will not considerrequests that do not follow these directions.It is important to note that because we consider your ability to internalize feedback alearning goal, we also assess it. This ability is considered 10% of each assignment. Wedefault to assigning you full credit. If you request a regrade and do not receive at least5 points as a result of the request, you will lose those 10 points.A note about plagiarism and proper citationsProbably the easiest way to fail this class is to plagiarize. Read the note on Piazzawhen it arrives and make sure it doesn’t happen.BTW, I consider using the code of others in this class to perform the analysis itself tobe plagiarism. Way above at the beginning of this assignment I note that I do not careabout your implementing machine learning algorithms (and I mean it: I do not believethe learning value is worth it for this course); however, I do care very much that you understand why your algorithms work and how they are affected by your choice in dataand hyper parameters (I believe very, very much in the learning value of this process).The assignments are actually designed to force you to immerse yourself in theempirical and engineering side of ML that one must master to be a viable practitionerand researcher. The phrase "as long as you participate in this journey of exploring,tuning, and analyzing" is the key one here. Taking someone else's random seeds,hyper-parameters, and such is, in fact, avoiding the work I care about and circumventsthis process because they are the results of the process. Do not circumvent theprocess.