代写POLS0010 Data Analysis Term II ESSAY QUESTIONS 2024代写数据结构程序
- 首页 >> CSPOLS0010 Data Analysis Term II
ESSAY QUESTIONS 2024
Guidelines for Completing and Submitting POLS0010 Term II Essay
• Read the guidelines below to avoid losing unnecessary marks.
• The assessment is due on Wednesday 1st May 2024, 2.00pm. It has two parts (I and II), both of which will need to be submitted together. Note that standard late submission penalties apply.
• Please follow all designated Department of Political Science submission guidelines. These may be different to those of your home department. You must submit one copy of your essay via Turnitin.
• The datasets for the essay can be found in the ‘Term 2 Assessment ’ folder in the ‘Assessment 3’ section of Moodle
• The word limit for both Parts I and II is 3,000 words, excluding your R script appendix (see below). You can divide the word limit as you like between the two parts.
• This is an assessed piece of coursework for the POLS0010 module; collaboration and/or discussion with anyone is strictly prohibited. The rules for plagiarism apply and any cases of suspected plagiarism of published work or the work of classmates will be taken very seriously.
• You may open up the datasets and work on the essay questions anytime up until the submission date. There is no limit on the number of times you may open the data files. Be sure to save your data files and R script file.
• You should include a copy of your R script as an appendix to your essay. FAILURE TO INCLUDE THE R SCRIPT. WILL INCUR A 10 POINT PENALTY. Note that your R script. file should be neatly presented and easy to follow, including comments indicating the question being addressed. The essay answers should not contain any code.
• All tables or figures must be included within your answers to the essay, not in the code appendix.
• Answers should be written in complete sentences; no bulleting or outlining.
• You may assume the methods you have used (e.g. logit regressions, etc) are understood by the reader and do not need definitions, but you do need to say which techniques you have used and why.
• As this is an assessed piece of work, you may not email/ask the course tutors for help with the essay questions.
PART I
This part of the final essay contains two questions. You must answer both of them. Question A is worth 30 points and Question B is worth 25 points.
Up to an additional five points will be awarded for clarity of presentation, especially tables and figures. Lecture 10 will give guidelines on good presentation.
Both questions require you to write a brief report. It is up to you how you structure the reports, but it is advisable to keep introductory material to a minimum, given the word limit. Your reports should discuss your methods, your results and the conclusions that you draw from them. You are welcome to use sub-headings to structure your reports.
QUESTION A: Support for the British Labour Party [30 points]
The next General election for the British parliament will beheld by January 2025. For this question, suppose that you work for a political consultancy, and the Labour Party has contacted you for your expertise. The party would like to target their upcoming election campaign towards those who are most likely to vote for them. Your job is to tell them which types of people are most supportive of the Labour Party. To help measure the likely effectiveness of their advertising, they also want to knowhow much each characteristic matters in explaining support. To answer these questions you will look at data on how people voted previously from a survey of 723 British voters in the European Social Survey. This was fielded in 2022 and asked about respondents' vote in the last election. You need to:
1. Run at least three logit models containing different sets of variables, and select the one that you think has the best performance in classifying supporters of the Labour Party
2. Present your chosen final model’s findings in ways that clearly explain how much the variables matter in explaining support for Labour.
3. Use the findings to make recommendations on whom the new campaign should target.
Present your approach and findings in a brief report. The dataset is called “ess_gb” and is contained in the file “labourvote.Rda”. It contains the following variables for each individual in the survey:
Name Variable description
labvote dependent variable: =1 if supported Labour, 0 if supported another party
gndr = “Male” if male, “Female” if female
yrbrn year of birth
eduyrs number of years of education completed
lrscale scale measuring how politically right-wing respondents are on a scale from 0 - 10:
higher values are more right-wing [treat as a continuous variable]
polintr level of interest in politics: not interested, hardly interested, quite interested, or
very interested
hinctna household total net income decile (1=individual is in the lowest 10% of
households, 10=highest 10%)
QUESTION B: Estimating Constituency-Level Results from the EU Referendum [25 points]
In the 2016 referendum on leaving the EU, results were not released for individual electoral constituencies. However, many scholars would like to know why people voted to leave the EU, and how support differed across constituencies. One previous study by Chris Hanretty has already estimated constituency-level support for ‘leave’ authoritatively. Your tasks in this question are (i) to produce estimates of the percentage of voters that voted ‘leave’ in every constituency in Great Britain using multilevel modeling and post-stratification that areas close as possible to the Hanretty estimates, as measured by the Mean Absolute Error, and (ii) to use your results to explain why people voted to leave. You need to:
i) Estimate an appropriate logistic multilevel model explaining voting for leave, using the predictors in the dataset.
ii) Present the multilevel model results and interpret how the variables affect voting to leave the EU (Note: you do not need to discuss statistical significance).
iii) Produce post-stratified estimates of the percentage of people who voted ‘leave’ in all 631 constituencies in England, Scotland and Wales
iv) Compare your results to the existing estimates by Chris Hanretty, including with the Mean Absolute Error. (Note: your grade does not depend on achieving a perfect match to the existing estimates. You are unlikely to be able to achieve this.)
Present your approach and findings in a brief report. The survey data is called “e” and is in the file “eusurvey.Rda”. It comes from the British Election Study and it contains the following variables:
Name |
Variable description |
cname |
constituency name |
ccode |
constituency code |
leave |
dependent variable: =1 if respondent voted to leave EU, 0 if respondent voted to remain in the EU |
votecon |
=1 if respondent voted Conservative in the 2015 election, 0 otherwise |
voteukip |
=1 if respondent voted UKIP in the 2015 election, 0 otherwise [note: UKIP is the United Kingdom Independence Party, which campaigned in favour of the UK leaving the EU] |
female |
=1 if female, 0 otherwise |
age |
in years |
highed |
=1 if respondent is educated to degree level or higher, 0 otherwise |
lowed |
=1 if respondent has no educational qualifications, 0 otherwise |
c_con15 |
percent vote for Conservative party in the constituency, 2015 election |
c_ukip15 |
percent vote for UKIP in the constituency, 2015 election |
c_unemployed |
constituency unemployment rate, percent |
c_whitebritish |
percent of constituency population who are white British |
c deprived |
percent of constituency population living in poverty |
|
Post-stratification data for the 631 constituencies in Great Britain is called “post” and is
contained in the file “eupoststrat.Rda” . Each row contains one particular demographic group in one constituency. In addition to the variables in “e”, it also contains these variables:
Variable name Variable description
c_count Number of people in the demographic group
c_total Number of people in the constituency
percent percent of constituency represented by the demographic group
Finally, the comparison data containing the existing estimates by constituency produced by Chris Hanretty is called “est” and is in the file “existing_estimates.Rda”. In addition to the constituency name and code, it contains the existing estimate of the leave vote share for each constituency (called estimate).
PART II
This part of the final essay contains one question. It is worth 40 points. Again, 5 points are reserved for clarity of presentation, especially tables and figures. See Q+A session 5 for guidelines on presentation.
The question requires you to write a brief report. It is up to you how you structure the report, but it is advisable to keep introductory material to a minimum, given the word limit. Your report should discuss your methods, your results and the conclusions that you draw from them.
QUESTION C: Describing and Classifying Tweets [40 points]
Many companies monitor social media posts to gauge how customers feel about their company and their competitors. For this question, imagine that you have been hired as a consultant by one of the major American airline companies to analyse tweets about airlines. They want to find out how people talk about airlines on Twitter, and then build a predictive tool that can classify tweets in future into ‘negative’ or ‘positive’ sentiment, to help them respond better to their customers in real time. They have provided you with a dataset of 11,541 tweets about airlines that have been labelled as ‘negative’ or ‘positive’ by their staff. The dataset also identifies which airline each tweet is talking about.
Your task is to prepare a brief report that describes the tweets, and recommends a classification method for future tweets. You need to:
i) Use appropriate tools to describe the tweets. What words are associated with negative or positive sentiment? How does word usage differ across the different airlines?
ii) Use your analysis from i) to build a short dictionary of negative and positive words describing airlines, then use it to classify tweets as ‘negative’ if they contain more negative than positive language, and ‘positive’ otherwise [code for creating your own dictionary is provided below]
iii) Use the lasso logit method to classify the tweets into ‘negative’ and ‘positive’
iv) Compare the performance of your classifiers from ii) and iii), and use this analysis to decide which one would be the better classifier for the company to use for future tweets
The dataset for this question is called “tweets” and is contained in the file “tweets.Rda”. It contains the following variables:
Variable name Variable description
text The text of each tweet
sentiment Labeled sentiment of each tweet: 1=negative, 0=positive
airline The airline company featured in the tweet: United, JetBlue, American
Airlines, US Airways, Virgin America or Southwest
You should first create a corpus of tweets using the following code:
tweetCorpus <- corpus(tweets$text, docvars = tweets)
Here is some advice for part ii):
• Your dictionary should contain a minimum of 5 words and a maximum of 15 words in each category
• You are not expected to exhaustively compare the performance of different dictionaries. Instead, simply choose one dictionary based on your analysis from i), explaining how you chose the words.
Code for creating a dictionary:
You can create a dictionary called “mydict” in R that contains two categories (‘negative’ and ‘positive’) using the following code:
neg.words <- c()
pos.words <- c()
mydict <- dictionary(list(negative = neg.words,
positive = pos.words))
You need to insert your chosen sets of negative and positive words in ‘neg.words’ and ‘pos.words ’. This dictionary can then be used with quanteda in exactly the same way as any of the existing built-in dictionaries.