代写data编程、代做SQL程序语言
- 首页 >> Java编程 Due Jul 2 by 11:59pm
Points 100
Submitting a file upload
File Types rmd and html
Available until Jul 14 at 11:59pm
:[HY[(ZZPNUTLU[
5V[H)LUL
Read this entire page twice before you get started.
The due date is July 2 at 11:59pm ET. Late submissions are accepted (with the usual penalty of
2.5% per day late) until July 14 at 11:59pm ET. No submissions are accepted after that -- not
submitting the practicum by then will make it impossible to earn a passing grade for the course.
Work on the practicum for at least three hours every day and use the time during the week prior to
the practicum to start working on it, especially the configuration of the MySQL Server and the loading
of the data. Do not wait to get started.
A gentle reminder that the average of both practicums must be above 70% to pass this course, so be
sure to complete on time and seek help right away. Do not procrastinate -- things that appear simple
often take more time than expected and, of course, programming is fraught with potholes on the road
to success. So plan accordingly. Do not wait until shortly before it is due.
The average time to complete the practicum is 15-25 hours. Do not wait to start. Seek help early.
Submit often and as soon as you have enough code that works. We will only grade the last
submission. Check your submission before you submit and after.
6]LY]PL^
In this practicum you will build a database that can be used to analyze bird strikes on aircraft. For an
existing data set from the FAA [1], you will build a logical data model, a relational schema, realize the
relational schema in MySQL/MariaDB, load data into the database, execute SQL queries, a finally
perform some simple analysis of the data.
The graphic below shows some statistics regarding bird strikes and helps frame the data in the data
file.
Use the provided time estimates for each tasks to time-box your time. Seek assistance if you spend
more time than specified on a task -- you are likely not solving the problem correctly. A key objective
is to learn how to look things up, how to navigate complex problems, and how to identify and resolve
programming errors.
3LHYUPUN6IQLJ[P]LZ
In this practicum you will learn how to:
configure cloud MySQL
connect to cloud MySQL from R in an R Notebook
implement a relational schema for an existing data set
load data from CSV files into a relational database through R
execute SQL queries against a MySQL database through R
perform simple analytics in R
identify and resolve programming errors
look up details for R, SQL, and MySQL
time-box work
-VYTH[
Complete individually.
7YLYLX\PZP[L;HZRZ
1. Read the Hints and Tips section below and go back to them often when you encounter problem.
Most problems are addressed in that section. Consult the list before contacting us for help as
you'll be able to resolve the issue more quickly.
2. Create an R Project and within that project create an R Notebook in which to do all of your work.
Place all files within the project folder and do not use absolute file paths. Embed any diagrams inthe R Notebook and be sure to submit them as files in your submission. Keep updating this
notebook as you complete tasks. Learn how to use journaling in data analysis work. Use 6.204 --
Literate Programming with R Notebooks (http://artificium.us/lessons/06.r/l-6-204-rnotebooks/l-6-204.html)
for reference. Of course, in practice, you will likely build different
programs (perhaps even using different programming language) for database creation, data
loading, and the dashboard. But here we will do it all in one R Notebook.
3. Become familiar with the problem of bird strikes on aircraft. The report in [1] provides an overview.
You should scan it to become familiar with the domain but you do not need to read it deeply.
4. Download the file BirdStrikesData.csv (https://s3.us-east2.amazonaws.com/artificium.us/datasets/BirdStrikesData-V4-SuF24.csv)
and save it locally to your
R Project folder (the same folder that contains you .Rmd and your .Rproj files) or reference it from
the URL. You may wish to create a new data file that is a subset of the full data that you use for
development so loading takes less time. This is a common strategy in practice. To download the
file, use the right mouse button and choose "Save Link As..." or a similar menu choice on your
browser. Do not click on the link as that will cause the browser to open and display the file.
Read the Hints & Tips section frequently and before posting questions.
2L`9LZV\YJLZ
7YLYLX\PZP[L3LZZVUZ
06.103 Working with Vectors and Data Frames in R (http://artificium.us/lessons/06.r/l-6-
103-vecs-and-dfs/l-6-103.html)
06.106 Import Data into R from CSV, TSV, and Excel Files
(http://artificium.us/lessons/06.r/l-6-106-load-csv-tsv-excel-files/l-6-106.html)
06.108 Loops and Iteration in R (http://artificium.us/lessons/06.r/l-6-108-loops-iteration-inr/l-6-108.html)
06.112
Basics of Text & String Processing in R (http://artificium.us/lessons/06.r/l-6-112-
text-proc/l-6-112.html)
06.121 Writing Functions in R (http://artificium.us/lessons/06.r/l-6-121-funcs-in-r/l-6-
121.html)
06.204 Literate Programming with R Notebooks (http://artificium.us/lessons/06.r/l-6-204-rnotebooks/l-6-204.html)
06.191
Debugging R Code (http://artificium.us/lessons/06.r/l-6-191-debugging/l-6-191.html)
06.301 Working with Databases in R (http://artificium.us/lessons/06.r/l-6-681-key-value-dbredis-from-r/l-6-301.html)
06.304
Configure and Connect to Cloud MySQL from R (http://artificium.us/lessons/06.r/l6-304-cloudMySQL-from-r/l-6-304.html)
06.302
Bulk Load Data from CSV into Database in R (http://artificium.us/lessons/06.r/l-6-
302-bulkload-data-into-db/l-6-302.html)
06.306 Dates in R and SQLite (http://artificium.us/lessons/06.r/l-6-306-dates-in-r-and-sql/l-6-
306.html)70.907 Stored Procedures in MySQL (http://artificium.us/lessons/70.sql/l-70-907-storedprocs-mysql/l-70-907.html)
;HZRZ
Before
you start, read all of the questions. Inspect the CSV data file that you downloaded so you are
familiar with its columns, data types, and overall structure. Assume that this database will eventually
be used for an app that can be used by pilots (of any kind of aircraft) to report wildlife incidents.
All R and SQL code blocks must be named, as shown in the example below. This is necessary so
that you can reference your code blocks in the self-evaluation rubric to be filled out at the end of the
practicum. The names of code blocks must be unique.
```{r nameOfRCodeBlock, eval = T, warning = F}
```{sql nameOfSQLCodeBlock, connection = xDB}
You may add any additional block parameters as needed.
Use functions to structure your code so that it becomes more readable and easier to develop and
debug. Use headers to segment your notebook and add explanations as to what each code block
means. Follow common coding practices and format your code so it is readable, and use functions to
break down complex code.
1. (5 pts / 2.0 hrs) Set up and configure a MySQL Server database on db4free.net
(http://db4free.net/) or any other MySQL cloud service of your choice such as
freemysqlhosting.net (https://www.freemysqlhosting.net/) , Google Cloud or AWS. Note that
you may use SQLite instead of MySQL but you will not get credit for this question nor credit for the
question below that asks you to create a stored procedure, as those are not supported in SQLite.
See the Hints below for information on db4free. Lesson 6.304 -- Configure and Connect to
Cloud MySQL from R (http://artificium.us/lessons/06.r/l-6-304-cloudMySQL-from-r/l-6-304.html)
provides some advice on connecting to these two databases.
If you use freemysqlhosting, we recommend that you pay (for the remainder of the term or the
entire year) for a 100MB upgrade (really cheap, like cost of couple cups of coffee) as otherwise
exceeding the 5MB limit will get your account suspended. You may share a subscription as it
allows five databases. Alternatively, you could split the data over multiple databases. Of course,
using a different MySQL host such as Google Cloud or Amazon AWS (or any other of your choice)
can solve this issue too, but they are a bit more difficult to set up. If you use AWS RDS you need
to pay particular attention to ensuring you set global access policies and proper security policies --
this can be quite tricky and therefore we do not recommend using AWS. We recommend that you
purchase an educational subscription to freemysqlhosting (and share it if you'd like to); it is the
easiest to set up and work with.
Both Amazon AWS and Google Cloud offer free credits, but be sure to monitor usage so you do
not exceed the free credit or are prepared to pay (they can be a bit costly, so be careful to
suspend database when not in use and delete after the Practicum has been graded).You may collaborate and work with others to set up a cloud MySQL installation, but not for the
remainder of the practicum. A cloud MySQL installation is necessary for us to run your code -- we
cannot connect to a local installation of MySQL, although you may use one for testing and
development.
2. (1 pts / 0.1 hrs) Create an R Notebook named "LastNameFirstInitial.CS5200.PractI-SuF24.Rmd"
where LastName is your last name and FirstInitial is the first letter of your first name, e.g., GilesM.
Set the title of the R Notebook to "Analysis of Wildlife Strikes to Aircraft" and the subtitle to
"Practicum I CS5200", add an author field, and set the date to "Summer Full 2024". Be sure to do
all work within an R Project (named as you wish).
3. (4 pts / 0.1 hrs) Add an R code chunk that connects to your MySQL database. Use headers for all
other questions with appropriate titles so you (and we) can navigate the notebook more easily. If
you have difficulty connecting to or setting up MySQL, then use SQLite and proceed. You can
always come back to this question and change your configuration so that you connect to MySQL.
This is the benefit of relational databases: you can easily switch between databases without
changing your code. Do not echo the code in the notebook and supress any warnings or other
messages. Add an appropriate code chunk label.
4. (30 pts / 3.5 hrs) In a single R code chunk that is not echoed (i.e., set "echo = F" for the code
chunk and suppress all messages), create the database schema described below (do not use
{sql} code chunks). Add appropriate constraints, primary key and foreign key definitions. In the
schema definitions below, primary keys are underlined and foreign keys are bolded.
A. (5 pts / 0.5 hrs) Create a table flights that stores information about flights that experienced
wildlife strikes and follows this schema:
flights (fid : integer,
date : date, originAirport : integer,
airlineName : text, aircraftType : text, isHeavy: boolean)
Make isHeavy a Boolean flag and use TRUE if the aircraft is a "heavy" aircraft, FALSE
otherwise -- this information is in the CSV. For aircraftType use the column aircraft from the
CSV. Use appropriate data types for the columns and store any date as a date type not as text
(subject to the data types your chosen database supports). If date or boolean data types are
not supported, choose another data type that will work or split the dates into month, day, and
year columns. Note that some columns contain periods so that will require special treatment in
SQL -- investigate how to deal with this common issue.
The column originAirport is a foreign key linking a flight to an airport (defined in the table
below).B. (5 pts / 0.5 hrs) Create a table that stores airports and that follows this schema:
airports (aid : integer, airportName : text, airportState : text, airportCode : text)
aid is a synthetic primary key, airportName is the name of the airport and is from the airport
column in the CSV, airportState corresponds to the airport state (the origin column) from the
data file. The airportCode should be the airport's international code, e.g., BOS for Boston or
LGA for LaGuardia. For now, set it to the default value "ZZZ" -- it is for future expansion.
C. (4 pts / 0.3 hrs) Link the flights and airports tables via the origin foreign key in flights to the
primary key aid in airports. The origin is an FK to the airport in the airports table. Update the
above table definitions for airports and flights as necessary.
D. (4 pts / 0.5 hrs) Create a lookup table called conditions defined as follows:
conditions (cid, sky_condition, explanation)
Link this lookup table to the incidents table defined below through the conditions foreign key.
This table contains the values of all sky conditions, e.g., 'Overcast'. Leave
the explanation column empty (future expansion). Use the values from the CSV.
E. (4 pts / 0.5 hrs) Create a table that stores wildlife strike incidents called incidents and that
follows this schema:
incidents (iid : integer, fid : integer, wlsize : text, impact : text, altitude : integer ≥ 0, conditions
: {...})
iid is a synthetic key. fid is a foreign key for the flight for which this strike occurred for. The
conditions column is a foreign key link to the conditions lookup table. The column wlsize is the
wildlife size from the corresponding column in the CSV. The other columns are directly from
the CSV, or, if not present in the CSV, set them to a default value for now. The column altitude
should be restricted to positive integers greater than or equal to zero using either application
logic or a CHECK constraint (if the database supports constraints; MySQL version prior to
8.0.16 do not enforce any CHECK constraints). Using a trigger is also an option, but do not
turn on triggers until all data is inserted, as it will have significant a negative performance
impact on row insertion.
F. (3 pts / 0.3 hrs) Link the incidents and flights tables via the fid foreign key in incidents to the
primary key fid in flights.
G. (2 pts / 0.3 hrs) Add one or more code chunks that are not evaluated (eval = F) when the
Notebook is knitted but can be used to test your table definitions. Add whatever test code youneed to assure yourself that your table definitions are correct.
5. (1 pts / 0.1 hrs) If you haven't yet, download the bird strikes CSV data file from the link provided
before. Place the CSV file into the same folder as your R Notebook and then, in a separate code
chunk (not echoed) load it into a dataframe called bds.raw. Do not use a path name when loading.
The default path is the local folder that contains the R Notebook when you have the R Notebook
in an R Project. Be sure to always re-open the project when returning to R Studio.
6. (20 pts / 8 hrs) Using the table definitions and the data in the dataframe bds.raw from above,
populate the tables with the data from the appropriate columns. Omit any columns from the CSV
that are not referenced in the tables. You do not need to create any additional tables. Because we
are not adding additional tables there will be (unnormalized) data and repetitions, for example for
aircraft -- that is acceptable for this practicum due to time constraints but would not be if this were
an actual analytics database project. Assume "Business" to be an airline name (it is actually a
private flight but we assume it is the airline called "Business") and store those incidents.
You may (nd should) use functions to structure your code. Be sure not to echo any code chunks
and supress messages.
Use default values where the data file does not contain values or leave empty. If there is no
airport or airline, then link to a "sentinel" airline or airport, i.e., add an "unknown" airline and airport
to the tables rather than leaving the value NULL. Assign synthetic key values as and where
needed and use them as primary keys. Whether you generate them in R code or the database is
up to you -- each has pros and cons and part of the objective of this practicum is for you to think
through such decisions.
Map the values in the data file to appropriate values in any lookup tables using reasonable rules
that you may define as necessary.
All data manipulation and importing work must occur in R. You may not modify the original
data outside of R -- that would not be reproducible work. It may be helpful to create a subset
of the data for development and testing as the full file is quite large and takes time to load.
7. (1 pts / 0.5 hr) Add code chunks that you can run and that show that the loading of the data
worked by displaying parts of each table (do not show the entire tables). Do not echo the results
of those code chunks and make sure they are eval=F so they are not included in any knitting
process.
8. (3 pts / 1 hr) Add a level two (##) header with the title "Top Airlines with Strikes" and below create
a SQL query against your database to find the top 5 airlines with the greatest number of wildlife
strike incidents. You must use an R function to execute the query. It must be a single query.
Display the airline and the number of incidents. Note that every row in the incidents tableconstitutes one "incident". Do not echo the code.
9. (10 pts / 1 hr) Add a level two (##) header with the title "Analysis by Airport". Create a SQL query
against your database to find the airports that had an above average number wildlife strike
incidents. You must use an R function to execute the query. It must be a single query. To do this,
find the number of bird strike incidents for each airport (remember that each row in the incidents
table is a single bird strike incident). Then calculate the average across all airports and from there
find those airports which had an above average number of bird strike incidents. List the names of
the top 5 airports and the number of incidents for each. Do not echo the code, only the result.
10. (8 pts / 1 hr) Add a level two (##) header with the title "Analysis by Year". Create a SQL query
against your database to find the (total) number of wildlife strikes per year. Save the result of the
query in a dataframe. You must use an R function to execute the query. It must be a single query.
Format the result with appropriate table headers using the kableExtra package. This query can
help answer the question which year has more wildlife striking aircraft than normal. Do not echo
the code, only the result.
11. (5 pts / 2 hrs) Add a level two (##) header with the title "Trend by Year". Using the dataframe from
Question 10 above, build a line chart that plots year along the x-axis versus number of the
number of strikes on the y-axis. Adorn the graph with appropriate axis labels, titles, legend, data
labels, etc. You should use the standard R plot() function; you do not need to use packages
such as ggplot, ggplot2, or plotly -- although you may, of course. This tutorial
(https://www.statmethods.net/graphs/scatterplot.html) may help you get started.
12. (10 pts / 3 hrs) Create a stored procedure in MySQL (note that if you used SQLite, then you
cannot complete this step) that updates a strike (identified by its primary key) from the database.
You may decide what you need to pass to the stored procedure to update a wildlife incident. You
must log this update in an audit log that contains the type of modification (an update), the table
that was manipulated, the time and date of removal, and the original values of the strike that was
updated. Create the "update_log" table if it does not exist.
Show (in R) that your procedure works and that the log table is updated as required.
Note that if you used SQLite rather than the required MySQL for the practicum, then you cannot
complete this question as SQLite does not support stored procedures.
13. (5 pts) Create professionally developed code that is well documented and all chunks are labeled.
/PU[ZHUK;PWZ
Ask clarifying questions in the "Practicum I QnA" channel on Teams.
If you find it helpful, draw an ERD.
Ask questions as soon as you encounter them.
While you need to look up details for functions and R, the solution cannot be found via Google.Do not spend extraordinary time on code errors; ask for help if you cannot resolve them within 30-
minutes. First through Teams and then by going to TA office hours -- the TAs will alert the
instructor if there are doubts they cannot resolve.
The TA can help you and answer questions, but they are not your personal tutors; do not rely on
them to debug your code. Code small, test often.
You may use AI assistants as needed but let us know which ones you used and for what; add
acknowledgments in your code and you must be able to explain any code you copied or
borrowed.
If you have trouble connecting, be sure to disable any firewall or anti-virus software that may be
clocking port 3306 -- or add port 3306 to the list of open ports in your firewall software
configuration.
The function dbWriteTable() is disabled for bulk loading on the MySQL cloud installation
on db4free. It does work fine for MySQL on freemysqlhosting.net and AWS RDS and for local
installations of MySQL if you allow for batch loading.
Be sure to click the activation link in the email from db4free.net after you create your database; if
you get "Access denied..." error messages then you did not activate your account.
Be sure to shut down your databases if you are not using them; otherwise you might exceed the
monthly free time. And, be sure to delete the database after the course is done or you will be
billed.
Batch loading (via dbWriteTable() ) is disabled for security reasons by default on local installations
of MySQL. Here's how to enable it: mysql - ERROR: Loading local data is disabled - this must
be enabled on both the client and server sides - Stack Overflow
(https://stackoverflow.com/questions/59993844/error-loading-local-data-is-disabled-this-must-beenabled-on-both-the-client)
. Note that you cannot enable this for db4free.net -- it does not give
you privileged access to do this.
Enable functions, procedures, and triggers on Amazon AWS RDS:
https://aws.amazon.com/premiumsupport/knowledge-center/rds-mysql-functions/
(https://aws.amazon.com/premiumsupport/knowledge-center/rds-mysql-functions/)
On AWS, if you get errors that the service is not reachable, then you need to edit the "inbound
rules" of the security group that is attached to the RDS -- add the MySQL/Aurora type and port
range 3306 with cidr 0.0.0.0/0
If you use sqldf for manipulating or querying internal data frames, be mindful of conflicts between
SQLite and MySQL; see mysql - How to use sqldf in R to manipulate local dataframes? -
Stack Overflow. (https://stackoverflow.com/questions/60117431/how-to-use-sqldf-in-r-tomanipulate-local-dataframes)
A potential issue with sqldf can occur when you connect to MySQL
or a non-SQLite database as sqldf attempts to use your existing database connection as a
backing store for its data; this will often not work due to security constraints. So, you need to add
the R code options(sqldf.driver = 'SQLite') which forces sqldf to use SQLite as its backing
store.
sqldf is very slow for querying or extracting data from a data frame as it actually copies the data
frame to an in-memory SQLite database and then runs a SQL query; so, only use sqldf is there's
no simple way to do a native R "query" with logical operations and which() and any() , e.g., use
sqldf to do grouping but not much elseIf you get errors connecting to MySQL, make sure you have the latest version of R and upgrade R
and all packages as necessary.
Using mixed upper and lower case for table names sometimes causes issues with
dbSendStatement() and dbWriteTable() when using MySQL; make all table names and attributes
names fully lower case; SQL is not case sensitive when it comes to keywords like INSERT vs
insert BUT it is when it comes to table and attribute names
Put columns that contain special character such as period (.) or have names that are keywords
into backticks, e.g., SELECT `condition` FROM incidents;
If you get a message "Can't initialize character set unknown", see
https://stackoverflow.com/questions/52613809/rmysql-error-cant-initialize-character-setunknown
(https://stackoverflow.com/questions/52613809/rmysql-error-cant-initialize-characterset-unknown)
and https://github.com/pBlueG/SA-MP-MySQL/issues/203
(https://github.com/pBlueG/SA-MP-MySQL/issues/203)
There appears to be a bug in MySQL stored procedures that contain a SELECT as the last
statement in the procedure; if you get the message "Commands out of sync; you can't run this
command now", you must use INTO with your SELECT; see
https://stackoverflow.com/questions/6583020/mysql-stored-procedure-caused-commandsout-of-sync
(https://stackoverflow.com/questions/6583020/mysql-stored-procedure-causedcommands-out-of-sync)
All
your work must be within your programs; we will run your code against one of our MySQL
servers with a blank database (or SQLite if that's what you used) and it has to run from beginning
to end; so if you did work outside of your code then we cannot reproduce your work you'll get a
grade of 0.
Before submitting, test that you code runs and your any notebooks that may be required knit in a
clean environment. So, remove all objects first by either including this code in the beginning of
your R Notebook rm(list = ls()) or by clearing your environment in R Studio with the menu item
Session/Clear Workspace.
Explain your approaches and any manipulations, omissions, deletions, or modifications of data.
Be sure to install packages within your code (but only if not installed) to ensure they get installed
when the graders run your code. Here's an elegant way to do this:
https://statsandr.com/blog/an-efficient-way-to-install-and-load-r-packages/
(https://statsandr.com/blog/an-efficient-way-to-install-and-load-r-packages/)
In R, if you want to change a global variable (one used first outside a function), then you need to
use a special assignment operator: <<- instead of <-.
Unlike MySQL and other databases, SQLite does not have support for a separate "Boolean" data
type; instead Boolean values are stored as integers, although SQLite recognizes the keyword
TRUE (1) and FALSE (0) as of version 3.23.0 but those are just synonyms for the integers values
1 (TRUE) and 0 (FALSE).
If your data frames are not written to the database when you use dbWriteTable() then check to
see if your data frame contains columns of type date -- there may be issues in converting R dates
to SQL dates; for more information on this, see Lesson 6.306 Dates in R and SQLite
(http://artificium.us/lessons/06.r/l-6-306-dates-in-r-and-sql/l-6-306.html)To find version of your database, issue this SQL meta-query from an R code chunk or via MySQL
Workbench: SELECT version()
freemysqlhosting does not support triggers and has limits on constraints.
If you are using Google Cloud or AWS MySQL (which are awesome), be sure to suspend when
not using and shut down when you are done -- you will continue to be charged and eventually
your free credits will run out.
If you use freemysqlhosting, we recommend that you pay for a 100MB upgrade (really cheap, like
cost of lunch or dinner out) as otherwise exceeding the 5MB limit may get your account
suspended. Alternatively, you could split the data over multiple databases, or you could reduce
the size of the CSV so that the database is below 5MB. You may share a single subscription
among several students as it allows for up to five independent databases. You may not share the
database, of course.
4`:83*SV\K:LY]LYZ
Amazon RDS Free Tier | Cloud Relational Database | Amazon Web Services
(https://aws.amazon.com/rds/free/)
db4free.net - MySQL Database (https://www.db4free.net/)
Hosting with PHP, MySQL and cPanel (freehosting.host) (https://freehosting.host/)
Hosting with PHP, MySQL (AwardSpace.com) (https://www.awardspace.com/)
0UZWLJ[PUN3HYNL*:=-PSLZ
In the video tutorial below, Khoury Boston's Prof. Feinberg demonstrates how to find the path of a file
and how to inspect large CSV files prior to loading into R or another programming
environment. Follow the steps as you watch the tutorial. Stop the tutorial as needed. This
demonstration presumes that you have MacOS or use a Linux/Unix environment.
0:00 / 5:59
:\ITPZZPVU+L[HPSZ
Before submitting your code, complete the self-evaluation rubric (separate "assignment"; see
Canvas).1. Provide in your R Notebook:
A. name and email
B. table creation SQL for MySQL/MariaDB (or SQLite)
C. results of queries, visualizations, computations to show that your code works as expected
D. clear explanations of your code that is decomposed into chunks and preceded by headers
2. Complete the self-evaluation rubric: [30 min] COMPLETE: Practicum I / Self-Evaluation
(https://northeastern.instructure.com/courses/180750/assignments/2248391?wrap=1)
3. Submit the .Rmd source file of your R Notebook. The code must run from start to end, so be sure
to test carefully, load any required libraries, and ensure that your chunks run sequentially from
start to end.
4. Submit a knitted HTML of your Rmd; the HTML must be a compact page, so pay attention to
formatting, headers, and how much data you print. If you have trouble knitting on your local
installation of R Studio, consider using the cloud version of R Studio to do the knitting. Not
submitting a knitted file results in a loss of 5 points. Submitting a poorly formatted HTML results in
a loss up to 5 points.
5. All your work must be within your R Notebook; during the demonstration, we will ask you to run
your R Notebook against one of our MySQL servers with a blank database and it has to run from
beginning to end; so if you did work outside of your R Notebook (e.g., in Excel or MySQL
Workbench) then we won't be able to reproduce your work and you will get a grade of 0.
6. Your code has to run, obviously, but it also has to run somewhat efficiently... if everyone else's
code runs in 10-30 minutes but yours takes several hours then clearly is due to poor programming
and not due to the inherent complexity of the problem... follow common coding strategies for
writing efficient code such as factoring out invariants from loops, not calling functions repeatedly,
pre-allocating memory, not copying objects needlessly, not calling expensive functions when
simpler ones will do (e.g., call substring() instead of doing regular expressions), use which() when
searching and don't use sqldf, and so forth. These practices are not specific to R, although there
are R specific performance issues, but those are less likely to be a concern here.
9LX\PYLK5L_[:[LWZ
Check your submission to ensure the correct files have been submitted
Complete and submit the self-evaluation rubric
Schedule demo
Points 100
Submitting a file upload
File Types rmd and html
Available until Jul 14 at 11:59pm
:[HY[(ZZPNUTLU[
5V[H)LUL
Read this entire page twice before you get started.
The due date is July 2 at 11:59pm ET. Late submissions are accepted (with the usual penalty of
2.5% per day late) until July 14 at 11:59pm ET. No submissions are accepted after that -- not
submitting the practicum by then will make it impossible to earn a passing grade for the course.
Work on the practicum for at least three hours every day and use the time during the week prior to
the practicum to start working on it, especially the configuration of the MySQL Server and the loading
of the data. Do not wait to get started.
A gentle reminder that the average of both practicums must be above 70% to pass this course, so be
sure to complete on time and seek help right away. Do not procrastinate -- things that appear simple
often take more time than expected and, of course, programming is fraught with potholes on the road
to success. So plan accordingly. Do not wait until shortly before it is due.
The average time to complete the practicum is 15-25 hours. Do not wait to start. Seek help early.
Submit often and as soon as you have enough code that works. We will only grade the last
submission. Check your submission before you submit and after.
6]LY]PL^
In this practicum you will build a database that can be used to analyze bird strikes on aircraft. For an
existing data set from the FAA [1], you will build a logical data model, a relational schema, realize the
relational schema in MySQL/MariaDB, load data into the database, execute SQL queries, a finally
perform some simple analysis of the data.
The graphic below shows some statistics regarding bird strikes and helps frame the data in the data
file.
Use the provided time estimates for each tasks to time-box your time. Seek assistance if you spend
more time than specified on a task -- you are likely not solving the problem correctly. A key objective
is to learn how to look things up, how to navigate complex problems, and how to identify and resolve
programming errors.
3LHYUPUN6IQLJ[P]LZ
In this practicum you will learn how to:
configure cloud MySQL
connect to cloud MySQL from R in an R Notebook
implement a relational schema for an existing data set
load data from CSV files into a relational database through R
execute SQL queries against a MySQL database through R
perform simple analytics in R
identify and resolve programming errors
look up details for R, SQL, and MySQL
time-box work
-VYTH[
Complete individually.
7YLYLX\PZP[L;HZRZ
1. Read the Hints and Tips section below and go back to them often when you encounter problem.
Most problems are addressed in that section. Consult the list before contacting us for help as
you'll be able to resolve the issue more quickly.
2. Create an R Project and within that project create an R Notebook in which to do all of your work.
Place all files within the project folder and do not use absolute file paths. Embed any diagrams inthe R Notebook and be sure to submit them as files in your submission. Keep updating this
notebook as you complete tasks. Learn how to use journaling in data analysis work. Use 6.204 --
Literate Programming with R Notebooks (http://artificium.us/lessons/06.r/l-6-204-rnotebooks/l-6-204.html)
for reference. Of course, in practice, you will likely build different
programs (perhaps even using different programming language) for database creation, data
loading, and the dashboard. But here we will do it all in one R Notebook.
3. Become familiar with the problem of bird strikes on aircraft. The report in [1] provides an overview.
You should scan it to become familiar with the domain but you do not need to read it deeply.
4. Download the file BirdStrikesData.csv (https://s3.us-east2.amazonaws.com/artificium.us/datasets/BirdStrikesData-V4-SuF24.csv)
and save it locally to your
R Project folder (the same folder that contains you .Rmd and your .Rproj files) or reference it from
the URL. You may wish to create a new data file that is a subset of the full data that you use for
development so loading takes less time. This is a common strategy in practice. To download the
file, use the right mouse button and choose "Save Link As..." or a similar menu choice on your
browser. Do not click on the link as that will cause the browser to open and display the file.
Read the Hints & Tips section frequently and before posting questions.
2L`9LZV\YJLZ
7YLYLX\PZP[L3LZZVUZ
06.103 Working with Vectors and Data Frames in R (http://artificium.us/lessons/06.r/l-6-
103-vecs-and-dfs/l-6-103.html)
06.106 Import Data into R from CSV, TSV, and Excel Files
(http://artificium.us/lessons/06.r/l-6-106-load-csv-tsv-excel-files/l-6-106.html)
06.108 Loops and Iteration in R (http://artificium.us/lessons/06.r/l-6-108-loops-iteration-inr/l-6-108.html)
06.112
Basics of Text & String Processing in R (http://artificium.us/lessons/06.r/l-6-112-
text-proc/l-6-112.html)
06.121 Writing Functions in R (http://artificium.us/lessons/06.r/l-6-121-funcs-in-r/l-6-
121.html)
06.204 Literate Programming with R Notebooks (http://artificium.us/lessons/06.r/l-6-204-rnotebooks/l-6-204.html)
06.191
Debugging R Code (http://artificium.us/lessons/06.r/l-6-191-debugging/l-6-191.html)
06.301 Working with Databases in R (http://artificium.us/lessons/06.r/l-6-681-key-value-dbredis-from-r/l-6-301.html)
06.304
Configure and Connect to Cloud MySQL from R (http://artificium.us/lessons/06.r/l6-304-cloudMySQL-from-r/l-6-304.html)
06.302
Bulk Load Data from CSV into Database in R (http://artificium.us/lessons/06.r/l-6-
302-bulkload-data-into-db/l-6-302.html)
06.306 Dates in R and SQLite (http://artificium.us/lessons/06.r/l-6-306-dates-in-r-and-sql/l-6-
306.html)70.907 Stored Procedures in MySQL (http://artificium.us/lessons/70.sql/l-70-907-storedprocs-mysql/l-70-907.html)
;HZRZ
Before
you start, read all of the questions. Inspect the CSV data file that you downloaded so you are
familiar with its columns, data types, and overall structure. Assume that this database will eventually
be used for an app that can be used by pilots (of any kind of aircraft) to report wildlife incidents.
All R and SQL code blocks must be named, as shown in the example below. This is necessary so
that you can reference your code blocks in the self-evaluation rubric to be filled out at the end of the
practicum. The names of code blocks must be unique.
```{r nameOfRCodeBlock, eval = T, warning = F}
```{sql nameOfSQLCodeBlock, connection = xDB}
You may add any additional block parameters as needed.
Use functions to structure your code so that it becomes more readable and easier to develop and
debug. Use headers to segment your notebook and add explanations as to what each code block
means. Follow common coding practices and format your code so it is readable, and use functions to
break down complex code.
1. (5 pts / 2.0 hrs) Set up and configure a MySQL Server database on db4free.net
(http://db4free.net/) or any other MySQL cloud service of your choice such as
freemysqlhosting.net (https://www.freemysqlhosting.net/) , Google Cloud or AWS. Note that
you may use SQLite instead of MySQL but you will not get credit for this question nor credit for the
question below that asks you to create a stored procedure, as those are not supported in SQLite.
See the Hints below for information on db4free. Lesson 6.304 -- Configure and Connect to
Cloud MySQL from R (http://artificium.us/lessons/06.r/l-6-304-cloudMySQL-from-r/l-6-304.html)
provides some advice on connecting to these two databases.
If you use freemysqlhosting, we recommend that you pay (for the remainder of the term or the
entire year) for a 100MB upgrade (really cheap, like cost of couple cups of coffee) as otherwise
exceeding the 5MB limit will get your account suspended. You may share a subscription as it
allows five databases. Alternatively, you could split the data over multiple databases. Of course,
using a different MySQL host such as Google Cloud or Amazon AWS (or any other of your choice)
can solve this issue too, but they are a bit more difficult to set up. If you use AWS RDS you need
to pay particular attention to ensuring you set global access policies and proper security policies --
this can be quite tricky and therefore we do not recommend using AWS. We recommend that you
purchase an educational subscription to freemysqlhosting (and share it if you'd like to); it is the
easiest to set up and work with.
Both Amazon AWS and Google Cloud offer free credits, but be sure to monitor usage so you do
not exceed the free credit or are prepared to pay (they can be a bit costly, so be careful to
suspend database when not in use and delete after the Practicum has been graded).You may collaborate and work with others to set up a cloud MySQL installation, but not for the
remainder of the practicum. A cloud MySQL installation is necessary for us to run your code -- we
cannot connect to a local installation of MySQL, although you may use one for testing and
development.
2. (1 pts / 0.1 hrs) Create an R Notebook named "LastNameFirstInitial.CS5200.PractI-SuF24.Rmd"
where LastName is your last name and FirstInitial is the first letter of your first name, e.g., GilesM.
Set the title of the R Notebook to "Analysis of Wildlife Strikes to Aircraft" and the subtitle to
"Practicum I CS5200", add an author field, and set the date to "Summer Full 2024". Be sure to do
all work within an R Project (named as you wish).
3. (4 pts / 0.1 hrs) Add an R code chunk that connects to your MySQL database. Use headers for all
other questions with appropriate titles so you (and we) can navigate the notebook more easily. If
you have difficulty connecting to or setting up MySQL, then use SQLite and proceed. You can
always come back to this question and change your configuration so that you connect to MySQL.
This is the benefit of relational databases: you can easily switch between databases without
changing your code. Do not echo the code in the notebook and supress any warnings or other
messages. Add an appropriate code chunk label.
4. (30 pts / 3.5 hrs) In a single R code chunk that is not echoed (i.e., set "echo = F" for the code
chunk and suppress all messages), create the database schema described below (do not use
{sql} code chunks). Add appropriate constraints, primary key and foreign key definitions. In the
schema definitions below, primary keys are underlined and foreign keys are bolded.
A. (5 pts / 0.5 hrs) Create a table flights that stores information about flights that experienced
wildlife strikes and follows this schema:
flights (fid : integer,
date : date, originAirport : integer,
airlineName : text, aircraftType : text, isHeavy: boolean)
Make isHeavy a Boolean flag and use TRUE if the aircraft is a "heavy" aircraft, FALSE
otherwise -- this information is in the CSV. For aircraftType use the column aircraft from the
CSV. Use appropriate data types for the columns and store any date as a date type not as text
(subject to the data types your chosen database supports). If date or boolean data types are
not supported, choose another data type that will work or split the dates into month, day, and
year columns. Note that some columns contain periods so that will require special treatment in
SQL -- investigate how to deal with this common issue.
The column originAirport is a foreign key linking a flight to an airport (defined in the table
below).B. (5 pts / 0.5 hrs) Create a table that stores airports and that follows this schema:
airports (aid : integer, airportName : text, airportState : text, airportCode : text)
aid is a synthetic primary key, airportName is the name of the airport and is from the airport
column in the CSV, airportState corresponds to the airport state (the origin column) from the
data file. The airportCode should be the airport's international code, e.g., BOS for Boston or
LGA for LaGuardia. For now, set it to the default value "ZZZ" -- it is for future expansion.
C. (4 pts / 0.3 hrs) Link the flights and airports tables via the origin foreign key in flights to the
primary key aid in airports. The origin is an FK to the airport in the airports table. Update the
above table definitions for airports and flights as necessary.
D. (4 pts / 0.5 hrs) Create a lookup table called conditions defined as follows:
conditions (cid, sky_condition, explanation)
Link this lookup table to the incidents table defined below through the conditions foreign key.
This table contains the values of all sky conditions, e.g., 'Overcast'. Leave
the explanation column empty (future expansion). Use the values from the CSV.
E. (4 pts / 0.5 hrs) Create a table that stores wildlife strike incidents called incidents and that
follows this schema:
incidents (iid : integer, fid : integer, wlsize : text, impact : text, altitude : integer ≥ 0, conditions
: {...})
iid is a synthetic key. fid is a foreign key for the flight for which this strike occurred for. The
conditions column is a foreign key link to the conditions lookup table. The column wlsize is the
wildlife size from the corresponding column in the CSV. The other columns are directly from
the CSV, or, if not present in the CSV, set them to a default value for now. The column altitude
should be restricted to positive integers greater than or equal to zero using either application
logic or a CHECK constraint (if the database supports constraints; MySQL version prior to
8.0.16 do not enforce any CHECK constraints). Using a trigger is also an option, but do not
turn on triggers until all data is inserted, as it will have significant a negative performance
impact on row insertion.
F. (3 pts / 0.3 hrs) Link the incidents and flights tables via the fid foreign key in incidents to the
primary key fid in flights.
G. (2 pts / 0.3 hrs) Add one or more code chunks that are not evaluated (eval = F) when the
Notebook is knitted but can be used to test your table definitions. Add whatever test code youneed to assure yourself that your table definitions are correct.
5. (1 pts / 0.1 hrs) If you haven't yet, download the bird strikes CSV data file from the link provided
before. Place the CSV file into the same folder as your R Notebook and then, in a separate code
chunk (not echoed) load it into a dataframe called bds.raw. Do not use a path name when loading.
The default path is the local folder that contains the R Notebook when you have the R Notebook
in an R Project. Be sure to always re-open the project when returning to R Studio.
6. (20 pts / 8 hrs) Using the table definitions and the data in the dataframe bds.raw from above,
populate the tables with the data from the appropriate columns. Omit any columns from the CSV
that are not referenced in the tables. You do not need to create any additional tables. Because we
are not adding additional tables there will be (unnormalized) data and repetitions, for example for
aircraft -- that is acceptable for this practicum due to time constraints but would not be if this were
an actual analytics database project. Assume "Business" to be an airline name (it is actually a
private flight but we assume it is the airline called "Business") and store those incidents.
You may (nd should) use functions to structure your code. Be sure not to echo any code chunks
and supress messages.
Use default values where the data file does not contain values or leave empty. If there is no
airport or airline, then link to a "sentinel" airline or airport, i.e., add an "unknown" airline and airport
to the tables rather than leaving the value NULL. Assign synthetic key values as and where
needed and use them as primary keys. Whether you generate them in R code or the database is
up to you -- each has pros and cons and part of the objective of this practicum is for you to think
through such decisions.
Map the values in the data file to appropriate values in any lookup tables using reasonable rules
that you may define as necessary.
All data manipulation and importing work must occur in R. You may not modify the original
data outside of R -- that would not be reproducible work. It may be helpful to create a subset
of the data for development and testing as the full file is quite large and takes time to load.
7. (1 pts / 0.5 hr) Add code chunks that you can run and that show that the loading of the data
worked by displaying parts of each table (do not show the entire tables). Do not echo the results
of those code chunks and make sure they are eval=F so they are not included in any knitting
process.
8. (3 pts / 1 hr) Add a level two (##) header with the title "Top Airlines with Strikes" and below create
a SQL query against your database to find the top 5 airlines with the greatest number of wildlife
strike incidents. You must use an R function to execute the query. It must be a single query.
Display the airline and the number of incidents. Note that every row in the incidents tableconstitutes one "incident". Do not echo the code.
9. (10 pts / 1 hr) Add a level two (##) header with the title "Analysis by Airport". Create a SQL query
against your database to find the airports that had an above average number wildlife strike
incidents. You must use an R function to execute the query. It must be a single query. To do this,
find the number of bird strike incidents for each airport (remember that each row in the incidents
table is a single bird strike incident). Then calculate the average across all airports and from there
find those airports which had an above average number of bird strike incidents. List the names of
the top 5 airports and the number of incidents for each. Do not echo the code, only the result.
10. (8 pts / 1 hr) Add a level two (##) header with the title "Analysis by Year". Create a SQL query
against your database to find the (total) number of wildlife strikes per year. Save the result of the
query in a dataframe. You must use an R function to execute the query. It must be a single query.
Format the result with appropriate table headers using the kableExtra package. This query can
help answer the question which year has more wildlife striking aircraft than normal. Do not echo
the code, only the result.
11. (5 pts / 2 hrs) Add a level two (##) header with the title "Trend by Year". Using the dataframe from
Question 10 above, build a line chart that plots year along the x-axis versus number of the
number of strikes on the y-axis. Adorn the graph with appropriate axis labels, titles, legend, data
labels, etc. You should use the standard R plot() function; you do not need to use packages
such as ggplot, ggplot2, or plotly -- although you may, of course. This tutorial
(https://www.statmethods.net/graphs/scatterplot.html) may help you get started.
12. (10 pts / 3 hrs) Create a stored procedure in MySQL (note that if you used SQLite, then you
cannot complete this step) that updates a strike (identified by its primary key) from the database.
You may decide what you need to pass to the stored procedure to update a wildlife incident. You
must log this update in an audit log that contains the type of modification (an update), the table
that was manipulated, the time and date of removal, and the original values of the strike that was
updated. Create the "update_log" table if it does not exist.
Show (in R) that your procedure works and that the log table is updated as required.
Note that if you used SQLite rather than the required MySQL for the practicum, then you cannot
complete this question as SQLite does not support stored procedures.
13. (5 pts) Create professionally developed code that is well documented and all chunks are labeled.
/PU[ZHUK;PWZ
Ask clarifying questions in the "Practicum I QnA" channel on Teams.
If you find it helpful, draw an ERD.
Ask questions as soon as you encounter them.
While you need to look up details for functions and R, the solution cannot be found via Google.Do not spend extraordinary time on code errors; ask for help if you cannot resolve them within 30-
minutes. First through Teams and then by going to TA office hours -- the TAs will alert the
instructor if there are doubts they cannot resolve.
The TA can help you and answer questions, but they are not your personal tutors; do not rely on
them to debug your code. Code small, test often.
You may use AI assistants as needed but let us know which ones you used and for what; add
acknowledgments in your code and you must be able to explain any code you copied or
borrowed.
If you have trouble connecting, be sure to disable any firewall or anti-virus software that may be
clocking port 3306 -- or add port 3306 to the list of open ports in your firewall software
configuration.
The function dbWriteTable() is disabled for bulk loading on the MySQL cloud installation
on db4free. It does work fine for MySQL on freemysqlhosting.net and AWS RDS and for local
installations of MySQL if you allow for batch loading.
Be sure to click the activation link in the email from db4free.net after you create your database; if
you get "Access denied..." error messages then you did not activate your account.
Be sure to shut down your databases if you are not using them; otherwise you might exceed the
monthly free time. And, be sure to delete the database after the course is done or you will be
billed.
Batch loading (via dbWriteTable() ) is disabled for security reasons by default on local installations
of MySQL. Here's how to enable it: mysql - ERROR: Loading local data is disabled - this must
be enabled on both the client and server sides - Stack Overflow
(https://stackoverflow.com/questions/59993844/error-loading-local-data-is-disabled-this-must-beenabled-on-both-the-client)
. Note that you cannot enable this for db4free.net -- it does not give
you privileged access to do this.
Enable functions, procedures, and triggers on Amazon AWS RDS:
https://aws.amazon.com/premiumsupport/knowledge-center/rds-mysql-functions/
(https://aws.amazon.com/premiumsupport/knowledge-center/rds-mysql-functions/)
On AWS, if you get errors that the service is not reachable, then you need to edit the "inbound
rules" of the security group that is attached to the RDS -- add the MySQL/Aurora type and port
range 3306 with cidr 0.0.0.0/0
If you use sqldf for manipulating or querying internal data frames, be mindful of conflicts between
SQLite and MySQL; see mysql - How to use sqldf in R to manipulate local dataframes? -
Stack Overflow. (https://stackoverflow.com/questions/60117431/how-to-use-sqldf-in-r-tomanipulate-local-dataframes)
A potential issue with sqldf can occur when you connect to MySQL
or a non-SQLite database as sqldf attempts to use your existing database connection as a
backing store for its data; this will often not work due to security constraints. So, you need to add
the R code options(sqldf.driver = 'SQLite') which forces sqldf to use SQLite as its backing
store.
sqldf is very slow for querying or extracting data from a data frame as it actually copies the data
frame to an in-memory SQLite database and then runs a SQL query; so, only use sqldf is there's
no simple way to do a native R "query" with logical operations and which() and any() , e.g., use
sqldf to do grouping but not much elseIf you get errors connecting to MySQL, make sure you have the latest version of R and upgrade R
and all packages as necessary.
Using mixed upper and lower case for table names sometimes causes issues with
dbSendStatement() and dbWriteTable() when using MySQL; make all table names and attributes
names fully lower case; SQL is not case sensitive when it comes to keywords like INSERT vs
insert BUT it is when it comes to table and attribute names
Put columns that contain special character such as period (.) or have names that are keywords
into backticks, e.g., SELECT `condition` FROM incidents;
If you get a message "Can't initialize character set unknown", see
https://stackoverflow.com/questions/52613809/rmysql-error-cant-initialize-character-setunknown
(https://stackoverflow.com/questions/52613809/rmysql-error-cant-initialize-characterset-unknown)
and https://github.com/pBlueG/SA-MP-MySQL/issues/203
(https://github.com/pBlueG/SA-MP-MySQL/issues/203)
There appears to be a bug in MySQL stored procedures that contain a SELECT as the last
statement in the procedure; if you get the message "Commands out of sync; you can't run this
command now", you must use INTO with your SELECT; see
https://stackoverflow.com/questions/6583020/mysql-stored-procedure-caused-commandsout-of-sync
(https://stackoverflow.com/questions/6583020/mysql-stored-procedure-causedcommands-out-of-sync)
All
your work must be within your programs; we will run your code against one of our MySQL
servers with a blank database (or SQLite if that's what you used) and it has to run from beginning
to end; so if you did work outside of your code then we cannot reproduce your work you'll get a
grade of 0.
Before submitting, test that you code runs and your any notebooks that may be required knit in a
clean environment. So, remove all objects first by either including this code in the beginning of
your R Notebook rm(list = ls()) or by clearing your environment in R Studio with the menu item
Session/Clear Workspace.
Explain your approaches and any manipulations, omissions, deletions, or modifications of data.
Be sure to install packages within your code (but only if not installed) to ensure they get installed
when the graders run your code. Here's an elegant way to do this:
https://statsandr.com/blog/an-efficient-way-to-install-and-load-r-packages/
(https://statsandr.com/blog/an-efficient-way-to-install-and-load-r-packages/)
In R, if you want to change a global variable (one used first outside a function), then you need to
use a special assignment operator: <<- instead of <-.
Unlike MySQL and other databases, SQLite does not have support for a separate "Boolean" data
type; instead Boolean values are stored as integers, although SQLite recognizes the keyword
TRUE (1) and FALSE (0) as of version 3.23.0 but those are just synonyms for the integers values
1 (TRUE) and 0 (FALSE).
If your data frames are not written to the database when you use dbWriteTable() then check to
see if your data frame contains columns of type date -- there may be issues in converting R dates
to SQL dates; for more information on this, see Lesson 6.306 Dates in R and SQLite
(http://artificium.us/lessons/06.r/l-6-306-dates-in-r-and-sql/l-6-306.html)To find version of your database, issue this SQL meta-query from an R code chunk or via MySQL
Workbench: SELECT version()
freemysqlhosting does not support triggers and has limits on constraints.
If you are using Google Cloud or AWS MySQL (which are awesome), be sure to suspend when
not using and shut down when you are done -- you will continue to be charged and eventually
your free credits will run out.
If you use freemysqlhosting, we recommend that you pay for a 100MB upgrade (really cheap, like
cost of lunch or dinner out) as otherwise exceeding the 5MB limit may get your account
suspended. Alternatively, you could split the data over multiple databases, or you could reduce
the size of the CSV so that the database is below 5MB. You may share a single subscription
among several students as it allows for up to five independent databases. You may not share the
database, of course.
4`:83*SV\K:LY]LYZ
Amazon RDS Free Tier | Cloud Relational Database | Amazon Web Services
(https://aws.amazon.com/rds/free/)
db4free.net - MySQL Database (https://www.db4free.net/)
Hosting with PHP, MySQL and cPanel (freehosting.host) (https://freehosting.host/)
Hosting with PHP, MySQL (AwardSpace.com) (https://www.awardspace.com/)
0UZWLJ[PUN3HYNL*:=-PSLZ
In the video tutorial below, Khoury Boston's Prof. Feinberg demonstrates how to find the path of a file
and how to inspect large CSV files prior to loading into R or another programming
environment. Follow the steps as you watch the tutorial. Stop the tutorial as needed. This
demonstration presumes that you have MacOS or use a Linux/Unix environment.
0:00 / 5:59
:\ITPZZPVU+L[HPSZ
Before submitting your code, complete the self-evaluation rubric (separate "assignment"; see
Canvas).1. Provide in your R Notebook:
A. name and email
B. table creation SQL for MySQL/MariaDB (or SQLite)
C. results of queries, visualizations, computations to show that your code works as expected
D. clear explanations of your code that is decomposed into chunks and preceded by headers
2. Complete the self-evaluation rubric: [30 min] COMPLETE: Practicum I / Self-Evaluation
(https://northeastern.instructure.com/courses/180750/assignments/2248391?wrap=1)
3. Submit the .Rmd source file of your R Notebook. The code must run from start to end, so be sure
to test carefully, load any required libraries, and ensure that your chunks run sequentially from
start to end.
4. Submit a knitted HTML of your Rmd; the HTML must be a compact page, so pay attention to
formatting, headers, and how much data you print. If you have trouble knitting on your local
installation of R Studio, consider using the cloud version of R Studio to do the knitting. Not
submitting a knitted file results in a loss of 5 points. Submitting a poorly formatted HTML results in
a loss up to 5 points.
5. All your work must be within your R Notebook; during the demonstration, we will ask you to run
your R Notebook against one of our MySQL servers with a blank database and it has to run from
beginning to end; so if you did work outside of your R Notebook (e.g., in Excel or MySQL
Workbench) then we won't be able to reproduce your work and you will get a grade of 0.
6. Your code has to run, obviously, but it also has to run somewhat efficiently... if everyone else's
code runs in 10-30 minutes but yours takes several hours then clearly is due to poor programming
and not due to the inherent complexity of the problem... follow common coding strategies for
writing efficient code such as factoring out invariants from loops, not calling functions repeatedly,
pre-allocating memory, not copying objects needlessly, not calling expensive functions when
simpler ones will do (e.g., call substring() instead of doing regular expressions), use which() when
searching and don't use sqldf, and so forth. These practices are not specific to R, although there
are R specific performance issues, but those are less likely to be a concern here.
9LX\PYLK5L_[:[LWZ
Check your submission to ensure the correct files have been submitted
Complete and submit the self-evaluation rubric
Schedule demo