代做STA 141B, Spring 2024 Reading Complex Data: SPAM email代写数据结构语言
- 首页 >> Matlab编程Reading Complex Data: SPAM email
STA 141B, Spring 2024
DUE: Wednesday, May 1, 10pm
We are all familiar with SPAM email messages, phone calls, texts, ... Nowadays, SPAM filters are quite effective. These filters use statistics to classify new email messages as SPAM or HAM (valid mail). There are many statistical techniques one could use. But before we can use them we need “data” . We need to “measure” variables on each message in a sample of email messages. We also need to know if they are SPAM or HAM. We can then train a statistical classifier using these variables to predict if a new message is SPAM or HAM.
Your job is to process a collection of email messages and create an R data frame of “derived” variables that give various measures of the email messages, e.g. the number of recipients to whom the mail was sent, the percentage of capital words in the body of the text, is the message a reply to another message. See below for a list of all the variables and also consider other variables you think might help help classify a message as SPAM versus HAM.
The data we use are derived from the email messages from the Spam Assassin website
http://spamassassin.apache.org/publiccorpus. These are old messages from the early 2000’s. Accordingly, the nature of SPAM messages is different than it is today.
The data you are to work with is available at https://canvas.ucdavis.edu/files/23930657/download.
The messages are in 5 different directories/folders. The name of the directory indicates whether the messages it contains are HAM or SPAM. There are 6,541 messages in total.
Once you have this data frame. of “derived” variables, explore the relationships amongst these variables and especially how they might be used to classify SPAM and HAM messages. In other words, look at density plots, scatter plots, mosaic plots of the variables. For scatter plots, you might color code the points based on if the message is SPAM or HAM. Which variables seem to do best at discriminating between SPAM and HAM messages?
For each message, extract URLs in the body of the message. Which URLs are only in the SPAM messages? Which Web domains (e.g., www.google.com) are most common in the SPAM URLs?
You are to write functions to
• process a file containing an email message, converting it into a list with 3 elements
1. the header of the email message as a named vector of values,
2. the body of the message
3. a list with each attachment, where each attachment is a list with
– a header as a named vector of values
– the body of the attachment
Once you have a list of message information in this format, you can compute the different variables.
Compute at least 20 of these.
And there are bonus marks for proposing and computing other variables that could help classify messages.
Some functions you might find useful include: list.files, readLines, strsplit, grep, grepl, gregexpr, gregexec, regmatches, gsub, substring, nchar.
1 The Anatomy of an E-mail message
Electronic mail, usually called e-mail, consists of simple text messages – a piece of text sent to a recipient via the internet. An e-mail message consists of two parts, the header and the body. The body of the e-mail message is separated from the header by a single blank line. When an attachment is added to an e-mail message, the attachment is included in the body of the message. Even with attachments, e-mail messages are still only text messages.
1.1 The E-mail Header
The header contains information about the message such as the sender’s address, the recipient’s address, and the date of transmission. This information is relayed in a special format that consists of KEY:VALUE pairs. Below is a sample header from a message found on the SpamAssassin website.
Return-Path: [email protected]
Delivery-Date: Fri Sep 6 20:53:36 2002
From: [email protected] (David LeBlanc)
Date: Fri, 6 Sep 2002 12:53:36 -0700
Subject: [Spambayes] Deployment
In-Reply-To: <[email protected]> Message-ID: <[email protected]>
Notice the keys are Return-Path, Delivery-Date, From, Date, Subject, In-Reply-to, and Message-ID. The
value follows the keyword. For example, in the above header, the value of the From key is whisperatoz.net (David LeBlanc).
Some of these keys are mandatory such as Date, From, and To (or In-Reply-To, or Bcc). Other keys are optional but widely used, such as Subject, Cc, Received, and Message-ID. Many keys
Description of Variables
isSpam logical whether mail is Spam (TRUE) or Ham (FALSE)
isRe logical if the string Re: appears as the first word in the subject of the message
numLinesInBody integer a count of the number of lines in the body of the email message
bodyCharacterCount integer the number of characters in the body of the email message
replyUnderline logical whether the Reply-To field in the header has an underline and num- bers/letters
subjectExclamationCount integer a count of the number of exclamation marks (!) in the subject of the message
subjectQuestCount integer the number of question marks in the subject
numAttachments integer the number of attachments in the message.
priority logical whether the message’s header had an X-Priority or X-Msmail- Priority that was set to high
numRecipients integer the number of recipients in the To, Cc fields
percentCapitals numeric the percentage of the characters in the body of the email that are upper case (excluding blanks, numbers, and punctuation)
isInReplyTo logical whether the header of the message has an In-Reply-To field.
sortedRecipients logical the recipient list is sorted by address
subjectPunctuationCheck logical whether the subject has punctuation or digits surrounded by char- acters, e.g. V?agra and pay1ng, but not New!
hourSent integer the hour in the day the mail was sent (0 – 23)
multipartText logical whether the header states that the message is a multipart/text, i.e. with attachments.
containsImages logical whether the message contain images (in HTML, i.e. the <img> tag)
isPGPsigned logical indicates whether the mail was digitally signed (e.g. using PGP or GPG)
percentHTMLTags numeric the proportion of any HTML text in the message’s body that is made up of HTML markup and not content.
subjectSpamWords logical whether the subject contains one of the following phrases: viagra, pounds, free, weight, guarantee, millions, dollars, credit, risk, pre- scription, generic, drug, money back, credit card.
percentSubjectBlanks numeric the percentage of blanks in the subject
messageIdHasNoHostname logical whether the message identifier (id) that uniquely identifies the message has no component identifying the machine from which it was sent
fromNumericEnd logical whether the user login in the From: field ends in numbers
isYelling logical whether the Subject of the mail is in capital letters
percentForwards numeric percent of the message’s body that is made up of content included from other messages
isOriginalMessage logical body does not contain the phrase “original message” or something similar
isDear logical whether the message body contains a form of the introduction Dear
isWrote logical wh(...)ether the text includes a line indicating an included message
as identified by the word wrote: in several different possible lan- guages
averageWordLength numeric the average length of the words in the body of the message numDollarSigns integer the number of dollar signs in the body of the message
are ignored by the mail system, but the entire header is relayed on to the recipient’s server whether or not it is recognized. For example, keys starting with “X-” are for personal application or institution use and are ignored by other applications. The Received header lines are important because they allow the message to be tracked. As a message makes its way to the intended recipient, servers add additional Received lines to the header.
Below are some typical header keys:
• Message-Id: a unique identifier for the message, assigned by the originating server;
• Return-Path: specifies the sender’s address and bounced mail gets sent to this address;
• Date: added by the e-mail client;
• Cc: lists the recipients of a “carbon copied” message;
• Reply-To: the address set by the sender to which the recipient can reply;
• MIME-Version: used for encoding binary content as attachments.
A value may be continued on a second line of the header, in which case the line will be indented and begin with a tab character or blank spaces. Consider this header
From [email protected] Thu Aug 22 14:23:39 2002
Return-Path: <[email protected]>
Delivered-To: [email protected]
Received: from localhost (localhost [ 127.0.0.1 ])
by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9DAE147C66 for <zzzz@localhost>; Thu, 22 Aug 2002 09:23:38 -0400 (EDT)
Received: from phobos [ 127.0.0.1 ]
by localhost with IMAP (fetchmail-5.9.0)
for zzzz@localhost (single-drop); Thu, 22 Aug 2002 14:23:38 +0100 (IST) MIME-Version: 1.0
After the ’From ’ row, there are two fields (Return-Path and Delivered-To) on separate lines. The next two fields however are ’Received’ fields and both span 3 rows. The first row of each starts in the first column and identifies the key ’Received’. The subsequent lines start with white space. The second ’Received’ starts in the first column and so indicates the start of a new field, and the conclusion of the previous field.
1.2 The Body of the Email
The body of the email is all the text after the first blank line following the header and up to any attachments. If the message has no attachments, then the body is everything excluding the header. If the message has attachments, we need to find where they begin to find the body. So we look at attachments next.
1.3 E-mail Attachments
An Internet standard called MIME, Multipurpose Internet Mail Extensions, specifies how messages may be formatted and how to separate the attachments from the message. Information about the MIME encoding is provided through header fields, which are specified in an RFC.
The Content-Type key is used to describe the content of a component or of the entire body. The value provides the top-level type and subtype using the syntax:
top-level/subtype; parameter .
Parameters may be required or optional. Below is an example of a content-type where the top-level is multipart, which indicates there will be several documents in the body of the message, the mixed subtype tells us that the documents may be of different types, and the boundary parameter provides a special character string for delimiting the start and end of the message parts.
Content-Type: multipart/mixed;
boundary="----=_NextPart_000_00DE_01511A02.DB1A02A0"
The Content-Type field in this example tells the receiving e-mail program that this message has more than one component, and each component will be separated by the string of characters
----=_NextPart_000_00DE_01511A02.DB1A02A0
The boundary string marks the beginning of each component. It is prefaced with two additional hyphens in all instances. The boundary string is also used to denote the end of the message, where it is both prefaced by two hyphens and followed immediately by two hyphens. The receiving email program knows when the last component of the message has been read when it reads the boundary string with two additional hyphens on either end of the string,
------=_NextPart_000_00DE_01511A02.DB1A02A0--
Each component of a message must be prefaced by the boundary string and a blank line. It may also contain MIME information. If the blank line is missing, the recipient’s e-mail client may have difficulty telling where the header information stops and the text of the message begins.
There are seven top-level types of attachments: text, image, audio, video, application, multipart, and mes- sage. Other examples of Content-Type values follow:
Content-type: text/html; charset=euc-kr;
Content-Type: application/zip; name="testFile.zip"
The first example indicates that the message is in HTML format using a Korean character set. The second indicates that the component is a zip file, and the sender named it testFile.zip. Binary files (such as a compressed archive) can be sent as attachments. In such cases, the sender’s software must first encode the binary file so that it can be sent over the Internet. One common encoding scheme is known as base64.
We conclude by providing two sample e-mail messages. The first is a plain text e-mail with no attachments. It consists of an instructor’s response to an e-mail inquiry sent by a student. The second e-mail message consists of a text message and two attachments sent by a student to the instructor. This e-mail message has then been forwarded by the instructor to the teaching assistant. The three periods at the end of each attachment indicates that only part of the attachment has been displayed. The first attachment is a pdf file and the second is an HTML file. The forwarded message is a plain text file.
From [email protected] Mon Feb 2 22:16:19 2004 -0800
Date: Mon, 2 Feb 2004 22:16:19 -0800 (PST)
From: [email protected]
X-X-Sender: [email protected]
To: Txxxx Uxxx <[email protected]>
Subject: Re: prof: did you receive my hw?
In-Reply-To: <[email protected]>
Message-ID: <[email protected]> References: <[email protected]>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: O
X-Status:
X-Keywords:
X-UID: 9079
Yes it was received .
On Mon, 2 Feb 2004, txxxx wrote:
> hey prof .nolan,
>
> i sent out my hw on sunday night . i just wonder did you receive it . > because i am kinda scared thatyou didnt’ receive it .
> like i just wonder how do i know if you got it or not, since the cal > mail system is kinda weird sometimes . thanks
>
> txxxx
>
Figure 1: Sample email message with no attachments. The header includes fourteen key:value pairs. Note the Date key includes a time-zone offset, the Message-ID key gives the unique ID to track the mail from the stat.berkeley.edymail server, the Content-Type key indicates it is a plain text message with no sttachments, and thre are four X- keys.
From [email protected] Mon Feb 2 22:18:56 2004 -0800
Date: Mon, 2 Feb 2004 22:18:55 -0800 (PST)
From: [email protected]
X-X-Sender: [email protected]
To: Gang Liang <[email protected]>
Subject: Assignment 1 sorry (fwd)
Message-ID: <[email protected]> MIME-Version: 1.0
Content-Type: MULTIPART/Mixed; BOUNDARY="_===669732====calmail-me.berkeley.edu===_ " Content-ID: <[email protected]>
Status: RO
X-Status:
X-Keywords:
X-UID: 9080
--_===669732====calmail-me.berkeley.edu===_
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII; FORMAT=flowed
Content-ID: <[email protected]>
Figure 2: This sample email (split over two figures) has two attachments, a PDF file and an HTML file. The Content-Type key indicates that the attachments are separated by ===669732====calmail-me.berkeley.edu=== with a -- prefix. The first part of the email body is the a forwarded message. Note that it has its own header indicating the content type is plain text. Next is a PDF attachment which the owner has named PLOTS.pdf, and the third part is an HTML attach- ment. Both attachments are encoded in base64.
---------- Forwarded message ----------
Date: Mon, 02 Feb 2004 21:50:47 -0800
From: Yyyy Zzz <[email protected]>
To: [email protected]
Subject: Assignment 1 sorry
I am sorry to send this email again, but my outbox told me that
the last email only send 1 attached file .
I am send ing this again to make sure you recieve the all
the necessary files .
Thank You and sorry for the inconvenience .
--_===669732====calmail-me.berkeley.edu===_
Content-Type: APPLICATION/PDF; CHARSET=US-ASCII
Content-Transfer-Encoding: BASE64
Content-ID: <[email protected]>
Content-Description:
Content-Disposition: ATTACHMENT; FILENAME="PLOTS.pdf"
JVBERi0xLjEKJYHigeOBz4HTDQoxIDAgb2JqCjw8Ci9DcmVhdGlvbkRhdGUgKEQ6MjAwNDAy MDIxMTIwMTEpCi9Nb2REYXRlIChEOjIwMDQwMjAyMTEyMDExKQovVGl0bGUgKFIgR3JhcGhp Y3MgT3V0cHV0KQovUHJvZHVjZXIgKFIgMS44LjEpCi9DcmVhdG9yIChSKQo+PgplbmRvYmoK
...
--_===669732====calmail-me.berkeley.edu===_
Content-Type: TEXT/HTML; CHARSET=US-ASCII
Content-Transfer-Encoding: BASE64
Content-ID: <[email protected]>
Content-Description:
Content-Disposition: ATTACHMENT; FILENAME="Stat133HW1.htm"
PGh0bWwgeG1sbnM6bz0idXJuOnN jaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6b2ZmaWNlˆM PGh0bWwgeG1sbnM6bz0idXJuOnN jaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6b2ZmaWNlˆM Ig0KeG1sbnM6dz0idXJuOnN jaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6d29yZCINCnhtˆM
...
--_===669732====calmail-me.berkeley.edu===_--
Figure 3: This sample email (split over two figures) has two attachments, a PDF file and an HTML file. The Content-Type key indicates that the attachments are separated by ===669732====calmail-me.berkeley.edu=== with a -- prefix. The first part of the email body is the a forwarded message. Note that it has its own header indicating the content type is plain text. Next is a PDF attachment which the owner has named PLOTS.pdf, and the third part is an HTML attach- ment. Both attachments are encoded in base64.