Reddit askwomen dating shorter guys

reddit askwomen dating shorter guys

(long-short term memory) which represents Figure 1: Visualization of a sample thread on Reddit. tweets and their retweets date, which distinguishes this work from prior tree- structured imented with 3 subreddits: askwomen, askmen, and politics. Kai Sheng Tai, Richard Socher, and Christopher D Man- ning. 2015. Tall Women Dating Shorter Men, What is Your Relationship Like? My current SO is a tiny bit taller than me (I'm 5'10) but most of the dudes I've dated have been between 5'5-5'7. It's not a More posts from the AskWomen community. 2.8​k. A lot of guys think that if they build muscle they'll have all the girls in the world. Sorry bro I've met women that liked fat guys, I've met women who would only date short guys, the point is beauty is in the eye of the beholder. /r/AskWomen. reddit askwomen dating shorter guys

Building the Data Set

How do We Represent Language So Computers Can Analyze It?¶

If we want to analyze Reddit comments to make some prediction about them, we need to first find a way to vectorize, or numerically represent, the text of the comments. This is the first concern of natural language processing (NLP). The entire comment collection taken together is referred to as the corpus, from the Latin for body.

One common approach to vectorizing words is called the bag of words representation. In this strategy, we don't care about the relative positions of any of the words in a comment. All we do is break the words into units (tokenization), count how often a given word appears in each comment, and then create a vector for each comment that is just the number of times each word appears, including all the 0s for words that appear somewhere in the corpus, but don't appear in that particular comment.
When we do this, we usually want to exclude words that are extremely common, like "the", "and", etc... these words dominate frequencies, but their influence on the meaning of the text is usually minimal. So it's better to leave them out if we're trying to make accurate predictions. In NLP, these words are called stop words, and there are standard exclusion lists that are built into the available vectorizers as options.

There are a few different vectorizers to choose from: CountVectorizer() and TfidfVectorizer() are the most widely used. CountVectorizer tokenizes and counts, and that's it. TfIdfVectorizer goes a step further, and normalizes the frequencies. Basically, TfidfVectorizer will apply a transformation to our comment vectors that will down-weight the influence of words that appear in a lot of comments, while up-weighting the influence of rarer words. The idea behind this is that common words tend to be less interesting to us; they give us less information about the comment. Rarer words are more likely to have predictive value.

That said, if a word is too rare, and only appears in one or two comments, there isn't much point to including it - what patterns can it really give us? There are parameters that will allow us to only include words in our analysis if they appear in at least n comments.

Single words may be able to help us predict whether a comment is AskMen or AskWomen, but what if we want to consider the frequency of certain pairs of words, or triads? This is where n-grams, or groups of n words, come into play. We can run our model with different combinations of n-grams included and see which performs best. In this case, we found the best performance with a model that included n=1 (individual words), n=2 (pairs), and n=3. Using n-grams gives us at least some characterization of the relative positions of words in each comment. It's no surprise that this improves our ability to predict a comment's origin.

Источник: http://www.eamonfleming.com/projects/reddit-gender.html

2 thoughts to “Reddit askwomen dating shorter guys”

Leave a Reply

Your email address will not be published. Required fields are marked *