# TF-IDF

### TF

（词频Term Frequent）

> |Di(word)|记为文章中单词/词word出现的次数
>
> |Di|为文章单词非去重复总数/有的策略是文章中出现次数最多的词的次数

词频$$TF(word)=\frac{|Di(word)|}{|Di|}$$

### IDF

（计算逆文档频率inverse documen frequency）

这时，需要一个语料库（corpus），用来模拟语言的使用环境

> |Corpus|表示语料库中文挡总数
>
> $$|Corpus\_{Di(word)}|$$表示拥有单词word的文档的数量，+1为了平滑/非0

$$IDF(word)=log(\frac{|Corpus|}{|Corpus\_{Di(word)}|+1})$$

### TF-IDF

$$
TF-IDF=TF \* IDF
$$

TF-IDF与一个词在文档中的出现次数成正比，与该词在整个语言中的出现次数成反比

## sklearn使用TF-IDF

CountVectorizer/TfidfVectorizer

```python
# coding:utf-8  
from sklearn.feature_extraction.text import CountVectorizer  

#语料  
corpus = [  
    'This is the first document.',  
    'This is the second second document.',  
    'And the third one.',  
    'Is this the first document?',  
]  
#将文本中的词语转换为词频矩阵  
vectorizer = CountVectorizer()  
#计算个词语出现的次数  
X = vectorizer.fit_transform(corpus)  
#获取词袋中所有文本关键词  
word = vectorizer.get_feature_names()  
print word  
#查看词频结果  
print X.toarray()
```

**用TF-IDF提取特征**

```python
from sklearn.feature_extraction.text import  TfidfVectorizer

#ngram_range(min,max)是指将text分成min，min+1，min+2,.........max 个不同的词组
#min_df最小次数（整数时）或比例（小数时），低于这个的被省略。max_df类似
vec = TfidfVectorizer(ngram_range=(1,2),min_df=3, max_df=0.9,use_idf=1,smooth_idf=1, sublinear_tf=1)
xTrain_tfidf = vec.fit_transform(xTrain)
xTest_tfidf = vec.transform(xTest)

#训练逻辑回归模型
clf = LogisticRegression(C=4, dual=True)
clf.fit(xTrain_tfidf, yTrain)
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://im-qianuxn.gitbook.io/pytorch/ji-suan-ji/ml/tf-idf.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.