Giter VIP home page Giter VIP logo

minisearchengine_basedon_yahoo's Introduction

readme.md

運行順序

  1. 爬取文件 python yahoo.py

  2. 提取文本內容 python htmlProcess.py

  3. 計算idf python idf.py

  4. 建立單詞和文件編碼

python word_file_index/word_file_index.py
  1. 儲存詞表的 idf 詞表每個詞的idf
python word_idf_array/word_idf_array.py
  1. 儲存文章列表的 tf 計算每篇文章的tf,並存儲
python file_word_tf_array/file_word_tf_array.py
  1. 計算 tf-idf
python tf_idf_array/tf_idf_array.py
  1. 建立倒排表
python inverted_list/inverted_list.py
  1. 查詢
python query.py

htmlProcess.py

對html文本內容進行提取,
分詞
stemming
最後統計詞頻

idf.py

統計了詞在所有文檔中出現的次數

urls

HTML的原文件
從www.yahoo.com上爬下來的以html結尾的1000頁
其中mainPage.html是www.yahoo.com首頁

dictionaries

html處理後的文件
用pickle格式dump的
讀取方法

import pickle
with open(filepath, 'r') as fin:
    d = pickle.load(fin)

讀出來d是dict格式
d['Url'] is url
d['Title']是文本的title
d['Raw']是文本的原始內容,字符串的形式
d['Content']是文本的分詞後的內容,是列表的形式,列表內是每個分好的詞
d['Length']是文本去除停用詞後的詞數
剩下的key是文本中出現的詞
對應的value是出現的次數
所有的詞都是小寫後、提取詞乾(stemming)的結果

# cdoe for stemming
from nltk.stem.porter import PorterStemmer 
porter_stemmer = PorterStemmer()
porter_stemmer.stem(word.lower().decode('utf-8'))

tokenize

相比空格分詞的方法,使用這個的好處是可以去除標點符號。

# code for tokenize
from nltk.tokenize import RegexpTokenizer
tokens = word_tokenize(html_doc)
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

stop word list

一共174個詞
使用stopwords之後,idf詞表從43936減少到43776個

numpy 的存取

利用这种方法,保存文件的后缀名字一定会被置为.npy

numpy.save("filename.npy",a)

这种格式最好只用a = numpy.load("filename")来读取。

minisearchengine_basedon_yahoo's People

Contributors

smellly avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.