shuyo / iir Goto Github PK

View Code? Open in Web Editor NEW

708.0 708.0 299.0 1.76 MB

Machine Learning / Natural Language Processing / Information Retrieval

Home Page: http://shuyo.wordpress.com/

Python 64.08% Ruby 23.02% R 6.37% C++ 6.54%

iir's People

Contributors

Stargazers

Watchers

Forkers

beibeiyang guniorobot harph mugenen shawntan dannik seekshreyas clustersvm emanuelaboros jhoon4574 katryo rickychang subright bertrandding purplebleed devinshields yamakata judycai12 ryotanize woyoung uncleleeee invinciblejha kensuke-mitsuzawa eshioji fanfannothing hellcoderz arunenigma valiantone jayantgupta vinaychittora wqren njuhugn vicflair lvchakele feeblecashew xiaofengzhu flaviotsf wavelets aacharya-cs twistedtree jlark arjunasolutions anmolgulati sakaiiiii hanaken romilbansal drevicko chonglinsun tshimba largelymfs zhujiahui nachiappan thejamesmarq googleinternetauthorityg2sunghan dloker art-wind hijbul sunnytime24 wenhaozheng-nju ousaizen shim1zu minhlongdo gavinhan anukiruba christofs kai-yip fiestabonita jyangtum acha21 barrychen4 fedenanni seanpue twinkleming jacky168 yuya-sakaizawa mushroomjam jfhrecoba isanfulia zizhengwu harry-chen-1116 tdhopper iohanna08 ambier guaibaoer zzmjohn nishkavijay voidnessx haibin0332 ian09 feeblerequest caohao2008 shdut kod3r eaglebh dtrckd cocosci xingchen-yu jsnono wangxiong2015 strange-jiong

iir's Issues

Error while implementing lda.py

Hi @shuyo ,

Thanks a lot for providing this script for the public community.

I ran your code based on Karpathy's Nipspreview. I generally follow his README.md file to generate the respective wordclouds, thumbnails etc but I find that when I execute python lda.py -f allpapers.txt -k 7 --alpha=0.5 --beta=0.5 -i 100, I got the following output:

$ python lda.py -f allpapers.txt -k 7 --alpha=0.5 --beta=0.5 -i 100
Traceback (most recent call last):
File "lda.py", line 150, in
main()
File "lda.py", line 139, in main
docs = [voca.doc_to_ids(doc) for doc in corpus]
File "/home/lex/Desktop/nipspreview/vocabulary.py", line 65, in doc_to_ids
id = self.term_to_id(term)
File "/home/lex/Desktop/nipspreview/vocabulary.py", line 48, in term_to_id
term = lemmatize(term0)
File "/home/lex/Desktop/nipspreview/vocabulary.py", line 35, in lemmatize
w = wl.lemmatize(w0.lower())
File "/usr/local/lib/python2.7/dist-packages/nltk/stem/wordnet.py", line 40, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/util.py", line 99, in getattr
self.__load()
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/util.py", line 64, in __load
except LookupError: raise e
LookupError:

Resource u'corpora/wordnet' not found. Please use the NLTK
Downloader to obtain the resource: >>> nltk.download()
Searched in:
- '/home/lex/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'

and I'm wondering what is wrong? Could you help me out? Thanks!

Input Low Res SVHN instead of uniform random distribution.

I want to use your DCGAN model to conduct Super Resolution on the SHVN model. I have created a dataset of Low Resolution SHVN numbers, however, im not sure where to throw in the input. I believe that they should be fed into Z, but i get gibberish images. I assume the first epoch should appear similar to my original Low Res sample, but it ended up like static.

1st Epoch

5th

10th

14th

Low Res Input

Default values

Hi,
I was wondering how and where you derived the default values for alpha and beta (both 0.01 I believe).

Thanks in advance!

English version of your slides

Hello Shuyo, have you got a English version of your slides. I am also trying to follow your CRF implementation with a bit of hick-ups on gradient computation as well as the feature definition. I would appreciate any form of help in understanding the steps in your code.

Thanks

will i get hierarchical representation of topics in hlda?

I need hierarchical representation of topic. would i get it from hlda?

lda

I am new to python or R. I would like to implement LDA by using either python or R. I am looking at your code but I don't quite understand what is the format of the input (filename). I tried to see what does re r'\w+(?:'\w+)?' do.
Can I input several documents?

how to use lad/llda.py, any dataset for demo?

features.add_feature_edge( lambda y_, y: 0 )

Hey, shuyo! the 351 line in crf.py :features.add_feature_edge( lambda y_, y: 0 ) , is there a problem? THX!!!

complement_label - category 0 is always 1

Hi shuyo,

could you explain why you are always setting category 0 to 1? (vec[0]= 1.0)

 def complement_label(self, label):
        if not label: return numpy.ones(len(self.labelmap))
        vec = numpy.zeros(len(self.labelmap))
        vec[0] = 1.0
        for x in label: vec[self.labelmap[x]] = 1.0
        return vec

Thank you and best regards!

hdplda.py:sampling_k

Hello,master:
Thanks for you code which give me more thought. I'am a new scholar in learning hdp-model.when i understand the code in your iir/lda/hdplda.py ,the row of 252 and 253 makes me puzzle.when we sample k,what is the mean of 'log_p_k[i]' and 'log_p_k[K]'.
Otherwise,the code 'k_new = self.sampling_topic(numpy.exp(log_p_k - log_p_k.max()))',should i change 'log_p_k.max()' to 'log_p_k.sum()'.can you tell me the program process or formula about that.thanks!

Labelled LDA implementation

In your compliment label function, you have used in Labelled LDA - there seems to be an error. Complement Label accepts a label from a list of labels, but it does a for loop in the function. This would get individual characters from a single label. Isn't this wrong ?

for x in label: vec[self.labelmap[x]] = 1.0

Error LLDA

Hi why this error occured when I try to use LLDA model?

BadOptionError Traceback (most recent call last)
File ~\anaconda3\lib\optparse.py:1387, in OptionParser.parse_args(self, args, values)
1386 try:
-> 1387 stop = self._process_args(largs, rargs, values)
1388 except (BadOptionError, OptionValueError) as err:

File ~\anaconda3\lib\optparse.py:1431, in OptionParser._process_args(self, largs, rargs, values)
1428 elif arg[:1] == "-" and len(arg) > 1:
1429 # process a cluster of short options (possibly with
1430 # value(s) for the last one only)
-> 1431 self._process_short_opts(rargs, values)
1432 elif self.allow_interspersed_args:

File ~\anaconda3\lib\optparse.py:1513, in OptionParser._process_short_opts(self, rargs, values)
1512 if not option:
-> 1513 raise BadOptionError(opt)
1514 if option.takes_value():
1515 # Any characters left in arg? Pretend they're the
1516 # next arg, and stop consuming characters of arg.

BadOptionError: no such option: -f

During handling of the above exception, another exception occurred:

SystemExit Traceback (most recent call last)
[... skipping hidden 1 frame]

Input In [32], in <cell line: 8>()
7 parser.add_option("-n", dest="samplesize", type="int", help="dataset sample size", default=100)
----> 8 (options, args) = parser.parse_args()
9 random.seed(options.seed)

File ~\anaconda3\lib\optparse.py:1389, in OptionParser.parse_args(self, args, values)
1388 except (BadOptionError, OptionValueError) as err:
-> 1389 self.error(str(err))
1391 args = largs + rargs

File ~\anaconda3\lib\optparse.py:1569, in OptionParser.error(self, msg)
1568 self.print_usage(sys.stderr)
-> 1569 self.exit(2, "%s: error: %s\n" % (self.get_prog_name(), msg))

File ~\anaconda3\lib\optparse.py:1559, in OptionParser.exit(self, status, msg)
1558 sys.stderr.write(msg)
-> 1559 sys.exit(status)

SystemExit: 2

During handling of the above exception, another exception occurred:

AssertionError Traceback (most recent call last)
[... skipping hidden 1 frame]

File ~\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:1972, in InteractiveShell.showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code)
1969 if exception_only:
1970 stb = ['An exception has occurred, use %tb to see '
1971 'the full traceback.\n']
-> 1972 stb.extend(self.InteractiveTB.get_exception_only(etype,
1973 value))
1974 else:
1975 try:
1976 # Exception classes can customise their traceback - we
1977 # use this in IPython.parallel for exceptions occurring
1978 # in the engines. This should return a list of strings.

File ~\anaconda3\lib\site-packages\IPython\core\ultratb.py:585, in ListTB.get_exception_only(self, etype, value)
577 def get_exception_only(self, etype, value):
578 """Only print the exception type and message, without a traceback.
579
580 Parameters
(...)
583 value : exception value
584 """
--> 585 return ListTB.structured_traceback(self, etype, value)

File ~\anaconda3\lib\site-packages\IPython\core\ultratb.py:443, in ListTB.structured_traceback(self, etype, evalue, etb, tb_offset, context)
440 chained_exc_ids.add(id(exception[1]))
441 chained_exceptions_tb_offset = 0
442 out_list = (
--> 443 self.structured_traceback(
444 etype, evalue, (etb, chained_exc_ids),
445 chained_exceptions_tb_offset, context)
446 + chained_exception_message
447 + out_list)
449 return out_list

File ~\anaconda3\lib\site-packages\IPython\core\ultratb.py:1118, in AutoFormattedTB.structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
1116 else:
1117 self.tb = tb
-> 1118 return FormattedTB.structured_traceback(
1119 self, etype, value, tb, tb_offset, number_of_lines_of_context)

File ~\anaconda3\lib\site-packages\IPython\core\ultratb.py:1012, in FormattedTB.structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
1009 mode = self.mode
1010 if mode in self.verbose_modes:
1011 # Verbose modes need a full traceback
-> 1012 return VerboseTB.structured_traceback(
1013 self, etype, value, tb, tb_offset, number_of_lines_of_context
1014 )
1015 elif mode == 'Minimal':
1016 return ListTB.get_exception_only(self, etype, value)

File ~\anaconda3\lib\site-packages\IPython\core\ultratb.py:865, in VerboseTB.structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context)
856 def structured_traceback(
857 self,
858 etype: type,
(...)
862 number_of_lines_of_context: int = 5,
863 ):
864 """Return a nice text document describing the traceback."""
--> 865 formatted_exception = self.format_exception_as_a_whole(etype, evalue, etb, number_of_lines_of_context,
866 tb_offset)
868 colors = self.Colors # just a shorthand + quicker name lookup
869 colorsnormal = colors.Normal # used a lot

File ~\anaconda3\lib\site-packages\IPython\core\ultratb.py:799, in VerboseTB.format_exception_as_a_whole(self, etype, evalue, etb, number_of_lines_of_context, tb_offset)
796 assert isinstance(tb_offset, int)
797 head = self.prepare_header(etype, self.long_header)
798 records = (
--> 799 self.get_records(etb, number_of_lines_of_context, tb_offset) if etb else []
800 )
802 frames = []
803 skipped = 0

File ~\anaconda3\lib\site-packages\IPython\core\ultratb.py:854, in VerboseTB.get_records(self, etb, number_of_lines_of_context, tb_offset)
848 formatter = None
849 options = stack_data.Options(
850 before=before,
851 after=after,
852 pygments_formatter=formatter,
853 )
--> 854 return list(stack_data.FrameInfo.stack_data(etb, options=options))[tb_offset:]

File ~\anaconda3\lib\site-packages\stack_data\core.py:546, in FrameInfo.stack_data(cls, frame_or_tb, options, collapse_repeated_frames)
530 @classmethod
531 def stack_data(
532 cls,
(...)
536 collapse_repeated_frames: bool = True
537 ) -> Iterator[Union['FrameInfo', RepeatedFrames]]:
538 """
539 An iterator of FrameInfo and RepeatedFrames objects representing
540 a full traceback or stack. Similar consecutive frames are collapsed into RepeatedFrames
(...)
544 and optionally an Options object to configure.
545 """
--> 546 stack = list(iter_stack(frame_or_tb))
548 # Reverse the stack from a frame so that it's in the same order
549 # as the order from a traceback, which is the order of a printed
550 # traceback when read top to bottom (most recent call last)
551 if is_frame(frame_or_tb):

File ~\anaconda3\lib\site-packages\stack_data\utils.py:98, in iter_stack(frame_or_tb)
96 while frame_or_tb:
97 yield frame_or_tb
---> 98 if is_frame(frame_or_tb):
99 frame_or_tb = frame_or_tb.f_back
100 else:

File ~\anaconda3\lib\site-packages\stack_data\utils.py:91, in is_frame(frame_or_tb)
90 def is_frame(frame_or_tb: Union[FrameType, TracebackType]) -> bool:
---> 91 assert_(isinstance(frame_or_tb, (types.FrameType, types.TracebackType)))
92 return isinstance(frame_or_tb, (types.FrameType,))

File ~\anaconda3\lib\site-packages\stack_data\utils.py:172, in assert_(condition, error)
170 if isinstance(error, str):
171 error = AssertionError(error)
--> 172 raise error

AssertionError:

How do I apply trained model on new data input?

Hi, I see that the function inference is called during training. inference function however doesn't take in a new data (like a test set for example). How can I apply the finished model on new data input?

Thank you.

For text match problem, what is the different between question-question match and question-answer match?

I know question-question match is a text similarity problem.
What about question-answer match or question-doc match? It is used in information retrieval.
question-question match is indeed text similarity. But how do you define question-answer similarity?
Thank you!!

llda.py: question on output

Hello Shuyo,

Thanks for putting this code online! I got it to work with my own data, which includes several labels, as in the following start of a line (each text has 5 labels, some of which are unique, others are recurring):
[Ponson,criminel,1850s,ExploitsRocambole1,rp169] brick commerce route nœud heure temps brise ...

Purely for testing, I used a small collection of just 20 rather long texts, and used only 5 iterations and 10 topics (of course this is not enough for serious results). But now I'm not sure how to interpret and further use the output. There seem to be two outputs (lines 145 and 146).

For the first one, I get something like this (just zeros with the occasional 1):
someword 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

For the second one, I get something like:
someword,1.98242060772e-08,1.35954869509e-07,3.20370912628e-07,2.61753929319e-07,4.15044189755e-07,3.13934090166e-07,4.24199387286e-07,7.01074466728e-07,2.85357018727e-07,6.1751952288e-07,3.20370912628e-07, ...

What is the difference between the two results? Are these per-word scores for each label? The length of this list of scores is identical to the length of the "labelmap.keys()" dict of labels; how do they match up? Or are these values something else?

Thanks for any hints, and best wishes.

my question

Hey,pal! I couldnt find your email on github, so do like this to ask for help. I'm interesting in CRF so want to code it in python. I'm reading your code and i'm a newbie to CRF. Could you give me a flow diagram or add something more detailed about parameters(I cannt match them to math functions). Any guidance would be greatly appreciated. Thanks very much in advance!

LDA : How to get the document-topic distribution?

Hi I am new to LDA. Can you please tell me how I can get the document topic distribution?

shuyo / iir Goto Github PK

iir's People

Contributors

Stargazers

Watchers

Forkers

iir's Issues

Resource u'corpora/wordnet' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/home/lex/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data'

Hi why this error occured when I try to use LLDA model?

Recommend Projects

Recommend Topics

Recommend Org

Resource u'corpora/wordnet' not found. Please use the NLTK
Downloader to obtain the resource: >>> nltk.download()
Searched in:
- '/home/lex/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'