susanli2016 / machine-learning-with-python Goto Github PK

Python code for common Machine Learning Algorithms

Python 0.02% Jupyter Notebook 99.98%

linear-regression polynomial-regression logistic-regression decision-trees random-forest svm svr knn-classification naive-bayes-classifier kmeans-clustering

machine-learning-with-python's Introduction

Machine-Learning-with-Python

Python codes for common Machine Learning Algorithms

machine-learning-with-python's People

Contributors

Stargazers

Watchers

Forkers

pawan07thapa vikramkumarsj gachet strategist922 techonomicsinc devspyrosv wagner-rodeski abhaygoyal rsbegue waficf slimlime azizalhaqbani manilwagle fiontsl rcrahul01 manjunathgit mrinal4github clustersdata peripatetics-isabella fhergal sumendar vivek-rai11 group4pov olgamango asmax111 littleyutong ibrahim85 ethiral dolittle007 horusprojects zwang96-dl shuozeng dineshkumares toutuo faoujisoka jagdeepsingh28 xk97 zhucer2003 yunisahmadlone brucexia6116 verochou07 brianxing bcui6611 sedce hongliangwei yanghao102 kormilitzin tkamag priya-gittest dthboyd youaremylovemyy vigneshkalai zoobe kkhatri99 ara1337 bhuvaneshwarank gemygem wolfws prakashjadhav donadatum surenciodepalma anmolky profmcdan taodsqi miguelperalvo gauravkumar30 kolosdan koltegirish danpechi tufts-mic kero13 zilhua xinyingwu wrericsson anshuagrawal2791 xiaoshuangzi hidannyxu sikisikiliu hotlize bugjay irtezakhan quanxing akrsh24 jorwalk amit-dingare batatavada tongsong91 erick-reis moitoi lhduy wjjayst2008 ankuvaidya frostflame eswar23 yakinrubaiat bemullen prateek2901 lovesilvermist antimkhel sharathdatascience

machine-learning-with-python's Issues

Dataset not found

Hello Mrs. Susan, I hope you are doing very well, I have followed your project carefully and I am very interested in it.

Please, I did not find the dataset "sample_data_search.tar.gz" for the "Trip Segmentation by User Search Behaviors.ipynb" Notebook, I tried to contact you on several platforms and did not find your contact info, so please i need this dataset urgently if you could provide it to me i will appreciate it very much.

Machine learning

.

how can we calculate probability of outcome ?

source of Sales_Product_Price_by_Store.csv

Does anyone know the data source of ./data/Sales_Product_Price_by_Store.csv? I hope to use this data in research.

dimension mismatch

I am trying to predict the class of a string (randomly) but the clf.predict always give dimensions mismatch error. here i am adding the very first line to check if it classifies correctly. but it displays mismatch error, i have done everything the same way mentioned in the notebook.

s = []
s.append((df['final'][0]))
print(clf.predict(count_vect.transform(s)))

How can I show accuracy from this model?

Cell 176. After correction works great!

def plot_fruit_knn(...

clf = KNeighborsClassifier(....

Is precision confused with recall?

in https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Employee_Turnover.ipynb,
for the confusion matrix of random forest model, I think the precision is 991/(991+47), recall is 991/(991+54).

019_Polo_Towers.csv

Hello Susan"

Where can I find the data file "019_Polo_Towers.csv?" This data is needed for the "Analysis - Polo Towers OCC & ADR & Rental RevPar & Time Series" project.

Thanks.

Manny

TypeError: object of type 'numpy.int64' has no len()

I'm getting the error 'TypeError: object of type 'numpy.int64' has no len()' for the last section of code.

My data file doesn't have column headings to I used 'header=None' when reading the csv file.

My data file also uses integers as the labels rather than text.

#Load data file from from GCS %gcs read --object "gs://projectname/data/data.csv" --variable csv_as_bytes df = pd.read_csv(BytesIO(csv_as_bytes), header=None, encoding='latin-1') df.head()

Memory error on Consumer_complaints.ipynb

MemoryError Traceback (most recent call last)
in ()
3 tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
4
----> 5 features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()
6 labels = df.category_id
7 features.shape

~\Anaconda3\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out)
945 if out is None and order is None:
946 order = self._swap('cf')[0]
--> 947 out = self._process_toarray_args(order, out)
948 if not (out.flags.c_contiguous or out.flags.f_contiguous):
949 raise ValueError('Output array must be C or F contiguous')

~\Anaconda3\lib\site-packages\scipy\sparse\base.py in _process_toarray_args(self, order, out)
1182 return out
1183 else:
-> 1184 return np.zeros(self.shape, dtype=self.dtype, order=order)
1185
1186

MemoryError:

Dataset

In Logistic Regression balanced, I was looking for the dataset
but was Not able to find the dataset.
Can you share the dataset?

h2o import file error

I am not able to run this line...

higgs = h2o.import_file('higgs_boston_train.csv')

Getting this error:

H2OResponseError Traceback (most recent call last)
in
----> 1 higgs = h2o.import_file('higgs_boston_train.csv')

/opt/conda/lib/python3.7/site-packages/h2o/h2o.py in import_file(path, destination_frame, parse, header, sep, col_names, col_types, na_strings, pattern, skipped_columns, custom_non_data_line_markers)
434 else:
435 return H2OFrame()._import_parse(path, pattern, destination_frame, header, sep, col_names, col_types, na_strings,
--> 436 skipped_columns, custom_non_data_line_markers)
437
438

/opt/conda/lib/python3.7/site-packages/h2o/frame.py in _import_parse(self, path, pattern, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns, custom_non_data_line_markers)
334 if H2OFrame.LOCAL_EXPANSION_ON_SINGLE_IMPORT and is_type(path, str) and "://" not in path: # fixme: delete those 2 lines, cf. PUBDEV-5717
335 path = os.path.abspath(path)
--> 336 rawkey = h2o.lazy_import(path, pattern)
337 self._parse(rawkey, destination_frame, header, separator, column_names, column_types, na_strings,
338 skipped_columns, custom_non_data_line_markers)

/opt/conda/lib/python3.7/site-packages/h2o/h2o.py in lazy_import(path, pattern)
296 assert_is_type(pattern, str, None)
297 paths = [path] if is_type(path, str) else path
--> 298 return _import_multi(paths, pattern)
299
300

/opt/conda/lib/python3.7/site-packages/h2o/h2o.py in _import_multi(paths, pattern)
302 assert_is_type(paths, [str])
303 assert_is_type(pattern, str, None)
--> 304 j = api("POST /3/ImportFilesMulti", {"paths": paths, "pattern": pattern})
305 if j["fails"]: raise ValueError("ImportFiles of '" + ".".join(paths) + "' failed on " + str(j["fails"]))
306 return j["destination_frames"]

/opt/conda/lib/python3.7/site-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to)
102 # type checks are performed in H2OConnection class
103 _check_connection()
--> 104 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
105
106

/opt/conda/lib/python3.7/site-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to)
405 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies)
406 self._log_end_transaction(start_time, resp)
--> 407 return self._process_response(resp, save_to)
408
409 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:

/opt/conda/lib/python3.7/site-packages/h2o/backend/connection.py in _process_response(response, save_to)
741 # Client errors (400 = "Bad Request", 404 = "Not Found", 412 = "Precondition Failed")
742 if status_code in {400, 404, 412} and isinstance(data, (H2OErrorV3, H2OModelBuilderErrorV3)):
--> 743 raise H2OResponseError(data)
744
745 # Server errors (notably 500 = "Server Error")

H2OResponseError: Server error water.exceptions.H2ONotFoundArgumentException:
Error: File /tmp/Machine-Learning-with-Python/higgs_boston_train.csv does not exist
Request: POST /3/ImportFilesMulti
data: {'paths': '[/tmp/Machine-Learning-with-Python/higgs_boston_train.csv]'}

`AttributeError: 'DataFrame' object has no attribute 'ix'` in `Time Series Forecastings`

Hi,
I tried to run Time Series Forecastings example. From

first_date = store.ix[np.min(list(np.where(store['office_sales'] > store['furniture_sales'])[0])), 'Order Date']

print("Office supplies first time produced higher sales than furniture is {}.".format(first_date.date()))

I got
AttributeError: 'DataFrame' object has no attribute 'ix'
If I replace ix to iloc, based on https://stackoverflow.com/questions/59991397/attributeerror-dataframe-object-has-no-attribute-ix then I got
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

How to fix it? Thanks

Logistic regression

Hi, in have a problem with the implementation of lr
when i try it...
logit_model=sm.Logit(y,X)
i get it...
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data)

Superstore.xls missing for Anomaly_Detection_for_Dummies

Link : https://community.tableau.com/docs/DOC-1236?source=post_page---------------------------

What's the definition of Diabetes Pedigree Function?

There is a feature in Diabetes.csv named Diabetes Pedigree Function (pedi), and I searched its introduction as below

A particularly interesting attribute used in the study was the Diabetes Pedigree Function, pedi. It provided some data on diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient. This measure of genetic influence gave us an idea of the hereditary risk one might have with the onset of diabetes mellitus. Based on observations in the proceeding section, it is unclear how well this function predicts the onset of diabetes.

Can anybody tell me that the definition of this pedi feature? I mean, the mathematical definition

Update READ.me file

Hey! I love how simple and amazing your repository is. I figured out that there is a need for a READ.me file, and I would love to contribute to it. If you think it is a good idea, we can discuss it further.

Best regards,
Rafay

Recommendation system dataset

Hello Susan:
It was nice to follow your github and get in touch with so many good python applications and code. I am a Ph.D. student who is currently learning some basic python examples. I found your 'recommendation system' topic is quite interesting. And I am trying to learn about it. But I did not find the BX-Books.csv you used. Would you mind sharing it with me? My email address is: [email protected]
Thanks very much.

Getting Error in Forecasting Graph

As I try to run the Validating Forecasts portion code. it gives me error. I have attached code and error. Please have a look and suggest me a solution.

Many Thanks

Minor Typo

Hi Susan,

In the Time Series of Price Anomaly Detection Expedia.ipynb, there is a minor typo in the function markovAnomaly.

In line 39&40 of the block, it should be:
if (j < windows_size): df_anomaly.append(0)

Overall I find the notebook really helpful. Thanks.

Chase

how to find the color score?

Error in 'topic_modeling_Gensim.ipynb'

Hi,

I have tried to run 'topic_modeling_Gensim.ipynb' and I get this error at this stage in the notebook. Can anyone help?: -

import random
text_data = []
with open('dataset.csv') as f:
    for line in f:
        tokens = prepare_text_for_lda(line)
        if random.random() > .99:
            print(tokens)
            text_data.append(tokens)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-54-7369a1356984> in <module>()
      3 with open('dataset.csv') as f:
      4     for line in f:
----> 5         tokens = prepare_text_for_lda(line)
      6         if random.random() > .99:
      7             print(tokens)

<ipython-input-51-4f0710beb9ee> in prepare_text_for_lda(text)
      1 def prepare_text_for_lda(text):
----> 2     tokens = tokenize(text)
      3     tokens = [token for token in tokens if len(token) > 4]
      4     tokens = [token for token in tokens if token not in en_stop]
      5     tokens = [get_lemma(token) for token in tokens]

<ipython-input-45-f5c7dc83eb04> in tokenize(text)
      3 def tokenize(text):
      4     lda_tokens = []
----> 5     tokens = parser(text)
      6     for token in tokens:
      7         if token.orth_.isspace():

NameError: name 'parser' is not defined

Expected 2D array, got 1D array instead

Hi,

I'm getting the following error when I run the following cell
What should I do?

scaler = MinMaxScaler(feature_range=(-1, 1)) train_sc = scaler.fit_transform(train) test_sc = scaler.transform(test)

Expected 2D array, got 1D array instead: array=[17.24 18.190001 19.219999 ... 10.47 10.18 11.04 ]. Reshape your data either using array.reshape(-1, 1)

Missing data for Multi label text classification.ipynb: train 2.csv

This notebook references "train 2.csv" but it's not included. Is there a source for this data?

I am getting problem in code number [80] in implementing from imblearn.over_sampling import SMOTE

DeprecationWarning, DataConversionWarning, NameErrors, FutureWarnings

Hej Susan,

I am trying to retrace your steps on this logistic regression. I have started with your your article “Building A Logistic Regression in Python, Step by Step” on DataScience+ and I am now working through the (latest commit of your) Jupiter notebook used to make that post.

I have tried to reproduce your results with a clone of your notebook.

I have replaced from sklearn.cross_validation import train_test_split with from sklearn.model_selection import train_test_split because of a DeprecationWarning in cell 1.

Cell 24 raises a DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)

In cell 32 there's a NameError: name 'classifier' is not defined. You have used logreg.score in cell 30. Replacing classifier with logreg in cell 32 works but obviously produces the exact results as in line 30. I am not sure what you trying to do here (using a different classifier?), or if this is just an accidental duplicate (after changing classifier to a more specific logreg).

In the last section titled “ROC Curvefrom sklearn import metrics” it looks to me like you (accidentally) converted some Python code to MarkDown. This code (cell 34) produces two FutureWarnings: pandas.tslib is deprecated and will be removed in a future version., one NameError: name 'clf1' is not defined and another NameError: name 'Y_test' is not defined.

Kind regards

date of date_account_created is greater than timestamp_first_active in Airbnb New User Bookings.ipynb

date of date_account_created is greater than timestamp_first_active
u can check at Out[28]:
https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Airbnb%20New%20User%20Bookings.ipynb

	affiliate_channel	affiliate_provider	age	country_destination	date_account_created	first_affiliate_tracked	first_browser	first_device_type	gender	id	language	signup_app	signup_flow	signup_method	timestamp_first_active	date_account_created_day	date_account_created_month
direct	direct	56.0	US	2010-09-28	untracked	IE	Windows Desktop	FEMALE	4ft3gnwmtx	en	Web	3	basic	2009-06-09	Tuesday	9	2010
direct	direct	42.0	other	2011-12-05	untracked	Firefox	Mac Desktop	FEMALE	bjjt8pjhuk	en	Web	0	facebook	2009-10-31	Monday	12	2011
direct	direct	41.0	US	2010-09-14	untracked	Chrome	Mac Desktop	M	87mebub9p4	en	Web	0	basic	2009-12-08	Tuesday	9	2010
other	other	NaN	US	2010-01-01	omg	Chrome	Mac Desktop	M	osr2jwljor	en	Web	0	basic	2010-01-01	Friday	1	2010
other	craigslist	46.0	US	2010-01-02	untracked	Safari	Mac Desktop	FEMALE	lsw9q7uk0j	en	Web	0	basic	2010-01-02	Saturday

i want to get the data to practice

hello ,
I want to learn the code in Consumer_complaints.ipynb , but i don't have the data. so can you help me ?
thanks

my eamil [email protected]

Missing data 'train.gz' for Click-Through Rate Prediction.ipynb

'train.gz' for Click-Through Rate Prediction.ipynb is misisng. Could you please provide a sample version of the data ?

Training Data Set.csv data set

Where can i find loyalty member clustering dataset Training Data Set.csv

在运行这段代码print(result.summary())时报错。

AttributeError: module 'scipy.stats' has no attribute 'chisqprob'

请问有什么解决或替代方法吗？

How can I show accuracy, precision, recall, f1-measure of this model?

ValueError: Length of endogenous variable must be larger the the number of lags used in the model and the number of observations burned in the log-likelihood calculation

Hi,

I tried to run Time Series Forecastings.ipynb both in Jupiter and python script. From Jupiter it seems fine. If I tried to run as a python file (paste sections one by one and run as whole), in

results.plot_diagnostics(figsize=(16, 8))
plt.show()

I got

Traceback (most recent call last):
  File "time-series.py", line 71, in <module>
    results.plot_diagnostics(figsize=(16, 8))
  File "/home/user/anaconda3/lib/python3.8/site-packages/statsmodels/tsa/statespace/mlemodel.py", line 4284, in plot_diagnostics
    raise ValueError(
ValueError: Length of endogenous variable must be larger the the number of lags used in the model and the number of observations burned in the log-likelihood calculation.

may I know what is the reason for it? Thanks

Need data for Expedia anomaly detection

I need the data to check the notebook for Expedia Anomaly Detection notebook. can Anybody help me?

[bpr_OnlineRetail_Implicit.ipynb]: operands could not be broadcast together with shapes (3664,) (4338,)

ValueError Traceback (most recent call last)
Input In [9], in
28 # Create recommendations for customer with id 2
29 customer_id = 2
---> 30 recommendations = recommend(customer_id, sparse_customer_item, customer_vecs, item_vecs)
32 print(recommendations)

Input In [9], in recommend(customer_id, sparse_customer_item, customer_vecs, item_vecs, num_items)
9 min_max = MinMaxScaler()
10 rec_vector_scaled = min_max.fit_transform(rec_vector.reshape(-1,1))[:,0]
---> 11 recommend_vector = customer_interactions * rec_vector_scaled
13 item_idx = np.argsort(recommend_vector)[::-1][:num_items]
15 descriptions = []

ValueError: operands could not be broadcast together with shapes (3664,) (4338,)