Python codes for common Machine Learning Algorithms
susanli2016 / machine-learning-with-python Goto Github PK
View Code? Open in Web Editor NEWPython code for common Machine Learning Algorithms
Python code for common Machine Learning Algorithms
Hello Mrs. Susan, I hope you are doing very well, I have followed your project carefully and I am very interested in it.
Please, I did not find the dataset "sample_data_search.tar.gz" for the "Trip Segmentation by User Search Behaviors.ipynb" Notebook, I tried to contact you on several platforms and did not find your contact info, so please i need this dataset urgently if you could provide it to me i will appreciate it very much.
Does anyone know the data source of ./data/Sales_Product_Price_by_Store.csv?
I hope to use this data in research.
I am trying to predict the class of a string (randomly) but the clf.predict always give dimensions mismatch error. here i am adding the very first line to check if it classifies correctly. but it displays mismatch error, i have done everything the same way mentioned in the notebook.
s = []
s.append((df['final'][0]))
print(clf.predict(count_vect.transform(s)))
def plot_fruit_knn(...
clf = KNeighborsClassifier(....
in https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Employee_Turnover.ipynb,
for the confusion matrix of random forest model, I think the precision is 991/(991+47), recall is 991/(991+54).
Hello Susan"
Where can I find the data file "019_Polo_Towers.csv?" This data is needed for the "Analysis - Polo Towers OCC & ADR & Rental RevPar & Time Series" project.
Thanks.
Manny
I'm getting the error 'TypeError: object of type 'numpy.int64' has no len()' for the last section of code.
My data file doesn't have column headings to I used 'header=None' when reading the csv file.
My data file also uses integers as the labels rather than text.
#Load data file from from GCS %gcs read --object "gs://projectname/data/data.csv" --variable csv_as_bytes df = pd.read_csv(BytesIO(csv_as_bytes), header=None, encoding='latin-1') df.head()
MemoryError Traceback (most recent call last)
in ()
3 tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
4
----> 5 features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()
6 labels = df.category_id
7 features.shape
~\Anaconda3\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out)
945 if out is None and order is None:
946 order = self._swap('cf')[0]
--> 947 out = self._process_toarray_args(order, out)
948 if not (out.flags.c_contiguous or out.flags.f_contiguous):
949 raise ValueError('Output array must be C or F contiguous')
~\Anaconda3\lib\site-packages\scipy\sparse\base.py in _process_toarray_args(self, order, out)
1182 return out
1183 else:
-> 1184 return np.zeros(self.shape, dtype=self.dtype, order=order)
1185
1186
MemoryError:
In Logistic Regression balanced, I was looking for the dataset
but was Not able to find the dataset.
Can you share the dataset?
I am not able to run this line...
higgs = h2o.import_file('higgs_boston_train.csv')
Getting this error:
H2OResponseError Traceback (most recent call last)
in
----> 1 higgs = h2o.import_file('higgs_boston_train.csv')/opt/conda/lib/python3.7/site-packages/h2o/h2o.py in import_file(path, destination_frame, parse, header, sep, col_names, col_types, na_strings, pattern, skipped_columns, custom_non_data_line_markers)
434 else:
435 return H2OFrame()._import_parse(path, pattern, destination_frame, header, sep, col_names, col_types, na_strings,
--> 436 skipped_columns, custom_non_data_line_markers)
437
438/opt/conda/lib/python3.7/site-packages/h2o/frame.py in _import_parse(self, path, pattern, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns, custom_non_data_line_markers)
334 if H2OFrame.LOCAL_EXPANSION_ON_SINGLE_IMPORT and is_type(path, str) and "://" not in path: # fixme: delete those 2 lines, cf. PUBDEV-5717
335 path = os.path.abspath(path)
--> 336 rawkey = h2o.lazy_import(path, pattern)
337 self._parse(rawkey, destination_frame, header, separator, column_names, column_types, na_strings,
338 skipped_columns, custom_non_data_line_markers)/opt/conda/lib/python3.7/site-packages/h2o/h2o.py in lazy_import(path, pattern)
296 assert_is_type(pattern, str, None)
297 paths = [path] if is_type(path, str) else path
--> 298 return _import_multi(paths, pattern)
299
300/opt/conda/lib/python3.7/site-packages/h2o/h2o.py in _import_multi(paths, pattern)
302 assert_is_type(paths, [str])
303 assert_is_type(pattern, str, None)
--> 304 j = api("POST /3/ImportFilesMulti", {"paths": paths, "pattern": pattern})
305 if j["fails"]: raise ValueError("ImportFiles of '" + ".".join(paths) + "' failed on " + str(j["fails"]))
306 return j["destination_frames"]/opt/conda/lib/python3.7/site-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to)
102 # type checks are performed in H2OConnection class
103 _check_connection()
--> 104 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
105
106/opt/conda/lib/python3.7/site-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to)
405 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies)
406 self._log_end_transaction(start_time, resp)
--> 407 return self._process_response(resp, save_to)
408
409 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:/opt/conda/lib/python3.7/site-packages/h2o/backend/connection.py in _process_response(response, save_to)
741 # Client errors (400 = "Bad Request", 404 = "Not Found", 412 = "Precondition Failed")
742 if status_code in {400, 404, 412} and isinstance(data, (H2OErrorV3, H2OModelBuilderErrorV3)):
--> 743 raise H2OResponseError(data)
744
745 # Server errors (notably 500 = "Server Error")H2OResponseError: Server error water.exceptions.H2ONotFoundArgumentException:
Error: File /tmp/Machine-Learning-with-Python/higgs_boston_train.csv does not exist
Request: POST /3/ImportFilesMulti
data: {'paths': '[/tmp/Machine-Learning-with-Python/higgs_boston_train.csv]'}
Hi,
I tried to run Time Series Forecastings
example. From
first_date = store.ix[np.min(list(np.where(store['office_sales'] > store['furniture_sales'])[0])), 'Order Date']
print("Office supplies first time produced higher sales than furniture is {}.".format(first_date.date()))
I got
AttributeError: 'DataFrame' object has no attribute 'ix'
If I replace ix
to iloc
, based on https://stackoverflow.com/questions/59991397/attributeerror-dataframe-object-has-no-attribute-ix
then I got
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
How to fix it? Thanks
Hi, in have a problem with the implementation of lr
when i try it...
logit_model=sm.Logit(y,X)
i get it...
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data)
There is a feature in Diabetes.csv named Diabetes Pedigree Function (pedi), and I searched its introduction as below
A particularly interesting attribute used in the study was the Diabetes Pedigree Function, pedi. It provided some data on diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient. This measure of genetic influence gave us an idea of the hereditary risk one might have with the onset of diabetes mellitus. Based on observations in the proceeding section, it is unclear how well this function predicts the onset of diabetes.
Can anybody tell me that the definition of this pedi feature? I mean, the mathematical definition
Hey! I love how simple and amazing your repository is. I figured out that there is a need for a READ.me file, and I would love to contribute to it. If you think it is a good idea, we can discuss it further.
Best regards,
Rafay
Hello Susan:
It was nice to follow your github and get in touch with so many good python applications and code. I am a Ph.D. student who is currently learning some basic python examples. I found your 'recommendation system' topic is quite interesting. And I am trying to learn about it. But I did not find the BX-Books.csv you used. Would you mind sharing it with me? My email address is: [email protected]
Thanks very much.
Hi Susan,
In the Time Series of Price Anomaly Detection Expedia.ipynb
, there is a minor typo in the function markovAnomaly
.
In line 39&40 of the block, it should be:
if (j < windows_size): df_anomaly.append(0)
Overall I find the notebook really helpful. Thanks.
Chase
Hi,
I have tried to run 'topic_modeling_Gensim.ipynb' and I get this error at this stage in the notebook. Can anyone help?: -
import random
text_data = []
with open('dataset.csv') as f:
for line in f:
tokens = prepare_text_for_lda(line)
if random.random() > .99:
print(tokens)
text_data.append(tokens)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-54-7369a1356984> in <module>()
3 with open('dataset.csv') as f:
4 for line in f:
----> 5 tokens = prepare_text_for_lda(line)
6 if random.random() > .99:
7 print(tokens)
<ipython-input-51-4f0710beb9ee> in prepare_text_for_lda(text)
1 def prepare_text_for_lda(text):
----> 2 tokens = tokenize(text)
3 tokens = [token for token in tokens if len(token) > 4]
4 tokens = [token for token in tokens if token not in en_stop]
5 tokens = [get_lemma(token) for token in tokens]
<ipython-input-45-f5c7dc83eb04> in tokenize(text)
3 def tokenize(text):
4 lda_tokens = []
----> 5 tokens = parser(text)
6 for token in tokens:
7 if token.orth_.isspace():
NameError: name 'parser' is not defined
Hi,
I'm getting the following error when I run the following cell
What should I do?
scaler = MinMaxScaler(feature_range=(-1, 1)) train_sc = scaler.fit_transform(train) test_sc = scaler.transform(test)
Expected 2D array, got 1D array instead: array=[17.24 18.190001 19.219999 ... 10.47 10.18 11.04 ]. Reshape your data either using array.reshape(-1, 1)
This notebook references "train 2.csv" but it's not included. Is there a source for this data?
Hej Susan,
I am trying to retrace your steps on this logistic regression. I have started with your your article “Building A Logistic Regression in Python, Step by Step” on DataScience+ and I am now working through the (latest commit of your) Jupiter notebook used to make that post.
I have tried to reproduce your results with a clone of your notebook.
I have replaced from sklearn.cross_validation import train_test_split
with from sklearn.model_selection import train_test_split
because of a DeprecationWarning
in cell 1.
Cell 24 raises a DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
In cell 32 there's a NameError: name 'classifier' is not defined
. You have used logreg.score
in cell 30. Replacing classifier
with logreg
in cell 32 works but obviously produces the exact results as in line 30. I am not sure what you trying to do here (using a different classifier?), or if this is just an accidental duplicate (after changing classifier
to a more specific logreg
).
In the last section titled “ROC Curvefrom sklearn import metrics” it looks to me like you (accidentally) converted some Python code to MarkDown. This code (cell 34) produces two FutureWarning
s: pandas.tslib is deprecated and will be removed in a future version.
, one NameError: name 'clf1' is not defined
and another NameError: name 'Y_test' is not defined
.
Kind regards
date of date_account_created is greater than timestamp_first_active
u can check at Out[28]:
https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Airbnb%20New%20User%20Bookings.ipynb
affiliate_channel | affiliate_provider | age | country_destination | date_account_created | first_affiliate_tracked | first_browser | first_device_type | gender | id | language | signup_app | signup_flow | signup_method | timestamp_first_active | date_account_created_day | date_account_created_month | date_account_created_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
direct | direct | 56.0 | US | 2010-09-28 | untracked | IE | Windows Desktop | FEMALE | 4ft3gnwmtx | en | Web | 3 | basic | 2009-06-09 | Tuesday | 9 | 2010 | |
direct | direct | 42.0 | other | 2011-12-05 | untracked | Firefox | Mac Desktop | FEMALE | bjjt8pjhuk | en | Web | 0 | 2009-10-31 | Monday | 12 | 2011 | ||
direct | direct | 41.0 | US | 2010-09-14 | untracked | Chrome | Mac Desktop | M | 87mebub9p4 | en | Web | 0 | basic | 2009-12-08 | Tuesday | 9 | 2010 | |
other | other | NaN | US | 2010-01-01 | omg | Chrome | Mac Desktop | M | osr2jwljor | en | Web | 0 | basic | 2010-01-01 | Friday | 1 | 2010 | |
other | craigslist | 46.0 | US | 2010-01-02 | untracked | Safari | Mac Desktop | FEMALE | lsw9q7uk0j | en | Web | 0 | basic | 2010-01-02 | Saturday |
hello ,
I want to learn the code in Consumer_complaints.ipynb , but i don't have the data. so can you help me ?
thanks
my eamil [email protected]
'train.gz' for Click-Through Rate Prediction.ipynb is misisng. Could you please provide a sample version of the data ?
Where can i find loyalty member clustering dataset Training Data Set.csv
AttributeError: module 'scipy.stats' has no attribute 'chisqprob'
请问有什么解决或替代方法吗?
Hi,
I tried to run Time Series Forecastings.ipynb
both in Jupiter
and python script
. From Jupiter
it seems fine. If I tried to run as a python file (paste sections one by one and run as whole), in
results.plot_diagnostics(figsize=(16, 8))
plt.show()
I got
Traceback (most recent call last):
File "time-series.py", line 71, in <module>
results.plot_diagnostics(figsize=(16, 8))
File "/home/user/anaconda3/lib/python3.8/site-packages/statsmodels/tsa/statespace/mlemodel.py", line 4284, in plot_diagnostics
raise ValueError(
ValueError: Length of endogenous variable must be larger the the number of lags used in the model and the number of observations burned in the log-likelihood calculation.
may I know what is the reason for it? Thanks
I need the data to check the notebook for Expedia Anomaly Detection notebook. can Anybody help me?
ValueError Traceback (most recent call last)
Input In [9], in
28 # Create recommendations for customer with id 2
29 customer_id = 2
---> 30 recommendations = recommend(customer_id, sparse_customer_item, customer_vecs, item_vecs)
32 print(recommendations)
Input In [9], in recommend(customer_id, sparse_customer_item, customer_vecs, item_vecs, num_items)
9 min_max = MinMaxScaler()
10 rec_vector_scaled = min_max.fit_transform(rec_vector.reshape(-1,1))[:,0]
---> 11 recommend_vector = customer_interactions * rec_vector_scaled
13 item_idx = np.argsort(recommend_vector)[::-1][:num_items]
15 descriptions = []
ValueError: operands could not be broadcast together with shapes (3664,) (4338,)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.