Comments (9)
Hi there,
could you maybe upload an example notebook highlighting the error. I was not able to reproduce your error. Please see attached files.
test_fit_frame_split.zip
Cheers,
Eren
from st_dbscan.
Hi there,
thanks for your patience. So finally had some time. You are right, it seems that for some cases the fit_frame_split
is not working correctly.
For others it does. I have to further investigate this. Meanwhile you can just use a smaller or larger frame size (1000 or 3000). I hope this helps you to resolve the problem. For instance:
def test_fit_split():
df = pd.read_csv('ST_DBSCAN_2024_03_14.csv')
# transform to numpy array
data = df.loc[:, ['timestamp','x','y']].values
st_dbscan = ST_DBSCAN(eps1=0.25, eps2=250, min_samples=10).fit_frame_split(data, 3000)
df['cluster'] = st_dbscan.labels
return df
df_fit_split = test_fit_split()
I expected that with the sparse matrices people would no longer have any need to rely on the fit_frame_split
method. But yeah, I see your problem. I will fix this over the next few week.
Cheers,
Eren
from st_dbscan.
Hi,
thanks for the tip. I'll have a look at it sometime this week in order to find out what the problem is.
Cheers,
Eren
from st_dbscan.
Hi, thanks for the tip. I'll have a look at it sometime this week in order to find out what the problem is.
awesome, thank you! let me know how I can assist 🙇♂️
from st_dbscan.
I was not able to reproduce your error.
You didn't correctly use fit_frame_split
, based on how you implemented it! fit_frame_split
uses fit
internally, right? Your demo shows that you perform fit_frame_split
on the data, but then run another fit
immediately after, nullifying the fit_frame_split
run's labels.
Please run the following notebook mod; note that cell 2 works as expected, but cell 3 does not.
mod_test_fit_frame_split.zip
from st_dbscan.
@eren-ck hey! Just wanted to check if there was anything wrong with that notebook or intuition? Let me know!
from st_dbscan.
No worries on the delay -- and sounds good, happy to hear I'm not crazy! Will keep my eyes peeled on future follow-up, as it would become the most memory-efficient approach when loads of parallel Spark jobs are being executed at the same time on millions of grouped rows and thousands of groups. Exciting stuff!
from st_dbscan.
I had the same error when sorting my dataframe before passing it into the fit_frame_split
. I believe this is not an issue just with fit_frame_split, but any method that uses pandas.Dataframe.sort_values and subsequently sklearn.utils.check_array on a dataframe will fail unless the indices are reset.
I was not able to reproduce your error.
I am able to reproduce on my own dataset. If your data is not sorted by time and you use the pandas.Dataframe.sort_values method, it will create the error. These following snippets illustrate the error and how to fix it.
sorted = selected_df.sort_values(by='UTC')
X_original_unsorted = selected_df.loc[start_idx:end_idx-1, ['UTC', 'x', 'y', 'z']]
print(f"Unsorted shape: {X_original_unsorted.shape}") # (10000, 4)
X_original = sorted.loc[start_idx:end_idx-1, ['UTC', 'x', 'y', 'z']]
print(f"Sorted shape: {X_original.shape}") # (11309, 4)
X_checked = check_array(X_original)
print(f"Checked shape: {X_checked.shape}") # (11309, 4)
n, m = X_checked.shape
# pdist errors
time_dist = pdist(X_checked[:, 0].reshape(n, 1), metric='euclidean')
# ValueError: Found input variables with inconsistent numbers of samples: [11309, 10000]
sorted = selected_df.sort_values(by='UTC')
# This fixes
sorted.reset_index(drop=True, inplace=True)
X_original_unsorted = selected_df.loc[start_idx:end_idx-1, ['UTC', 'x', 'y', 'z']]
print(f"Unsorted shape: {X_original_unsorted.shape}") # (10000, 4)
X_original = sorted.loc[start_idx:end_idx-1, ['UTC', 'x', 'y', 'z']]
print(f"Sorted shape: {X_original.shape}") # (10000, 4)
X_checked = check_array(X_original)
print(f"Checked shape: {X_checked.shape}") # (10000, 4)
n, m = X_checked.shape
# No error
time_dist = pdist(X_checked[:, 0].reshape(n, 1), metric='euclidean')
This error is probably still following intended behavior of sklearn.utils.check_array, but I am not sure. I will follow up at some point after looking into it. I would encourage that fit_frame_split should sort by time so preventing this behavior is not made the user's issue. Also inform the user that fit will fail if indices as well as timestamps are not in strictly increasing order if making a copy or sorting the dataframe are out of the question. Or possibly throw some exception that indices should also be ordered as well as time.
EDIT: I now realize there is a mistake when using the .loc operator on indexes that are not in order, and that this issue does not occur in the same place as the original reporter's code. My issue occurs because I pass a pandas.Dataframe instead of a 2d numpy array.
from st_dbscan.
Related Issues (10)
- No issue... Just some questions. HOT 1
- Provide time series before the spatial attributes HOT 1
- Units or metrics HOT 2
- Usage of squareform HOT 1
- ValueError: frame_size, frame_overlap not correctly configured HOT 6
- Using the model for multiple features HOT 1
- density factor implementation HOT 2
- wrong when use "st_dbscan.fit_frame_split" HOT 1
- Another distance metric HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from st_dbscan.