Loading data with DMatrix about xgboost HOT 5 CLOSED

dmlc commented on May 7, 2024

Loading data with DMatrix

from xgboost.

Comments (5)

tqchen commented on May 7, 2024

XBoost does not read from csv file directly so far. For text format, we
support libsvm format. Check the example to read using numpy

On Tuesday, September 16, 2014, explorerr [email protected] wrote:

I have a large file with 300+ features in each record.
While trying to load the data with DMatrix in python, I got the following
message:

dtest = xgb.DMatrix(tsDir+'xgbTest.csv', missing=-999.0)
86x397 matrix with 328730778 entries is loaded from ../data/xgbTest.csv

I know I have 1834123 lines of record.

I looked into the file at 86 line, which is no different than any other
line.

What could be a possible reason for this?

Thanks very much!

Rui

—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/78.

from xgboost.

explorerr commented on May 7, 2024

Thanks very much for your response.

I name the file as csv, but actually I have converted the file into libvm format.

I have successfully loaded the training dataset. I am having problem with the testing dataset.

so the one record of the training file is like this:

0 0:1 1:0.0397 2:0 3:318.77 4:1 21:0 6:0.00 7:0.00 8:0 9:1 10:0.0281 11:0.01 12:0.0397 137:0.0397 76:0.00 68:0.03 16:0.00 18:74500602 19:0.00 100:0.00 22:0 23:1 24:1 25:0.00 136:0.47 140:1 27:9999999.99 28:0 29:1 30:4 32:42 33:0.0281 105:0.0419 35:4 99:0.00 36:-1.23 37:59 38:0 39:281.21 103:1 41:0 42:1 43:0 44:1 45:3 46:0 221:0 47:9999999.99 48:0 49:0 50:0.00 51:1 52:1.00 53:0.0006 54:0.0281 55:0 56:0 57:0.0000 58:0.0281 59:295.52 193:1.00 61:1.00 62:0.0397 145:1 65:0.00 66:0.0281 121:0.0281 208:0 69:0.000 71:0 146:0 74:0 75:999 14:699 77:0.00 78:0 79:0 80:0.0397 198:1 82:7 83:2.127 84:0 239:0.00 86:0 87:1 88:0 107:0 91:0 93:1 264:0 95:2608 96:1 5:0.00 92:0.05 67:0 20:1.00 101:4 102:0.0005 151:0.0281 104:0.00 141:711 106:0 175:0 186:-3 108:38 109:0 110:0.00 111:0.01 203:1 113:0 114:0 115:0.0397 116:262 72:9999 118:1 157:5 120:0 31:4 122:0 123:0.0397 124:9999999.99 125:41 126:0 127:14 128:9999 129:0.0871 130:5 248:80.19 132:0 160:115.62 134:2 260:0.00 135:5.00 94:0 15:1 138:0 139:0 26:0 117:0.00 267:1 142:1 143:-3 252:1 13:68 73:-0.46 147:1 148:-3 149:1.00 150:9999999.99 97:0 152:0 164:79 154:1 155:12 156:0 119:2 158:0.0397 159:2.10 133:1392 161:0.05 162:0 163:0 153:0.00 166:-0.68 167:679 90:0.00 169:0 265:0 171:0.00 173:0 174:0.00 177:0 170:0 179:9999 176:999.99 181:0 182:-1.59 183:0.00 184:12 34:0 89:1 40:96 188:1459 189:0 165:0 191:0 192:1 60:0 63:0 195:0.0397 196:0 197:19 81:0.00 199:0.0281 200:0.0011 201:0.00 168:1 112:1.00 204:1.00 216:47 206:0 207:0.05 70:1 209:33 210:0 211:0 212:0.00 190:1 213:2 214:0.0397 215:0.0281 205:1 217:-3 266:1.00 218:0.00 219:1 220:0 172:1.000 222:0.0281 223:0 224:0.25 225:0 226:76 187:1 227:2777.65 228:0.00 229:1 230:18 231:9999 232:0 233:1 234:1 235:9999 236:0.00 237:0 238:263 257:1 85:0 240:0.0397 241:0.0000 202:1 242:0 243:0.00 244:0.00 245:-1.53 246:5 247:0 131:0.00 249:-1.90 250:1.00 251:-711 144:999 253:0 98:0 254:1 255:-0.66 256:0.0000 178:999.99 258:0 259:0 185:1.00 261:1 270:1 263:0 194:23 180:1 17:0.00 64:1.00 268:538.23 269:1 262:0 271:3073.17 337:1

The testing dataset look exactly the same, just without the first column (the label)...

Thanks!

Rui

from xgboost.

tqchen commented on May 7, 2024

Oh, you need dummy label for testset as well

On Tuesday, September 16, 2014, explorerr [email protected] wrote:

Thanks very much for your response.

I name the file as csv, but actually I have converted the file into libvm
format.

I have successfully loaded the training dataset. I am having problem with
the testing dataset.

so the one record of the training file is like this:

0 0:1 1:0.0397 2:0 3:318.77 4:1 21:0 6:0.00 7:0.00 8:0 9:1 10:0.0281
11:0.01 12:0.0397 137:0.0397 76:0.00 68:0.03 16:0.00 18:74500602 19:0.00
100:0.00 22:0 23:1 24:1 25:0.00 136:0.47 140:1 27:9999999.99 28:0 29:1 30:4
32:42 33:0.0281 105:0.0419 35:4 99:0.00 36:-1.23 37:59 38:0 39:281.21 103:1
41:0 42:1 43:0 44:1 45:3 46:0 221:0 47:9999999.99 48:0 49:0 50:0.00 51:1
52:1.00 53:0.0006 54:0.0281 55:0 56:0 57:0.0000 58:0.0281 59:295.52
193:1.00 61:1.00 62:0.0397 145:1 65:0.00 66:0.0281 121:0.0281 208:0
69:0.000 71:0 146:0 74:0 75:999 14:699 77:0.00 78:0 79:0 80:0.0397 198:1
82:7 83:2.127 84:0 239:0.00 86:0 87:1 88:0 107:0 91:0 93:1 264:0 95:2608
96:1 5:0.00 92:0.05 67:0 20:1.00 101:4 102:0.0005 151:0.0281 104:0.00
141:711 106:0 175:0 186:-3 108:38 109:0 110:0.00 111:0.01 203:1 113:0 114:0
115:0.0397 116:262 72:9999 118:1 157:5 120:0 31:4 122:0 123:0.0397
124:9999999.99 125:41 126:0 127:14 128:9999 129:0.0871 130:5 248:80.19
132:0 160:115.62 134:2 260:0.00 135:5.00 94:0 15:1 138:0 139:0 26:0
117:0.00 267:1 142:1 143:-3 252:1 13:68 73:-0.46 147:1 148:-3 149:1.00
150:9999999.99 97:0 152:0 164:79 154:1 155:12 156:0 119:2 158:0.0397
159:2.10 133:1392 161:0.05 162:0 163:0 153:0.00 166:-0.68 167:679 90:0.00
169:0 265:0 171:0.00 173:0 174:0.00 177:0 170:0 179:9999 176:999.99 181:0
182:-1.59 183:0.00 184:12 34:0 89:1 40:96 188:1459 189:0 165:0 191:0 192:1
60:0 63:0 195:0.0397 196:0 197:19 81:0.00 199:0.0281 200:0.0011 201:0.00
168:1 112:1.00 204:1.00 216:47 206:0 207:0.05 70:1 209:33 210:0 211:0
212:0.00 190:1 213:2 214:0.0397 215:0.0281 205:1 217:-3 266:1.00 218:0.00
219:1 220:0 172:1.000 222:0.0281 223:0 224:0.25 225:0 226:76 187:1
227:2777.65 228:0.00 229:1 230:18 231:9999 232:0 233:1 234:1 235:9999
236:0.00 237:0 238:263 257:1 85:0 240:0.0397 241:0.0000 202:1 242:0
243:0.00 244:0.00 245:-1.53 246:5 247:0 131:0.00 249:-1.90 250:1.00
251:-711 144:999 253:0 98:0 254:1 255:-0.66 256:0.0000 178:999.99 258:0
259:0 185:1.00 261:1 270:1 263:0 194:23 180:1 17:0.00 64:1.00 268: 538.23
269:1 262:0 271:3073.17 337:1

The testing dataset look exactly the same, just without the first column
(the label)...

Thanks!

Rui

—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/78#issuecomment-55839738.

Sincerely,

Tianqi Chen
Computer Science & Engineering, University of Washington

from xgboost.

explorerr commented on May 7, 2024

I see, thanks very much :)

On Tue, Sep 16, 2014 at 10:07 PM, Tianqi Chen [email protected]
wrote:

Oh, you need dummy label for testset as well

On Tuesday, September 16, 2014, explorerr [email protected]
wrote:

Thanks very much for your response.

I name the file as csv, but actually I have converted the file into
libvm
format.

I have successfully loaded the training dataset. I am having problem
with
the testing dataset.

so the one record of the training file is like this:

0 0:1 1:0.0397 2:0 3:318.77 4:1 21:0 6:0.00 7:0.00 8:0 9:1 10:0.0281
11:0.01 12:0.0397 137:0.0397 76:0.00 68:0.03 16:0.00 18:74500602 19:0.00
100:0.00 22:0 23:1 24:1 25:0.00 136:0.47 140:1 27:9999999.99 28:0 29:1
30:4
32:42 33:0.0281 105:0.0419 35:4 99:0.00 36:-1.23 37:59 38:0 39:281.21
103:1
41:0 42:1 43:0 44:1 45:3 46:0 221:0 47:9999999.99 48:0 49:0 50:0.00 51:1
52:1.00 53:0.0006 54:0.0281 55:0 56:0 57:0.0000 58:0.0281 59:295.52
193:1.00 61:1.00 62:0.0397 145:1 65:0.00 66:0.0281 121:0.0281 208:0
69:0.000 71:0 146:0 74:0 75:999 14:699 77:0.00 78:0 79:0 80:0.0397 198:1
82:7 83:2.127 84:0 239:0.00 86:0 87:1 88:0 107:0 91:0 93:1 264:0 95:2608
96:1 5:0.00 92:0.05 67:0 20:1.00 101:4 102:0.0005 151:0.0281 104:0.00
141:711 106:0 175:0 186:-3 108:38 109:0 110:0.00 111:0.01 203:1 113:0
114:0
115:0.0397 116:262 72:9999 118:1 157:5 120:0 31:4 122:0 123:0.0397
124:9999999.99 125:41 126:0 127:14 128:9999 129:0.0871 130:5 248:80.19
132:0 160:115.62 134:2 260:0.00 135:5.00 94:0 15:1 138:0 139:0 26:0
117:0.00 267:1 142:1 143:-3 252:1 13:68 73:-0.46 147:1 148:-3 149:1.00
150:9999999.99 97:0 152:0 164:79 154:1 155:12 156:0 119:2 158:0.0397
159:2.10 133:1392 161:0.05 162:0 163:0 153:0.00 166:-0.68 167:679
90:0.00
169:0 265:0 171:0.00 173:0 174:0.00 177:0 170:0 179:9999 176:999.99
181:0
182:-1.59 183:0.00 184:12 34:0 89:1 40:96 188:1459 189:0 165:0 191:0
192:1
60:0 63:0 195:0.0397 196:0 197:19 81:0.00 199:0.0281 200:0.0011 201:0.00
168:1 112:1.00 204:1.00 216:47 206:0 207:0.05 70:1 209:33 210:0 211:0
212:0.00 190:1 213:2 214:0.0397 215:0.0281 205:1 217:-3 266:1.00
218:0.00
219:1 220:0 172:1.000 222:0.0281 223:0 224:0.25 225:0 226:76 187:1
227:2777.65 228:0.00 229:1 230:18 231:9999 232:0 233:1 234:1 235:9999
236:0.00 237:0 238:263 257:1 85:0 240:0.0397 241:0.0000 202:1 242:0
243:0.00 244:0.00 245:-1.53 246:5 247:0 131:0.00 249:-1.90 250:1.00
251:-711 144:999 253:0 98:0 254:1 255:-0.66 256:0.0000 178:999.99 258:0
259:0 185:1.00 261:1 270:1 263:0 194:23 180:1 17:0.00 64:1.00 268:
538.23
269:1 262:0 271:3073.17 337:1

The testing dataset look exactly the same, just without the first column
(the label)...

Thanks!

Rui

—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/78#issuecomment-55839738.

Sincerely,

Tianqi Chen
Computer Science & Engineering, University of Washington

—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/78#issuecomment-55839923.

from xgboost.

explorerr commented on May 7, 2024

You did a great job with xgboost, bravo :)

On Tue, Sep 16, 2014 at 10:08 PM, Zhang Rui [email protected] wrote:

I see, thanks very much :)

On Tue, Sep 16, 2014 at 10:07 PM, Tianqi Chen [email protected]
wrote:

Oh, you need dummy label for testset as well

On Tuesday, September 16, 2014, explorerr [email protected]
wrote:

Thanks very much for your response.

I name the file as csv, but actually I have converted the file into
libvm
format.

I have successfully loaded the training dataset. I am having problem
with
the testing dataset.

so the one record of the training file is like this:

0 0:1 1:0.0397 2:0 3:318.77 4:1 21:0 6:0.00 7:0.00 8:0 9:1 10:0.0281
11:0.01 12:0.0397 137:0.0397 76:0.00 68:0.03 16:0.00 18:74500602
19:0.00
100:0.00 22:0 23:1 24:1 25:0.00 136:0.47 140:1 27:9999999.99 28:0 29:1
30:4
32:42 33:0.0281 105:0.0419 35:4 99:0.00 36:-1.23 37:59 38:0 39:281.21
103:1
41:0 42:1 43:0 44:1 45:3 46:0 221:0 47:9999999.99 48:0 49:0 50:0.00
51:1
52:1.00 53:0.0006 54:0.0281 55:0 56:0 57:0.0000 58:0.0281 59:295.52
193:1.00 61:1.00 62:0.0397 145:1 65:0.00 66:0.0281 121:0.0281 208:0
69:0.000 71:0 146:0 74:0 75:999 14:699 77:0.00 78:0 79:0 80:0.0397
198:1
82:7 83:2.127 84:0 239:0.00 86:0 87:1 88:0 107:0 91:0 93:1 264:0
95:2608
96:1 5:0.00 92:0.05 67:0 20:1.00 101:4 102:0.0005 151:0.0281 104:0.00
141:711 106:0 175:0 186:-3 108:38 109:0 110:0.00 111:0.01 203:1 113:0
114:0
115:0.0397 116:262 72:9999 118:1 157:5 120:0 31:4 122:0 123:0.0397
124:9999999.99 125:41 126:0 127:14 128:9999 129:0.0871 130:5 248:80.19
132:0 160:115.62 134:2 260:0.00 135:5.00 94:0 15:1 138:0 139:0 26:0
117:0.00 267:1 142:1 143:-3 252:1 13:68 73:-0.46 147:1 148:-3 149:1.00
150:9999999.99 97:0 152:0 164:79 154:1 155:12 156:0 119:2 158:0.0397
159:2.10 133:1392 161:0.05 162:0 163:0 153:0.00 166:-0.68 167:679
90:0.00
169:0 265:0 171:0.00 173:0 174:0.00 177:0 170:0 179:9999 176:999.99
181:0
182:-1.59 183:0.00 184:12 34:0 89:1 40:96 188:1459 189:0 165:0 191:0
192:1
60:0 63:0 195:0.0397 196:0 197:19 81:0.00 199:0.0281 200:0.0011
201:0.00
168:1 112:1.00 204:1.00 216:47 206:0 207:0.05 70:1 209:33 210:0 211:0
212:0.00 190:1 213:2 214:0.0397 215:0.0281 205:1 217:-3 266:1.00
218:0.00
219:1 220:0 172:1.000 222:0.0281 223:0 224:0.25 225:0 226:76 187:1
227:2777.65 228:0.00 229:1 230:18 231:9999 232:0 233:1 234:1 235:9999
236:0.00 237:0 238:263 257:1 85:0 240:0.0397 241:0.0000 202:1 242:0
243:0.00 244:0.00 245:-1.53 246:5 247:0 131:0.00 249:-1.90 250:1.00
251:-711 144:999 253:0 98:0 254:1 255:-0.66 256:0.0000 178:999.99 258:0
259:0 185:1.00 261:1 270:1 263:0 194:23 180:1 17:0.00 64:1.00 268:
538.23
269:1 262:0 271:3073.17 337:1

The testing dataset look exactly the same, just without the first
column
(the label)...

Thanks!

Rui

—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/78#issuecomment-55839738.

Sincerely,

Tianqi Chen
Computer Science & Engineering, University of Washington

—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/78#issuecomment-55839923.

from xgboost.

Loading data with DMatrix about xgboost HOT 5 CLOSED

Comments (5)

Sincerely,

Sincerely,

Sincerely,

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent